Hi spark-folk, I have a directory full of files that I want to process using PySpark. There is some necessary metadata in the filename that I would love to attach to each record in that file. Using Java MapReduce, I would access
(FileSplit) context.getInputSplit()).getPath().getName() in the setup() method of the mapper. Using Hadoop Streaming, I can access the environment variable map_input_fileto get the filename. Is there something I can do in PySpark to get the filename? Surely, one solution would be to get the list of files first, load each one as an RDD separately, and then union them together. But listing the files in HDFS is a bit annoying through Python, so I was wondering if the filename is somehow attached to a partition. Thanks! Uri -- Uri Laserson, PhD Data Scientist, Cloudera Twitter/GitHub: @laserson +1 617 910 0447 laser...@cloudera.com