Hi all, TL;DR - is there a way to extract the source path from an RDD via the Scala API?
I have sequence files on S3 that look something like this: s3://data/customer=123/... s3://data/customer=456/... I am using Spark Dataframes to convert these sequence files to Parquet. As part of the processing, I actually need to know the customer-id. I'm doing something like this: val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", classOf[BytesWritable], classOf[Text]) val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter)) val dataFrame = sql.createDataFrame(rowRdd, schema) What I am trying to figure out is how to get the customer-id, which is part of the path. I am not sure if there's a way to extract the source path from the resulting HadoopRDD. Do I need to create one RDD per customer to get around this? Thanks, -Matt