Re: extracting file path using dataframes
You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though El martes, 1 de septiembre de 2015, Matt K escribió: > Just want to add - I'm looking to partition the resulting Parquet files by > customer-id, which is why I'm looking to extract the customer-id from the > path. > > On Tue, Sep 1, 2015 at 7:00 PM, Matt K > wrote: > >> Hi all, >> >> TL;DR - is there a way to extract the source path from an RDD via the >> Scala API? >> >> I have sequence files on S3 that look something like this: >> s3://data/customer=123/... >> s3://data/customer=456/... >> >> I am using Spark Dataframes to convert these sequence files to Parquet. >> As part of the processing, I actually need to know the customer-id. I'm >> doing something like this: >> >> val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", >> classOf[BytesWritable], >> classOf[Text]) >> >> val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, >> delimiter)) >> >> val dataFrame = sql.createDataFrame(rowRdd, schema) >> >> >> What I am trying to figure out is how to get the customer-id, which is >> part of the path. I am not sure if there's a way to extract the source path >> from the resulting HadoopRDD. Do I need to create one RDD per customer to >> get around this? >> >> >> Thanks, >> >> -Matt >> > > > > -- > www.calcmachine.com - easy online calculator. >
Re: extracting file path using dataframes
Just want to add - I'm looking to partition the resulting Parquet files by customer-id, which is why I'm looking to extract the customer-id from the path. On Tue, Sep 1, 2015 at 7:00 PM, Matt K wrote: > Hi all, > > TL;DR - is there a way to extract the source path from an RDD via the > Scala API? > > I have sequence files on S3 that look something like this: > s3://data/customer=123/... > s3://data/customer=456/... > > I am using Spark Dataframes to convert these sequence files to Parquet. As > part of the processing, I actually need to know the customer-id. I'm doing > something like this: > > val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", > classOf[BytesWritable], > classOf[Text]) > > val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter > )) > > val dataFrame = sql.createDataFrame(rowRdd, schema) > > > What I am trying to figure out is how to get the customer-id, which is > part of the path. I am not sure if there's a way to extract the source path > from the resulting HadoopRDD. Do I need to create one RDD per customer to > get around this? > > > Thanks, > > -Matt > -- www.calcmachine.com - easy online calculator.
extracting file path using dataframes
Hi all, TL;DR - is there a way to extract the source path from an RDD via the Scala API? I have sequence files on S3 that look something like this: s3://data/customer=123/... s3://data/customer=456/... I am using Spark Dataframes to convert these sequence files to Parquet. As part of the processing, I actually need to know the customer-id. I'm doing something like this: val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", classOf[BytesWritable], classOf[Text]) val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter)) val dataFrame = sql.createDataFrame(rowRdd, schema) What I am trying to figure out is how to get the customer-id, which is part of the path. I am not sure if there's a way to extract the source path from the resulting HadoopRDD. Do I need to create one RDD per customer to get around this? Thanks, -Matt