Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney
You can make a Hadoop input format which passes through the name of the file. I generally find it easier to just hit Hadoop, get the file names, and construct the RDDs though El martes, 1 de septiembre de 2015, Matt K escribió: > Just want to add - I'm looking to partition the resulting Parquet

Re: extracting file path using dataframes

2015-09-01 Thread Matt K
Just want to add - I'm looking to partition the resulting Parquet files by customer-id, which is why I'm looking to extract the customer-id from the path. On Tue, Sep 1, 2015 at 7:00 PM, Matt K wrote: > Hi all, > > TL;DR - is there a way to extract the source path from an RDD via the > Scala API

extracting file path using dataframes

2015-09-01 Thread Matt K
Hi all, TL;DR - is there a way to extract the source path from an RDD via the Scala API? I have sequence files on S3 that look something like this: s3://data/customer=123/... s3://data/customer=456/... I am using Spark Dataframes to convert these sequence files to Parquet. As part of the process