You can make a Hadoop input format which passes through the name of the
file. I generally find it easier to just hit Hadoop, get the file names,
and construct the RDDs though
El martes, 1 de septiembre de 2015, Matt K escribió:
> Just want to add - I'm looking to partition the resulting Parquet
Just want to add - I'm looking to partition the resulting Parquet files by
customer-id, which is why I'm looking to extract the customer-id from the
path.
On Tue, Sep 1, 2015 at 7:00 PM, Matt K wrote:
> Hi all,
>
> TL;DR - is there a way to extract the source path from an RDD via the
> Scala API
Hi all,
TL;DR - is there a way to extract the source path from an RDD via the Scala
API?
I have sequence files on S3 that look something like this:
s3://data/customer=123/...
s3://data/customer=456/...
I am using Spark Dataframes to convert these sequence files to Parquet. As
part of the process