Re: extracting file path using dataframes

2015-09-01 Thread Jonathan Coveney
You can make a Hadoop input format which passes through the name of the
file. I generally find it easier to just hit Hadoop, get the file names,
and construct the RDDs though

El martes, 1 de septiembre de 2015, Matt K  escribió:

> Just want to add - I'm looking to partition the resulting Parquet files by
> customer-id, which is why I'm looking to extract the customer-id from the
> path.
>
> On Tue, Sep 1, 2015 at 7:00 PM, Matt K  > wrote:
>
>> Hi all,
>>
>> TL;DR - is there a way to extract the source path from an RDD via the
>> Scala API?
>>
>> I have sequence files on S3 that look something like this:
>> s3://data/customer=123/...
>> s3://data/customer=456/...
>>
>> I am using Spark Dataframes to convert these sequence files to Parquet.
>> As part of the processing, I actually need to know the customer-id. I'm
>> doing something like this:
>>
>> val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", 
>> classOf[BytesWritable],
>> classOf[Text])
>>
>> val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema,
>> delimiter))
>>
>> val dataFrame = sql.createDataFrame(rowRdd, schema)
>>
>>
>> What I am trying to figure out is how to get the customer-id, which is
>> part of the path. I am not sure if there's a way to extract the source path
>> from the resulting HadoopRDD. Do I need to create one RDD per customer to
>> get around this?
>>
>>
>> Thanks,
>>
>> -Matt
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>


Re: extracting file path using dataframes

2015-09-01 Thread Matt K
Just want to add - I'm looking to partition the resulting Parquet files by
customer-id, which is why I'm looking to extract the customer-id from the
path.

On Tue, Sep 1, 2015 at 7:00 PM, Matt K  wrote:

> Hi all,
>
> TL;DR - is there a way to extract the source path from an RDD via the
> Scala API?
>
> I have sequence files on S3 that look something like this:
> s3://data/customer=123/...
> s3://data/customer=456/...
>
> I am using Spark Dataframes to convert these sequence files to Parquet. As
> part of the processing, I actually need to know the customer-id. I'm doing
> something like this:
>
> val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*", 
> classOf[BytesWritable],
> classOf[Text])
>
> val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter
> ))
>
> val dataFrame = sql.createDataFrame(rowRdd, schema)
>
>
> What I am trying to figure out is how to get the customer-id, which is
> part of the path. I am not sure if there's a way to extract the source path
> from the resulting HadoopRDD. Do I need to create one RDD per customer to
> get around this?
>
>
> Thanks,
>
> -Matt
>



-- 
www.calcmachine.com - easy online calculator.


extracting file path using dataframes

2015-09-01 Thread Matt K
Hi all,

TL;DR - is there a way to extract the source path from an RDD via the Scala
API?

I have sequence files on S3 that look something like this:
s3://data/customer=123/...
s3://data/customer=456/...

I am using Spark Dataframes to convert these sequence files to Parquet. As
part of the processing, I actually need to know the customer-id. I'm doing
something like this:

val rdd = sql.sparkContext.sequenceFile("s3://data/customer=*/*",
classOf[BytesWritable],
classOf[Text])

val rowRdd = rdd.map(x => convertTextRowToTypedRdd(x._2, schema, delimiter))

val dataFrame = sql.createDataFrame(rowRdd, schema)


What I am trying to figure out is how to get the customer-id, which is part
of the path. I am not sure if there's a way to extract the source path from
the resulting HadoopRDD. Do I need to create one RDD per customer to get
around this?


Thanks,

-Matt