subject:"Access HDFS within Spark Map Operation"

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha

Sure, and please post back if it works (or it does not :) ) On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake wrote: > Thank you, I'll try. > > saliya > > On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > >> Depends on join, but unless you are doing cross join, it should not blow >> up. 6M i

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake

Thank you, I'll try. saliya On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > Depends on join, but unless you are doing cross join, it should not blow > up. 6M is not too much. I think what you may want to consider (a) volume of > your data files (b) reduce shuffling by following similar par

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha

Depends on join, but unless you are doing cross join, it should not blow up. 6M is not too much. I think what you may want to consider (a) volume of your data files (b) reduce shuffling by following similar partitioning on both RDDs On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake wrote: > Than

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake

Thank you, but isn't that join going to be too expensive for this? On Tue, Sep 13, 2016 at 11:55 PM, ayan guha wrote: > My suggestion: > > 1. Read first text file in (say) RDD1 using textFile > 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of > signature (filename,filecontent)

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha

My suggestion: 1. Read first text file in (say) RDD1 using textFile 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of signature (filename,filecontent). 3. Join RDD1 and 2 based on some file name (or some other key). On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake wrote: > 1.

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake

1.) What needs to be parallelized is the work for each of those 6M rows, not the 80K files. Let me elaborate this with a simple for loop if we were to write this serially. For each line L out of 6M in the first file{ process the file corresponding to L out of those 80K files. } The 80K files

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha

Question: 1. Why you can not read all 80K files together? ie, why you have a dependency on first text file? 2. Your first text file has 6M rows, but total number of files~80K. is there a scenario where there may not be a file in HDFS corresponding to the row in first text file? 3. May be a follow

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake

The first text file is not that large, it has 6 million records (lines). For each line I need to read a file out of 8 files. They total around 1.5TB. I didn't understand what you meant by "then again read text files for each line and union all rdds." On Tue, Sep 13, 2016 at 10:04 PM, Raghavend

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Raghavendra Pandey

How large is your first text file? The idea is you read first text file and if it is not large you can collect all the lines on driver and then again read text files for each line and union all rdds. On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote: > Just wonder if this is possible with Spar

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake

Just wonder if this is possible with Spark? On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake wrote: > Hi, > > I've got a text file where each line is a record. For each record, I need > to process a file in HDFS. > > So if I represent these records as an RDD and invoke a map() operation on > t

Access HDFS within Spark Map Operation

2016-09-11 Thread Saliya Ekanayake

Hi, I've got a text file where each line is a record. For each record, I need to process a file in HDFS. So if I represent these records as an RDD and invoke a map() operation on them how can I access the HDFS within that map()? Do I have to create a Spark context within map() or is there a bette

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Re: Access HDFS within Spark Map Operation

Access HDFS within Spark Map Operation

11 matches

Site Navigation

Mail list logo

Footer information