Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Sure, and please post back if it works (or it does not :) ) On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake wrote: > Thank you, I'll try. > > saliya > > On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > >> Depends on join, but unless you are doing

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, I'll try. saliya On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote: > Depends on join, but unless you are doing cross join, it should not blow > up. 6M is not too much. I think what you may want to consider (a) volume of > your data files (b) reduce shuffling by

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Depends on join, but unless you are doing cross join, it should not blow up. 6M is not too much. I think what you may want to consider (a) volume of your data files (b) reduce shuffling by following similar partitioning on both RDDs On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Thank you, but isn't that join going to be too expensive for this? On Tue, Sep 13, 2016 at 11:55 PM, ayan guha wrote: > My suggestion: > > 1. Read first text file in (say) RDD1 using textFile > 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of > signature

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
My suggestion: 1. Read first text file in (say) RDD1 using textFile 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of signature (filename,filecontent). 3. Join RDD1 and 2 based on some file name (or some other key). On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
1.) What needs to be parallelized is the work for each of those 6M rows, not the 80K files. Let me elaborate this with a simple for loop if we were to write this serially. For each line L out of 6M in the first file{ process the file corresponding to L out of those 80K files. } The 80K

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread ayan guha
Question: 1. Why you can not read all 80K files together? ie, why you have a dependency on first text file? 2. Your first text file has 6M rows, but total number of files~80K. is there a scenario where there may not be a file in HDFS corresponding to the row in first text file? 3. May be a follow

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
The first text file is not that large, it has 6 million records (lines). For each line I need to read a file out of 8 files. They total around 1.5TB. I didn't understand what you meant by "then again read text files for each line and union all rdds." On Tue, Sep 13, 2016 at 10:04 PM,

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Raghavendra Pandey
How large is your first text file? The idea is you read first text file and if it is not large you can collect all the lines on driver and then again read text files for each line and union all rdds. On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote: > Just wonder if this

Re: Access HDFS within Spark Map Operation

2016-09-13 Thread Saliya Ekanayake
Just wonder if this is possible with Spark? On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake wrote: > Hi, > > I've got a text file where each line is a record. For each record, I need > to process a file in HDFS. > > So if I represent these records as an RDD and invoke a

Access HDFS within Spark Map Operation

2016-09-11 Thread Saliya Ekanayake
Hi, I've got a text file where each line is a record. For each record, I need to process a file in HDFS. So if I represent these records as an RDD and invoke a map() operation on them how can I access the HDFS within that map()? Do I have to create a Spark context within map() or is there a