Sure, and please post back if it works (or it does not :) ) On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake <esal...@gmail.com> wrote:
> Thank you, I'll try. > > saliya > > On Wed, Sep 14, 2016 at 12:07 AM, ayan guha <guha.a...@gmail.com> wrote: > >> Depends on join, but unless you are doing cross join, it should not blow >> up. 6M is not too much. I think what you may want to consider (a) volume of >> your data files (b) reduce shuffling by following similar partitioning on >> both RDDs >> >> On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com> >> wrote: >> >>> Thank you, but isn't that join going to be too expensive for this? >>> >>> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> My suggestion: >>>> >>>> 1. Read first text file in (say) RDD1 using textFile >>>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of >>>> signature (filename,filecontent). >>>> 3. Join RDD1 and 2 based on some file name (or some other key). >>>> >>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com> >>>> wrote: >>>> >>>>> 1.) What needs to be parallelized is the work for each of those 6M >>>>> rows, not the 80K files. Let me elaborate this with a simple for loop if >>>>> we >>>>> were to write this serially. >>>>> >>>>> For each line L out of 6M in the first file{ >>>>> process the file corresponding to L out of those 80K files. >>>>> } >>>>> >>>>> The 80K files are in HDFS and to read all that content into each >>>>> worker is not possible due to size. >>>>> >>>>> 2. No. multiple rows may point to rthe same file but they operate on >>>>> different records within the file. >>>>> >>>>> 3. End goal is to write back 6M processed information. >>>>> >>>>> This is simple map only type scenario. One workaround I can think of >>>>> is to append all the 6M records to each of the data files. >>>>> >>>>> Thank you >>>>> >>>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> >>>>> wrote: >>>>> >>>>>> Question: >>>>>> >>>>>> 1. Why you can not read all 80K files together? ie, why you have a >>>>>> dependency on first text file? >>>>>> 2. Your first text file has 6M rows, but total number of files~80K. >>>>>> is there a scenario where there may not be a file in HDFS corresponding >>>>>> to >>>>>> the row in first text file? >>>>>> 3. May be a follow up of 1, what is your end goal? >>>>>> >>>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com >>>>>> > wrote: >>>>>> >>>>>>> The first text file is not that large, it has 6 million records >>>>>>> (lines). For each line I need to read a file out of 80000 files. They >>>>>>> total >>>>>>> around 1.5TB. I didn't understand what you meant by "then again >>>>>>> read text files for each line and union all rdds." >>>>>>> >>>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey < >>>>>>> raghavendra.pan...@gmail.com> wrote: >>>>>>> >>>>>>>> How large is your first text file? The idea is you read first text >>>>>>>> file and if it is not large you can collect all the lines on driver and >>>>>>>> then again read text files for each line and union all rdds. >>>>>>>> >>>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Just wonder if this is possible with Spark? >>>>>>>>> >>>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake < >>>>>>>>> esal...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I've got a text file where each line is a record. For each >>>>>>>>>> record, I need to process a file in HDFS. >>>>>>>>>> >>>>>>>>>> So if I represent these records as an RDD and invoke a map() >>>>>>>>>> operation on them how can I access the HDFS within that map()? Do I >>>>>>>>>> have to >>>>>>>>>> create a Spark context within map() or is there a better solution to >>>>>>>>>> that? >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Saliya >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Saliya Ekanayake >>>>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>>>> Indiana University, Bloomington >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Saliya Ekanayake >>>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>>> Indiana University, Bloomington >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Saliya Ekanayake >>>>>>> Ph.D. Candidate | Research Assistant >>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>> Indiana University, Bloomington >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Ayan Guha >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Saliya Ekanayake >>>>> Ph.D. Candidate | Research Assistant >>>>> School of Informatics and Computing | Digital Science Center >>>>> Indiana University, Bloomington >>>>> >>>>> >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > -- Best Regards, Ayan Guha