Thank you, I'll try. saliya
On Wed, Sep 14, 2016 at 12:07 AM, ayan guha <guha.a...@gmail.com> wrote: > Depends on join, but unless you are doing cross join, it should not blow > up. 6M is not too much. I think what you may want to consider (a) volume of > your data files (b) reduce shuffling by following similar partitioning on > both RDDs > > On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com> > wrote: > >> Thank you, but isn't that join going to be too expensive for this? >> >> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> My suggestion: >>> >>> 1. Read first text file in (say) RDD1 using textFile >>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of >>> signature (filename,filecontent). >>> 3. Join RDD1 and 2 based on some file name (or some other key). >>> >>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> 1.) What needs to be parallelized is the work for each of those 6M >>>> rows, not the 80K files. Let me elaborate this with a simple for loop if we >>>> were to write this serially. >>>> >>>> For each line L out of 6M in the first file{ >>>> process the file corresponding to L out of those 80K files. >>>> } >>>> >>>> The 80K files are in HDFS and to read all that content into each worker >>>> is not possible due to size. >>>> >>>> 2. No. multiple rows may point to rthe same file but they operate on >>>> different records within the file. >>>> >>>> 3. End goal is to write back 6M processed information. >>>> >>>> This is simple map only type scenario. One workaround I can think of is >>>> to append all the 6M records to each of the data files. >>>> >>>> Thank you >>>> >>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> >>>> wrote: >>>> >>>>> Question: >>>>> >>>>> 1. Why you can not read all 80K files together? ie, why you have a >>>>> dependency on first text file? >>>>> 2. Your first text file has 6M rows, but total number of files~80K. is >>>>> there a scenario where there may not be a file in HDFS corresponding to >>>>> the >>>>> row in first text file? >>>>> 3. May be a follow up of 1, what is your end goal? >>>>> >>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com> >>>>> wrote: >>>>> >>>>>> The first text file is not that large, it has 6 million records >>>>>> (lines). For each line I need to read a file out of 80000 files. They >>>>>> total >>>>>> around 1.5TB. I didn't understand what you meant by "then again read >>>>>> text files for each line and union all rdds." >>>>>> >>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey < >>>>>> raghavendra.pan...@gmail.com> wrote: >>>>>> >>>>>>> How large is your first text file? The idea is you read first text >>>>>>> file and if it is not large you can collect all the lines on driver and >>>>>>> then again read text files for each line and union all rdds. >>>>>>> >>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Just wonder if this is possible with Spark? >>>>>>>> >>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake < >>>>>>>> esal...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I've got a text file where each line is a record. For each record, >>>>>>>>> I need to process a file in HDFS. >>>>>>>>> >>>>>>>>> So if I represent these records as an RDD and invoke a map() >>>>>>>>> operation on them how can I access the HDFS within that map()? Do I >>>>>>>>> have to >>>>>>>>> create a Spark context within map() or is there a better solution to >>>>>>>>> that? >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Saliya >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Saliya Ekanayake >>>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>>> Indiana University, Bloomington >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Saliya Ekanayake >>>>>>>> Ph.D. Candidate | Research Assistant >>>>>>>> School of Informatics and Computing | Digital Science Center >>>>>>>> Indiana University, Bloomington >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Saliya Ekanayake >>>>>> Ph.D. Candidate | Research Assistant >>>>>> School of Informatics and Computing | Digital Science Center >>>>>> Indiana University, Bloomington >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> >>>> >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Saliya Ekanayake >> Ph.D. Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> >> > > > -- > Best Regards, > Ayan Guha > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington