Question: 1. Why you can not read all 80K files together? ie, why you have a dependency on first text file? 2. Your first text file has 6M rows, but total number of files~80K. is there a scenario where there may not be a file in HDFS corresponding to the row in first text file? 3. May be a follow up of 1, what is your end goal?
On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com> wrote: > The first text file is not that large, it has 6 million records (lines). > For each line I need to read a file out of 80000 files. They total around > 1.5TB. I didn't understand what you meant by "then again read text files > for each line and union all rdds." > > On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey < > raghavendra.pan...@gmail.com> wrote: > >> How large is your first text file? The idea is you read first text file >> and if it is not large you can collect all the lines on driver and then >> again read text files for each line and union all rdds. >> >> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote: >> >>> Just wonder if this is possible with Spark? >>> >>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I've got a text file where each line is a record. For each record, I >>>> need to process a file in HDFS. >>>> >>>> So if I represent these records as an RDD and invoke a map() operation >>>> on them how can I access the HDFS within that map()? Do I have to create a >>>> Spark context within map() or is there a better solution to that? >>>> >>>> Thank you, >>>> Saliya >>>> >>>> >>>> >>>> -- >>>> Saliya Ekanayake >>>> Ph.D. Candidate | Research Assistant >>>> School of Informatics and Computing | Digital Science Center >>>> Indiana University, Bloomington >>>> >>>> >>> >>> >>> -- >>> Saliya Ekanayake >>> Ph.D. Candidate | Research Assistant >>> School of Informatics and Computing | Digital Science Center >>> Indiana University, Bloomington >>> >>> > > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > > -- Best Regards, Ayan Guha