My suggestion:

1. Read first text file in (say) RDD1 using textFile
2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
signature (filename,filecontent).
3. Join RDD1 and 2 based on some file name (or some other key).

On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com> wrote:

> 1.) What needs to be parallelized is the work for each of those 6M rows,
> not the 80K files. Let me elaborate this with a simple for loop if we were
> to write this serially.
>
> For each line L out of 6M in the first file{
>      process the file corresponding to L out of those 80K files.
> }
>
> The 80K files are in HDFS and to read all that content into each worker is
> not possible due to size.
>
> 2. No. multiple rows may point to rthe same file but they operate on
> different records within the file.
>
> 3. End goal is to write back 6M processed information.
>
> This is simple map only type scenario. One workaround I can think of is to
> append all the 6M records to each of the data files.
>
> Thank you
>
> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Question:
>>
>> 1. Why you can not read all 80K files together? ie, why you have a
>> dependency on first text file?
>> 2. Your first text file has 6M rows, but total number of files~80K. is
>> there a scenario where there may not be a file in HDFS corresponding to the
>> row in first text file?
>> 3. May be a follow up of 1, what is your end goal?
>>
>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> The first text file is not that large, it has 6 million records (lines).
>>> For each line I need to read a file out of 80000 files. They total around
>>> 1.5TB. I didn't understand what you meant by "then again read text
>>> files for each line and union all rdds."
>>>
>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>> raghavendra.pan...@gmail.com> wrote:
>>>
>>>> How large is your first text file? The idea is you read first text file
>>>> and if it is not large you can collect all the lines on driver and then
>>>> again read text files for each line and union all rdds.
>>>>
>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> Just wonder if this is possible with Spark?
>>>>>
>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've got a text file where each line is a record. For each record, I
>>>>>> need to process a file in HDFS.
>>>>>>
>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>> operation on them how can I access the HDFS within that map()? Do I have 
>>>>>> to
>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>> that?
>>>>>>
>>>>>> Thank you,
>>>>>> Saliya
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to