Re: Access HDFS within Spark Map Operation

Saliya Ekanayake Tue, 13 Sep 2016 21:01:08 -0700

Thank you, but isn't that join going to be too expensive for this?

On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:


> My suggestion:
>
> 1. Read first text file in (say) RDD1 using textFile
> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
> signature (filename,filecontent).
> 3. Join RDD1 and 2 based on some file name (or some other key).
>
> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> 1.) What needs to be parallelized is the work for each of those 6M rows,
>> not the 80K files. Let me elaborate this with a simple for loop if we were
>> to write this serially.
>>
>> For each line L out of 6M in the first file{
>>      process the file corresponding to L out of those 80K files.
>> }
>>
>> The 80K files are in HDFS and to read all that content into each worker
>> is not possible due to size.
>>
>> 2. No. multiple rows may point to rthe same file but they operate on
>> different records within the file.
>>
>> 3. End goal is to write back 6M processed information.
>>
>> This is simple map only type scenario. One workaround I can think of is
>> to append all the 6M records to each of the data files.
>>
>> Thank you
>>
>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Question:
>>>
>>> 1. Why you can not read all 80K files together? ie, why you have a
>>> dependency on first text file?
>>> 2. Your first text file has 6M rows, but total number of files~80K. is
>>> there a scenario where there may not be a file in HDFS corresponding to the
>>> row in first text file?
>>> 3. May be a follow up of 1, what is your end goal?
>>>
>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>>
>>>> The first text file is not that large, it has 6 million records
>>>> (lines). For each line I need to read a file out of 80000 files. They total
>>>> around 1.5TB. I didn't understand what you meant by "then again read
>>>> text files for each line and union all rdds."
>>>>
>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>> raghavendra.pan...@gmail.com> wrote:
>>>>
>>>>> How large is your first text file? The idea is you read first text
>>>>> file and if it is not large you can collect all the lines on driver and
>>>>> then again read text files for each line and union all rdds.
>>>>>
>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Just wonder if this is possible with Spark?
>>>>>>
>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've got a text file where each line is a record. For each record, I
>>>>>>> need to process a file in HDFS.
>>>>>>>
>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>> have to
>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>>> that?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Saliya
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Access HDFS within Spark Map Operation

Reply via email to