Re: Access HDFS within Spark Map Operation

ayan guha Tue, 13 Sep 2016 21:21:01 -0700

Sure, and please post back if it works (or it does not :) )

On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake <esal...@gmail.com> wrote:


> Thank you, I'll try.
>
> saliya
>
> On Wed, Sep 14, 2016 at 12:07 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Depends on join, but unless you are doing cross join, it should not blow
>> up. 6M is not too much. I think what you may want to consider (a) volume of
>> your data files (b) reduce shuffling by following similar partitioning on
>> both RDDs
>>
>> On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> Thank you, but isn't that join going to be too expensive for this?
>>>
>>> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> My suggestion:
>>>>
>>>> 1. Read first text file in (say) RDD1 using textFile
>>>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
>>>> signature (filename,filecontent).
>>>> 3. Join RDD1 and 2 based on some file name (or some other key).
>>>>
>>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1.) What needs to be parallelized is the work for each of those 6M
>>>>> rows, not the 80K files. Let me elaborate this with a simple for loop if 
>>>>> we
>>>>> were to write this serially.
>>>>>
>>>>> For each line L out of 6M in the first file{
>>>>>      process the file corresponding to L out of those 80K files.
>>>>> }
>>>>>
>>>>> The 80K files are in HDFS and to read all that content into each
>>>>> worker is not possible due to size.
>>>>>
>>>>> 2. No. multiple rows may point to rthe same file but they operate on
>>>>> different records within the file.
>>>>>
>>>>> 3. End goal is to write back 6M processed information.
>>>>>
>>>>> This is simple map only type scenario. One workaround I can think of
>>>>> is to append all the 6M records to each of the data files.
>>>>>
>>>>> Thank you
>>>>>
>>>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Question:
>>>>>>
>>>>>> 1. Why you can not read all 80K files together? ie, why you have a
>>>>>> dependency on first text file?
>>>>>> 2. Your first text file has 6M rows, but total number of files~80K.
>>>>>> is there a scenario where there may not be a file in HDFS corresponding 
>>>>>> to
>>>>>> the row in first text file?
>>>>>> 3. May be a follow up of 1, what is your end goal?
>>>>>>
>>>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The first text file is not that large, it has 6 million records
>>>>>>> (lines). For each line I need to read a file out of 80000 files. They 
>>>>>>> total
>>>>>>> around 1.5TB. I didn't understand what you meant by "then again
>>>>>>> read text files for each line and union all rdds."
>>>>>>>
>>>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>>>>> raghavendra.pan...@gmail.com> wrote:
>>>>>>>
>>>>>>>> How large is your first text file? The idea is you read first text
>>>>>>>> file and if it is not large you can collect all the lines on driver and
>>>>>>>> then again read text files for each line and union all rdds.
>>>>>>>>
>>>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Just wonder if this is possible with Spark?
>>>>>>>>>
>>>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <
>>>>>>>>> esal...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've got a text file where each line is a record. For each
>>>>>>>>>> record, I need to process a file in HDFS.
>>>>>>>>>>
>>>>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>>>>> have to
>>>>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>>>>>> that?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Saliya
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Saliya Ekanayake
>>>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>>>> Indiana University, Bloomington
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Saliya Ekanayake
>>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>>> Indiana University, Bloomington
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Best Regards,
Ayan Guha

Re: Access HDFS within Spark Map Operation

Reply via email to