Re: Access HDFS within Spark Map Operation

Saliya Ekanayake Tue, 13 Sep 2016 21:09:55 -0700

Thank you, I'll try.

saliya


On Wed, Sep 14, 2016 at 12:07 AM, ayan guha <guha.a...@gmail.com> wrote:

> Depends on join, but unless you are doing cross join, it should not blow
> up. 6M is not too much. I think what you may want to consider (a) volume of
> your data files (b) reduce shuffling by following similar partitioning on
> both RDDs
>
> On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> Thank you, but isn't that join going to be too expensive for this?
>>
>> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> My suggestion:
>>>
>>> 1. Read first text file in (say) RDD1 using textFile
>>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
>>> signature (filename,filecontent).
>>> 3. Join RDD1 and 2 based on some file name (or some other key).
>>>
>>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>>
>>>> 1.) What needs to be parallelized is the work for each of those 6M
>>>> rows, not the 80K files. Let me elaborate this with a simple for loop if we
>>>> were to write this serially.
>>>>
>>>> For each line L out of 6M in the first file{
>>>>      process the file corresponding to L out of those 80K files.
>>>> }
>>>>
>>>> The 80K files are in HDFS and to read all that content into each worker
>>>> is not possible due to size.
>>>>
>>>> 2. No. multiple rows may point to rthe same file but they operate on
>>>> different records within the file.
>>>>
>>>> 3. End goal is to write back 6M processed information.
>>>>
>>>> This is simple map only type scenario. One workaround I can think of is
>>>> to append all the 6M records to each of the data files.
>>>>
>>>> Thank you
>>>>
>>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com>
>>>> wrote:
>>>>
>>>>> Question:
>>>>>
>>>>> 1. Why you can not read all 80K files together? ie, why you have a
>>>>> dependency on first text file?
>>>>> 2. Your first text file has 6M rows, but total number of files~80K. is
>>>>> there a scenario where there may not be a file in HDFS corresponding to 
>>>>> the
>>>>> row in first text file?
>>>>> 3. May be a follow up of 1, what is your end goal?
>>>>>
>>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The first text file is not that large, it has 6 million records
>>>>>> (lines). For each line I need to read a file out of 80000 files. They 
>>>>>> total
>>>>>> around 1.5TB. I didn't understand what you meant by "then again read
>>>>>> text files for each line and union all rdds."
>>>>>>
>>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>>>> raghavendra.pan...@gmail.com> wrote:
>>>>>>
>>>>>>> How large is your first text file? The idea is you read first text
>>>>>>> file and if it is not large you can collect all the lines on driver and
>>>>>>> then again read text files for each line and union all rdds.
>>>>>>>
>>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Just wonder if this is possible with Spark?
>>>>>>>>
>>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <
>>>>>>>> esal...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I've got a text file where each line is a record. For each record,
>>>>>>>>> I need to process a file in HDFS.
>>>>>>>>>
>>>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>>>> have to
>>>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>>>>> that?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Saliya
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Saliya Ekanayake
>>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>>> Indiana University, Bloomington
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Saliya Ekanayake
>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>> Indiana University, Bloomington
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Saliya Ekanayake
>>>>>> Ph.D. Candidate | Research Assistant
>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>> Indiana University, Bloomington
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Access HDFS within Spark Map Operation

Reply via email to