Depends on join, but unless you are doing cross join, it should not blow
up. 6M is not too much. I think what you may want to consider (a) volume of
your data files (b) reduce shuffling by following similar partitioning on
both RDDs

On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake <esal...@gmail.com> wrote:

> Thank you, but isn't that join going to be too expensive for this?
>
> On Tue, Sep 13, 2016 at 11:55 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> My suggestion:
>>
>> 1. Read first text file in (say) RDD1 using textFile
>> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
>> signature (filename,filecontent).
>> 3. Join RDD1 and 2 based on some file name (or some other key).
>>
>> On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> 1.) What needs to be parallelized is the work for each of those 6M rows,
>>> not the 80K files. Let me elaborate this with a simple for loop if we were
>>> to write this serially.
>>>
>>> For each line L out of 6M in the first file{
>>>      process the file corresponding to L out of those 80K files.
>>> }
>>>
>>> The 80K files are in HDFS and to read all that content into each worker
>>> is not possible due to size.
>>>
>>> 2. No. multiple rows may point to rthe same file but they operate on
>>> different records within the file.
>>>
>>> 3. End goal is to write back 6M processed information.
>>>
>>> This is simple map only type scenario. One workaround I can think of is
>>> to append all the 6M records to each of the data files.
>>>
>>> Thank you
>>>
>>> On Tue, Sep 13, 2016 at 11:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Question:
>>>>
>>>> 1. Why you can not read all 80K files together? ie, why you have a
>>>> dependency on first text file?
>>>> 2. Your first text file has 6M rows, but total number of files~80K. is
>>>> there a scenario where there may not be a file in HDFS corresponding to the
>>>> row in first text file?
>>>> 3. May be a follow up of 1, what is your end goal?
>>>>
>>>> On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> The first text file is not that large, it has 6 million records
>>>>> (lines). For each line I need to read a file out of 80000 files. They 
>>>>> total
>>>>> around 1.5TB. I didn't understand what you meant by "then again read
>>>>> text files for each line and union all rdds."
>>>>>
>>>>> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
>>>>> raghavendra.pan...@gmail.com> wrote:
>>>>>
>>>>>> How large is your first text file? The idea is you read first text
>>>>>> file and if it is not large you can collect all the lines on driver and
>>>>>> then again read text files for each line and union all rdds.
>>>>>>
>>>>>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Just wonder if this is possible with Spark?
>>>>>>>
>>>>>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <
>>>>>>> esal...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I've got a text file where each line is a record. For each record,
>>>>>>>> I need to process a file in HDFS.
>>>>>>>>
>>>>>>>> So if I represent these records as an RDD and invoke a map()
>>>>>>>> operation on them how can I access the HDFS within that map()? Do I 
>>>>>>>> have to
>>>>>>>> create a Spark context within map() or is there a better solution to 
>>>>>>>> that?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Saliya
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Saliya Ekanayake
>>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>>> Indiana University, Bloomington
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to