Re: Access HDFS within Spark Map Operation

ayan guha Tue, 13 Sep 2016 20:26:36 -0700

Question:

1. Why you can not read all 80K files together? ie, why you have a
dependency on first text file?
2. Your first text file has 6M rows, but total number of files~80K. is
there a scenario where there may not be a file in HDFS corresponding to the
row in first text file?
3. May be a follow up of 1, what is your end goal?


On Wed, Sep 14, 2016 at 12:17 PM, Saliya Ekanayake <esal...@gmail.com>
wrote:

> The first text file is not that large, it has 6 million records (lines).
> For each line I need to read a file out of 80000 files. They total around
> 1.5TB. I didn't understand what you meant by "then again read text files
> for each line and union all rdds."
>
> On Tue, Sep 13, 2016 at 10:04 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> How large is your first text file? The idea is you read first text file
>> and if it is not large you can collect all the lines on driver and then
>> again read text files for each line and union all rdds.
>>
>> On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" <esal...@gmail.com> wrote:
>>
>>> Just wonder if this is possible with Spark?
>>>
>>> On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've got a text file where each line is a record. For each record, I
>>>> need to process a file in HDFS.
>>>>
>>>> So if I represent these records as an RDD and invoke a map() operation
>>>> on them how can I access the HDFS within that map()? Do I have to create a
>>>> Spark context within map() or is there a better solution to that?
>>>>
>>>> Thank you,
>>>> Saliya
>>>>
>>>>
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>>
>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Best Regards,
Ayan Guha

Re: Access HDFS within Spark Map Operation

Reply via email to