Sure, and please post back if it works (or it does not :) )
On Wed, Sep 14, 2016 at 2:09 PM, Saliya Ekanayake wrote:
> Thank you, I'll try.
>
> saliya
>
> On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote:
>
>> Depends on join, but unless you are doing cross join, it should not blow
>> up. 6M i
Thank you, I'll try.
saliya
On Wed, Sep 14, 2016 at 12:07 AM, ayan guha wrote:
> Depends on join, but unless you are doing cross join, it should not blow
> up. 6M is not too much. I think what you may want to consider (a) volume of
> your data files (b) reduce shuffling by following similar par
Depends on join, but unless you are doing cross join, it should not blow
up. 6M is not too much. I think what you may want to consider (a) volume of
your data files (b) reduce shuffling by following similar partitioning on
both RDDs
On Wed, Sep 14, 2016 at 2:00 PM, Saliya Ekanayake wrote:
> Than
Thank you, but isn't that join going to be too expensive for this?
On Tue, Sep 13, 2016 at 11:55 PM, ayan guha wrote:
> My suggestion:
>
> 1. Read first text file in (say) RDD1 using textFile
> 2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
> signature (filename,filecontent)
My suggestion:
1. Read first text file in (say) RDD1 using textFile
2. Read 80K data files in RDD2 using wholeTextFile. RDD2 will be of
signature (filename,filecontent).
3. Join RDD1 and 2 based on some file name (or some other key).
On Wed, Sep 14, 2016 at 1:41 PM, Saliya Ekanayake wrote:
> 1.
1.) What needs to be parallelized is the work for each of those 6M rows,
not the 80K files. Let me elaborate this with a simple for loop if we were
to write this serially.
For each line L out of 6M in the first file{
process the file corresponding to L out of those 80K files.
}
The 80K files
Question:
1. Why you can not read all 80K files together? ie, why you have a
dependency on first text file?
2. Your first text file has 6M rows, but total number of files~80K. is
there a scenario where there may not be a file in HDFS corresponding to the
row in first text file?
3. May be a follow
The first text file is not that large, it has 6 million records (lines).
For each line I need to read a file out of 8 files. They total around
1.5TB. I didn't understand what you meant by "then again read text files
for each line and union all rdds."
On Tue, Sep 13, 2016 at 10:04 PM, Raghavend
How large is your first text file? The idea is you read first text file and
if it is not large you can collect all the lines on driver and then again
read text files for each line and union all rdds.
On 13 Sep 2016 11:39 p.m., "Saliya Ekanayake" wrote:
> Just wonder if this is possible with Spar
Just wonder if this is possible with Spark?
On Mon, Sep 12, 2016 at 12:14 AM, Saliya Ekanayake
wrote:
> Hi,
>
> I've got a text file where each line is a record. For each record, I need
> to process a file in HDFS.
>
> So if I represent these records as an RDD and invoke a map() operation on
> t
Hi,
I've got a text file where each line is a record. For each record, I need
to process a file in HDFS.
So if I represent these records as an RDD and invoke a map() operation on
them how can I access the HDFS within that map()? Do I have to create a
Spark context within map() or is there a bette
11 matches
Mail list logo