+1 for the guidance from Nirvan. Also it would be better to repartition and
store the data in parquet format in case you are planning to do the joins
more than once or with other data sources. Parquet with SPARK works likes a
charm. Over S3 I have seen its performance being quite close to cached data
over a few million records.

Regards,
Gourav

On Wed, Jun 22, 2016 at 7:11 AM, Nirav Patel <npa...@xactlycorp.com> wrote:

> Can your domain list fit in memory of one executor. if so you can use
> broadcast join.
>
> You can always narrow down to inner join and derive rest from original set
> if memory is issue there. If you are just concerned about shuffle memory
> then to reduce amount of shuffle you can do following:
> 1) partition both rdd (dataframes) with same partitioner with same count
> so corresponding data will on on same node at least
> 2) increase shuffle.memoryfraction
>
> you can use dataframes with spark 1.6 or greater to further reduce memory
> footprint. I haven't tested that though.
>
>
> On Tue, Jun 21, 2016 at 6:16 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> Hi,
>>
>>
>> can somebody please explain the way FullOuterJoin works on Spark? Does
>> each intersection get fully loaded to memory?
>>
>> My problem is as follows:
>>
>>
>> I have two large data-sets:
>>
>>
>> * a list of web pages,
>>
>> * a list of domain-names with specific rules for processing pages from
>> that domain.
>>
>>
>> I am joining these web-pages with processing rules.
>>
>>
>> For certain domains there are millions of web-pages.
>>
>>
>> Based on the memory demands the join is having it looks like the whole
>> intersection (i.e. a domain + all corresponding pages) are kept in memory
>> while processing.
>>
>>
>> What I really need in this case, though, is to hold just the domain and
>> iterate over all corresponding pages, one at a time.
>>
>>
>> What would be the best way to do this on Spark?
>>
>> Thank you,
>>
>> Dusan Rychnovsky
>>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
> <https://twitter.com/Xactly>  [image: Facebook]
> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
> <http://www.youtube.com/xactlycorporation>

Reply via email to