onday, August 21, 2017 2:44:35 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues
(SimilarityAnalysis.cooccurrencesIDSs)
We do currently have optimizations based on density analysis in use e.g.: in
AtB.
https://githu
scussing a user defined setting for this. Something that we have
> not worked in yet.
>
>
> From: Andrew Palumbo <ap@outlook.com>
> Sent: Monday, August 21, 2017 2:44:35 PM
> To: user@mahout.apache.org
> Subject: Re: spark-itemsimilarity s
e not worked
in yet.
From: Andrew Palumbo <ap@outlook.com>
Sent: Monday, August 21, 2017 2:44:35 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues
(SimilarityAnalysis.cooccurrencesIDSs)
We do currently
: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues
(SimilarityAnalysis.cooccurrencesIDSs)
We do currently have optimizations based on density analysis in use e.g.: in
AtB.
https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e5
From: Pat Ferrel <p...@occamsmachete.com>
Sent: Monday, August 21, 2017 2:26:58 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues
(SimilarityAnalysis.cooccurrencesIDSs)
That looks like ancient code from t
att.scru...@bronto.com>
>>> wrote:
>>>>
>>>> Hi Pat,
>>>>
>>>> I've taken some screenshots of my Spark UI to hopefully shed some light
>>> on the behavior I'm seeing. Do you mind if I send you a link via direct
>>> email (
uot; <matt.scru...@bronto.com> wrote:
>>>
>>>> I'm running a custom Scala app (distributed in a shaded jar) directly
>> calling SimilarityAnalysis.cooccurrenceIDSs(), not using the CLI.
>>>>
>>>> The input data already gets explicitly repar
n some screenshots of my Spark UI to hopefully shed some light
>>> on the behavior I'm seeing. Do you mind if I send you a link via direct
>>> email (would rather not post it here)? It's just a shared Dropbox folder.
>>>>
>>>>
>>>> Thanks,
>>&g
; Thanks,
>>> Matt
>>>
>>>
>>>
>>> On 8/14/17, 11:34 PM, "Scruggs, Matt" <matt.scru...@bronto.com> wrote:
>>>
>>>> I'm running a custom Scala app (distributed in a shaded jar) directly
>> calling SimilarityAnalysis.cooccurrenceIDSs(), no
stom Scala app (distributed in a shaded jar) directly
>> calling SimilarityAnalysis.cooccurrenceIDSs(), not using the CLI.
>> >>
>> >> The input data already gets explicitly repartitioned to spark.cores.max
>> (defaultParallelism) in our code. I'll try increa
se even with all cores busy the whole time, which is
>> why I've been playing around with various values for
>> spark.sql.shuffle.partitions. The O(log n) operations I mentioned seem to
>> take >95% of runtime.
>>
>> Thanks,
>> Matt
>>
>> spark.sql.shuffle.partitions. The O(log n) operations I mentioned seem to
>> take >95% of runtime.
>>
>> Thanks,
>> Matt
>> ____
>> From: Pat Ferrel <p...@occamsmachete.com>
>> Sent: Monday, August 14,
gt; O(log n) operations I mentioned seem to take >95% of runtime.
>
> Thanks,
> Matt
>
> From: Pat Ferrel <p...@occamsmachete.com>
> Sent: Monday, August 14, 2017 11:02:42 PM
> To: user@mahout.apache.org
> Subject: Re: spark-itemsimil
1:02:42 PM
>To: user@mahout.apache.org
>Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues
>(SimilarityAnalysis.cooccurrencesIDSs)
>
>Are you using the CLI? If so it’s likely that there is only one partition of
>the data. If you use Mahout i
. The O(log
n) operations I mentioned seem to take >95% of runtime.
Thanks,
Matt
From: Pat Ferrel <p...@occamsmachete.com>
Sent: Monday, August 14, 2017 11:02:42 PM
To: user@mahout.apache.org
Subject: Re: spark-itemsimilarity scalability / Spark paralleli
Are you using the CLI? If so it’s likely that there is only one partition of
the data. If you use Mahout in the Spark shell or using it as a lib, do a
repartition on the input data before passing it into
SimilarityAnalysis.cooccurrencesIDSs. I repartition to 4*total cores to start
with and set
16 matches
Mail list logo