Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-30 Thread Pat Ferrel
onday, August 21, 2017 2:44:35 PM To: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs) We do currently have optimizations based on density analysis in use e.g.: in AtB. https://githu

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-23 Thread Dmitriy Lyubimov
scussing a user defined setting for this. Something that we have > not worked in yet. > > > From: Andrew Palumbo <ap@outlook.com> > Sent: Monday, August 21, 2017 2:44:35 PM > To: user@mahout.apache.org > Subject: Re: spark-itemsimilarity s

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
e not worked in yet. From: Andrew Palumbo <ap@outlook.com> Sent: Monday, August 21, 2017 2:44:35 PM To: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs) We do currently

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs) We do currently have optimizations based on density analysis in use e.g.: in AtB. https://github.com/apache/mahout/blob/08e02602e947ff945b9bd73ab5f0b45863df3e5

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Andrew Palumbo
From: Pat Ferrel <p...@occamsmachete.com> Sent: Monday, August 21, 2017 2:26:58 PM To: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs) That looks like ancient code from t

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
att.scru...@bronto.com> >>> wrote: >>>> >>>> Hi Pat, >>>> >>>> I've taken some screenshots of my Spark UI to hopefully shed some light >>> on the behavior I'm seeing. Do you mind if I send you a link via direct >>> email (

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
uot; <matt.scru...@bronto.com> wrote: >>> >>>> I'm running a custom Scala app (distributed in a shaded jar) directly >> calling SimilarityAnalysis.cooccurrenceIDSs(), not using the CLI. >>>> >>>> The input data already gets explicitly repar

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Scruggs, Matt
n some screenshots of my Spark UI to hopefully shed some light >>> on the behavior I'm seeing. Do you mind if I send you a link via direct >>> email (would rather not post it here)? It's just a shared Dropbox folder. >>>> >>>> >>>> Thanks, >>&g

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-21 Thread Pat Ferrel
; Thanks, >>> Matt >>> >>> >>> >>> On 8/14/17, 11:34 PM, "Scruggs, Matt" <matt.scru...@bronto.com> wrote: >>> >>>> I'm running a custom Scala app (distributed in a shaded jar) directly >> calling SimilarityAnalysis.cooccurrenceIDSs(), no

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-17 Thread Scruggs, Matt
stom Scala app (distributed in a shaded jar) directly >> calling SimilarityAnalysis.cooccurrenceIDSs(), not using the CLI. >> >> >> >> The input data already gets explicitly repartitioned to spark.cores.max >> (defaultParallelism) in our code. I'll try increa

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-16 Thread Pat Ferrel
se even with all cores busy the whole time, which is >> why I've been playing around with various values for >> spark.sql.shuffle.partitions. The O(log n) operations I mentioned seem to >> take >95% of runtime. >> >> Thanks, >> Matt >>

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-15 Thread Scruggs, Matt
>> spark.sql.shuffle.partitions. The O(log n) operations I mentioned seem to >> take >95% of runtime. >> >> Thanks, >> Matt >> ____ >> From: Pat Ferrel <p...@occamsmachete.com> >> Sent: Monday, August 14,

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-15 Thread Pat Ferrel
gt; O(log n) operations I mentioned seem to take >95% of runtime. > > Thanks, > Matt > > From: Pat Ferrel <p...@occamsmachete.com> > Sent: Monday, August 14, 2017 11:02:42 PM > To: user@mahout.apache.org > Subject: Re: spark-itemsimil

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-15 Thread Scruggs, Matt
1:02:42 PM >To: user@mahout.apache.org >Subject: Re: spark-itemsimilarity scalability / Spark parallelism issues >(SimilarityAnalysis.cooccurrencesIDSs) > >Are you using the CLI? If so it’s likely that there is only one partition of >the data. If you use Mahout i

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-14 Thread Scruggs, Matt
. The O(log n) operations I mentioned seem to take >95% of runtime. Thanks, Matt From: Pat Ferrel <p...@occamsmachete.com> Sent: Monday, August 14, 2017 11:02:42 PM To: user@mahout.apache.org Subject: Re: spark-itemsimilarity scalability / Spark paralleli

Re: spark-itemsimilarity scalability / Spark parallelism issues (SimilarityAnalysis.cooccurrencesIDSs)

2017-08-14 Thread Pat Ferrel
Are you using the CLI? If so it’s likely that there is only one partition of the data. If you use Mahout in the Spark shell or using it as a lib, do a repartition on the input data before passing it into SimilarityAnalysis.cooccurrencesIDSs. I repartition to 4*total cores to start with and set