ould probably not be hard coding the
spark.kryoserializer.buffer.mb either.
On Mon, Oct 13, 2014 at 9:54 AM, Reinis Vicups wrote:
Hello,
When you set the Spark config as below do you still get one task?
Unfortunately yes.
Currently I am looking for the very first shuffle stage in
Similarit
the CLI to set them like we do with -sem if needed.
Let’s see what Dmitriy thinks about why only one task is being created.
On Oct 13, 2014, at 9:32 AM, Reinis Vicups wrote:
Hi,
Do you think that simply increasing this parameter is a safe and sane thing
to do?
Why would it be unsafe?
rue)
else
rdd.coalesce(numPartitions = rdd.partitions.size)
Dmitriy can you shed any light on the use of spark.default.parallelism, how to
increase it or how to get more than one task created when performing ABt?
On Oct 13, 2014, at 8:56 AM, Reinis Vicups wrote:
Hi,
I am cur
back from the time when I used "old"
RowSimilarityJob and with some exceptions (I guess due to randomized
sparsization) I still have approx. the same values with my own row
similarity implementation.
reinis
On 13.10.2014 18:06, Ted Dunning wrote:
On Mon, Oct 13, 2014 at 11:56
Hi,
I am currently testing SimilarityAnalysis.rowSimilarity and I am
wondering, how could I increase number of tasks to use for distributed
shuffle.
What I currently observe, is that SimilarityAnalysis is requiring almost
20 minutes for my dataset only with this stage:
combineByKey at ABt.
y already have (and, in
fact, do in case of spark) all that tooling far better than we will ever
have on our own.
Sent from my phone.
On Oct 9, 2014 12:56 PM, "Reinis Vicups" wrote:
Hello,
I am currently looking into the new (DRM) mahout framework.
I find myself wondering why is it so
Hello,
I am currently looking into the new (DRM) mahout framework.
I find myself wondering why is it so that from one side there is a lot
of thought, effort and design complexity being invested into abstracting
engines, contexts or algebraic operations,
but from the other side, even abstract in
k’s much better. For example if a pipeline is long,
requiring lots of serialization and IO on hadoop and none using Spark’s
in-memory RDDs. I doubt we’ll see that with RSJ, which is fairly simple.
Other comments inline
On Aug 5, 2014, at 9:38 PM, Reinis Vicups wrote:
Yes, that would make usea
ed for any particular
item. Over all it would be 100*number of rows that you’d get and they’d be the
items with highest similarity scores of the total created. Do you think this
would help?
On Aug 5, 2014, at 4:50 PM, Reinis Vicups wrote:
Hi,
we have had good results with RowSimilarityJob in
Hi,
we have had good results with RowSimilarityJob in our Use Case with some
quality loss due to pruning and decline in performance if setting
thresholds too high/low.
The current work on mahout integration with spark done by dlyubimov,
pferrel and others is just amazing (although I would lo
Hi,
I am humbly bumping this. Alas, up to now I haven't figured why the heck
the error occurs.
Any hint on what direction I should look into is greatly appreciated.
Kind regards
reinis
On 12.07.2014 17:38, Reinis Vicups wrote:
Hi,
the log bellow shows an issue that started to occur
Hi,
the log bellow shows an issue that started to occur just "recently" (I
haven't ran tests with this somewhat larger dataset (320K documents) for
some time and when I did today, I got this "all of a sudden").
Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I
can tell, t
Hi,
in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer
are using Map *ClusterMappings =
ClusterCountReader.getClusterIDs(clusterOutputPath, conf, ).
This map alows to map clusterIds to index of 0 to k-1 where k is the
number of clusters.
What is the purpose of this mapp
Hi,
in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer
are using Map *ClusterMappings =
ClusterCountReader.getClusterIDs(clusterOutputPath, conf, ).
This map alows to map clusterIds to index of 0 to k-1 where k is the
number of clusters.
What is the purpose of this mapp
ds to be emitted here. In our collaborative
filtering code, we solve this through downsampling.
--sebastian
On 04/08/2014 10:08 AM, Reinis Vicups wrote:
Hi,
this is not mahout question directly, but I figured that you guys most
likely can answer it.
Actually I have two questions:
1. This:
Hi,
this is not mahout question directly, but I figured that you guys most
likely can answer it.
Actually I have two questions:
1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It
is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it
called? Partial carte
Hi,
when I run "mahout resplit", I get this output:
support@hadoop1:~$ mahout resplit --input
.../final/clusteredPoints/part-m-* --output .../final/split --numSplits 4
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH-5.0.0-0.cdh5
oved from the "input set", while points within T1 are added to the
cluster but NOT removed from the ³input set" (and therefore may be added
to another cluster later in the process).
SCott
On 3/24/14, 6:44 AM, "Reinis Vicups" wrote:
Hi,
apparently I am missunderstanding the
Hi,
apparently I am missunderstanding the way canopy works. I thought that
once datapoint is added to canopy, it is removed from the list of
to-be-clustered points thus one point is assigned to one canopy.
In the example below this is not the case:
:C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182
19 matches
Mail list logo