Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Reinis Vicups
ould probably not be hard coding the spark.kryoserializer.buffer.mb either. On Mon, Oct 13, 2014 at 9:54 AM, Reinis Vicups wrote: Hello, When you set the Spark config as below do you still get one task? Unfortunately yes. Currently I am looking for the very first shuffle stage in Similarit

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Reinis Vicups
the CLI to set them like we do with -sem if needed. Let’s see what Dmitriy thinks about why only one task is being created. On Oct 13, 2014, at 9:32 AM, Reinis Vicups wrote: Hi, Do you think that simply increasing this parameter is a safe and sane thing to do? Why would it be unsafe?

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Reinis Vicups
rue) else rdd.coalesce(numPartitions = rdd.partitions.size) Dmitriy can you shed any light on the use of spark.default.parallelism, how to increase it or how to get more than one task created when performing ABt? On Oct 13, 2014, at 8:56 AM, Reinis Vicups wrote: Hi, I am cur

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Reinis Vicups
back from the time when I used "old" RowSimilarityJob and with some exceptions (I guess due to randomized sparsization) I still have approx. the same values with my own row similarity implementation. reinis On 13.10.2014 18:06, Ted Dunning wrote: On Mon, Oct 13, 2014 at 11:56

Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Reinis Vicups
Hi, I am currently testing SimilarityAnalysis.rowSimilarity and I am wondering, how could I increase number of tasks to use for distributed shuffle. What I currently observe, is that SimilarityAnalysis is requiring almost 20 minutes for my dataset only with this stage: combineByKey at ABt.

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Reinis Vicups
y already have (and, in fact, do in case of spark) all that tooling far better than we will ever have on our own. Sent from my phone. On Oct 9, 2014 12:56 PM, "Reinis Vicups" wrote: Hello, I am currently looking into the new (DRM) mahout framework. I find myself wondering why is it so

Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Reinis Vicups
Hello, I am currently looking into the new (DRM) mahout framework. I find myself wondering why is it so that from one side there is a lot of thought, effort and design complexity being invested into abstracting engines, contexts or algebraic operations, but from the other side, even abstract in

Re: RowSimilarityJob implementation with Spark

2014-08-07 Thread Reinis Vicups
k’s much better. For example if a pipeline is long, requiring lots of serialization and IO on hadoop and none using Spark’s in-memory RDDs. I doubt we’ll see that with RSJ, which is fairly simple. Other comments inline On Aug 5, 2014, at 9:38 PM, Reinis Vicups wrote: Yes, that would make usea

Re: RowSimilarityJob implementation with Spark

2014-08-05 Thread Reinis Vicups
ed for any particular item. Over all it would be 100*number of rows that you’d get and they’d be the items with highest similarity scores of the total created. Do you think this would help? On Aug 5, 2014, at 4:50 PM, Reinis Vicups wrote: Hi, we have had good results with RowSimilarityJob in

RowSimilarityJob implementation with Spark

2014-08-05 Thread Reinis Vicups
Hi, we have had good results with RowSimilarityJob in our Use Case with some quality loss due to pruning and decline in performance if setting thresholds too high/low. The current work on mahout integration with spark done by dlyubimov, pferrel and others is just amazing (although I would lo

Re: SparseVectorsFromSequenceFiles: ArrayIndexOutOfBoundsException in DictionaryVectorizer

2014-07-18 Thread Reinis Vicups
Hi, I am humbly bumping this. Alas, up to now I haven't figured why the heck the error occurs. Any hint on what direction I should look into is greatly appreciated. Kind regards reinis On 12.07.2014 17:38, Reinis Vicups wrote: Hi, the log bellow shows an issue that started to occur

SparseVectorsFromSequenceFiles: ArrayIndexOutOfBoundsException in DictionaryVectorizer

2014-07-12 Thread Reinis Vicups
Hi, the log bellow shows an issue that started to occur just "recently" (I haven't ran tests with this somewhat larger dataset (320K documents) for some time and when I did today, I got this "all of a sudden"). Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I can tell, t

ClusterOutputPostProcessor: what is the purpose of clusterMappings

2014-05-14 Thread Reinis Vicups
Hi, in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer are using Map *ClusterMappings = ClusterCountReader.getClusterIDs(clusterOutputPath, conf, ). This map alows to map clusterIds to index of 0 to k-1 where k is the number of clusters. What is the purpose of this mapp

ClusterOutputPostProcessor: what is the purpose of clusterMappings

2014-05-11 Thread Reinis Vicups
Hi, in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer are using Map *ClusterMappings = ClusterCountReader.getClusterIDs(clusterOutputPath, conf, ). This map alows to map clusterIds to index of 0 to k-1 where k is the number of clusters. What is the purpose of this mapp

Re: Best practice for partial cartesian product

2014-04-08 Thread Reinis Vicups
ds to be emitted here. In our collaborative filtering code, we solve this through downsampling. --sebastian On 04/08/2014 10:08 AM, Reinis Vicups wrote: Hi, this is not mahout question directly, but I figured that you guys most likely can answer it. Actually I have two questions: 1. This:

Best practice for partial cartesian product

2014-04-08 Thread Reinis Vicups
Hi, this is not mahout question directly, but I figured that you guys most likely can answer it. Actually I have two questions: 1. This: {(1,2); (1,3); (2,3)} is not full cartesian product, right? It is missing (1,1); (2,2); (3,3); (2,1); My question is - how is it called? Partial carte

resplit not generating splits

2014-03-28 Thread Reinis Vicups
Hi, when I run "mahout resplit", I get this output: support@hadoop1:~$ mahout resplit --input .../final/clusteredPoints/part-m-* --output .../final/split --numSplits 4 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /opt/cloudera/parcels/CDH-5.0.0-0.cdh5

Re: Re: canopy creating canopies with the same points

2014-03-24 Thread Reinis Vicups
oved from the "input set", while points within T1 are added to the cluster but NOT removed from the ³input set" (and therefore may be added to another cluster later in the process). SCott On 3/24/14, 6:44 AM, "Reinis Vicups" wrote: Hi, apparently I am missunderstanding the

canopy creating canopies with the same points

2014-03-24 Thread Reinis Vicups
Hi, apparently I am missunderstanding the way canopy works. I thought that once datapoint is added to canopy, it is removed from the list of to-be-clustered points thus one point is assigned to one canopy. In the example below this is not the case: :C-28{n=1 c=[70:11.686, 72:7.170, 236:8.182