Hi there,
I've have no experience with mahout but I know that it will solve my
problem :) !
I've the following requirements:
* No hadoop setup should be necessary. I want a simple approach and I
know this is possible with mahout!
* I have lots of points (~100 million) but also some RAM (32GB)
Hi
I'm pretty newbie regarding learning achine and above all Apache Mahout, so
pardon me my low level questions
I need to do some cluster analysis by using some data. At the beginning
this data can be not too much huge, but after some time they can be really
huge (I did some calculation and
hello;
i searched wiki and the web but couldn't find the reason why theta
normalization is commented out for naive bayes classification.
there is a todo comment on top that states this will be enabled soon.
is there any schedule for this?
do anyone know the reason not to use theta normalization?
Hi All, I posted up a Mahout canopy generation related troubleshoot
last week; however, I didn't get the problem solved. The message below
is the error I received. I'm trying to run canopy generation about 900
mb worth of information. There are estimated about 120,000 vectors.
I'm currently
Hello Ted,
No, the training ran also in one machine. What happens sometimes is that
each box execute one job one at a time, but not together. For example, if
it will run 3 jobs, it runs the first job in box1, the next in box2 and the
next in box 1 again.
The full dataset is a csv around 70MB. I
Hi all,
In some recommender applications the system might recommend already
consumed items. For example, a hotel recommendation site might recommend
hotel A to a user who already stayed at hotel A before.
In order to recommend already consumed items we have to rank all of the
items (consumed and
Gokhan
On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote:
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi
vishal.santo...@gmail.com
Are we to assume that SGD is still a work in progress and
implementations (
Cross Fold, Online, Adaptive ) are too flawed to
Peter,
What you say is a bit confusing to me.
You say you have centers already. But then you talk about algorithms which
find the centers.
Also, you say you want to assign points based on centers, but you also say
that clusters have different shapes, area, size and point count. Do you
mean
Hi All,We are using Pig top build our data pipeline.
I came across the following:https://github.com/tdunning/pig-vector
The last commit was 2 yrs ago. Any information on will there be any further
work on this project?
You might also look into elephant-bird from Twitter; covers a lot of ground.
https://github.com/kevinweil/elephant-bird
On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,We are using Pig top build our data pipeline.
I came across the
Hi All,We are using Apache Pig for building our data pipeline. We have data in
the following fashion:
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code. I would like to use mahout for
clustering it. Based on my current reading I see
Elephant bird is distinctly superior to Pig Vector for many things (it
moved forward, Pig Vector did not).
I believe here is also a Twitter internal project known as PigML which is
much more what Pig Vector wanted to be.
There is also https://github.com/hanborq/pigml, but I think it is very
Cool! I am using it for sequence file reading so will be happy to look into it.
Date: Mon, 2 Dec 2013 16:14:23 -0800
Subject: Re: Pig vector project
From: andrew.mussel...@gmail.com
To: user@mahout.apache.org
You might also look into elephant-bird from Twitter; covers a lot of ground.
I am looking for some input on how to vectorize my data.
From: ssti...@live.com
To: user@mahout.apache.org
Subject: Mahout for clustering
Date: Mon, 2 Dec 2013 16:22:03 -0800
Hi All,We are using Apache Pig for building our data pipeline. We have data
in the following fashion:
I would probably write a script to parse that out and stream to it from Pig.
http://pig.apache.org/docs/r0.11.0/basic.html#stream
On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak ssti...@live.com wrote:
I am looking for some input on how to vectorize my data.
From: ssti...@live.com
To:
Do you want to cluster users or items?
For items, the vectorization that you suggest will work reasonably well,
especially if you use TF.IDF weighting and normalize the resulting vectors.
You can also use one of the matrix decomposition techniques and cluster the
resulting vectors. The spectral
Inline
On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote:
... To accomplish this, we used AdaptiveLogisticRegression and trained 46
binary classification models. Our approach has been to do an 80/20 split
on the data, holding the 20% back for cross-validation of the
well similarity between data should be calculated by taking care of the
following variables: meteo, manifestation, day of the week, month of the
year and vacation
2013/12/3 Ted Dunning ted.dunn...@gmail.com
The key first question is how you plan to compute similarity between data
points. It
18 matches
Mail list logo