well similarity between data should be calculated by taking care of the
following variables: meteo, manifestation, day of the week, month of the
year and vacation
2013/12/3 Ted Dunning
> The key first question is how you plan to compute similarity between data
> points. It isn't clear how you
I recently updated my Mahout-0.9 snapshot version code and rebuilt from the
terminal. The process was successful with no build errors.
But when I try to build mahout from Eclipse (run as --> Maven build) I get
the following build error while Mahout-Integration is being built.
Failed to execute goa
Inline
On Mon, Dec 2, 2013 at 8:55 AM, optimusfan wrote:
> ... To accomplish this, we used AdaptiveLogisticRegression and trained 46
> binary classification models. Our approach has been to do an 80/20 split
> on the data, holding the 20% back for cross-validation of the models we
> generate.
Do you want to cluster users or items?
For items, the vectorization that you suggest will work reasonably well,
especially if you use TF.IDF weighting and normalize the resulting vectors.
You can also use one of the matrix decomposition techniques and cluster the
resulting vectors. The spectral
I would probably write a script to parse that out and stream to it from Pig.
http://pig.apache.org/docs/r0.11.0/basic.html#stream
On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak wrote:
> I am looking for some input on how to vectorize my data.
>
> > From: ssti...@live.com
> > To: user@mahout.apac
I am looking for some input on how to vectorize my data.
> From: ssti...@live.com
> To: user@mahout.apache.org
> Subject: Mahout for clustering
> Date: Mon, 2 Dec 2013 16:22:03 -0800
>
>
>
>
> Hi All,We are using Apache Pig for building our data pipeline. We have data
> in the following fash
Cool! I am using it for sequence file reading so will be happy to look into it.
> Date: Mon, 2 Dec 2013 16:14:23 -0800
> Subject: Re: Pig vector project
> From: andrew.mussel...@gmail.com
> To: user@mahout.apache.org
>
> You might also look into elephant-bird from Twitter; covers a lot of ground
Elephant bird is distinctly superior to Pig Vector for many things (it
moved forward, Pig Vector did not).
I believe here is also a Twitter internal project known as PigML which is
much more what Pig Vector wanted to be.
There is also https://github.com/hanborq/pigml, but I think it is very
diffe
Hi All,We are using Apache Pig for building our data pipeline. We have data in
the following fashion:
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code. I would like to use mahout for
clustering it. Based on my current reading I see foll
You might also look into elephant-bird from Twitter; covers a lot of ground.
https://github.com/kevinweil/elephant-bird
On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak wrote:
>
>
>
> Hi All,We are using Pig top build our data pipeline.
> I came across the following:https://github.com/tdunning/pig
Hi All,We are using Pig top build our data pipeline.
I came across the following:https://github.com/tdunning/pig-vector
The last commit was 2 yrs ago. Any information on will there be any further
work on this project?
The key first question is how you plan to compute similarity between data
points. It isn't clear how you should do this with your data.
On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata wrote:
> Hi
>
> I'm pretty newbie regarding learning achine and above all Apache Mahout, so
> pardon me my l
Peter,
What you say is a bit confusing to me.
You say you have centers already. But then you talk about algorithms which
find the centers.
Also, you say you want to assign points based on centers, but you also say
that clusters have different shapes, area, size and point count. Do you
mean tha
Gokhan
On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning wrote:
> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> vishal.santo...@gmail.com>
>
> >
> >
> > Are we to assume that SGD is still a work in progress and
> implementations (
> > Cross Fold, Online, Adaptive ) are too flawed to be realis
Hi,
I'm trying to build a web service for Mahout.
If I use GlassFish 3.x there is no problem. However, when I change to
GlassFish 4, during deployment the following error occurs: (by the way,
GlassFish 4 works fine for my other web service implementations, my
suspicion is that there is some inconsi
Hi all,
In some recommender applications the system might recommend already
consumed items. For example, a hotel recommendation site might recommend
hotel A to a user who already stayed at hotel A before.
In order to recommend already consumed items we have to rank all of the
items (consumed and
Train and test set are in single files (part-r-0). Training file is
30MB and testing file is 2MB.
2013/12/2 Fernando Santos
> Hello Ted,
>
> No, the training ran also in one machine. What happens sometimes is that
> each box execute one job one at a time, but not together. For example, if
>
Hello Ted,
No, the training ran also in one machine. What happens sometimes is that
each box execute one job one at a time, but not together. For example, if
it will run 3 jobs, it runs the first job in box1, the next in box2 and the
next in box 1 again.
The full dataset is a csv around 70MB. I t
Ted-
Thanks for the response. Just getting back after the holiday weekend and am
catching up on this. Let me be more specific in what we're doing and what
we're seeing in terms of results. Our goal was to created a classifier that
could assign one or more of 46 categories to various document
Hi All, I posted up a Mahout canopy generation related troubleshoot
last week; however, I didn't get the problem solved. The message below
is the error I received. I'm trying to run canopy generation about 900
mb worth of information. There are estimated about 120,000 vectors.
I'm currently runnin
I believe this was something that was to be fixed when the old Naive Bayes code
was replaced by the present implementation.
See Mahout-1001 for more info and history on this.
On Monday, December 2, 2013 7:55 AM, tuku wrote:
hello;
i searched wiki and the web but couldn't find the reaso
hello;
i searched wiki and the web but couldn't find the reason why theta
normalization is commented out for naive bayes classification.
there is a todo comment on top that states this will be enabled soon.
is there any schedule for this?
do anyone know the reason not to use theta normalization?
Hi
I'm pretty newbie regarding learning achine and above all Apache Mahout, so
pardon me my low level questions
I need to do some cluster analysis by using some data. At the beginning
this data can be not too much huge, but after some time they can be really
huge (I did some calculation and after
Hi there,
I've have no experience with mahout but I know that it will solve my
problem :) !
I've the following requirements:
* No hadoop setup should be necessary. I want a simple approach and I
know this is possible with mahout!
* I have lots of points (~100 million) but also some RAM (32GB)
24 matches
Mail list logo