Re: Write SequenceFile from custom data

2013-12-02 Thread Angelo Immediata
well similarity between data should be calculated by taking care of the following variables: meteo, manifestation, day of the week, month of the year and vacation 2013/12/3 Ted Dunning > The key first question is how you plan to compute similarity between data > points. It isn't clear how you

Build Failure in Eclipse

2013-12-02 Thread Tharindu Rusira
I recently updated my Mahout-0.9 snapshot version code and rebuilt from the terminal. The process was successful with no build errors. But when I try to build mahout from Eclipse (run as --> Maven build) I get the following build error while Mahout-Integration is being built. Failed to execute goa

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Ted Dunning
Inline On Mon, Dec 2, 2013 at 8:55 AM, optimusfan wrote: > ... To accomplish this, we used AdaptiveLogisticRegression and trained 46 > binary classification models. Our approach has been to do an 80/20 split > on the data, holding the 20% back for cross-validation of the models we > generate.

Re: Mahout for clustering

2013-12-02 Thread Ted Dunning
Do you want to cluster users or items? For items, the vectorization that you suggest will work reasonably well, especially if you use TF.IDF weighting and normalize the resulting vectors. You can also use one of the matrix decomposition techniques and cluster the resulting vectors. The spectral

Re: Mahout for clustering

2013-12-02 Thread Andrew Musselman
I would probably write a script to parse that out and stream to it from Pig. http://pig.apache.org/docs/r0.11.0/basic.html#stream On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak wrote: > I am looking for some input on how to vectorize my data. > > > From: ssti...@live.com > > To: user@mahout.apac

RE: Mahout for clustering

2013-12-02 Thread Sameer Tilak
I am looking for some input on how to vectorize my data. > From: ssti...@live.com > To: user@mahout.apache.org > Subject: Mahout for clustering > Date: Mon, 2 Dec 2013 16:22:03 -0800 > > > > > Hi All,We are using Apache Pig for building our data pipeline. We have data > in the following fash

RE: Pig vector project

2013-12-02 Thread Sameer Tilak
Cool! I am using it for sequence file reading so will be happy to look into it. > Date: Mon, 2 Dec 2013 16:14:23 -0800 > Subject: Re: Pig vector project > From: andrew.mussel...@gmail.com > To: user@mahout.apache.org > > You might also look into elephant-bird from Twitter; covers a lot of ground

Re: Pig vector project

2013-12-02 Thread Ted Dunning
Elephant bird is distinctly superior to Pig Vector for many things (it moved forward, Pig Vector did not). I believe here is also a Twitter internal project known as PigML which is much more what Pig Vector wanted to be. There is also https://github.com/hanborq/pigml, but I think it is very diffe

Mahout for clustering

2013-12-02 Thread Sameer Tilak
Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion: userid, age, items {code 1, code 2, ….}, few other features... Each item has a unique alphanumeric code. I would like to use mahout for clustering it. Based on my current reading I see foll

Re: Pig vector project

2013-12-02 Thread Andrew Musselman
You might also look into elephant-bird from Twitter; covers a lot of ground. https://github.com/kevinweil/elephant-bird On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak wrote: > > > > Hi All,We are using Pig top build our data pipeline. > I came across the following:https://github.com/tdunning/pig

Pig vector project

2013-12-02 Thread Sameer Tilak
Hi All,We are using Pig top build our data pipeline. I came across the following:https://github.com/tdunning/pig-vector The last commit was 2 yrs ago. Any information on will there be any further work on this project?

Re: Write SequenceFile from custom data

2013-12-02 Thread Ted Dunning
The key first question is how you plan to compute similarity between data points. It isn't clear how you should do this with your data. On Mon, Dec 2, 2013 at 1:31 AM, Angelo Immediata wrote: > Hi > > I'm pretty newbie regarding learning achine and above all Apache Mahout, so > pardon me my l

Re: Clustering Spatial Data

2013-12-02 Thread Ted Dunning
Peter, What you say is a bit confusing to me. You say you have centers already. But then you talk about algorithms which find the centers. Also, you say you want to assign points based on centers, but you also say that clusters have different shapes, area, size and point count. Do you mean tha

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Gokhan Capan
Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning wrote: > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi < > vishal.santo...@gmail.com> > > > > > > > Are we to assume that SGD is still a work in progress and > implementations ( > > Cross Fold, Online, Adaptive ) are too flawed to be realis

Mahout Web Service GlassFish 4 Deployment Error

2013-12-02 Thread Mario Levitin
Hi, I'm trying to build a web service for Mahout. If I use GlassFish 3.x there is no problem. However, when I change to GlassFish 4, during deployment the following error occurs: (by the way, GlassFish 4 works fine for my other web service implementations, my suspicion is that there is some inconsi

Recommending already consumed items

2013-12-02 Thread Mario Levitin
Hi all, In some recommender applications the system might recommend already consumed items. For example, a hotel recommendation site might recommend hotel A to a user who already stayed at hotel A before. In order to recommend already consumed items we have to rank all of the items (consumed and

Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-02 Thread Fernando Santos
Train and test set are in single files (part-r-0). Training file is 30MB and testing file is 2MB. 2013/12/2 Fernando Santos > Hello Ted, > > No, the training ran also in one machine. What happens sometimes is that > each box execute one job one at a time, but not together. For example, if >

Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-02 Thread Fernando Santos
Hello Ted, No, the training ran also in one machine. What happens sometimes is that each box execute one job one at a time, but not together. For example, if it will run 3 jobs, it runs the first job in box1, the next in box2 and the next in box 1 again. The full dataset is a csv around 70MB. I t

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread optimusfan
Ted- Thanks for the response.  Just getting back after the holiday weekend and am catching up on this.  Let me be more specific in what we're doing and what we're seeing in terms of results.  Our goal was to created a classifier that could assign one or more of 46 categories to various document

Canopy generation out of memory troubleshooting

2013-12-02 Thread Chih-Hsien Wu
Hi All, I posted up a Mahout canopy generation related troubleshoot last week; however, I didn't get the problem solved. The message below is the error I received. I'm trying to run canopy generation about 900 mb worth of information. There are estimated about 120,000 vectors. I'm currently runnin

Re: theta normalization fo naive bayes is commented out

2013-12-02 Thread Suneel Marthi
I believe this was something that was to be fixed when the old Naive Bayes code was replaced by the present implementation. See Mahout-1001 for more info and history on this. On Monday, December 2, 2013 7:55 AM, tuku wrote: hello; i searched wiki and the web but couldn't find the reaso

theta normalization fo naive bayes is commented out

2013-12-02 Thread tuku
hello; i searched wiki and the web but couldn't find the reason why theta normalization is commented out for naive bayes classification. there is a todo comment on top that states this will be enabled soon. is there any schedule for this? do anyone know the reason not to use theta normalization?

Write SequenceFile from custom data

2013-12-02 Thread Angelo Immediata
Hi I'm pretty newbie regarding learning achine and above all Apache Mahout, so pardon me my low level questions I need to do some cluster analysis by using some data. At the beginning this data can be not too much huge, but after some time they can be really huge (I did some calculation and after

Clustering Spatial Data

2013-12-02 Thread Peter K
Hi there, I've have no experience with mahout but I know that it will solve my problem :) ! I've the following requirements: * No hadoop setup should be necessary. I want a simple approach and I know this is possible with mahout! * I have lots of points (~100 million) but also some RAM (32GB)