Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Piero Giacomelli
Dear Simon, thanks for informing us. I am now evaluating Prediction.io for creating a reccomandation system. However as I see the license is an Alfresco Limited one. So I do not understand what are the limitation. I mean if I install prediction and I do make some chanages to the source

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Bertrand Dechoux
Affero GPL : http://en.wikipedia.org/wiki/Affero_General_Public_License Alfresco is something else. It does imply that if you provide someone access to a custom version of the engine, then you must provide the sources. But is only about the engine ie not the clients, not the configuration, not

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Piero Giacomelli
Dear Bertrand Yes, that was what I understood. But for me I miss a step. Let us take a practical example. I use SlopeRecommender engine and I implement its policy on how to evaluate the similarity. In this case I made a custom version on the engine right? The SKD is not customize. So my

Re: clusterdump samplePoints parameter

2014-03-19 Thread Terry Blankers
I understand that part. What I'm unclear on is if there is any ranking or ordering of the points in each cluster before they are limited. In other words, are the points in each cluster random ordered? Or ordered alphabetically by the document id or filename? Or ordered by some calculation as

Problem with mahout seqdirectory

2014-03-19 Thread Natalia Connolly
Hello, I have mahout 0.9 and a single-node Hadoop 1.2.1 running on a Mac. I am trying to create a bunch of vectors for clustering from a collection of text documents. So I did: $MAHOUT_HOME/bin/mahout seqdirectory --input /Users/hadoop/fuzzyjoin-results/NOTES/progress_notes --output

Re: Problem with mahout seqdirectory

2014-03-19 Thread Pavan Kumar N
Hi Natalia, It appears you are referencing files in your local file system instead of files in HDFS. If you want to run Mahout under Hadoop, you would then need to access the input file stored in HDFS and ideally output could also be stored in potential HDFS location. Here's how I would run:

Re: Using SSVD for dimensionality reduction on Mahout

2014-03-19 Thread Dmitriy Lyubimov
I am not sure if we have direct CSV converters to do that; CSV is not that expressive anyway. But it is not difficult to write up such converter on your own, i suppose. The steps you need to do is this : (1) prepare set of data points in a form of (unique vector key, n-vector) tuples. Vector key

Fwd: Using SSVD for dimensionality reduction on Mahout

2014-03-19 Thread Vijay B
Hi All, I have a CSV file on which I've to perform dimensionality reduction. I'm new to Mahout, on doing some search I understood that SSVD can be used for performing dimensionality reduction. I'm not sure of the steps that have to be executed before SSVD, please help me. Thanks, Vijay

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Pat Ferrel
I looked at the docs and the AGPL for the server is a problem for me—maybe even a blocker. Since the SDK is useless without the server, this may be a problem for you. I like the SDK, idea. The alternative is logfiles to store prefs (not a bad architecture really) and a grow your own method for

Re: Using SSVD for dimensionality reduction on Mahout

2014-03-19 Thread Dmitriy Lyubimov
PS. dspca method, which is almost exact replica of SSVD --pca true, is also available on Spark running on exactly same sequence file DRM (there's no CLI though, it needs to be wrapped in a scala code) [1]. It potentially may be a bit better performant than MR version, although it is new. If you

Re: Using SSVD for dimensionality reduction on Mahout

2014-03-19 Thread Vijay B
Thanks a lot for the detailed explanation, it was very helpful. I will write a CSV to sequence converter, just needed some clarity on the key/value pairs in the sequence file. Suppose my csv file contains the below values 11,22,33,44,55 13,23,34,45,56 I assume that the sequence file would look

Re: Text clustering with hashing vector encoders

2014-03-19 Thread Frank Scholten
On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Hashing vector encoders will preserve distances when used with multiple probes. So if a token occurs two times in a document the first token will be mapped to a given location and when the token is hashed the

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Simon Chan
Dear Piero, The AGPL is to encourage people who develop on PredictionIO contributes back to the open community, even though they are using it to offer cloud services. We are seriously looking into the possibility of making custom engines/algorithms separated from the main server code, so that

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Simon Chan
Thanks for the feedback. Happy to discuss how we can resolve the AGPL limitation for your work. There are a few Ruby gem for PredictionIO, contributed by developers as well as supported by PredictionIO team, that you can choose from. We hope that the UI can assist developers manage the data

Re: Text clustering with hashing vector encoders

2014-03-19 Thread Ted Dunning
With text hashing, you have an issue because of collisions. In spite of this, you get good results and can decrease the dimension of the data substantially using a single hashed location. If you use more than one probe, the probability that two words will hash to exactly the same two locations

Re: Text clustering with hashing vector encoders

2014-03-19 Thread Ted Dunning
On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten fr...@frankscholten.nlwrote: On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Hashing vector encoders will preserve distances when used with multiple probes. So if a token occurs two times in a document

Re: Introducing PredictionIO: A developer-friendly Mahout stack for production

2014-03-19 Thread Ted Dunning
AGPL is a complete show-stopper for contributions even for dependencies. Apache software can't critically depend on GPL components of any sort. As such, it doesn't make any sense to have components of Mahout designed to run only on a server that is AGPL. On Wed, Mar 19, 2014 at 11:53 AM,

Re: debug mode

2014-03-19 Thread Pat Ferrel
If you are using a debugger like IntelliJ or Eclipse you just create a project that uses Mahout. By default it will run any hadoop on the native local file system with all processes on your debug machine. That is as far as I’ve needed to go. Andrew is talking about how to debug while running