Cool will look at this after the release.


On Feb 11, 2009, at 10:09, prasenjit mukherjee <[email protected]> wrote:

So I created a jira-issue :
https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
patch along with readme instructions. Please feel free to try out with
different input samples. The default behaviour is to run pig in local
mode. Appreciate any suggestions/reviews.

-Prasen

On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll <[email protected]> wrote:
This is excellent, Prasen.

I see no reason not to include them.  We are about ML first,
distributed/scalable ML second and Hadoop-based third, IMO. Java would be a distant fourth in my mind. In other words, I don't feel particularly strong about us being Java only or even Hadoop only. To me there is a significant need for community-developed machine learning capabilities with a commercial friendly license. Add in the ability to scale/run efficiently and you have
a home run.  In fact, those are the very reasons we founded Mahout.


On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:

Pig is a higher level language ( more like Swazall for Google's
mapreduce )  on top of hadoop which makes hadoop easy to use.

It has SQL like syntaxes and can break the command into separate
mapreduce tasks and also chain them. From execution point of view they
are as simple as running a shell script with very few
operators/commands.

Some of its commands are join, group, cogroup, load etc.

For example the following pig script takes a logfile in the format :
<txid>,<txt>,<user> and outputs user-term-freq  file in the foll
format : <txt>\t<user>\t<cnt>

raw = load 'tx_log.csv' using PigStorage(',') AS
(transactionid:chararray, txt:chararray, user:chararray);
tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
attribute;
user_term_freq = group tokenized by (user,attribute);
user_term_freq = foreach ratings generate flatten(group),COUNT(tokenized);
store ratings into 'user_term_freq.txt';

During runtime pig takes the input and breaks it into several map and
reduce tasks. It takes the hadoop-site.xml from its classpath.

-Prasen

On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[email protected]> wrote:

Needs to go somewhere like trunk/core/src/pig/main right, versus / java/ ?

I also see no harm in adding it, other than that it would remain
pretty isolated right? isn't part of the build, can't be integrated
with the other code, etc.? Does it add value to package it with the
project then?

Perhaps I misunderstand what Pig can do or how it can relate to Java?

On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[email protected] >
wrote:

Hmm, hadn't really thought about it, but I see no reason why we wouldn't accept it and add it. I think our source tree can definitely handle it.

I'd propose it go somewhere under:
trunk/core/src/main/pig/plsi

I'm not familiar with Pig, but I can learn, and I know others are, is it
a
single file?

See http://cwiki.apache.org/MAHOUT/howtocontribute.html for instructions
on
contributing.  Basically, just attach the file(s) to a JIRA issue.

On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:

Hi,
I have implemented hofmann's plsi/em algo in pig which I would like
to contribute back to the community for further
scrutinization/improvement. Let me know if mahout is the appropriate
forum or should  it go to  pig project.

Haven't seen any non-java contributions to Mahout yet, which begs the
question is Mahout only java based ?

-Thanks,
Prasen

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search




--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search


Reply via email to