Re: plsi in pig

Grant Ingersoll Wed, 11 Feb 2009 08:17:27 -0800

Cool will look at this after the release.

On Feb 11, 2009, at 10:09, prasenjit mukherjee <[email protected]>wrote:

So I created a jira-issue :
https://issues.apache.org/jira/browse/MAHOUT-106 and also submitted a
patch along with readme instructions. Please feel free to try out with
different input samples. The default behaviour is to run pig in local
mode. Appreciate any suggestions/reviews.

-Prasen
On Wed, Feb 11, 2009 at 5:32 PM, Grant Ingersoll<[email protected]> wrote:
This is excellent, Prasen.

I see no reason not to include them.  We are about ML first,
distributed/scalable ML second and Hadoop-based third, IMO. Javawould be adistant fourth in my mind. In other words, I don't feelparticularly strongabout us being Java only or even Hadoop only. To me there is asignificantneed for community-developed machine learning capabilities with acommercialfriendly license. Add in the ability to scale/run efficiently andyou have
a home run.  In fact, those are the very reasons we founded Mahout.


On Feb 11, 2009, at 6:40 AM, prasenjit mukherjee wrote:
Pig is a higher level language ( more like Swazall for Google's
mapreduce )  on top of hadoop which makes hadoop easy to use.

It has SQL like syntaxes and can break the command into separate
mapreduce tasks and also chain them. From execution point of viewthey
are as simple as running a shell script with very few
operators/commands.

Some of its commands are join, group, cogroup, load etc.
For example the following pig script takes a logfile in theformat :
<txid>,<txt>,<user> and outputs user-term-freq  file in the foll
format : <txt>\t<user>\t<cnt>

raw = load 'tx_log.csv' using PigStorage(',') AS
(transactionid:chararray, txt:chararray, user:chararray);
tokenized = FOREACH raw GENERATE user, flatten(TOKENIZE(txt)) as
attribute;
user_term_freq = group tokenized by (user,attribute);
user_term_freq = foreach ratings generateflatten(group),COUNT(tokenized);
store ratings into 'user_term_freq.txt';
During runtime pig takes the input and breaks it into several mapand
reduce tasks. It takes the hadoop-site.xml from its classpath.

-Prasen

On Wed, Feb 11, 2009 at 4:54 PM, Sean Owen <[email protected]> wrote:
Needs to go somewhere like trunk/core/src/pig/main right, versus /java/ ?
I also see no harm in adding it, other than that it would remain
pretty isolated right? isn't part of the build, can't be integrated
with the other code, etc.? Does it add value to package it with the
project then?
Perhaps I misunderstand what Pig can do or how it can relate toJava?
On Wed, Feb 11, 2009 at 11:13 AM, Grant Ingersoll <[email protected]>
wrote:
Hmm, hadn't really thought about it, but I see no reason why wewouldn'taccept it and add it. I think our source tree can definitelyhandle it.
I'd propose it go somewhere under:
trunk/core/src/main/pig/plsi
I'm not familiar with Pig, but I can learn, and I know othersare, is it
a
single file?
See http://cwiki.apache.org/MAHOUT/howtocontribute.html forinstructions
on
contributing.  Basically, just attach the file(s) to a JIRA issue.

On Feb 11, 2009, at 2:18 AM, prasenjit mukherjee wrote:
Hi,
I have implemented hofmann's plsi/em algo in pig which I wouldlike
to contribute back to the community for further
scrutinization/improvement. Let me know if mahout is theappropriate
forum or should  it go to  pig project.
Haven't seen any non-java contributions to Mahout yet, whichbegs the
question is Mahout only java based ?

-Thanks,
Prasen
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search

Re: plsi in pig

Reply via email to