Re: San Francisco/Bay Area Mahout users group

2011-09-16 Thread Ted Dunning
Glad to hear that there could be a Chicago meetup. I doubt I will be there at the right time, but it is too cool to have enough interest in more than one city. I definitely will not have time next week in the bay area. I am lucky enough to have seen Grant recently elsewhere. On Fri, Sep 16, 201

Re: Graph Output formats

2011-09-16 Thread Ted Dunning
Indeed. I strongly prefer the other two for expressivity. On Fri, Sep 16, 2011 at 4:37 PM, Jake Mannix wrote: > On Fri, Sep 16, 2011 at 3:30 PM, Ted Dunning > wrote: > > > I think that Avro and protobufs are the current best options for large > data > > assets like this. > > > > (or serialized

Re: Apache Giraph?

2011-09-16 Thread Jake Mannix
On Fri, Sep 16, 2011 at 3:36 PM, Ted Dunning wrote: > Returning something halves performance or worse since you can't fire and > forget. IN Pregel style, you should expect the message to be processed in > the next super step and a value returned in the super step after that. > I guess it depend

Re: Graph Output formats

2011-09-16 Thread Jake Mannix
On Fri, Sep 16, 2011 at 3:30 PM, Ted Dunning wrote: > I think that Avro and protobufs are the current best options for large data > assets like this. > (or serialized Thrift)

Re: Apache Giraph?

2011-09-16 Thread Ted Dunning
Returning something halves performance or worse since you can't fire and forget. IN Pregel style, you should expect the message to be processed in the next super step and a value returned in the super step after that. On Fri, Sep 16, 2011 at 2:31 PM, Jake Mannix wrote: > On Fri, Sep 16, 2011 at

Re: Graph Output formats

2011-09-16 Thread Ted Dunning
I think that Avro and protobufs are the current best options for large data assets like this. On Fri, Sep 16, 2011 at 2:44 PM, Jake Mannix wrote: > Can I vote for whichever one isn't based on XML? :) > > I really can't imagine encoding a 10-billion node graph in XML. Or rather, > I can, and I'm

Re: Graph Output formats

2011-09-16 Thread Lance Norskog
What's your displayer? And what formats does it use? On Fri, Sep 16, 2011 at 2:29 PM, Grant Ingersoll wrote: > Yeah, I hear you. I've actually just modeled it like our VectorWriter and > it will be pluggable. I'm likely just going to do CSV and GML to start (the > latter being XML) Maybe we ne

Re: Graph Output formats

2011-09-16 Thread Grant Ingersoll
Yeah, I hear you. I've actually just modeled it like our VectorWriter and it will be pluggable. I'm likely just going to do CSV and GML to start (the latter being XML) Maybe we need YAGF (yet another graph format)? I used to do a lot of NLP processing and output XML and always felt like what

Re: Graph Output formats

2011-09-16 Thread Tanton Gibbs
I have used XML to represent very large graphs (billions of nodes). It is as bad as you would imagine. On Fri, Sep 16, 2011 at 1:44 PM, Jake Mannix wrote: > Can I vote for whichever one isn't based on XML? :) > > I really can't imagine encoding a 10-billion node graph in XML.  Or rather, > I can,

[jira] [Commented] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106811#comment-13106811 ] Hudson commented on MAHOUT-811: --- Integrated in Mahout-Quality #1043 (See [https://builds.ap

Re: Graph Output formats

2011-09-16 Thread Jake Mannix
Can I vote for whichever one isn't based on XML? :) I really can't imagine encoding a 10-billion node graph in XML. Or rather, I can, and I'm skeered. On Fri, Sep 16, 2011 at 1:02 PM, Grant Ingersoll wrote: > I'm going to write a converter to dump out clusters and their points to a > graph

Re: Apache Giraph?

2011-09-16 Thread Jake Mannix
On Fri, Sep 16, 2011 at 1:24 PM, Ted Dunning wrote: > Well, distributed memory to me would have fetch and store operations. Here > we can send a message, but we can't actually fetch or store data without > cooperation. > Funny you mention that - I've been considering suggesting that Giraph modi

Re: Apache Giraph?

2011-09-16 Thread Ted Dunning
Well, distributed memory to me would have fetch and store operations. Here we can send a message, but we can't actually fetch or store data without cooperation. On Fri, Sep 16, 2011 at 4:45 AM, Grant Ingersoll wrote: > > On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > > > Actually, I don't th

Graph Output formats

2011-09-16 Thread Grant Ingersoll
I'm going to write a converter to dump out clusters and their points to a graph structure so they can be displayed. Gephi (and others) supports a myriad of formats: http://gephi.org/users/supported-graph-formats/ * GEXF * GDF * GML * GraphML * Pajek NET * GraphViz DOT * CSV * UCINET DL

[jira] [Commented] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106661#comment-13106661 ] Hudson commented on MAHOUT-811: --- Integrated in Mahout-Quality #1042 (See [https://builds.ap

[jira] [Commented] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Andrew Bayer (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106637#comment-13106637 ] Andrew Bayer commented on MAHOUT-811: - Yeah, I kept the rm -rf for consistency, but ch

[jira] [Commented] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106617#comment-13106617 ] Drew Farris commented on MAHOUT-811: {quote} Should be easy enough to do this without

[jira] [Commented] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13106535#comment-13106535 ] Sean Owen commented on MAHOUT-811: -- Should be easy enough to do this without any cd-ing a

RE: San Francisco/Bay Area Mahout users group

2011-09-16 Thread Alan Said
+1 for Chicago. October 28th? /Alan -- *** M.Sc.(Eng.) Alan Said Competence Center Information Retrieval & Machine Learning Technische Universität Berlin DAI-Labor Sekr. TEL 14 Ernst-Reuter-Platz 7 10587 Berlin / Germany Phone: 0049 - 30 - 314 74072 Fax:0

Re: San Francisco/Bay Area Mahout users group

2011-09-16 Thread Steven Bourke
How about one at Recsys in Chicago in October (recsys.acm.org) there are definitely other researchers using mahout, some industry folks will be there to. I'll be attending the conference. On Fri, Sep 16, 2011 at 4:04 PM, Grant Ingersoll wrote: > We do from time to time, but they are usually ad h

Re: San Francisco/Bay Area Mahout users group

2011-09-16 Thread Grant Ingersoll
We do from time to time, but they are usually ad hoc at this point (usually when I am in town, which happens to be next week, although I don't think I have time to get together) On Sep 16, 2011, at 10:30 AM, Dhruv Kumar wrote: > Are there any events regularly scheduled in/near San Francisco for

San Francisco/Bay Area Mahout users group

2011-09-16 Thread Dhruv Kumar
Are there any events regularly scheduled in/near San Francisco for the users and devs of Mahout? I will be moving there next week and was curious to know about the networking opportunities with similar minded folks in the coming months.

[jira] [Reopened] (MAHOUT-811) Mahout examples try to write to examples/bin/work, which may not be writeable by current user

2011-09-16 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris reopened MAHOUT-811: Assignee: Drew Farris (was: Sean Owen) This patch introduces another problem, specifically with

Re: how to convert a text file to vector for kmeans

2011-09-16 Thread Grant Ingersoll
You need to specify the Lucene analyzer that will be used to tokenize the text. That being said, I thought there was a default. What version of Mahout are you using? On Sep 16, 2011, at 5:41 AM, Jack He wrote: > I've tried commad below: > mahout seqdirectory -i cluster/testdata -o cluster-se

Re: Apache Giraph?

2011-09-16 Thread Grant Ingersoll
On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > Actually, I don't think that these really provide a distributed memory > layer. > > What they is multiple iterations without having to renegotiate JVM launches, > local memory that persists across iterations and decent message passing. > (and of

how to convert a text file to vector for kmeans

2011-09-16 Thread Jack He
I've tried commad below: mahout seqdirectory -i cluster/testdata -o cluster-seq -c UTF-8 the input file just like: 1 2 3 4 5 6 7 8 9 10 11 12 ...etc then, I've got a file named chunk-0 in the directory cluster-seq.it's almost the same with input file. the next step, I ran the commad below: mahout