Re: Tuning LDA on Reuters

2010-05-19 Thread Grant Ingersoll
You might find http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28 informative. (BTW, LDA is only meant to run w/ TF) -Grant On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:

Re: Tuning LDA on Reuters

2010-05-19 Thread Drew Farris
On Wed, May 19, 2010 at 10:10 PM, Drew Farris wrote: > Of course this doesn't really address the root problem however -- why LDA > on reuters is slow. How long is it taking to run? > > Drew nm, saw it in the JIRA issue (5.5min vs. 1.5min)

Re: Tuning LDA on Reuters

2010-05-19 Thread Drew Farris
Jeff, Just curious, have you tried: ./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq The mahout script (MahoutDriver) allows arbitrary hadoop properties to be specified via -D arguments which a

[jira] Updated: (MAHOUT-397) SparseVectorsFromSequenceFiles only outputs a single vector file

2010-05-19 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-397: Status: Patch Available (was: Open) patch submitted runs on r946508 > SparseVectorsFromSequenceFi

[jira] Updated: (MAHOUT-397) SparseVectorsFromSequenceFiles only outputs a single vector file

2010-05-19 Thread Jeff Eastman (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Eastman updated MAHOUT-397: Attachment: MAHOUT-397.patch This patch seems to resolve the issue by propagating the number of red

[jira] Created: (MAHOUT-397) SparseVectorsFromSequenceFiles only outputs a single vector file

2010-05-19 Thread Jeff Eastman (JIRA)
SparseVectorsFromSequenceFiles only outputs a single vector file Key: MAHOUT-397 URL: https://issues.apache.org/jira/browse/MAHOUT-397 Project: Mahout Issue Type: Improvement

Re: Tuning LDA on Reuters

2010-05-19 Thread Jeff Eastman
On 5/19/10 3:19 PM, Jeff Eastman wrote: I tried propagating numReducers into its makePartialVectors driver; however, but a single reducer is still all I get. I need to figure out how to tickle the elephant to give me more. Note to self: Use a real elephant. Running Hadoop in Eclipse is great f

Re: Tuning LDA on Reuters

2010-05-19 Thread Jeff Eastman
On 5/19/10 1:49 PM, Drew Farris wrote: On Wed, May 19, 2010 at 3:49 PM, Jeff Eastmanwrote: I cannot imagine how one could ever get LDA to scale if it is always limited to a single input vector file. Is there a way to get multiple output vector files from seqtosparse? I don't know o

Re: Tuning LDA on Reuters

2010-05-19 Thread Drew Farris
On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman wrote: > I cannot imagine how one could ever get LDA to scale if it is always > limited to a single input vector file. Is there a way to get multiple output > vector files from seqtosparse? > I don't know offhand, but is the default input split (mapr

Tuning LDA on Reuters

2010-05-19 Thread Jeff Eastman
I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took a really long time to converge (100 iterations * 10 min ea) extracting 20 topics. I was able to reduce the iteration time by 50% by using just TF and SeqAccSparseVectors but it was still only using a single mapper a

[jira] Commented: (MAHOUT-383) Investigate possibility of integration with Neuroph neural-net library

2010-05-19 Thread Zoran Sevarac (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869050#action_12869050 ] Zoran Sevarac commented on MAHOUT-383: -- Just to let you that we've released the Neurop

[jira] Commented: (MAHOUT-364) [GSOC] Proposal to implement Neural Network with backpropagation learning on Hadoop

2010-05-19 Thread Zoran Sevarac (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869048#action_12869048 ] Zoran Sevarac commented on MAHOUT-364: -- Hi. Just to let you that we've released the N