[jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-167:
-

  Component/s: Matrix
   Genetic Algorithms
   Frequent Itemset/Association Rule Mining
   Collaborative Filtering
   Classification
Fix Version/s: 0.4

I suggest we postpone this until Hadoop 0.21, as it fixes some new API issues 
that would prevent us from moving entirely to new APIs. And that will take a 
while.

 Convert clustering code to Hadoop 0.20 API
 --

 Key: MAHOUT-167
 URL: https://issues.apache.org/jira/browse/MAHOUT-167
 Project: Mahout
  Issue Type: Improvement
  Components: Classification, Clustering, Collaborative Filtering, 
 Frequent Itemset/Association Rule Mining, Genetic Algorithms, Matrix
Affects Versions: 0.1
Reporter: Jeff Eastman
Assignee: Jeff Eastman
 Fix For: 0.4

 Attachments: MAHOUT-167.patch


 We need to update the clustering implementations to remove the deprecated 
 Hadoop API calls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-90:


 Priority: Minor  (was: Major)
Fix Version/s: 0.3
 Assignee: Isabel Drost

I'm sort of arbitrarily assigning this stale issue to Isabel since I think you 
had recently looked at things like nightly build scripts, etc.? AFAIK this one 
is up to you really.

 Adding all scripts (for nightly build) to SVN repository.
 -

 Key: MAHOUT-90
 URL: https://issues.apache.org/jira/browse/MAHOUT-90
 Project: Mahout
  Issue Type: New Feature
Reporter: Edward J. Yoon
Assignee: Isabel Drost
Priority: Minor
 Fix For: 0.3

 Attachments: mahout.tgz


 I made below scripts for the hudson continuous integration service on my 
 hudson account. 
 mahout/hudsonBuildMahoutPatch.sh   
 mahout/processMahoutPatchEmail.sh
 mahout/hudsonPatchQueueAdmin.sh
 They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-71) Dataset to Matrix Reader

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-71:


Fix Version/s: 0.3
 Assignee: Deneche A. Hakim

Deneche do you think this issue is still live? Is it possible to read any input 
in general into matrix form?

 Dataset to Matrix Reader
 

 Key: MAHOUT-71
 URL: https://issues.apache.org/jira/browse/MAHOUT-71
 Project: Mahout
  Issue Type: New Feature
Reporter: Deneche A. Hakim
Assignee: Deneche A. Hakim
Priority: Minor
 Fix For: 0.3


 This component should allow the input datasets to be read as Matrix Rows.
 A Map-Reduce Algorithm should handle any dataset in a matrix format, where 
 the collumns are the attributes (and one of them is the Label) and the rows 
 are the datas.
 Working with Hadoop, we'll need to pass the dataset in the mapper's input, so 
 it must be a file (or many files). We'll then need a custom InputFormat to 
 feed the mappers with the data, and here comes the lovely-named row-wise 
 splitting matrix input format.
 Now we want to be able to work with any given dataset file format (including 
 the ARFF and my custom format), and thus the InputFormat needs a decoder that 
 converts the dataset lines into matrix rows.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-70) no way to pass in weight filename to WeightedDistanceMeasure

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-70?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-70:


Affects Version/s: 0.1
Fix Version/s: 0.3
 Assignee: Grant Ingersoll

Grant just doing some housekeeping, what do you think of this old issue?

 no way to pass in weight filename to WeightedDistanceMeasure
 

 Key: MAHOUT-70
 URL: https://issues.apache.org/jira/browse/MAHOUT-70
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.1
Reporter: peter dapkus
Assignee: Grant Ingersoll
 Fix For: 0.3


 I might be missing something, but it doesn't seem that there's a way to pass 
 in the weights file for a weighted distance measure without modifying one of 
 the mahout classes (e.g. CanopyDriver, ClusterDriver, CanopyClusteringJob).   
  Seems like the runJob methods should have an option to include one, or that 
 maybe the distance measure should be passed as something other than just a 
 string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-85:


Affects Version/s: 0.1
Fix Version/s: 0.3

More housekeeping for 0.3. Is this still pretty commitable? I'd go for it if 
you think it's basically sound.

 Perceptron/Winnow Trainer
 -

 Key: MAHOUT-85
 URL: https://issues.apache.org/jira/browse/MAHOUT-85
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Affects Versions: 0.1
Reporter: Isabel Drost
Assignee: Isabel Drost
 Fix For: 0.3

 Attachments: perceptronWinnowTrainer.diff


 Please find attached a first sketch for perceptron and winnow training. 
 Please look very, very carefully at the patch, as I added the heart of the 
 algorithms in the emergency room at Charite Berlin (after I broke my leg when 
 cycling to the Hadoop Get Together ;) ). 
 The patch does not yet feature unit tests nor is it parallelised. Currently 
 my plan is to set up an example with the webKb dataset, add unit tests to the 
 code and after that go parallel. I would like to get some feedback early on, 
 in addition I would feel a lot better, if a second and third pair of eyes had 
 a look at the code to make sure all obvious mistakes are out as early as 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-144) Some maven refactoring and prep for enforcing code style

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-144.
--

Resolution: Fixed

This was completed right?

 Some maven refactoring and prep for enforcing code style
 

 Key: MAHOUT-144
 URL: https://issues.apache.org/jira/browse/MAHOUT-144
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.2
Reporter: Benson Margulies
Assignee: Grant Ingersoll
 Attachments: cs1.diff, miscbuildtools.patch.txt


 The attached does a few things:
 1) sorts out the maven parents: now the modules parent to 'maven', and 
 'maven' parents to the top-level project. 
 2) The release management in the top-level POM is in a profile.
 3) the version of 'maven' is consistent with other version numbers.
 4) the source control URLs are corrected.
 5) a new buildtools module to hold pmd and checkstyle config.
 6) dependencyManagement in the parent, initially just for lucene.
 7) backup to current lucene release. -Dlucene.version is there for those who 
 really want to use 2.9-SNAPSHOT.
 8) a profile, sourcecheck, that turns on checkstyle and pmd. This creates a 
 giant pile of complaints. 
 The next step in this process would be to come up with a set of checkstyle 
 and pmd rules consistent with the community's desires.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-119) Create an uber jar for use on Amazon Elastic M/R, etc.

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-119.
--

   Resolution: Fixed
Fix Version/s: 0.2
 Assignee: Sean Owen

I think this is done? we have long since generated all-inclusive .job files 
which I've been using with Hadoop just fine.

 Create an uber jar for use on Amazon Elastic M/R, etc.
 --

 Key: MAHOUT-119
 URL: https://issues.apache.org/jira/browse/MAHOUT-119
 Project: Mahout
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Sean Owen
Priority: Minor
 Fix For: 0.2


 Some cloud resources have problems loading classes across JARs in the Job 
 jar.  See 
 http://www.lucidimagination.com/search/document/3a5680dfe567d812/running_dirichlet_example_on_aemr
 This can be fixed by adding a new target that creates a single Jar target.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-67) plus method and divide method in AbstractVector can be optimized for SparseVectors

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved MAHOUT-67.
-

   Resolution: Fixed
Fix Version/s: 0.2
 Assignee: Sean Owen

I checked, and at this point AbstractVector already does what this patch did, 
through use of iterateNonZero(). Marking this basically obsoleted.

 plus method and divide method in AbstractVector can be optimized for 
 SparseVectors
 --

 Key: MAHOUT-67
 URL: https://issues.apache.org/jira/browse/MAHOUT-67
 Project: Mahout
  Issue Type: Bug
  Components: Matrix
Reporter: Pallavi Palleti
Assignee: Sean Owen
Priority: Minor
 Fix For: 0.2

 Attachments: MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch, 
 MAHOUT-67.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-191:
-

Fix Version/s: 0.3
 Assignee: Sean Owen

So, may I submit these on y'all's behalf? Not clear on the status here.

 NPE while creating term vectors with an index on a field that does not exist 
 in all the documents
 -

 Key: MAHOUT-191
 URL: https://issues.apache.org/jira/browse/MAHOUT-191
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
 Environment: mac, snow leopard, eclipse galileo, jdk 6
Reporter: Sushil Bajracharya
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch


 (based on the message from here: 
 http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
 I checked out mahout from trunk and tried to create term frequency vector 
 from a lucene index and ran into this..
 09/10/27 17:36:10 INFO lucene.Driver: Output File: 
 /Users/shoeseal/DATA/luc2tvec.out
 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
 at 
 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
 at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
 I am running this from Eclipse (snow leopard with JDK 6), on an index that 
 has field with stored term vectors..
 my input parameters for Driver are:
 --dir path/smallidx/ --output path/luc2tvec.out --idField id_field
  --field field_with_TV --dictOut path/luc2tvec.dict --max 50  --weight tf
 Luke shows the following info on the fields I am using:
  id_field is indexed, stored, omit norms
  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API

2009-12-06 Thread Jeff Eastman
I agree. I've already posted a patch for Canopy that does most of the 
changes and compiles but could not get it to work correctly. Since you 
are breaking trail on 0.21 let me know when you think it makes sense for 
me to continue debugging it and when you think we should move to 0.21.  
I'll probably keep working on the other clustering apps in the interim.


Sean Owen (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-167:
-

  Component/s: Matrix
   Genetic Algorithms
   Frequent Itemset/Association Rule Mining
   Collaborative Filtering
   Classification
Fix Version/s: 0.4

I suggest we postpone this until Hadoop 0.21, as it fixes some new API issues 
that would prevent us from moving entirely to new APIs. And that will take a 
while.

  

Convert clustering code to Hadoop 0.20 API
--

Key: MAHOUT-167
URL: https://issues.apache.org/jira/browse/MAHOUT-167
Project: Mahout
 Issue Type: Improvement
 Components: Classification, Clustering, Collaborative Filtering, 
Frequent Itemset/Association Rule Mining, Genetic Algorithms, Matrix
   Affects Versions: 0.1
   Reporter: Jeff Eastman
   Assignee: Jeff Eastman
Fix For: 0.4

Attachments: MAHOUT-167.patch


We need to update the clustering implementations to remove the deprecated 
Hadoop API calls.



  




[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786678#action_12786678
 ] 

Isabel Drost commented on MAHOUT-90:


I did add a hudson job to upload maven snapshots of our projects to the apache 
repository on a nightly basis. No idea however how building and publishing 
nightly releases should work at Apache.

 Adding all scripts (for nightly build) to SVN repository.
 -

 Key: MAHOUT-90
 URL: https://issues.apache.org/jira/browse/MAHOUT-90
 Project: Mahout
  Issue Type: New Feature
Reporter: Edward J. Yoon
Assignee: Isabel Drost
Priority: Minor
 Fix For: 0.3

 Attachments: mahout.tgz


 I made below scripts for the hudson continuous integration service on my 
 hudson account. 
 mahout/hudsonBuildMahoutPatch.sh   
 mahout/processMahoutPatchEmail.sh
 mahout/hudsonPatchQueueAdmin.sh
 They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost reassigned MAHOUT-90:
--

Assignee: (was: Isabel Drost)

 Adding all scripts (for nightly build) to SVN repository.
 -

 Key: MAHOUT-90
 URL: https://issues.apache.org/jira/browse/MAHOUT-90
 Project: Mahout
  Issue Type: New Feature
Reporter: Edward J. Yoon
Priority: Minor
 Fix For: 0.3

 Attachments: mahout.tgz


 I made below scripts for the hudson continuous integration service on my 
 hudson account. 
 mahout/hudsonBuildMahoutPatch.sh   
 mahout/processMahoutPatchEmail.sh
 mahout/hudsonPatchQueueAdmin.sh
 They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer

2009-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786696#action_12786696
 ] 

Sean Owen commented on MAHOUT-85:
-

Sure, worth committing or shelving, you think? Just trying to review all the 
old issues that haven't seen activity in a year or so.

 Perceptron/Winnow Trainer
 -

 Key: MAHOUT-85
 URL: https://issues.apache.org/jira/browse/MAHOUT-85
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Affects Versions: 0.1
Reporter: Isabel Drost
Assignee: Isabel Drost
 Fix For: 0.3

 Attachments: perceptronWinnowTrainer.diff


 Please find attached a first sketch for perceptron and winnow training. 
 Please look very, very carefully at the patch, as I added the heart of the 
 algorithms in the emergency room at Charite Berlin (after I broke my leg when 
 cycling to the Hadoop Get Together ;) ). 
 The patch does not yet feature unit tests nor is it parallelised. Currently 
 my plan is to set up an example with the webKb dataset, add unit tests to the 
 code and after that go parallel. I would like to get some feedback early on, 
 in addition I would feel a lot better, if a second and third pair of eyes had 
 a look at the code to make sure all obvious mistakes are out as early as 
 possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.

2009-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786695#action_12786695
 ] 

Sean Owen commented on MAHOUT-90:
-

Shall I mark this defunct then? It hadn't had activity in 14 months.

 Adding all scripts (for nightly build) to SVN repository.
 -

 Key: MAHOUT-90
 URL: https://issues.apache.org/jira/browse/MAHOUT-90
 Project: Mahout
  Issue Type: New Feature
Reporter: Edward J. Yoon
Priority: Minor
 Fix For: 0.3

 Attachments: mahout.tgz


 I made below scripts for the hudson continuous integration service on my 
 hudson account. 
 mahout/hudsonBuildMahoutPatch.sh   
 mahout/processMahoutPatchEmail.sh
 mahout/hudsonPatchQueueAdmin.sh
 They will be modified by only me, so It should be handled via SVN.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper

2009-12-06 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-197:
---

Status: Patch Available  (was: Open)

 LDADriver: No job jar file set leads to ClassNotFoundException: 
 org.apache.mahout.clustering.lda.LDAMapper
 --

 Key: MAHOUT-197
 URL: https://issues.apache.org/jira/browse/MAHOUT-197
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.2
 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 
 0.20.1, build from r834311 of 
 http://svn.apache.org/repos/asf/lucene/mahout/trunk
Reporter: Drew Farris
Priority: Minor
 Attachments: LDADriver-setJar.patch


 hadoop jar 
 core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver
  -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 
 --maxIter 2
 [...]
 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or
 JobConf#setJar(String).
 [...]
 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1
 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005
 09/11/09 22:02:02 INFO mapred.JobClient:  map 0% reduce 0%
 09/11/09 22:02:12 INFO mapred.JobClient: Task Id :
 attempt_200911091316_0005_m_00_0, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808)
at 
 org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 Can be fixed by adding the following line to LDADriver after line 299 in 
 r831743:
 job.setJarByClass(LDADriver.class);
 (will attach trivial patch)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-12-06 Thread Shashikant Kore (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786782#action_12786782
 ] 

Shashikant Kore commented on MAHOUT-191:


Yeah, this can be committed.

 NPE while creating term vectors with an index on a field that does not exist 
 in all the documents
 -

 Key: MAHOUT-191
 URL: https://issues.apache.org/jira/browse/MAHOUT-191
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
 Environment: mac, snow leopard, eclipse galileo, jdk 6
Reporter: Sushil Bajracharya
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch


 (based on the message from here: 
 http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
 I checked out mahout from trunk and tried to create term frequency vector 
 from a lucene index and ran into this..
 09/10/27 17:36:10 INFO lucene.Driver: Output File: 
 /Users/shoeseal/DATA/luc2tvec.out
 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
 at 
 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
 at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
 I am running this from Eclipse (snow leopard with JDK 6), on an index that 
 has field with stored term vectors..
 my input parameters for Driver are:
 --dir path/smallidx/ --output path/luc2tvec.out --idField id_field
  --field field_with_TV --dictOut path/luc2tvec.dict --max 50  --weight tf
 Luke shows the following info on the fields I am using:
  id_field is indexed, stored, omit norms
  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Ted Dunning (JIRA)
Need random sampler for use in reducers
---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
 Fix For: 0.3



For a variety of mining algorithms, it helps to have a uniform way to only 
process a sub-set of the records in a reducer.

As such, I have written a simple generic sampler that filters an Iterator 
returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reassigned MAHOUT-212:
--

Assignee: Ted Dunning

 Need random sampler for use in reducers
 ---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 0.3


 For a variety of mining algorithms, it helps to have a uniform way to only 
 process a sub-set of the records in a reducer.
 As such, I have written a simple generic sampler that filters an Iterator 
 returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-212:
---

Status: Patch Available  (was: Open)


Code plus test cases.

Ready for use.  I think.

 Need random sampler for use in reducers
 ---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 0.3


 For a variety of mining algorithms, it helps to have a uniform way to only 
 process a sub-set of the records in a reducer.
 As such, I have written a simple generic sampler that filters an Iterator 
 returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reassigned MAHOUT-212:
--

Assignee: Sean Owen  (was: Ted Dunning)

 Need random sampler for use in reducers
 ---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-212.patch


 For a variety of mining algorithms, it helps to have a uniform way to only 
 process a sub-set of the records in a reducer.
 As such, I have written a simple generic sampler that filters an Iterator 
 returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-212:
---

Attachment: MAHOUT-212.patch

Hmm... didn't get asked for where the patch file was when marking the bug as 
patch available. 

 Need random sampler for use in reducers
 ---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
Assignee: Ted Dunning
 Fix For: 0.3

 Attachments: MAHOUT-212.patch


 For a variety of mining algorithms, it helps to have a uniform way to only 
 process a sub-set of the records in a reducer.
 As such, I have written a simple generic sampler that filters an Iterator 
 returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper

2009-12-06 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall reassigned MAHOUT-197:
-

Assignee: David Hall

 LDADriver: No job jar file set leads to ClassNotFoundException: 
 org.apache.mahout.clustering.lda.LDAMapper
 --

 Key: MAHOUT-197
 URL: https://issues.apache.org/jira/browse/MAHOUT-197
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.2
 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 
 0.20.1, build from r834311 of 
 http://svn.apache.org/repos/asf/lucene/mahout/trunk
Reporter: Drew Farris
Assignee: David Hall
Priority: Minor
 Attachments: LDADriver-setJar.patch


 hadoop jar 
 core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver
  -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 
 --maxIter 2
 [...]
 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or
 JobConf#setJar(String).
 [...]
 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1
 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005
 09/11/09 22:02:02 INFO mapred.JobClient:  map 0% reduce 0%
 09/11/09 22:02:12 INFO mapred.JobClient: Task Id :
 attempt_200911091316_0005_m_00_0, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808)
at 
 org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 Can be fixed by adding the following line to LDADriver after line 299 in 
 r831743:
 job.setJarByClass(LDADriver.class);
 (will attach trivial patch)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper

2009-12-06 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-197:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Fixed in 887843 

Thanks for the patch!

 LDADriver: No job jar file set leads to ClassNotFoundException: 
 org.apache.mahout.clustering.lda.LDAMapper
 --

 Key: MAHOUT-197
 URL: https://issues.apache.org/jira/browse/MAHOUT-197
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.2
 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 
 0.20.1, build from r834311 of 
 http://svn.apache.org/repos/asf/lucene/mahout/trunk
Reporter: Drew Farris
Assignee: David Hall
Priority: Minor
 Attachments: LDADriver-setJar.patch


 hadoop jar 
 core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver
  -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 
 --maxIter 2
 [...]
 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set.  User
 classes may not be found. See JobConf(Class) or
 JobConf#setJar(String).
 [...]
 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1
 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005
 09/11/09 22:02:02 INFO mapred.JobClient:  map 0% reduce 0%
 09/11/09 22:02:12 INFO mapred.JobClient: Task Id :
 attempt_200911091316_0005_m_00_0, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808)
at 
 org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.lda.LDAMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 Can be fixed by adding the following line to LDADriver after line 299 in 
 r831743:
 job.setJarByClass(LDADriver.class);
 (will attach trivial patch)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

2009-12-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786806#action_12786806
 ] 

Sean Owen commented on MAHOUT-212:
--

This kinda already existed as SamplingIterator -- does that do the same thing? 
could these be merged then, pulling the class into a common location and 
combining aspects of both?

 Need random sampler for use in reducers
 ---

 Key: MAHOUT-212
 URL: https://issues.apache.org/jira/browse/MAHOUT-212
 Project: Mahout
  Issue Type: Bug
  Components: Utils
Affects Versions: 0.2
Reporter: Ted Dunning
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-212.patch


 For a variety of mining algorithms, it helps to have a uniform way to only 
 process a sub-set of the records in a reducer.
 As such, I have written a simple generic sampler that filters an Iterator 
 returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-12-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-191:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed variant on patches on behalf of submitters. Did not override the 
standard setDocumentNumber() method since this seems to be available in Lucene 
now and undesirable to cut off calls to super.setDocumentNumber()?

 NPE while creating term vectors with an index on a field that does not exist 
 in all the documents
 -

 Key: MAHOUT-191
 URL: https://issues.apache.org/jira/browse/MAHOUT-191
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
 Environment: mac, snow leopard, eclipse galileo, jdk 6
Reporter: Sushil Bajracharya
Assignee: Sean Owen
 Fix For: 0.3

 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch


 (based on the message from here: 
 http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
 I checked out mahout from trunk and tried to create term frequency vector 
 from a lucene index and ran into this..
 09/10/27 17:36:10 INFO lucene.Driver: Output File: 
 /Users/shoeseal/DATA/luc2tvec.out
 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
 at 
 org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
 at 
 org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
 at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
 I am running this from Eclipse (snow leopard with JDK 6), on an index that 
 has field with stored term vectors..
 my input parameters for Driver are:
 --dir path/smallidx/ --output path/luc2tvec.out --idField id_field
  --field field_with_TV --dictOut path/luc2tvec.dict --max 50  --weight tf
 Luke shows the following info on the fields I am using:
  id_field is indexed, stored, omit norms
  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.