[jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API
[ https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-167: - Component/s: Matrix Genetic Algorithms Frequent Itemset/Association Rule Mining Collaborative Filtering Classification Fix Version/s: 0.4 I suggest we postpone this until Hadoop 0.21, as it fixes some new API issues that would prevent us from moving entirely to new APIs. And that will take a while. Convert clustering code to Hadoop 0.20 API -- Key: MAHOUT-167 URL: https://issues.apache.org/jira/browse/MAHOUT-167 Project: Mahout Issue Type: Improvement Components: Classification, Clustering, Collaborative Filtering, Frequent Itemset/Association Rule Mining, Genetic Algorithms, Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Fix For: 0.4 Attachments: MAHOUT-167.patch We need to update the clustering implementations to remove the deprecated Hadoop API calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-90: Priority: Minor (was: Major) Fix Version/s: 0.3 Assignee: Isabel Drost I'm sort of arbitrarily assigning this stale issue to Isabel since I think you had recently looked at things like nightly build scripts, etc.? AFAIK this one is up to you really. Adding all scripts (for nightly build) to SVN repository. - Key: MAHOUT-90 URL: https://issues.apache.org/jira/browse/MAHOUT-90 Project: Mahout Issue Type: New Feature Reporter: Edward J. Yoon Assignee: Isabel Drost Priority: Minor Fix For: 0.3 Attachments: mahout.tgz I made below scripts for the hudson continuous integration service on my hudson account. mahout/hudsonBuildMahoutPatch.sh mahout/processMahoutPatchEmail.sh mahout/hudsonPatchQueueAdmin.sh They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-71) Dataset to Matrix Reader
[ https://issues.apache.org/jira/browse/MAHOUT-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-71: Fix Version/s: 0.3 Assignee: Deneche A. Hakim Deneche do you think this issue is still live? Is it possible to read any input in general into matrix form? Dataset to Matrix Reader Key: MAHOUT-71 URL: https://issues.apache.org/jira/browse/MAHOUT-71 Project: Mahout Issue Type: New Feature Reporter: Deneche A. Hakim Assignee: Deneche A. Hakim Priority: Minor Fix For: 0.3 This component should allow the input datasets to be read as Matrix Rows. A Map-Reduce Algorithm should handle any dataset in a matrix format, where the collumns are the attributes (and one of them is the Label) and the rows are the datas. Working with Hadoop, we'll need to pass the dataset in the mapper's input, so it must be a file (or many files). We'll then need a custom InputFormat to feed the mappers with the data, and here comes the lovely-named row-wise splitting matrix input format. Now we want to be able to work with any given dataset file format (including the ARFF and my custom format), and thus the InputFormat needs a decoder that converts the dataset lines into matrix rows. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-70) no way to pass in weight filename to WeightedDistanceMeasure
[ https://issues.apache.org/jira/browse/MAHOUT-70?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-70: Affects Version/s: 0.1 Fix Version/s: 0.3 Assignee: Grant Ingersoll Grant just doing some housekeeping, what do you think of this old issue? no way to pass in weight filename to WeightedDistanceMeasure Key: MAHOUT-70 URL: https://issues.apache.org/jira/browse/MAHOUT-70 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.1 Reporter: peter dapkus Assignee: Grant Ingersoll Fix For: 0.3 I might be missing something, but it doesn't seem that there's a way to pass in the weights file for a weighted distance measure without modifying one of the mahout classes (e.g. CanopyDriver, ClusterDriver, CanopyClusteringJob). Seems like the runJob methods should have an option to include one, or that maybe the distance measure should be passed as something other than just a string. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-85: Affects Version/s: 0.1 Fix Version/s: 0.3 More housekeeping for 0.3. Is this still pretty commitable? I'd go for it if you think it's basically sound. Perceptron/Winnow Trainer - Key: MAHOUT-85 URL: https://issues.apache.org/jira/browse/MAHOUT-85 Project: Mahout Issue Type: New Feature Components: Classification Affects Versions: 0.1 Reporter: Isabel Drost Assignee: Isabel Drost Fix For: 0.3 Attachments: perceptronWinnowTrainer.diff Please find attached a first sketch for perceptron and winnow training. Please look very, very carefully at the patch, as I added the heart of the algorithms in the emergency room at Charite Berlin (after I broke my leg when cycling to the Hadoop Get Together ;) ). The patch does not yet feature unit tests nor is it parallelised. Currently my plan is to set up an example with the webKb dataset, add unit tests to the code and after that go parallel. I would like to get some feedback early on, in addition I would feel a lot better, if a second and third pair of eyes had a look at the code to make sure all obvious mistakes are out as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-144) Some maven refactoring and prep for enforcing code style
[ https://issues.apache.org/jira/browse/MAHOUT-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-144. -- Resolution: Fixed This was completed right? Some maven refactoring and prep for enforcing code style Key: MAHOUT-144 URL: https://issues.apache.org/jira/browse/MAHOUT-144 Project: Mahout Issue Type: Bug Affects Versions: 0.2 Reporter: Benson Margulies Assignee: Grant Ingersoll Attachments: cs1.diff, miscbuildtools.patch.txt The attached does a few things: 1) sorts out the maven parents: now the modules parent to 'maven', and 'maven' parents to the top-level project. 2) The release management in the top-level POM is in a profile. 3) the version of 'maven' is consistent with other version numbers. 4) the source control URLs are corrected. 5) a new buildtools module to hold pmd and checkstyle config. 6) dependencyManagement in the parent, initially just for lucene. 7) backup to current lucene release. -Dlucene.version is there for those who really want to use 2.9-SNAPSHOT. 8) a profile, sourcecheck, that turns on checkstyle and pmd. This creates a giant pile of complaints. The next step in this process would be to come up with a set of checkstyle and pmd rules consistent with the community's desires. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-119) Create an uber jar for use on Amazon Elastic M/R, etc.
[ https://issues.apache.org/jira/browse/MAHOUT-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-119. -- Resolution: Fixed Fix Version/s: 0.2 Assignee: Sean Owen I think this is done? we have long since generated all-inclusive .job files which I've been using with Hadoop just fine. Create an uber jar for use on Amazon Elastic M/R, etc. -- Key: MAHOUT-119 URL: https://issues.apache.org/jira/browse/MAHOUT-119 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Sean Owen Priority: Minor Fix For: 0.2 Some cloud resources have problems loading classes across JARs in the Job jar. See http://www.lucidimagination.com/search/document/3a5680dfe567d812/running_dirichlet_example_on_aemr This can be fixed by adding a new target that creates a single Jar target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-67) plus method and divide method in AbstractVector can be optimized for SparseVectors
[ https://issues.apache.org/jira/browse/MAHOUT-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-67. - Resolution: Fixed Fix Version/s: 0.2 Assignee: Sean Owen I checked, and at this point AbstractVector already does what this patch did, through use of iterateNonZero(). Marking this basically obsoleted. plus method and divide method in AbstractVector can be optimized for SparseVectors -- Key: MAHOUT-67 URL: https://issues.apache.org/jira/browse/MAHOUT-67 Project: Mahout Issue Type: Bug Components: Matrix Reporter: Pallavi Palleti Assignee: Sean Owen Priority: Minor Fix For: 0.2 Attachments: MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch, MAHOUT-67.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
[ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-191: - Fix Version/s: 0.3 Assignee: Sean Owen So, may I submit these on y'all's behalf? Not clear on the status here. NPE while creating term vectors with an index on a field that does not exist in all the documents - Key: MAHOUT-191 URL: https://issues.apache.org/jira/browse/MAHOUT-191 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Environment: mac, snow leopard, eclipse galileo, jdk 6 Reporter: Sushil Bajracharya Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) I checked out mahout from trunk and tried to create term frequency vector from a lucene index and ran into this.. 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor Exception in thread main java.lang.NullPointerException at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) I am running this from Eclipse (snow leopard with JDK 6), on an index that has field with stored term vectors.. my input parameters for Driver are: --dir path/smallidx/ --output path/luc2tvec.out --idField id_field --field field_with_TV --dictOut path/luc2tvec.dict --max 50 --weight tf Luke shows the following info on the fields I am using: id_field is indexed, stored, omit norms field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Updated: (MAHOUT-167) Convert clustering code to Hadoop 0.20 API
I agree. I've already posted a patch for Canopy that does most of the changes and compiles but could not get it to work correctly. Since you are breaking trail on 0.21 let me know when you think it makes sense for me to continue debugging it and when you think we should move to 0.21. I'll probably keep working on the other clustering apps in the interim. Sean Owen (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-167: - Component/s: Matrix Genetic Algorithms Frequent Itemset/Association Rule Mining Collaborative Filtering Classification Fix Version/s: 0.4 I suggest we postpone this until Hadoop 0.21, as it fixes some new API issues that would prevent us from moving entirely to new APIs. And that will take a while. Convert clustering code to Hadoop 0.20 API -- Key: MAHOUT-167 URL: https://issues.apache.org/jira/browse/MAHOUT-167 Project: Mahout Issue Type: Improvement Components: Classification, Clustering, Collaborative Filtering, Frequent Itemset/Association Rule Mining, Genetic Algorithms, Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Fix For: 0.4 Attachments: MAHOUT-167.patch We need to update the clustering implementations to remove the deprecated Hadoop API calls.
[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786678#action_12786678 ] Isabel Drost commented on MAHOUT-90: I did add a hudson job to upload maven snapshots of our projects to the apache repository on a nightly basis. No idea however how building and publishing nightly releases should work at Apache. Adding all scripts (for nightly build) to SVN repository. - Key: MAHOUT-90 URL: https://issues.apache.org/jira/browse/MAHOUT-90 Project: Mahout Issue Type: New Feature Reporter: Edward J. Yoon Assignee: Isabel Drost Priority: Minor Fix For: 0.3 Attachments: mahout.tgz I made below scripts for the hudson continuous integration service on my hudson account. mahout/hudsonBuildMahoutPatch.sh mahout/processMahoutPatchEmail.sh mahout/hudsonPatchQueueAdmin.sh They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-90: -- Assignee: (was: Isabel Drost) Adding all scripts (for nightly build) to SVN repository. - Key: MAHOUT-90 URL: https://issues.apache.org/jira/browse/MAHOUT-90 Project: Mahout Issue Type: New Feature Reporter: Edward J. Yoon Priority: Minor Fix For: 0.3 Attachments: mahout.tgz I made below scripts for the hudson continuous integration service on my hudson account. mahout/hudsonBuildMahoutPatch.sh mahout/processMahoutPatchEmail.sh mahout/hudsonPatchQueueAdmin.sh They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-85) Perceptron/Winnow Trainer
[ https://issues.apache.org/jira/browse/MAHOUT-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786696#action_12786696 ] Sean Owen commented on MAHOUT-85: - Sure, worth committing or shelving, you think? Just trying to review all the old issues that haven't seen activity in a year or so. Perceptron/Winnow Trainer - Key: MAHOUT-85 URL: https://issues.apache.org/jira/browse/MAHOUT-85 Project: Mahout Issue Type: New Feature Components: Classification Affects Versions: 0.1 Reporter: Isabel Drost Assignee: Isabel Drost Fix For: 0.3 Attachments: perceptronWinnowTrainer.diff Please find attached a first sketch for perceptron and winnow training. Please look very, very carefully at the patch, as I added the heart of the algorithms in the emergency room at Charite Berlin (after I broke my leg when cycling to the Hadoop Get Together ;) ). The patch does not yet feature unit tests nor is it parallelised. Currently my plan is to set up an example with the webKb dataset, add unit tests to the code and after that go parallel. I would like to get some feedback early on, in addition I would feel a lot better, if a second and third pair of eyes had a look at the code to make sure all obvious mistakes are out as early as possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-90) Adding all scripts (for nightly build) to SVN repository.
[ https://issues.apache.org/jira/browse/MAHOUT-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786695#action_12786695 ] Sean Owen commented on MAHOUT-90: - Shall I mark this defunct then? It hadn't had activity in 14 months. Adding all scripts (for nightly build) to SVN repository. - Key: MAHOUT-90 URL: https://issues.apache.org/jira/browse/MAHOUT-90 Project: Mahout Issue Type: New Feature Reporter: Edward J. Yoon Priority: Minor Fix For: 0.3 Attachments: mahout.tgz I made below scripts for the hudson continuous integration service on my hudson account. mahout/hudsonBuildMahoutPatch.sh mahout/processMahoutPatchEmail.sh mahout/hudsonPatchQueueAdmin.sh They will be modified by only me, so It should be handled via SVN. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper
[ https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-197: --- Status: Patch Available (was: Open) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper -- Key: MAHOUT-197 URL: https://issues.apache.org/jira/browse/MAHOUT-197 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.2 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 0.20.1, build from r834311 of http://svn.apache.org/repos/asf/lucene/mahout/trunk Reporter: Drew Farris Priority: Minor Attachments: LDADriver-setJar.patch hadoop jar core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 --maxIter 2 [...] 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [...] 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005 09/11/09 22:02:02 INFO mapred.JobClient: map 0% reduce 0% 09/11/09 22:02:12 INFO mapred.JobClient: Task Id : attempt_200911091316_0005_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:200) Can be fixed by adding the following line to LDADriver after line 299 in r831743: job.setJarByClass(LDADriver.class); (will attach trivial patch) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
[ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786782#action_12786782 ] Shashikant Kore commented on MAHOUT-191: Yeah, this can be committed. NPE while creating term vectors with an index on a field that does not exist in all the documents - Key: MAHOUT-191 URL: https://issues.apache.org/jira/browse/MAHOUT-191 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Environment: mac, snow leopard, eclipse galileo, jdk 6 Reporter: Sushil Bajracharya Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) I checked out mahout from trunk and tried to create term frequency vector from a lucene index and ran into this.. 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor Exception in thread main java.lang.NullPointerException at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) I am running this from Eclipse (snow leopard with JDK 6), on an index that has field with stored term vectors.. my input parameters for Driver are: --dir path/smallidx/ --output path/luc2tvec.out --idField id_field --field field_with_TV --dictOut path/luc2tvec.dict --max 50 --weight tf Luke shows the following info on the fields I am using: id_field is indexed, stored, omit norms field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-212) Need random sampler for use in reducers
Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Fix For: 0.3 For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers
[ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reassigned MAHOUT-212: -- Assignee: Ted Dunning Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 0.3 For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers
[ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-212: --- Status: Patch Available (was: Open) Code plus test cases. Ready for use. I think. Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 0.3 For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers
[ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reassigned MAHOUT-212: -- Assignee: Sean Owen (was: Ted Dunning) Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-212.patch For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers
[ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-212: --- Attachment: MAHOUT-212.patch Hmm... didn't get asked for where the patch file was when marking the bug as patch available. Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Assignee: Ted Dunning Fix For: 0.3 Attachments: MAHOUT-212.patch For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper
[ https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall reassigned MAHOUT-197: - Assignee: David Hall LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper -- Key: MAHOUT-197 URL: https://issues.apache.org/jira/browse/MAHOUT-197 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.2 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 0.20.1, build from r834311 of http://svn.apache.org/repos/asf/lucene/mahout/trunk Reporter: Drew Farris Assignee: David Hall Priority: Minor Attachments: LDADriver-setJar.patch hadoop jar core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 --maxIter 2 [...] 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [...] 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005 09/11/09 22:02:02 INFO mapred.JobClient: map 0% reduce 0% 09/11/09 22:02:12 INFO mapred.JobClient: Task Id : attempt_200911091316_0005_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:200) Can be fixed by adding the following line to LDADriver after line 299 in r831743: job.setJarByClass(LDADriver.class); (will attach trivial patch) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-197) LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper
[ https://issues.apache.org/jira/browse/MAHOUT-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Hall updated MAHOUT-197: -- Resolution: Fixed Status: Resolved (was: Patch Available) Fixed in 887843 Thanks for the patch! LDADriver: No job jar file set leads to ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper -- Key: MAHOUT-197 URL: https://issues.apache.org/jira/browse/MAHOUT-197 Project: Mahout Issue Type: Bug Components: Clustering Affects Versions: 0.2 Environment: ubuntu 9.04, sun jdk 1.6.0_07, hadoop cluster running 0.20.1, build from r834311 of http://svn.apache.org/repos/asf/lucene/mahout/trunk Reporter: Drew Farris Assignee: David Hall Priority: Minor Attachments: LDADriver-setJar.patch hadoop jar core/target/mahout-core-0.2-SNAPSHOT.joborg.apache.mahout.clustering.lda.LDADriver -i mahout/foo/foo-vectors -o mahout/foo/lda-cluster -w -k 1000 -v 82342 --maxIter 2 [...] 09/11/09 22:02:00 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). [...] 09/11/09 22:02:00 INFO input.FileInputFormat: Total input paths to process : 1 09/11/09 22:02:01 INFO mapred.JobClient: Running job: job_200911091316_0005 09/11/09 22:02:02 INFO mapred.JobClient: map 0% reduce 0% 09/11/09 22:02:12 INFO mapred.JobClient: Task Id : attempt_200911091316_0005_m_00_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:808) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:157) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.clustering.lda.LDAMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:200) Can be fixed by adding the following line to LDADriver after line 299 in r831743: job.setJarByClass(LDADriver.class); (will attach trivial patch) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers
[ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12786806#action_12786806 ] Sean Owen commented on MAHOUT-212: -- This kinda already existed as SamplingIterator -- does that do the same thing? could these be merged then, pulling the class into a common location and combining aspects of both? Need random sampler for use in reducers --- Key: MAHOUT-212 URL: https://issues.apache.org/jira/browse/MAHOUT-212 Project: Mahout Issue Type: Bug Components: Utils Affects Versions: 0.2 Reporter: Ted Dunning Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-212.patch For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer. As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
[ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-191: - Resolution: Fixed Status: Resolved (was: Patch Available) Committed variant on patches on behalf of submitters. Did not override the standard setDocumentNumber() method since this seems to be available in Lucene now and undesirable to cut off calls to super.setDocumentNumber()? NPE while creating term vectors with an index on a field that does not exist in all the documents - Key: MAHOUT-191 URL: https://issues.apache.org/jira/browse/MAHOUT-191 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Environment: mac, snow leopard, eclipse galileo, jdk 6 Reporter: Sushil Bajracharya Assignee: Sean Owen Fix For: 0.3 Attachments: MAHOUT-191-patch.txt, MAHOUT-191.patch (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) I checked out mahout from trunk and tried to create term frequency vector from a lucene index and ran into this.. 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor Exception in thread main java.lang.NullPointerException at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) I am running this from Eclipse (snow leopard with JDK 6), on an index that has field with stored term vectors.. my input parameters for Driver are: --dir path/smallidx/ --output path/luc2tvec.out --idField id_field --field field_with_TV --dictOut path/luc2tvec.dict --max 50 --weight tf Luke shows the following info on the fields I am using: id_field is indexed, stored, omit norms field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.