Re: Success
On Tue Grant Ingersoll wrote: > Isabel, any idea where those things actually go? That URL is not > browseable. http://maven.apache.org/developers/release/releasing.html (5th and 6th point) <- says that for others to be able to view the artifacts you first need to log into Nexus, and close the repositoriy containing the release candidate for further deployments: > Right click on this repository and select "Close". This will close the > repository from future deployments and make it available for others to > view. Currently Nexus does not let me login, so I cannot verify whether I might see your release :( Isabel
[jira] Created: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
NPE while creating term vectors with an index on a field that does not exist in all the documents - Key: MAHOUT-191 URL: https://issues.apache.org/jira/browse/MAHOUT-191 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Environment: mac, snow leopard, eclipse galileo, jdk 6 Reporter: Sushil Bajracharya (based on the message from here: http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) I checked out mahout from trunk and tried to create term frequency vector from a lucene index and ran into this.. 09/10/27 17:36:10 INFO lucene.Driver: Output File: /Users/shoeseal/DATA/luc2tvec.out 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor Exception in thread "main" java.lang.NullPointerException at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) at org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) at org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) I am running this from Eclipse (snow leopard with JDK 6), on an index that has field with stored term vectors.. my input parameters for Driver are: --dir /smallidx/ --output /luc2tvec.out --idField id_field --field field_with_TV --dictOut /luc2tvec.dict --max 50 --weight tf Luke shows the following info on the fields I am using: id_field is indexed, stored, omit norms field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
[ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Bajracharya updated MAHOUT-191: -- Status: Patch Available (was: Open) It seems that the problem is because that not all the documents in my index has the field that I am using to get term vectors from. I made the following changes to make this work, but I am not sure if thats the right way. I wanted to get this work to run the LDA topic modeling using the output from the Driver. Index: utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java === --- utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (revision 830343) +++ utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java (working copy) @@ -42,7 +42,7 @@ break; } //point.write(dataOut); - writer.append(new LongWritable(recNum++), point); + if(point!=null) writer.append(new LongWritable(recNum++), point); } return recNum; Index: utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java === --- utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (revision 830343) +++ utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java (working copy) @@ -104,6 +104,10 @@ try { indexReader.getTermFreqVector(doc, field, mapper); result = mapper.getVector(); + +if (result == null) + return null; + if (idField != null) { String id = indexReader.document(doc, idFieldSelector).get(idField); result.setName(id); > NPE while creating term vectors with an index on a field that does not exist > in all the documents > - > > Key: MAHOUT-191 > URL: https://issues.apache.org/jira/browse/MAHOUT-191 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 > Environment: mac, snow leopard, eclipse galileo, jdk 6 >Reporter: Sushil Bajracharya > > (based on the message from here: > http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) > I checked out mahout from trunk and tried to create term frequency vector > from a lucene index and ran into this.. > 09/10/27 17:36:10 INFO lucene.Driver: Output File: > /Users/shoeseal/DATA/luc2tvec.out > 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.NullPointerException > at > org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) > at > org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) > at > org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) > at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) > I am running this from Eclipse (snow leopard with JDK 6), on an index that > has field with stored term vectors.. > my input parameters for Driver are: > --dir /smallidx/ --output /luc2tvec.out --idField id_field > --field field_with_TV --dictOut /luc2tvec.dict --max 50 --weight tf > Luke shows the following info on the fields I am using: > id_field is indexed, stored, omit norms > field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents
[ https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushil Bajracharya updated MAHOUT-191: -- Attachment: MAHOUT-191-patch.txt > NPE while creating term vectors with an index on a field that does not exist > in all the documents > - > > Key: MAHOUT-191 > URL: https://issues.apache.org/jira/browse/MAHOUT-191 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 > Environment: mac, snow leopard, eclipse galileo, jdk 6 >Reporter: Sushil Bajracharya > Attachments: MAHOUT-191-patch.txt > > > (based on the message from here: > http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263) > I checked out mahout from trunk and tried to create term frequency vector > from a lucene index and ran into this.. > 09/10/27 17:36:10 INFO lucene.Driver: Output File: > /Users/shoeseal/DATA/luc2tvec.out > 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.NullPointerException > at > org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109) > at > org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1) > at > org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40) > at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200) > I am running this from Eclipse (snow leopard with JDK 6), on an index that > has field with stored term vectors.. > my input parameters for Driver are: > --dir /smallidx/ --output /luc2tvec.out --idField id_field > --field field_with_TV --dictOut /luc2tvec.dict --max 50 --weight tf > Luke shows the following info on the fields I am using: > id_field is indexed, stored, omit norms > field_with_TV is indexed, tokenized, stored, term vector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-190) Make all instance fields private
[ https://issues.apache.org/jira/browse/MAHOUT-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770871#action_12770871 ] Sean Owen commented on MAHOUT-190: -- Yes, I am suggesting we (= I) go through now and create potentially all the getters/setters. It will take me 10 minutes with my IDE. My personal preference is strongly to design for extension, but failing that, prevent extension if it's not designed for. and a lot of stuff is not yet designed for extension. I am surprised to hear we'd welcome some dependencies to weave their way into the internal representation of these classes, in ways we aren't tracking. Tens of small subtle bugs come to mind. Oops, now I want to synchronize on some internal object. But I've allowed callers to access it directly, and they are too. Maybe a deadlock occurs. Oops I didn't expect the field to be nulled at this point. Isn't just opening up the representation just punting on designing for extension? should it not be intentional? The strong argument for complete extensibility sounds like an argument for no encapsulation, which can't be the idea. There's a line, and I thought encapsulating representation was one of the things farthest from that line. I am sure that's the right thing given my own experience, but we all have different experience, and I'm not pushing this point of view. One other thing, it's open-source right? this is the very case where the worst-case is just that someone copies/pastes a class. It's not a closed library. The least change would be to expose absolutely everything through getters/setters. I think you said it Jake -- it's a crazy lot of methods added, most of which are not necessary. But these 'methods' already exist, they're part of the API, in the form of accessible fields. They're in the javadoc. This change is just a 'messenger'. Why don't I make a patch that does in fact add all the getters/setters, for a look. I think in many cases it will just highlight that the fields aren't going to be useful to any extenders. And we chuck those. And we leave a lot of them. And in 3 versions, can even review them again. > Make all instance fields private > > > Key: MAHOUT-190 > URL: https://issues.apache.org/jira/browse/MAHOUT-190 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 0.3 > > > This one may be more controversial but is useful and interesting enough to > discuss. > I personally believe instance fields should always be private. I think the > pro- and con- debate goes like this: > Making all fields private increases encapsulation. Fields must be made > explicitly accessible via getters and setters, which is good -- default to > hiding, rather than exposing. Not-hiding a field amounts to committing it to > be a part of the API, which is rarely intended. Using getters/setters allows > read/write access to be independently controlled and even allowed -- allows > for read-only 'fields'. Getters/setters establish an API independent from the > representation which is a Good Thing. > But don't getters and setters slow things down? > Trivially. JIT compilers will easily inline one-liners. Making fields private > more readily allows fields to be marked final, and these two factors allow > for optimizations by (Proguard or) JIT. It could actually speed things up. > But isn't it messy to write all those dang getters/setters? > Not really, and not at all if you use an IDE, which I think we all should be. > But sometimes a class needs to share representation with its subclasses. > Yes, and it remains possible with package-private / protected getters and > setters. This is IMHO a rare situation anyway, and, the code is far easier to > read when fields from a parent don't magically appear, or one doesn't wonder > about where else a field may be accessed in subclasses. I also feel like > sometimes making a field more visible is a shortcut enabler to some bad > design. It usually is a bad smell. > Thoughts on this narrative. Once again I volunteer to implement the consensus. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Success
On Oct 28, 2009, at 3:02 AM, Isabel Drost wrote: On Tue Grant Ingersoll wrote: Isabel, any idea where those things actually go? That URL is not browseable. http://maven.apache.org/developers/release/releasing.html (5th and 6th point) <- says that for others to be able to view the artifacts you first need to log into Nexus, and close the repositoriy containing the release candidate for further deployments: Right click on this repository and select "Close". This will close the repository from future deployments and make it available for others to view. Ah, OK. Done. View artifacts here: https://repository.apache.org/content/repositories/orgapachemahout-001/ Please look them over and give your thoughts on them, then if that looks good, we can call a vote. Currently Nexus does not let me login, so I cannot verify whether I might see your release :( It should be your SVN creds.
Feedback on release candidate for 0.2
Ran into this -- [INFO] [remote-resources:process {execution: default}] [ERROR] Error loading supplemental data models: Could not find resource 'supplemental-models.xml'. org.codehaus.plexus.resource.loader.ResourceNotFoundException: Could not find resource 'supplemental-models.xml'. I know we solved this by adding a file, src/main/appended-resources/supplemental-models.xml. I guess it just needs to be packaged. I'll look at that -- Isabel you might know more about this.
Re: Success
On Wed Grant Ingersoll wrote: > Please look them over and give your thoughts on them, then if that > looks good, we can call a vote. First of all - a big Thanks to all who helped get through the issues from me as well! Looks good on first sight - will have to digg deeper tomorrow. One thing I noticed - the 3rd party dependencies (hadoop, commons, kosmofs and the like) are not signed. > > Currently Nexus does not let me login, so I cannot verify whether I > > might see your release :( > > > > It should be your SVN creds. Just found out: Nexus does not like Konqueror (at least not the version currently installed on my machine). Any other browser works. Isabel
Re: Feedback on release candidate for 0.2
On Wed Sean Owen wrote: > Ran into this -- Currently when trying to build one of the tests fails for me. > [INFO] [remote-resources:process {execution: default}] > [ERROR] Error loading supplemental data models: Could not find > resource 'supplemental-models.xml'. > org.codehaus.plexus.resource.loader.ResourceNotFoundException: Could > not find resource 'supplemental-models.xml'. > > I know we solved this by adding a file, > src/main/appended-resources/supplemental-models.xml. I guess it just > needs to be packaged. I'll look at that -- Isabel you might know more > about this. That file should contain licensing information for all artifacts that we depend on through maven that have no description through apache deployed resources. However I do see it when unpacking the tar.gz file - it is located under "mahout-0.2/src/main/appended-resources/" More information on that: http://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html Isabel
Re: Success
On Oct 28, 2009, at 9:24 AM, Isabel Drost wrote: On Wed Grant Ingersoll wrote: Please look them over and give your thoughts on them, then if that looks good, we can call a vote. First of all - a big Thanks to all who helped get through the issues from me as well! Looks good on first sight - will have to digg deeper tomorrow. One thing I noticed - the 3rd party dependencies (hadoop, commons, kosmofs and the like) are not signed. OK, I will look into this. I suspect we will need to add some extra calls. I may just sign them by hand, too. Currently Nexus does not let me login, so I cannot verify whether I might see your release :( It should be your SVN creds. Just found out: Nexus does not like Konqueror (at least not the version currently installed on my machine). Any other browser works. Isabel
Re: Feedback on release candidate for 0.2
On Wed, 28 Oct 2009 16:03:51 +0100 Isabel Drost wrote: > On Wed Sean Owen wrote: > > > Ran into this -- > > Currently when trying to build one of the tests fails for me. Sorry - forgot to mention the failing test in my last mail: (org.apache.mahout.clustering.kmeans.TestKmeansClustering) Time elapsed: 18.9 sec <<< FAILURE! Will test on my own laptop to see whether this is simply an environment issue. Isabel
[jira] Commented: (MAHOUT-190) Make all instance fields private
[ https://issues.apache.org/jira/browse/MAHOUT-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771174#action_12771174 ] Ted Dunning commented on MAHOUT-190: I would prefer to make all instance variables private, and then add getters and setters *only* where used. Putting getters and setters on everything is not a good idea (in my opinion). > Make all instance fields private > > > Key: MAHOUT-190 > URL: https://issues.apache.org/jira/browse/MAHOUT-190 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.2 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 0.3 > > > This one may be more controversial but is useful and interesting enough to > discuss. > I personally believe instance fields should always be private. I think the > pro- and con- debate goes like this: > Making all fields private increases encapsulation. Fields must be made > explicitly accessible via getters and setters, which is good -- default to > hiding, rather than exposing. Not-hiding a field amounts to committing it to > be a part of the API, which is rarely intended. Using getters/setters allows > read/write access to be independently controlled and even allowed -- allows > for read-only 'fields'. Getters/setters establish an API independent from the > representation which is a Good Thing. > But don't getters and setters slow things down? > Trivially. JIT compilers will easily inline one-liners. Making fields private > more readily allows fields to be marked final, and these two factors allow > for optimizations by (Proguard or) JIT. It could actually speed things up. > But isn't it messy to write all those dang getters/setters? > Not really, and not at all if you use an IDE, which I think we all should be. > But sometimes a class needs to share representation with its subclasses. > Yes, and it remains possible with package-private / protected getters and > setters. This is IMHO a rare situation anyway, and, the code is far easier to > read when fields from a parent don't magically appear, or one doesn't wonder > about where else a field may be accessed in subclasses. I also feel like > sometimes making a field more visible is a shortcut enabler to some bad > design. It usually is a bad smell. > Thoughts on this narrative. Once again I volunteer to implement the consensus. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.