Re: Success

2009-10-28 Thread Isabel Drost
On Tue Grant Ingersoll  wrote:
> Isabel, any idea where those things actually go?  That URL is not  
> browseable.

http://maven.apache.org/developers/release/releasing.html (5th and 6th
point)

<- says that for others to be able to view the artifacts you first need
to log into Nexus, and close the repositoriy containing the release
candidate for further deployments:

> Right click on this repository and select "Close". This will close the
> repository from future deployments and make it available for others to
> view. 

Currently Nexus does not let me login, so I cannot verify whether I
might see your release :(



Isabel


[jira] Created: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-10-28 Thread Sushil Bajracharya (JIRA)
NPE while creating term vectors with an index on a field that does not exist in 
all the documents
-

 Key: MAHOUT-191
 URL: https://issues.apache.org/jira/browse/MAHOUT-191
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
 Environment: mac, snow leopard, eclipse galileo, jdk 6
Reporter: Sushil Bajracharya


(based on the message from here: 
http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)

I checked out mahout from trunk and tried to create term frequency vector from 
a lucene index and ran into this..

09/10/27 17:36:10 INFO lucene.Driver: Output File: 
/Users/shoeseal/DATA/luc2tvec.out
09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
at 
org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
at 
org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)

I am running this from Eclipse (snow leopard with JDK 6), on an index that has 
field with stored term vectors..

my input parameters for Driver are:
--dir /smallidx/ --output /luc2tvec.out --idField id_field
 --field field_with_TV --dictOut /luc2tvec.dict --max 50  --weight tf

Luke shows the following info on the fields I am using:
 id_field is indexed, stored, omit norms
 field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-10-28 Thread Sushil Bajracharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Bajracharya updated MAHOUT-191:
--

Status: Patch Available  (was: Open)

It seems that the problem is because that not all the documents in my index has 
the field that I am using to get term vectors from. I made the following 
changes to make this work, but I am not sure if thats the right way. I wanted 
to get this work to run the LDA topic modeling using the output from the Driver.

Index: 
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
===
--- 
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
 (revision 830343)
+++ 
utils/src/main/java/org/apache/mahout/utils/vectors/io/SequenceFileVectorWriter.java
 (working copy)
@@ -42,7 +42,7 @@
 break;
   }
   //point.write(dataOut);
-  writer.append(new LongWritable(recNum++), point);
+  if(point!=null) writer.append(new LongWritable(recNum++), point);
 
 }
 return recNum;
Index: 
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java
===
--- 
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java 
(revision 830343)
+++ 
utils/src/main/java/org/apache/mahout/utils/vectors/lucene/LuceneIterable.java 
(working copy)
@@ -104,6 +104,10 @@
   try {
 indexReader.getTermFreqVector(doc, field, mapper);
 result = mapper.getVector();
+
+if (result == null)
+ return null;
+
 if (idField != null) {
   String id = indexReader.document(doc, idFieldSelector).get(idField);
   result.setName(id);

> NPE while creating term vectors with an index on a field that does not exist 
> in all the documents
> -
>
> Key: MAHOUT-191
> URL: https://issues.apache.org/jira/browse/MAHOUT-191
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
> Environment: mac, snow leopard, eclipse galileo, jdk 6
>Reporter: Sushil Bajracharya
>
> (based on the message from here: 
> http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
> I checked out mahout from trunk and tried to create term frequency vector 
> from a lucene index and ran into this..
> 09/10/27 17:36:10 INFO lucene.Driver: Output File: 
> /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
> at 
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
> at 
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
> I am running this from Eclipse (snow leopard with JDK 6), on an index that 
> has field with stored term vectors..
> my input parameters for Driver are:
> --dir /smallidx/ --output /luc2tvec.out --idField id_field
>  --field field_with_TV --dictOut /luc2tvec.dict --max 50  --weight tf
> Luke shows the following info on the fields I am using:
>  id_field is indexed, stored, omit norms
>  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-191) NPE while creating term vectors with an index on a field that does not exist in all the documents

2009-10-28 Thread Sushil Bajracharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushil Bajracharya updated MAHOUT-191:
--

Attachment: MAHOUT-191-patch.txt

> NPE while creating term vectors with an index on a field that does not exist 
> in all the documents
> -
>
> Key: MAHOUT-191
> URL: https://issues.apache.org/jira/browse/MAHOUT-191
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
> Environment: mac, snow leopard, eclipse galileo, jdk 6
>Reporter: Sushil Bajracharya
> Attachments: MAHOUT-191-patch.txt
>
>
> (based on the message from here: 
> http://www.nabble.com/Creating-Vectors-from-Text-tt24298643.html#a26090263)
> I checked out mahout from trunk and tried to create term frequency vector 
> from a lucene index and ran into this..
> 09/10/27 17:36:10 INFO lucene.Driver: Output File: 
> /Users/shoeseal/DATA/luc2tvec.out
> 09/10/27 17:36:11 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 09/10/27 17:36:11 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:109)
> at 
> org.apache.mahout.utils.vectors.lucene.LuceneIterable$TDIterator.next(LuceneIterable.java:1)
> at 
> org.apache.mahout.utils.vectors.io.SequenceFileVectorWriter.write(SequenceFileVectorWriter.java:40)
> at org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:200)
> I am running this from Eclipse (snow leopard with JDK 6), on an index that 
> has field with stored term vectors..
> my input parameters for Driver are:
> --dir /smallidx/ --output /luc2tvec.out --idField id_field
>  --field field_with_TV --dictOut /luc2tvec.dict --max 50  --weight tf
> Luke shows the following info on the fields I am using:
>  id_field is indexed, stored, omit norms
>  field_with_TV is indexed, tokenized, stored, term vector 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-190) Make all instance fields private

2009-10-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770871#action_12770871
 ] 

Sean Owen commented on MAHOUT-190:
--

Yes, I am suggesting we (= I) go through now and create potentially all the 
getters/setters. It will take me 10 minutes with my IDE.

My personal preference is strongly to design for extension, but failing that, 
prevent extension if it's not designed for. and a lot of stuff is not yet 
designed for extension.

I am surprised to hear we'd welcome some dependencies to weave their way into 
the internal representation of these classes, in ways we aren't tracking. Tens 
of small subtle bugs come to mind. Oops, now I want to synchronize on some 
internal object. But I've allowed callers to access it directly, and they are 
too. Maybe a deadlock occurs. Oops I didn't expect the field to be nulled at 
this point.

Isn't just opening up the representation just punting on designing for 
extension? should it not be intentional?

The strong argument for complete extensibility sounds like an argument for no 
encapsulation, which can't be the idea. There's a line, and I thought 
encapsulating representation was one of the things farthest from that line. I 
am sure that's the right thing given my own experience, but we all have 
different experience, and I'm not pushing this point of view.

One other thing, it's open-source right? this is the very case where the 
worst-case is just that someone copies/pastes a class. It's not a closed 
library.

The least change would be to expose absolutely everything through 
getters/setters. I think you said it Jake -- it's a crazy lot of methods added, 
most of which are not necessary. But these 'methods' already exist, they're 
part of the API, in the form of accessible fields. They're in the javadoc. This 
change is just a 'messenger'.

Why don't I make a patch that does in fact add all the getters/setters, for a 
look. I think in many cases it will just highlight that the fields aren't going 
to be useful to any extenders. And we chuck those. And we leave a lot of them. 
And in 3 versions, can even review them again.

> Make all instance fields private
> 
>
> Key: MAHOUT-190
> URL: https://issues.apache.org/jira/browse/MAHOUT-190
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.3
>
>
> This one may be more controversial but is useful and interesting enough to 
> discuss.
> I personally believe instance fields should always be private. I think the 
> pro- and con- debate goes like this:
> Making all fields private increases encapsulation. Fields must be made 
> explicitly accessible via getters and setters, which is good -- default to 
> hiding, rather than exposing. Not-hiding a field amounts to committing it to 
> be a part of the API, which is rarely intended. Using getters/setters allows 
> read/write access to be independently controlled and even allowed -- allows 
> for read-only 'fields'. Getters/setters establish an API independent from the 
> representation which is a Good Thing.
> But don't getters and setters slow things down?
> Trivially. JIT compilers will easily inline one-liners. Making fields private 
> more readily allows fields to be marked final, and these two factors allow 
> for optimizations by (Proguard or) JIT. It could actually speed things up.
> But isn't it messy to write all those dang getters/setters?
> Not really, and not at all if you use an IDE, which I think we all should be.
> But sometimes a class needs to share representation with its subclasses.
> Yes, and it remains possible with package-private / protected getters and 
> setters. This is IMHO a rare situation anyway, and, the code is far easier to 
> read when fields from a parent don't magically appear, or one doesn't wonder 
> about where else a field may be accessed in subclasses. I also feel like 
> sometimes making a field more visible is a shortcut enabler to some bad 
> design. It usually is a bad smell.
> Thoughts on this narrative. Once again I volunteer to implement the consensus.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Success

2009-10-28 Thread Grant Ingersoll


On Oct 28, 2009, at 3:02 AM, Isabel Drost wrote:


On Tue Grant Ingersoll  wrote:

Isabel, any idea where those things actually go?  That URL is not
browseable.


http://maven.apache.org/developers/release/releasing.html (5th and 6th
point)

<- says that for others to be able to view the artifacts you first  
need

to log into Nexus, and close the repositoriy containing the release
candidate for further deployments:

Right click on this repository and select "Close". This will close  
the
repository from future deployments and make it available for others  
to

view.


Ah, OK.  Done.  View artifacts here: 
https://repository.apache.org/content/repositories/orgapachemahout-001/

Please look them over and give your thoughts on them, then if that  
looks good, we can call a vote.





Currently Nexus does not let me login, so I cannot verify whether I
might see your release :(



It should be your SVN creds.


Feedback on release candidate for 0.2

2009-10-28 Thread Sean Owen
Ran into this --

[INFO] [remote-resources:process {execution: default}]
[ERROR] Error loading supplemental data models: Could not find
resource 'supplemental-models.xml'.
org.codehaus.plexus.resource.loader.ResourceNotFoundException: Could
not find resource 'supplemental-models.xml'.

I know we solved this by adding a file,
src/main/appended-resources/supplemental-models.xml. I guess it just
needs to be packaged. I'll look at that -- Isabel you might know more
about this.


Re: Success

2009-10-28 Thread Isabel Drost
On Wed Grant Ingersoll  wrote:

> Please look them over and give your thoughts on them, then if that  
> looks good, we can call a vote.

First of all - a big Thanks to all who helped get through the issues
from me as well!

Looks good on first sight - will have to digg deeper tomorrow. One
thing I noticed - the 3rd party dependencies (hadoop, commons, kosmofs
and the like) are not signed.


> > Currently Nexus does not let me login, so I cannot verify whether I
> > might see your release :(
> >
> 
> It should be your SVN creds.

Just found out: Nexus does not like Konqueror (at least not the version
currently installed on my machine). Any other browser works.

Isabel


Re: Feedback on release candidate for 0.2

2009-10-28 Thread Isabel Drost
On Wed Sean Owen  wrote:

> Ran into this --

Currently when trying to build one of the tests fails for me.

 
> [INFO] [remote-resources:process {execution: default}]
> [ERROR] Error loading supplemental data models: Could not find
> resource 'supplemental-models.xml'.
> org.codehaus.plexus.resource.loader.ResourceNotFoundException: Could
> not find resource 'supplemental-models.xml'.
> 
> I know we solved this by adding a file,
> src/main/appended-resources/supplemental-models.xml. I guess it just
> needs to be packaged. I'll look at that -- Isabel you might know more
> about this.

That file should contain licensing information for all artifacts that
we depend on through maven that have no description through apache
deployed resources. However I do see it when unpacking the tar.gz file
- it is located under "mahout-0.2/src/main/appended-resources/"

More information on that:

http://maven.apache.org/plugins/maven-remote-resources-plugin/supplemental-models.html

Isabel


Re: Success

2009-10-28 Thread Grant Ingersoll


On Oct 28, 2009, at 9:24 AM, Isabel Drost wrote:


On Wed Grant Ingersoll  wrote:


Please look them over and give your thoughts on them, then if that
looks good, we can call a vote.


First of all - a big Thanks to all who helped get through the issues
from me as well!

Looks good on first sight - will have to digg deeper tomorrow. One
thing I noticed - the 3rd party dependencies (hadoop, commons, kosmofs
and the like) are not signed.


OK, I will look into this.  I suspect we will need to add some extra  
calls.  I may just sign them by hand, too.






Currently Nexus does not let me login, so I cannot verify whether I
might see your release :(



It should be your SVN creds.


Just found out: Nexus does not like Konqueror (at least not the  
version

currently installed on my machine). Any other browser works.

Isabel





Re: Feedback on release candidate for 0.2

2009-10-28 Thread Isabel Drost
On Wed, 28 Oct 2009 16:03:51 +0100
Isabel Drost  wrote:

> On Wed Sean Owen  wrote:
> 
> > Ran into this --
> 
> Currently when trying to build one of the tests fails for me.

Sorry - forgot to mention the failing test in my last mail:

(org.apache.mahout.clustering.kmeans.TestKmeansClustering) Time
elapsed: 18.9 sec  <<< FAILURE!

Will test on my own laptop to see whether this is simply an environment
issue.

Isabel


[jira] Commented: (MAHOUT-190) Make all instance fields private

2009-10-28 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771174#action_12771174
 ] 

Ted Dunning commented on MAHOUT-190:



I would prefer to make all instance variables private, and then add  getters 
and setters *only* where used.

Putting getters and setters on everything is not a good idea (in my opinion).

> Make all instance fields private
> 
>
> Key: MAHOUT-190
> URL: https://issues.apache.org/jira/browse/MAHOUT-190
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.3
>
>
> This one may be more controversial but is useful and interesting enough to 
> discuss.
> I personally believe instance fields should always be private. I think the 
> pro- and con- debate goes like this:
> Making all fields private increases encapsulation. Fields must be made 
> explicitly accessible via getters and setters, which is good -- default to 
> hiding, rather than exposing. Not-hiding a field amounts to committing it to 
> be a part of the API, which is rarely intended. Using getters/setters allows 
> read/write access to be independently controlled and even allowed -- allows 
> for read-only 'fields'. Getters/setters establish an API independent from the 
> representation which is a Good Thing.
> But don't getters and setters slow things down?
> Trivially. JIT compilers will easily inline one-liners. Making fields private 
> more readily allows fields to be marked final, and these two factors allow 
> for optimizations by (Proguard or) JIT. It could actually speed things up.
> But isn't it messy to write all those dang getters/setters?
> Not really, and not at all if you use an IDE, which I think we all should be.
> But sometimes a class needs to share representation with its subclasses.
> Yes, and it remains possible with package-private / protected getters and 
> setters. This is IMHO a rare situation anyway, and, the code is far easier to 
> read when fields from a parent don't magically appear, or one doesn't wonder 
> about where else a field may be accessed in subclasses. I also feel like 
> sometimes making a field more visible is a shortcut enabler to some bad 
> design. It usually is a bad smell.
> Thoughts on this narrative. Once again I volunteer to implement the consensus.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.