Re: [jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Ted Dunning
Transposing is actually a common need as you abstract away from users and
ratings.

On Thu, Jun 18, 2009 at 10:19 PM, Sean Owen (JIRA)  wrote:

> Looks OK to me -- I applied the patch locally and tweaked a few things.
> Seems like a rare use case but simple to implement anyway. Mind if I submit
> over here?
>
> > Allow FileDataModel to transpose users and items
>
>


[jira] Commented: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721653#action_12721653
 ] 

Sean Owen commented on MAHOUT-135:
--

Looks OK to me -- I applied the patch locally and tweaked a few things. Seems 
like a rare use case but simple to implement anyway. Mind if I submit over here?

> Allow FileDataModel to transpose users and items
> 
>
> Key: MAHOUT-135
> URL: https://issues.apache.org/jira/browse/MAHOUT-135
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-135.patch
>
>
> Sometimes it would be nice to flip around users and items in the 
> FileDataModel.  This patch adds a transpose boolean that flips userId and 
> itemId in the processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [GSOC] Thoughts about Random forests map-reduce implementation

2009-06-18 Thread Ted Dunning
Very similar, but I was talking about building trees on each split of the
data (a la map reduce split).

That would give many small splits and would thus give very different results
from bagging because the splits would be small and contigous rather than
large and random.


On Thu, Jun 18, 2009 at 1:37 AM, deneche abdelhakim wrote:

> "build multiple trees for different portions of the data"
>
> What's the difference with the basic bagging algorithm, which builds 'each
> tree' using a different portion (about 2/3) of the data ?


[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721646#action_12721646
 ] 

Sean Owen commented on MAHOUT-121:
--

Since I am not hearing objections, and cognizant that people are waiting on 
this, going to commit. If there are issues we can roll back or tweak from there.

> Speed up distance calculations for sparse vectors
> -
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Reporter: Shashikant Kore
> Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
> Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated MAHOUT-135:
---

Attachment: MAHOUT-135.patch

Patch that adds transpose and tests

> Allow FileDataModel to transpose users and items
> 
>
> Key: MAHOUT-135
> URL: https://issues.apache.org/jira/browse/MAHOUT-135
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-135.patch
>
>
> Sometimes it would be nice to flip around users and items in the 
> FileDataModel.  This patch adds a transpose boolean that flips userId and 
> itemId in the processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-135) Allow FileDataModel to transpose users and items

2009-06-18 Thread Grant Ingersoll (JIRA)
Allow FileDataModel to transpose users and items


 Key: MAHOUT-135
 URL: https://issues.apache.org/jira/browse/MAHOUT-135
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 0.2


Sometimes it would be nice to flip around users and items in the FileDataModel. 
 This patch adds a transpose boolean that flips userId and itemId in the 
processLine method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: MAHOUT-65

2009-06-18 Thread Jeff Eastman
Er, um, I see what you mean. How about just deleting the method? What 
really needs doing then is for all of the various clusters to themselves 
implement Writable so that they don't need to call asFormatString but 
can just emit themselves.

Jeff




Ted Dunning wrote:

What does this method do?

If the vector already implements Writable, what is the purpose of a
conversion?

On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote:

  

Shall I change the method to asWritable()?






  




PGP.sig
Description: PGP signature


Re: MAHOUT-65

2009-06-18 Thread Ted Dunning
What does this method do?

If the vector already implements Writable, what is the purpose of a
conversion?

On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote:

> Shall I change the method to asWritable()?




-- 
Ted Dunning, CTO
DeepDyve


Re: MAHOUT-65

2009-06-18 Thread David Hall
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman wrote:
> Shall I change the method to asWritable()?

I'd just be for getting rid of it. Vector implements Writable, so
asWritable() could just be "return this;", which seems gratuitous

As for actual efficiency:
   
lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java

is currently dumping output values as the text strings. If there's a
standard dataset, that would be an easy place to do the test.

- David

> I don't know of any situations where Vectors are used as keys. It hardly
> makes sense to use them as they are so unwieldy. Suggest we could change to
> just Writable and be ahead. In terms of the potential density improvement,
> it will be interesting to see what can typically be achieved.
>
> r786323 just removed all calls to asWritableComparable, replacing them with
> asFormatString which was correct anyway.
>

>
> Jeff
>
> David Hall wrote:
>>
>> How often does Mahout need the "Comparable" part for Vectors? Are
>> vectors commonly used as map output keys?
>>
>> In terms of space efficiency, I'd bet it's probably a bit better than
>> a factor of two in the average case, especially for densevectors. The
>> gson format is storing both the int index and the double as raw
>> strings, plus whatever boundary characters.  The writable
>> implementation stores just the bytes of the double, plus a length.
>>
>> -- David
>>
>> On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman
>> wrote:
>>
>>>
>>> +1 asWritableComparable is a simple implementation that uses
>>> asFormatString.
>>> It would be good to rewrite it for internal communication. A factor of
>>> two
>>> is still a factor of two.
>>>
>>> Jeff
>>>
>>>
>>> Grant Ingersoll wrote:
>>>

 On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:


>
> Writable should be plenty!
>
>

 +1.  Still nice to have JSON for user facing though.


>
> On Thu, Jun 18, 2009 at 1:15 PM, David Hall 
> wrote:
>
>
>>
>> See my followup on another thread (sorry for the schizophrenic
>> posting); Vector already implements Writable, so that's all I really
>> can ask of it. Is there something more you'd like? I'd be happy to do
>> it.
>>
>>
>>



>>>
>>>
>>
>>
>>
>
>


Re: MAHOUT-65

2009-06-18 Thread Jeff Eastman
I don't know of any situations where Vectors are used as keys. It hardly 
makes sense to use them as they are so unwieldy. Suggest we could change 
to just Writable and be ahead. In terms of the potential density 
improvement, it will be interesting to see what can typically be achieved.


r786323 just removed all calls to asWritableComparable, replacing them 
with asFormatString which was correct anyway.


Shall I change the method to asWritable()?

Jeff

David Hall wrote:

How often does Mahout need the "Comparable" part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman wrote:
  

+1 asWritableComparable is a simple implementation that uses asFormatString.
It would be good to rewrite it for internal communication. A factor of two
is still a factor of two.

Jeff


Grant Ingersoll wrote:


On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:

  

Writable should be plenty!



+1.  Still nice to have JSON for user facing though.

  

On Thu, Jun 18, 2009 at 1:15 PM, David Hall  wrote:



See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.


  



  




  




PGP.sig
Description: PGP signature


Re: MAHOUT-65

2009-06-18 Thread David Hall
How often does Mahout need the "Comparable" part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman wrote:
> +1 asWritableComparable is a simple implementation that uses asFormatString.
> It would be good to rewrite it for internal communication. A factor of two
> is still a factor of two.
>
> Jeff
>
>
> Grant Ingersoll wrote:
>>
>> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:
>>
>>> Writable should be plenty!
>>>
>>
>> +1.  Still nice to have JSON for user facing though.
>>
>>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall  wrote:
>>>
 See my followup on another thread (sorry for the schizophrenic
 posting); Vector already implements Writable, so that's all I really
 can ask of it. Is there something more you'd like? I'd be happy to do
 it.


>>
>>
>>
>>
>
>


[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

2009-06-18 Thread David Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hall updated MAHOUT-123:
--

Attachment: MAHOUT-123.patch

(Still in progress.)

It seems to work, but it's much to slow because I underestimated the badness of 
using DenseVectors. Switching to an element wise system now.



> Implement Latent Dirichlet Allocation
> -
>
> Key: MAHOUT-123
> URL: https://issues.apache.org/jira/browse/MAHOUT-123
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
>Reporter: David Hall
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: lda.patch, MAHOUT-123.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (For GSoC)
> Abstract:
> Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
> algorithm for automatically and jointly clustering words into "topics"
> and documents into mixtures of topics, and it has been successfully
> applied to model change in scientific fields over time (Griffiths and
> Steyver, 2004; Hall, et al. 2008). In this project, I propose to
> implement a distributed variant of Latent Dirichlet Allocation using
> MapReduce, and, time permitting, to investigate extensions of LDA and
> possibly more efficient algorithms for distributed inference.
> Detailed Description:
> A topic model is, roughly, a hierarchical Bayesian model that
> associates with each document a probability distribution over
> "topics", which are in turn distributions over words. For instance, a
> topic in a collection of newswire might include words about "sports",
> such as "baseball", "home run", "player", and a document about steroid
> use in baseball might include "sports", "drugs", and "politics". Note
> that the labels "sports", "drugs", and "politics", are post-hoc labels
> assigned by a human, and that the algorithm itself only assigns
> associate words with probabilities. The task of parameter estimation
> in these models is to learn both what these topics are, and which
> documents employ them in what proportions.
> One of the promises of unsupervised learning algorithms like Latent
> Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
> massive collections of documents and condense them down into a
> collection of easily understandable topics. However, all available
> open source implementations of LDA and related topics models are not
> distributed, which hampers their utility. This project seeks to
> correct this shortcoming.
> In the literature, there have been several proposals for paralellzing
> LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
> which each processors gets its own subset of the documents to run
> Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
> its very nature, which is not advantageous for repeated runs. Instead,
> I propose to follow Nallapati, et al. (2007) and use a variational
> approximation that is fast and non-random.
> References:
> David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
> David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
> allocation, The Journal of Machine Learning Research, 3, p.993-1022,
> 3/1/2003
> T. L. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl
> Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
> David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
> the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
> Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
> variational EM for Latent Dirichlet Allocation: An experimental
> evaluation of speed and scalability, ICDM workshop on high performance
> data mining, 2007.
> Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
> Inference for Latent Dirichlet Allocation. NIPS, 2007.
> Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
> continuous-time model of topical trends. KDD, 2006
> Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
> large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: MAHOUT-65

2009-06-18 Thread Jeff Eastman
+1 asWritableComparable is a simple implementation that uses 
asFormatString. It would be good to rewrite it for internal 
communication. A factor of two is still a factor of two.


Jeff


Grant Ingersoll wrote:


On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:


Writable should be plenty!



+1.  Still nice to have JSON for user facing though.

On Thu, Jun 18, 2009 at 1:15 PM, David Hall  
wrote:



See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.











PGP.sig
Description: PGP signature


Re: MAHOUT-65

2009-06-18 Thread Grant Ingersoll


On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:


Writable should be plenty!



+1.  Still nice to have JSON for user facing though.

On Thu, Jun 18, 2009 at 1:15 PM, David Hall   
wrote:



See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.







Re: MAHOUT-65

2009-06-18 Thread Ted Dunning
Writable should be plenty!

On Thu, Jun 18, 2009 at 1:15 PM, David Hall  wrote:

> See my followup on another thread (sorry for the schizophrenic
> posting); Vector already implements Writable, so that's all I really
> can ask of it. Is there something more you'd like? I'd be happy to do
> it.
>
>


Re: MAHOUT-65

2009-06-18 Thread David Hall
See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.

-- David

On Thu, Jun 18, 2009 at 1:11 PM, Ted Dunning wrote:
> +10!!!
>
> How would you like to do it?  Something like avro?  Thrift?  Homespun?
>
> On Thu, Jun 18, 2009 at 12:01 PM, David Hall  wrote:
>
>> Would anyone be interested in a "compressed" serialization for
>> DenseVector/SparseVector that follows in the vein of
>> hadoop.io.Writable? The space overhead for gson (parsing issues
>> not-withstanding) is pretty high, and it wouldn't be terribly hard to
>> implement a high-performance thing for vectors.
>>
>


Re: MAHOUT-65

2009-06-18 Thread Ted Dunning
+10!!!

How would you like to do it?  Something like avro?  Thrift?  Homespun?

On Thu, Jun 18, 2009 at 12:01 PM, David Hall  wrote:

> Would anyone be interested in a "compressed" serialization for
> DenseVector/SparseVector that follows in the vein of
> hadoop.io.Writable? The space overhead for gson (parsing issues
> not-withstanding) is pretty high, and it wouldn't be terribly hard to
> implement a high-performance thing for vectors.
>


Re: MAHOUT-65

2009-06-18 Thread David Hall
oh, wow, nevermind. Vector implements writable.

Sorry everyone.

-- David

On Thu, Jun 18, 2009 at 12:19 PM, David Hall wrote:
> actually, it looks like someone went to all the trouble to make both
> SparseVector and DenseVector have all the methods required by
> Writable, but they don't implement Writable.
>
> Could I just make Vector extend Writable?
>
> -- David
>
> On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote:
>> following up on my earlier email.
>>
>> Would anyone be interested in a "compressed" serialization for
>> DenseVector/SparseVector that follows in the vein of
>> hadoop.io.Writable? The space overhead for gson (parsing issues
>> not-withstanding) is pretty high, and it wouldn't be terribly hard to
>> implement a high-performance thing for vectors.
>>
>> -- David
>>
>> On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman 
>> wrote:
>>> +1, you added name constructors that I didn't have and the equals/equivalent
>>> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>>>
>>>
>>> Grant Ingersoll wrote:

 Shall I take that as approval of the approach?

 BTW, the Gson stuff seems like a winner for serialization.

 On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:

> You gonna commit your patch? I agree with shortening the class name in
> the JsonVectorAdapter and will do it once you commit ur stuff.
> Jeff




>>>
>>>
>>
>


Re: MAHOUT-65

2009-06-18 Thread David Hall
actually, it looks like someone went to all the trouble to make both
SparseVector and DenseVector have all the methods required by
Writable, but they don't implement Writable.

Could I just make Vector extend Writable?

-- David

On Thu, Jun 18, 2009 at 12:01 PM, David Hall wrote:
> following up on my earlier email.
>
> Would anyone be interested in a "compressed" serialization for
> DenseVector/SparseVector that follows in the vein of
> hadoop.io.Writable? The space overhead for gson (parsing issues
> not-withstanding) is pretty high, and it wouldn't be terribly hard to
> implement a high-performance thing for vectors.
>
> -- David
>
> On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman 
> wrote:
>> +1, you added name constructors that I didn't have and the equals/equivalent
>> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>>
>>
>> Grant Ingersoll wrote:
>>>
>>> Shall I take that as approval of the approach?
>>>
>>> BTW, the Gson stuff seems like a winner for serialization.
>>>
>>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>>>
 You gonna commit your patch? I agree with shortening the class name in
 the JsonVectorAdapter and will do it once you commit ur stuff.
 Jeff
>>>
>>>
>>>
>>>
>>
>>
>


Re: MAHOUT-65

2009-06-18 Thread David Hall
following up on my earlier email.

Would anyone be interested in a "compressed" serialization for
DenseVector/SparseVector that follows in the vein of
hadoop.io.Writable? The space overhead for gson (parsing issues
not-withstanding) is pretty high, and it wouldn't be terribly hard to
implement a high-performance thing for vectors.

-- David

On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman wrote:
> +1, you added name constructors that I didn't have and the equals/equivalent
> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>
>
> Grant Ingersoll wrote:
>>
>> Shall I take that as approval of the approach?
>>
>> BTW, the Gson stuff seems like a winner for serialization.
>>
>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>>
>>> You gonna commit your patch? I agree with shortening the class name in
>>> the JsonVectorAdapter and will do it once you commit ur stuff.
>>> Jeff
>>
>>
>>
>>
>
>


GSON stack overflows

2009-06-18 Thread David Hall
GSON's parser is apparently not tale recursive. Opinions? In the
meantime, I'm going to consider an alternative implementation in the
meantime that doesn't involve serializing huge vectors.

-- David

java.io.IOException: Spill failed
  at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:573)
  at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:65)
  at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:48)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
  at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: com.google.gson.JsonParseException: Failed parsing JSON
source: java.io.stringrea...@558964ad to Json
  at com.google.gson.JsonParser.parse(JsonParser.java:59)
  at com.google.gson.Gson.fromJson(Gson.java:376)
  at com.google.gson.Gson.fromJson(Gson.java:329)
  at com.google.gson.Gson.fromJson(Gson.java:305)
  at 
org.apache.mahout.matrix.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:69)
  at 
org.apache.mahout.matrix.JsonVectorAdapter.deserialize(JsonVectorAdapter.java:35)
  at 
com.google.gson.JsonDeserializerExceptionWrapper.deserialize(JsonDeserializerExceptionWrapper.java:50)
  at 
com.google.gson.JsonDeserializationVisitor.visitUsingCustomHandler(JsonDeserializationVisitor.java:65)
  at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:96)
  at 
com.google.gson.JsonDeserializationContextDefault.fromJsonObject(JsonDeserializationContextDefault.java:73)
  at 
com.google.gson.JsonDeserializationContextDefault.deserialize(JsonDeserializationContextDefault.java:49)
  at com.google.gson.Gson.fromJson(Gson.java:379)
  at com.google.gson.Gson.fromJson(Gson.java:329)
  at 
org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:326)
  at 
org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:310)
  at org.apache.mahout.clustering.lda.LDAReducer.reduce(LDAReducer.java:47)
  at org.apache.mahout.clustering.lda.LDAReducer.reduce(LDAReducer.java:40)
  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1116)
  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:989)
  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:401)
  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:886)
Caused by: java.lang.StackOverflowError
  at com.google.gson.JsonParserJavacc.jj_3R_4(JsonParserJavacc.java:387)
  at com.google.gson.JsonParserJavacc.jj_3R_3(JsonParserJavacc.java:394)
  at com.google.gson.JsonParserJavacc.jj_3R_1(JsonParserJavacc.java:414)
  at com.google.gson.JsonParserJavacc.jj_3_1(JsonParserJavacc.java:400)
 at com.google.gson.JsonParserJavacc.jj_2_1(JsonParserJavacc.java:381)
  at com.google.gson.JsonParserJavacc.JsonNumber(JsonParserJavacc.java:229)
  at com.google.gson.JsonParserJavacc.JsonValue(JsonParserJavacc.java:166)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:142)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
  at com.google.gson.JsonParserJavacc.Elements(JsonParserJavacc.java:146)
 (etc)


[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721351#action_12721351
 ] 

Grant Ingersoll commented on MAHOUT-126:


Yep, you are right.  I committed your patch anyway.  We probably should add to 
the cmd line to support setting minDF, maxDF.

> Prepare document vectors from the text
> --
>
> Key: MAHOUT-126
> URL: https://issues.apache.org/jira/browse/MAHOUT-126
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: mahout-126-benson.patch, 
> MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
> MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
> MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  
> Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as 
> TF-IDF values of the term. With lucene index, this value can be calculated 
> very easily. 
> Presently, I have created two separate utilities, which could possibly be 
> invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread David Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721346#action_12721346
 ] 

David Hall commented on MAHOUT-126:
---

That's not the only time. This constructor clearly lets certain things slip 
through.

{code}
  public CachedTermInfo(IndexReader reader, String field, int minDf, int 
maxDfPercent) throws IOException {
this.field = field;
TermEnum te = reader.terms(new Term(field, ""));
int count = 0;
int numDocs = reader.numDocs();
double percent = numDocs * maxDfPercent / 100.0;
//Should we use a linked hash map so that we no terms are in order?
termEntries = new LinkedHashMap();
do {
  Term term = te.term();
  if (term == null || term.field().equals(field) == false){
break;
  }
  int df = te.docFreq();
  if (df < minDf || df > percent){
continue;
  }
  TermEntry entry = new TermEntry(term.text(), count++, df);
  termEntries.put(entry.term, entry);
} while (te.next());
te.close();
{code}

My code is essentially Lucene's demo indexing code (IndexFiles.java and 
FileDocument.java: 
http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.java&q=org.apache.lucene.demo.IndexFiles
} except that I replaced
{code}doc.add(new Field("contents", new FileReader(f)));{code}

with
{code}   doc.add(new Field("contents", new 
FileReader(f),Field.TermVector.YES));{code}

I then ran {code} java -cp  org.apache.lucene.demo.IndexFiles 
/Users/dlwh/txt-reuters/ {code}

and then {code} java -cp  org.apache.mahout.utils.vectors.Driver 
--dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t 
/Users/dlwh/dict --weight TF {code}

For what's it worth, it gives a null on "reuters", which is not usually a stop 
word, except that every single document ends with it, and so the IDF filtering 
above is catching it.



> Prepare document vectors from the text
> --
>
> Key: MAHOUT-126
> URL: https://issues.apache.org/jira/browse/MAHOUT-126
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: mahout-126-benson.patch, 
> MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
> MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
> MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  
> Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as 
> TF-IDF values of the term. With lucene index, this value can be calculated 
> very easily. 
> Presently, I have created two separate utilities, which could possibly be 
> invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-121) Speed up distance calculations for sparse vectors

2009-06-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-121:
-

Attachment: MAHOUT-121.patch

Not sure if my very truly last version of the patch got posted. Here it is. It 
is relative to the root rather than trunk/ -- seems my hand editing doesn't 
work.

> Speed up distance calculations for sparse vectors
> -
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Reporter: Shashikant Kore
> Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, 
> MAHOUT-121.patch, MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, 
> Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. 
> The complete dataset has few tens of thousands of feature items but each 
> vector has only couple of hundred feature items. For this, there is an 
> optimization in distance calculation, a link to which I found the archives of 
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization.  The test setup had 2000 document  vectors 
> with few hundred items.  I ran canopy generation with Euclidean distance and 
> t1, t2 values as 250 and 200.
>  
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives 
> is computationally expensive. I changed the sparse vector  implementation to 
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. 
>  
> Licensing of Trove seems to be an issue which needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

2009-06-18 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721215#action_12721215
 ] 

Grant Ingersoll commented on MAHOUT-126:


Hey David,

I'm not sure what's going on here, because that value being null means the term 
is not the index, yet is in the Term Vector for that doc.  Are you sure you're 
loading the same field?  Can you share the indexing code?

This fix works, though, but I'd like to know at a deeper level what's going on.

> Prepare document vectors from the text
> --
>
> Key: MAHOUT-126
> URL: https://issues.apache.org/jira/browse/MAHOUT-126
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: mahout-126-benson.patch, 
> MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
> MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
> MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  
> Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as 
> TF-IDF values of the term. With lucene index, this value can be calculated 
> very easily. 
> Presently, I have created two separate utilities, which could possibly be 
> invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [GSOC] Thoughts about Random forests map-reduce implementation

2009-06-18 Thread deneche abdelhakim

Ok then, I shall implement the easy mapreduce version and see how it behaves.

> Ultimately, I would think that it is also interesting to modify the 
> original algorithm to build multiple trees for different portions of the 
> data.  That loses some of the solidity of the original method, but could 
> actually do better if the splits exposed non-stationary behavior.

very interesting, and it could make the map-reduce implementation capable of 
dealing with very large datasets. When you say :

"build multiple trees for different portions of the data"

What's the difference with the basic bagging algorithm, which builds 'each 
tree' using a different portion (about 2/3) of the data ?

--- En date de : Mer 17.6.09, Ted Dunning  a écrit :

> De: Ted Dunning 
> Objet: Re: [GSOC] Thoughts about Random forests map-reduce implementation
> À: mahout-dev@lucene.apache.org
> Date: Mercredi 17 Juin 2009, 21h10
> This is a classic problem of scaling
> a solution as the problem gets wide
> (number of trees) and tall (amount of data).
> 
> The problem of building a random forest on a large data set
> with N trees is
> N times the cost on a single node (as you point out) and N
> is typically
> about the number of cores available in a hadoop cluster or
> a small multiple
> thereof.  This means that your simple solution would
> give essentially
> perfect speed up if the data set still fits in
> memory.  That means that a
> simple implementation is likely to be of some use.
> 
> On the other hand, it sounds like your Information Gain
> computation has some
> real performance problems that probably should be
> addressed.
> 
> Ultimately, I would think that it is also interesting to
> modify the original
> algorithm to build multiple trees for different portions of
> the data.  That
> loses some of the solidity of the original method, but
> could actually do
> better if the splits exposed non-stationary behavior.
> 
> On Wed, Jun 17, 2009 at 3:45 AM, deneche abdelhakim wrote:
> 
> >
> > As we talked about in the following discussion (A),
> I'm considering two
> > ways to implement a distributed map-reduce builder.
> >
> > Given the reference implementation, the easiest
> implementation is the
> > following:
> >
> > * the data is distributed to the slave nodes using the
> DistributedCache
> > * each mapper loader the data in-memory when in
> JobConfigurable.configure()
> > * each tree is built by one mapper
> > ...
> 
> * the main program builds the forest using
> DecisionTree.parse(String) for
> > each tree
> > ...
> > Cons:
> > * because its based on the ref. implementation, it
> will be very slow when
> > dealing with large datasets
>