[jira] [Resolved] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions

2014-11-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1616.
--
Resolution: Fixed

> Better support for hadoop dependencies of multiple versions 
> 
>
> Key: MAHOUT-1616
> URL: https://issues.apache.org/jira/browse/MAHOUT-1616
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Reporter: Gokhan Capan
>Assignee: Gokhan Capan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2014-11-15 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1626:


 Summary: Support for required quasi-algebraic operations and 
starting with aggregating rows/blocks
 Key: MAHOUT-1626
 URL: https://issues.apache.org/jira/browse/MAHOUT-1626
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 1.0
Reporter: Gokhan Capan
 Fix For: 1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155309#comment-14155309
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Correct

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154937#comment-14154937
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Jay, here is the documentation:

http://mahout.apache.org/developers/buildingmahout.html

And the instructions apply to trunk, not to the 0.9 release

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154918#comment-14154918
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Jay,

This is integrated in trunk, not in 0.9, and should work. Also, you can find 
MAHOUT-1616 useful for a recent simplification and further improvement effort.

Best

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions

2014-09-26 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1616:


 Summary: Better support for hadoop dependencies of multiple 
versions 
 Key: MAHOUT-1616
 URL: https://issues.apache.org/jira/browse/MAHOUT-1616
 Project: Mahout
  Issue Type: Improvement
  Components: build
Reporter: Gokhan Capan
Assignee: Gokhan Capan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062041#comment-14062041
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

Sorry guys, I committed this 2 weeks ago, but I forgot to close the issue. 
Thank you, [~nravi]

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1565.
--

Resolution: Fixed

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
>Assignee: Gokhan Capan
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1565:


Assignee: Gokhan Capan

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
>Assignee: Gokhan Capan
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016998#comment-14016998
 ] 

Gokhan Capan commented on MAHOUT-1529:
--

Alright, I'm sold.

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016565#comment-14016565
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Brian, 
This was actually well-tested. But I'm gonna build and test it again, probably 
tomorrow. 
By the way can you run a 
{{$ find . -name hadoop*.jar}}

after building mahout, in the mahout root director.
Best

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016378#comment-14016378
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

We agree, conceptually, but this needs some further testing.

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Fix For: 1.0
>
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016372#comment-14016372
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Seems like the dependencies are correctly set. Are you certain that the cluster 
you're running mahout against is an hadoop-2 and M/R-2 cluster?

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for memory-based algorithms such 
as neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 2:55 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan commented on MAHOUT-1529:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-29 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012140#comment-14012140
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

Sorry, now I can read the patch properly. The MR1 versions of those 
configurations are already set in bin/mahout, and you're suggesting to add MR2 
versions of them, too, right?

I am personally not a fan of setting such configurations in Mahout, and I would 
remove them as well.

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-28 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012126#comment-14012126
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

I think there is no point of configuring output compression, number of 
reducers, etc. for Mahout.

> add MR2 options to MAHOUT_OPTS in bin/mahout
> 
>
> Key: MAHOUT-1565
> URL: https://issues.apache.org/jira/browse/MAHOUT-1565
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 1.0, 0.9
>Reporter: Nishkam Ravi
> Attachments: MAHOUT-1565.patch
>
>
> MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
> those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-05-22 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005719#comment-14005719
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Please check http://mahout.apache.org/developers/buildingmahout.html for 
instructions to build mahout against to hadoop-2

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1534.
--

Resolution: Fixed

The instructions are now available on the BuildingMahout page: 
http://mahout.apache.org/developers/buildingmahout.html

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Gokhan Capan
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005663#comment-14005663
 ] 

Gokhan Capan commented on MAHOUT-1534:
--

We might want to add the link to the Mahout News, but let's wait and see if the 
users could locate the page.

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Gokhan Capan
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1534:


Assignee: Gokhan Capan

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
>Assignee: Gokhan Capan
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004662#comment-14004662
 ] 

Gokhan Capan commented on MAHOUT-1534:
--

[~ssc] I added the directions to the BuildingMahout page. If you're happy with 
the staged, I'll "Publish Site"

> Add documentation for using Mahout with Hadoop2 to the website
> --
>
> Key: MAHOUT-1534
> URL: https://issues.apache.org/jira/browse/MAHOUT-1534
> Project: Mahout
>  Issue Type: Task
>  Components: Documentation
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
> We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2

2014-05-15 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996351#comment-13996351
 ] 

Gokhan Capan commented on MAHOUT-1550:
--

Paul,

Did you try build mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=

Let us know if this fails

> Naive Bayes training fails with Hadoop 2
> 
>
> Key: MAHOUT-1550
> URL: https://issues.apache.org/jira/browse/MAHOUT-1550
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
> Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2
>Reporter: Paul Marret
>Priority: Minor
>  Labels: bayesian, training
> Attachments: mahout-snapshot.patch, stacktrace.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When using the trainnb option of the program, we get the following error:
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.JobContext, but class was expected
> at 
> org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
> at 
> org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
> at 
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100)
> [...]
> It is possible to correct this by modifying the file 
> mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and 
> converting the instance job (line 174) to a Job object (it is a JobContext in 
> the current version).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2

2014-05-13 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996351#comment-13996351
 ] 

Gokhan Capan edited comment on MAHOUT-1550 at 5/13/14 1:10 PM:
---

Paul,

Did you try building mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=

Let us know if this fails


was (Author: gokhancapan):
Paul,

Did you try build mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=

Let us know if this fails

> Naive Bayes training fails with Hadoop 2
> 
>
> Key: MAHOUT-1550
> URL: https://issues.apache.org/jira/browse/MAHOUT-1550
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 1.0
> Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2
>Reporter: Paul Marret
>Priority: Minor
>  Labels: bayesian, training
> Attachments: mahout-snapshot.patch, stacktrace.txt
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When using the trainnb option of the program, we get the following error:
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.JobContext, but class was expected
> at 
> org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
> at 
> org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
> at 
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100)
> [...]
> It is possible to correct this by modifying the file 
> mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and 
> converting the instance job (line 174) to a Job object (it is a JobContext in 
> the current version).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

The thing is it just 'loads' a Lucene index in memory as a matrix. You 
construct a matrix with the lucene index directory location and that's it. So 
it is not a fix for incremental document management issue.

The alternative approach is querying the index when a row/column vector, or 
cell is required. I, however, am not sure if the SolrMatrix thing is fast 
enough for that.

I haven't been available lately, and now I'm reading through the changes in and 
proposals for Mahout's future, and trying to set up my perspective for Mahout2. 
We probably can come up with a better way of document storage (still 
Lucene/Solr based). Let me leave this as is now, and then we can discuss the 
input formats further.

Is that OK for you?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968221#comment-13968221
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

I personally like the idea of integrating additional storage layers as matrix 
inputs, but not like the implementation I did here.
After agreeing on the new algorithm layers, we can later move to the the 
additional input formats. 

So my vote also is for "Won't Fix"

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Well I can add this, but considering the current status of the project, I think 
this is no longer in people's interest.
What do you say [~ssc], should we 'won't fix' it or commit?

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-03-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918159#comment-13918159
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Let me get the pieces together and submit a patch in a few days.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>Assignee: Gokhan Capan
>  Labels: gsoc2013, mentor
> Fix For: 1.0
>
> Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13914494#comment-13914494
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Sure I can.

Although my vote would be passing the version, considering different 
distributions out there, people may want to build mahout against whatever 
hadoop2 distro they use (I am not very sure about my own argument actually, It 
would be great to hear a counter-argument)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
> 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911436#comment-13911436
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

I committed this to trunk

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1329:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908443#comment-13908443
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Good news that I tried that too, on a 2.2.0 cluster.
seqdir, seq2sparse, and kmeans worked without a problem.

I'm gonna wait till Monday to commit this, in case folks want to verify that it 
works.



> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM:
---

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster [EDIT:Sorry I missed you 
mentioned that you ran the examples, great then]



was (Author: gokhancapan):
Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM:
---

Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.


was (Author: gokhancapan):
Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1329:


Assignee: Gokhan Capan  (was: Suneel Marthi)

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Gokhan Capan
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1329:
-

Attachment: 1329-3.patch

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329-3.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907237#comment-13907237
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/20/14 5:50 PM:
---

Hi Sergey, thank you for that, I am copying from MAHOUT-1354:

Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to 
apply the hadoop-2 related dependencies to integration and examples modules, 
despite they are both dependent to core and core is dependent to hadoop-2. For 
me, moving hadoop dependencies to the root solved the problem, but I think we 
wouldn't want that since hadoop is not a common dependency for all modules of 
the project."

Ted: "It is important to keep modules like mahout math free of the massive 
Hadoop dependency."

I think pushing dependencies to the root is not something that we desire, but 
let me look into this further.



was (Author: gokhancapan):
Hi Sergey, thank you for that, I am copying from MAHOUT-1354:

Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to 
apply the hadoop-2 related dependencies to integration and examples modules, 
despite they are both dependent to core and core is dependent to hadoop-2. For 
me, moving hadoop dependencies to the root solved the problem, but I think we 
wouldn't want that since hadoop is not a common dependency for all modules of 
the project."

Ted: "It is important to keep modules like mahout math free of the massive 
Hadoop dependency."

I think pushing dependencies to the root is not something that we desire I 
think, but let me look into this further.


> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907237#comment-13907237
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Hi Sergey, thank you for that, I am copying from MAHOUT-1354:

Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to 
apply the hadoop-2 related dependencies to integration and examples modules, 
despite they are both dependent to core and core is dependent to hadoop-2. For 
me, moving hadoop dependencies to the root solved the problem, but I think we 
wouldn't want that since hadoop is not a common dependency for all modules of 
the project."

Ted: "It is important to keep modules like mahout math free of the massive 
Hadoop dependency."

I think pushing dependencies to the root is not something that we desire I 
think, but let me look into this further.


> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329-2.patch, 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906062#comment-13906062
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Is it OK to add hadoop dependencies to the project root, and to the math module 
(actually to all modules even they already depend on the core module)?

I remember that's what we wanted to avoid

> Mahout for hadoop 2
> ---
>
> Key: MAHOUT-1329
> URL: https://issues.apache.org/jira/browse/MAHOUT-1329
> Project: Mahout
>  Issue Type: Task
>  Components: build
>Affects Versions: 0.9
>Reporter: Sergey Svinarchuk
>Assignee: Suneel Marthi
>  Labels: patch
> Fix For: 1.0
>
> Attachments: 1329.patch
>
>
> Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843226#comment-13843226
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Yeah, I agree

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
> Attachments: MAHOUT-1354_initial.patch
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842960#comment-13842960
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Looks like when hadoop-2 profile is activated, this patch fails to apply the 
hadoop-2 related dependencies to integration and examples modules, despite they 
are both dependent to core and core is dependent to hadoop-2. For me, moving 
hadoop dependencies to the root solved the problem, but I think we wouldn't 
want that since hadoop is not a common dependency for all modules of the 
project. 

CC'ing [~frankscholten]

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
> Attachments: MAHOUT-1354_initial.patch
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-03 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1354:
-

Attachment: MAHOUT-1354_initial.patch

Could you guys test this initial patch against different versions of clusters 
to see if that works?

Usage:
mahout against hadoop1 (version 1.2.1): 
mvn package

mahout against hadoop2-stable (version 2.2.0, by default): 
mvn package -Phadoop2 

mahout against hadoop2-earlier: 
mvn package -Phadoop2 -Dhadoop.version=


> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
> Attachments: MAHOUT-1354_initial.patch
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837933#comment-13837933
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Today I had some troubles with integration's transitive dependencies, let me 
dig further.

So this still should stay in 1.0 queue

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836965#comment-13836965
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Let me submit a patch first, probably tomorrow.
Best

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836953#comment-13836953
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Well, I tried something and want to share.

Based on:
In hadoop-2-stable, compatibility with hadoop-1 is preferred over with 
hadoop-2-alpha 
(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html).
 For example, return type for ProgramDriver#driver(String) was void in hadoop-1 
(which we use in MahoutDriver), int in hadoop-2-alpha, void again in 
hadoop-2-stable. It seems if we select the right artifacts, there is nothing to 
worry about the compatibility. 

My conclusion was:
The current hadoop-0.20 and hadoop-0.23 profiles can be utilized: we can rename 
them to hadoop-1 and hadoop-2, respectively, then make hadoop-2 (stable) the 
default profile, then set the hadoop.version property to 2.2.0. We need to 
worry about some third party dependencies though, for instance, hbase-client in 
mahout-integration is dependent to hadoop-1 (for that particular artifact, 
simply excluding hadoop-core did not break any tests, by the way).

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836661#comment-13836661
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Do you think we should support hadoop-1 and hadoop-2 at the same time?

> Mahout Support for Hadoop 2 
> 
>
> Key: MAHOUT-1354
> URL: https://issues.apache.org/jira/browse/MAHOUT-1354
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.8
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-12-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836102#comment-13836102
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Let's "Won't Fix" this issue.

I think what we need to do is implementing more sparse matrix (or alike) data 
structures for different access patterns, other than the current map of maps 
approach. The ideas would apply to current 2 FastByIDMaps based DataModel.



 

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch, benchmark.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-10-26 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806106#comment-13806106
 ] 

Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM:


Peng,

I am attaching a patch (not to be committed) that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.


was (Author: gokhancapan):
Peng,

I am attaching a patch --not to be committed-- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: benchmark.patch, InMemoryDataModel.java, 
> InMemoryDataModelTest.java, Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-10-26 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806106#comment-13806106
 ] 

Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM:


Peng,

I am attaching a patch --not to be committed-- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.


was (Author: gokhancapan):
Peng,

I am attaching a patch -not to be committed- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: benchmark.patch, InMemoryDataModel.java, 
> InMemoryDataModelTest.java, Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-10-26 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1286:
-

Attachment: benchmark.patch

Peng,

I am attaching a patch -not to be committed- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: benchmark.patch, InMemoryDataModel.java, 
> InMemoryDataModelTest.java, Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-10-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799916#comment-13799916
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Hi [~smarthi], 

Although I'm not sure if there is no more an interest, I have a Lucene matrix 
(in-memory) and a Solr matrix (that does not load the index into memory) 
implementations. I believe both can be committed after a couple review rounds.



> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Fix For: Backlog
>
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-05 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759021#comment-13759021
 ] 

Gokhan Capan edited comment on MAHOUT-1286 at 9/5/13 12:22 PM:
---

Even if it is not an exact Matrix structure, we can start with 2d hash tables 
and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.

  was (Author: gokhancapan):
There was a thread on updating "int" indices and "double" values in 
matrices, but there are simply too many consequences of that update that we 
can't deal with right now. Even if it is not an exact Matrix structure, we can 
start with 2d hash tables and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.
  
> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-05 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759021#comment-13759021
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

There was a thread on updating "int" indices and "double" values in matrices, 
but there are simply too many consequences of that update that we can't deal 
with right now. Even if it is not an exact Matrix structure, we can start with 
2d hash tables and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-04 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757801#comment-13757801
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Here is what I think:

1- We should implement a matrix that uses your 2d Hopscotch hash table as the 
underlying data structure (or the current open addressing hash table 
implementation that already exists in Mahout, depending on benchmarks)

2- We should handle concurrency issues that might be introduced by that matrix 
implementation

3- We then can replace the FastByIDMap(s) with that matrix, trust at the 
underlying matrix for concurrent updates, and never create a PreferenceArray 
unless there is an iteration over users (or items)

What do you think?

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
> Semifinal-implementation-added.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751053#comment-13751053
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

By the way, it seems the link to the paper is broken, if it is not just me.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751049#comment-13751049
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Hi Peng, could you submit the diff files instead of .javas? That would be more 
convenient for me if it is possible.

> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737267#comment-13737267
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Peng,

With a SparseRowMatrix, column access (getPreferencesForItem), but row access 
is pretty fast (getPreferencesFromUsers). I agree with all other problems you 
mentioned. 

In Mahout's SVD-based recommenders and FactorizablePreferences, while computing 
top-N recommendations, I believe we compute  predictions for 
each item, and return the top-N. So basically, a SVD based recommender needs 
fast access to the rows of the matrix, but not the columns (It still needs to 
iterate over item ids, though). It is only needed in an item-based recommender, 
or if a CandidateItemsStrategy is used.

In my tests for Netflix data, I saw a 3G heap, too. Let me compare this 
particular approach with the SparseRowMatrix backed one. I will investigate 
your approach further.

Ted, 

Additionally, I recently implemented a read-only SolrMatrix, which might be 
beneficial while implementing the SolrRecommender, if we want to use existing 
mahout library for similarities etc. I will open a new thread for that.

Best


> Memory-efficient DataModel, supporting fast online updates and element-wise 
> iteration
> -
>
> Key: MAHOUT-1286
> URL: https://issues.apache.org/jira/browse/MAHOUT-1286
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Peng Cheng
>Assignee: Sean Owen
>  Labels: collaborative-filtering, datamodel, patch, recommender
> Fix For: 0.9
>
> Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Most DataModel implementation in current CF component use hash map to enable 
> fast 2d indexing and update. This is not memory-efficient for big data set. 
> e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
> Improved implementation of DataModel should use more compact data structure 
> (like arrays), this can trade a little of time complexity in 2d indexing for 
> vast improvement in memory efficiency. In addition, any online recommender or 
> online-to-batch converted recommender will not be affected by this in 
> training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix

2013-04-26 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642752#comment-13642752
 ] 

Gokhan Capan commented on MAHOUT-1193:
--

Sorry I missed that.

I modified the SparseMatrix code to handle dense rows and I am happy with that. 
The code is not patch-quality, but I can implement a flexible extension to the 
current implementation if that is desired (I believe that might be a common use 
case).

I personally liked the BlockSparseMatrix idea and its really flexible schema. I 
did a quick implementation to make it work with configurable block size, in a 
few days I can submit an additional diff to the reviewboard so we can discuss 
on code. One thing to consider, I suspect my version's CPU usage is kind of 
high. 

I believe both versions are valuable and important, they have their own 
benefits, particularly as an input to online learning algorithms.

> We may want a BlockSparseMatrix
> ---
>
> Key: MAHOUT-1193
> URL: https://issues.apache.org/jira/browse/MAHOUT-1193
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
> Attachments: MAHOUT-1193.patch
>
>
> Here is an implementation.
> Is it good enough to commit?
> Is it useful?
> Is it redundant?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix

2013-04-24 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640268#comment-13640268
 ] 

Gokhan Capan commented on MAHOUT-1193:
--

Ok, here are the updates:

I modified the code a little (made it run and modified as I had commented 
previosly), and did some tests within the real application that I mentioned int 
the user list.

Performance of get and sets (bigger is better):
DenseMatrix > SparseMatrix (with dense rows) > BlockSparseMatrix > SparseMatrix 
(with sparse rows) > SparseColumnMatrix


Performance difference between SparseMatrix with dense rows and 
BlockSparseMatrix is small.

One drawback of SparseMatrix might be that you need to specify the rowSize in 
advance (which means you need to set a boundary for your row indices). This 
wasn't a problem for me, but it's worth mentioning. With this version of 
BlockSparseMatrix, there might also be a memory overhead depending on 
blockSize. 

I decided to go for SparseMatrix with dense rows for now, but I also work on 
BlockSparseMatrix code (thanks to the flexible schema).

> We may want a BlockSparseMatrix
> ---
>
> Key: MAHOUT-1193
> URL: https://issues.apache.org/jira/browse/MAHOUT-1193
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
> Attachments: MAHOUT-1193.patch
>
>
> Here is an implementation.
> Is it good enough to commit?
> Is it useful?
> Is it redundant?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix

2013-04-18 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635670#comment-13635670
 ] 

Gokhan Capan commented on MAHOUT-1193:
--

Is it just me or doesn't it compile because it does not have super-matching 
constructor and cardinality is not declared?

What I understand from the implementation is that we create a Map, each Entry of which represents a block and the associated DenseMatrix.

If I didn't totally misunderstand the implementation, if the blockSize always 
will be 1, this associates a matrix with each row. 

Say I want to sacrifice some memory and try to set blockSize to 5, so if there 
were n actual rows in [row/blockSize, row/blockSize+5), there would be 5-n 
empty ones, and I am OK with that. Shouldn't we modify the extendToThisRow 
method such that:

int blockIndex = row / blockSize;
Matrix block = data.get(blockIndex);
if (block == null) {
  data.put(blockIndex, new DenseMatrix(blockSize, columns));
} else if (!block.hasRow(row)) {
  block.assignRow(row % blockIndex, new DenseVector(columns))
}
rows = Math.max(row + 1, rows);
cardinality[ROW] = rows;

> We may want a BlockSparseMatrix
> ---
>
> Key: MAHOUT-1193
> URL: https://issues.apache.org/jira/browse/MAHOUT-1193
> Project: Mahout
>  Issue Type: Bug
>Reporter: Ted Dunning
> Attachments: MAHOUT-1193.patch
>
>
> Here is an implementation.
> Is it good enough to commit?
> Is it useful?
> Is it redundant?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-17 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634477#comment-13634477
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Thanks for the valuable reviews. I updated the review request, but not the 
patch here. I will do it after another review round.

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-11 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056
 ] 

Gokhan Capan edited comment on MAHOUT-1178 at 4/11/13 4:21 PM:
---

Hi Sebastian,

I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, 
the diff here and there are not the same (the base directories I created the 
diffs are different, and the one in reviewboard is in a single diff file. Code 
is same though, I hope this is not a problem)

Update: adding the link https://reviews.apache.org/r/10420/

  was (Author: gokhancapan):
Hi Sebastian,

I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, 
the diff here and there are not the same (the base directories I created the 
diffs are different, and the one in reviewboard is in a single diff file. Code 
is same though, I hope this is not a problem)

  
> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-11 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Hi Sebastian,

I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, 
the diff here and there are not the same (the base directories I created the 
diffs are different, and the one in reviewboard is in a single diff file. Code 
is same though, I hope this is not a problem)


> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-11 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1178:
-

Attachment: MAHOUT-1178.patch
MAHOUT-1178-TEST.patch

Hi,

I am adding a Matrix impementation that loads the entire data of a field of a 
Lucene index to an underlying SparseRowMatrix here.

It delegates reading from index logic to the existing LuceneIterator.
When I changed LuceneIterator code a little to make this support StringFields, 
it broke LuceneIteratorTest, so I am going to add a new version of of 
LuceneIterator that supports StringFields later.

Also there is an ongoing effort on another version of LuceneMatrix that 
lazy-loads from index while iterating over matrix. I am going to start a 
separate issue for that.

I put the code to the integration module, and test and actual code are in 
different diff files. 

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
> Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch
>
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-04-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626619#comment-13626619
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Ted, do you think this should load the entire index to memory as a matrix? Or 
should it ask to the index when a get request is done? (And if this is the 
option, should set methods also update the lucene index itself?)

> GSOC 2013: Improve Lucene support in Mahout
> ---
>
> Key: MAHOUT-1178
> URL: https://issues.apache.org/jira/browse/MAHOUT-1178
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Dan Filimon
>  Labels: gsoc2013, mentor
>
> [via Ted Dunning]
> It should be possible to view a Lucene index as a matrix.  This would
> require that we standardize on a way to convert documents to rows.  There
> are many choices, the discussion of which should be deferred to the actual
> work on the project, but there are a few obvious constraints:
> a) it should be possible to get the same result as dumping the term vectors
> for each document each to a line and converting that result using standard
> Mahout methods.
> b) numeric fields ought to work somehow.
> c) if there are multiple text fields that ought to work sensibly as well.
>  Two options include dumping multiple matrices or to convert the fields
> into a single row of a single matrix.
> d) it should be possible to refer back from a row of the matrix to find the
> correct document.  THis might be because we remember the Lucene doc number
> or because a field is named as holding a unique id.
> e) named vectors and matrices should be used if plausible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

2012-10-11 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1069:
-

Attachment: MAHOUT-1069.patch

Fixed a few minor bugs and updated the patch

> Multi-target, side-info aware, SGD-based recommender algorithms, examples, 
> and tools to run
> ---
>
> Key: MAHOUT-1069
> URL: https://issues.apache.org/jira/browse/MAHOUT-1069
> Project: Mahout
>  Issue Type: Improvement
>  Components: CLI, Collaborative Filtering
>Affects Versions: 0.8
>Reporter: Gokhan Capan
>Assignee: Sean Owen
>  Labels: cf, improvement, sgd
> Attachments: MAHOUT-1069.patch, MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have 
> completed the merge of the recommender algorithms that is mentioned in 
> http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based 
> recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, 
> day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment 
> utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty 
> good results (0.851 RMSE in a randomly generated test data after some 
> validation to determine learning and regularization rates on a separate 
> validation data)
> There is no modification in the existing Mahout code, except the added lines 
> in driver.class.props for command-line tools. However, that became a huge 
> patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender 
> System papers, especially Yehuda Koren's. For example, the Ordinal model is 
> from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those 
> tests do not cover. I saw many of those in action without problem, but I am 
> going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may 
> need some improvement. However, I use the algorithms also for my M.Sc. 
> thesis, which means I will eventually submit more experiments. As the 
> experimenting infrastructure exists, I believe community may provide more 
> experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

2012-09-18 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1069:
-

Attachment: MAHOUT-1069.patch

Attached is the patch.

> Multi-target, side-info aware, SGD-based recommender algorithms, examples, 
> and tools to run
> ---
>
> Key: MAHOUT-1069
> URL: https://issues.apache.org/jira/browse/MAHOUT-1069
> Project: Mahout
>  Issue Type: Improvement
>  Components: CLI, Collaborative Filtering
>Affects Versions: 0.8
>Reporter: Gokhan Capan
>Assignee: Sean Owen
>  Labels: cf, improvement, sgd
> Attachments: MAHOUT-1069.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Upon our conversations on dev-list, I would like to state that I have 
> completed the merge of the recommender algorithms that is mentioned in 
> http://goo.gl/fh4d9 to mahout. 
> These are a set of learning algorithms for matrix factorization based 
> recommendation, which are capable of:
> * Recommending multiple targets:
> *# Numerical Recommendation with OLS Regression
> *# Binary Recommendation with Logistic Regression
> *# Multinomial Recommendation with Softmax Regression
> *# Ordinal Recommendation with Proportional Odds Model
> * Leveraging side info in mahout vector format where available
> *# User side information
> *# Item side information
> *# Dynamic side information (side info at feedback moment, such as proximity, 
> day of week etc.)
> * Online learning
> Some command-line tools are provided as mahout jobs, for pre-experiment 
> utilities and running experiments.
> Evaluation tools for numerical and categorical recommenders are added.
> A simple example for Movielens-1M data is provided, and it achieved pretty 
> good results (0.851 RMSE in a randomly generated test data after some 
> validation to determine learning and regularization rates on a separate 
> validation data)
> There is no modification in the existing Mahout code, except the added lines 
> in driver.class.props for command-line tools. However, that became a huge 
> patch with dozens of new source files.
> These algorithms are highly inspired from various influential Recommender 
> System papers, especially Yehuda Koren's. For example, the Ordinal model is 
> from Koren's OrdRec paper, except the cuts are not user-specific but global.
> Left for future:
> # The core algorithms are tested, but there probably exists some parts those 
> tests do not cover. I saw many of those in action without problem, but I am 
> going to add new tests regularly.
> # Not all algorithms have been tried on appropriate datasets, and they may 
> need some improvement. However, I use the algorithms also for my M.Sc. 
> thesis, which means I will eventually submit more experiments. As the 
> experimenting infrastructure exists, I believe community may provide more 
> experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run

2012-09-18 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1069:


 Summary: Multi-target, side-info aware, SGD-based recommender 
algorithms, examples, and tools to run
 Key: MAHOUT-1069
 URL: https://issues.apache.org/jira/browse/MAHOUT-1069
 Project: Mahout
  Issue Type: Improvement
  Components: CLI, Collaborative Filtering
Affects Versions: 0.8
Reporter: Gokhan Capan
Assignee: Sean Owen


Upon our conversations on dev-list, I would like to state that I have completed 
the merge of the recommender algorithms that is mentioned in 
http://goo.gl/fh4d9 to mahout. 

These are a set of learning algorithms for matrix factorization based 
recommendation, which are capable of:

* Recommending multiple targets:
*# Numerical Recommendation with OLS Regression
*# Binary Recommendation with Logistic Regression
*# Multinomial Recommendation with Softmax Regression
*# Ordinal Recommendation with Proportional Odds Model

* Leveraging side info in mahout vector format where available
*# User side information
*# Item side information
*# Dynamic side information (side info at feedback moment, such as proximity, 
day of week etc.)

* Online learning

Some command-line tools are provided as mahout jobs, for pre-experiment 
utilities and running experiments.

Evaluation tools for numerical and categorical recommenders are added.

A simple example for Movielens-1M data is provided, and it achieved pretty good 
results (0.851 RMSE in a randomly generated test data after some validation to 
determine learning and regularization rates on a separate validation data)

There is no modification in the existing Mahout code, except the added lines in 
driver.class.props for command-line tools. However, that became a huge patch 
with dozens of new source files.

These algorithms are highly inspired from various influential Recommender 
System papers, especially Yehuda Koren's. For example, the Ordinal model is 
from Koren's OrdRec paper, except the cuts are not user-specific but global.

Left for future:
# The core algorithms are tested, but there probably exists some parts those 
tests do not cover. I saw many of those in action without problem, but I am 
going to add new tests regularly.
# Not all algorithms have been tried on appropriate datasets, and they may need 
some improvement. However, I use the algorithms also for my M.Sc. thesis, which 
means I will eventually submit more experiments. As the experimenting 
infrastructure exists, I believe community may provide more experiments, too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1064) Weird behavior of vector dumper

2012-09-03 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1064:
-

Attachment: MAHOUT-1064.patch

Attached is a test that fails, and a quick fix.

> Weird behavior of vector dumper
> ---
>
> Key: MAHOUT-1064
> URL: https://issues.apache.org/jira/browse/MAHOUT-1064
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.8
>Reporter: Gokhan Capan
>Priority: Minor
>  Labels: sort, vectordump
> Fix For: 0.8
>
> Attachments: MAHOUT-1064.patch
>
>
> When vectordump utility is executed with sort flag true, I expect the 
> resulting vector that is sorted by values. If that is the case, sometimes 
> VectorHelper.vectorToJson method returns unexpected results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAHOUT-1064) Weird behavior of vector dumper

2012-09-03 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1064:


 Summary: Weird behavior of vector dumper
 Key: MAHOUT-1064
 URL: https://issues.apache.org/jira/browse/MAHOUT-1064
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.8
Reporter: Gokhan Capan
Priority: Minor
 Fix For: 0.8


When vectordump utility is executed with sort flag true, I expect the resulting 
vector that is sorted by values. If that is the case, sometimes 
VectorHelper.vectorToJson method returns unexpected results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs

2012-08-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432039#comment-13432039
 ] 

Gokhan Capan commented on MAHOUT-1051:
--

Jake, I've run the new version without an error and checked a few documents if 
they are related with the inferred topics.
It works for me. 

> InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs
> 
>
> Key: MAHOUT-1051
> URL: https://issues.apache.org/jira/browse/MAHOUT-1051
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Gokhan Capan
>Priority: Minor
>  Labels: cvb, lda
> Fix For: 0.8
>
> Attachments: MAHOUT-1051.patch, MAHOUT-1051.patch
>
>
> Based upon our conversation with Jake in the user-list, I have modified the 
> o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so 
> that it does not ignore document ids in input. To preserve backwards 
> compatibility, it behaves as it did earlier if a ClassCastException is 
> thrown; which occurs when ids are not integers, and/or the document vector 
> (or getDelegate() if it is a NamedVector) cannot be cast to a 
> RandomAccessSparseVector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs

2012-08-07 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1051:
-

Attachment: MAHOUT-1051.patch

Attached is the patch.

> InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs
> 
>
> Key: MAHOUT-1051
> URL: https://issues.apache.org/jira/browse/MAHOUT-1051
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.8
>Reporter: Gokhan Capan
>Priority: Minor
>  Labels: cvb, lda
> Fix For: 0.8
>
> Attachments: MAHOUT-1051.patch
>
>
> Based upon our conversation with Jake in the user-list, I have modified the 
> o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so 
> that it does not ignore document ids in input. To preserve backwards 
> compatibility, it behaves as it did earlier if a ClassCastException is 
> thrown; which occurs when ids are not integers, and/or the document vector 
> (or getDelegate() if it is a NamedVector) cannot be cast to a 
> RandomAccessSparseVector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs

2012-08-07 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1051:


 Summary: InMemoryCollapsedVariationalBayes0 to load input vectors 
with docIDs
 Key: MAHOUT-1051
 URL: https://issues.apache.org/jira/browse/MAHOUT-1051
 Project: Mahout
  Issue Type: Improvement
  Components: Clustering
Affects Versions: 0.8
Reporter: Gokhan Capan
Priority: Minor
 Fix For: 0.8
 Attachments: MAHOUT-1051.patch

Based upon our conversation with Jake in the user-list, I have modified the 
o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so that 
it does not ignore document ids in input. To preserve backwards compatibility, 
it behaves as it did earlier if a ClassCastException is thrown; which occurs 
when ids are not integers, and/or the document vector (or getDelegate() if it 
is a NamedVector) cannot be cast to a RandomAccessSparseVector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira