[jira] [Resolved] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions
[ https://issues.apache.org/jira/browse/MAHOUT-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1616. -- Resolution: Fixed > Better support for hadoop dependencies of multiple versions > > > Key: MAHOUT-1616 > URL: https://issues.apache.org/jira/browse/MAHOUT-1616 > Project: Mahout > Issue Type: Improvement > Components: build >Reporter: Gokhan Capan >Assignee: Gokhan Capan > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks
Gokhan Capan created MAHOUT-1626: Summary: Support for required quasi-algebraic operations and starting with aggregating rows/blocks Key: MAHOUT-1626 URL: https://issues.apache.org/jira/browse/MAHOUT-1626 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 1.0 Reporter: Gokhan Capan Fix For: 1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155309#comment-14155309 ] Gokhan Capan commented on MAHOUT-1329: -- Correct > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154937#comment-14154937 ] Gokhan Capan commented on MAHOUT-1329: -- Jay, here is the documentation: http://mahout.apache.org/developers/buildingmahout.html And the instructions apply to trunk, not to the 0.9 release > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154918#comment-14154918 ] Gokhan Capan commented on MAHOUT-1329: -- Jay, This is integrated in trunk, not in 0.9, and should work. Also, you can find MAHOUT-1616 useful for a recent simplification and further improvement effort. Best > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions
Gokhan Capan created MAHOUT-1616: Summary: Better support for hadoop dependencies of multiple versions Key: MAHOUT-1616 URL: https://issues.apache.org/jira/browse/MAHOUT-1616 Project: Mahout Issue Type: Improvement Components: build Reporter: Gokhan Capan Assignee: Gokhan Capan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062041#comment-14062041 ] Gokhan Capan commented on MAHOUT-1565: -- Sorry guys, I committed this 2 weeks ago, but I forgot to close the issue. Thank you, [~nravi] > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi > Fix For: 1.0 > > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1565. -- Resolution: Fixed > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi >Assignee: Gokhan Capan > Fix For: 1.0 > > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1565: Assignee: Gokhan Capan > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi >Assignee: Gokhan Capan > Fix For: 1.0 > > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016998#comment-14016998 ] Gokhan Capan commented on MAHOUT-1529: -- Alright, I'm sold. > Finalize abstraction of distributed logical plans from backend operations > - > > Key: MAHOUT-1529 > URL: https://issues.apache.org/jira/browse/MAHOUT-1529 > Project: Mahout > Issue Type: Improvement >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > We have a few situations when algorithm-facing API has Spark dependencies > creeping in. > In particular, we know of the following cases: > -(1) checkpoint() accepts Spark constant StorageLevel directly;- > -(2) certain things in CheckpointedDRM;- > -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.- > -(5) drmBroadcast returns a Spark-specific Broadcast object- > (6) Stratosphere/Flink conceptual api changes. > *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, > need new PR for remaining things once ready. > *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016565#comment-14016565 ] Gokhan Capan commented on MAHOUT-1329: -- Brian, This was actually well-tested. But I'm gonna build and test it again, probably tomorrow. By the way can you run a {{$ find . -name hadoop*.jar}} after building mahout, in the mahout root director. Best > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016378#comment-14016378 ] Gokhan Capan commented on MAHOUT-1565: -- We agree, conceptually, but this needs some further testing. > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi > Fix For: 1.0 > > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016372#comment-14016372 ] Gokhan Capan commented on MAHOUT-1329: -- Seems like the dependencies are correctly set. Are you certain that the cluster you're running mahout against is an hadoop-2 and M/R-2 cluster? > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985 ] Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for memory-based algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? was (Author: gokhancapan): [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? > Finalize abstraction of distributed logical plans from backend operations > - > > Key: MAHOUT-1529 > URL: https://issues.apache.org/jira/browse/MAHOUT-1529 > Project: Mahout > Issue Type: Improvement >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > We have a few situations when algorithm-facing API has Spark dependencies > creeping in. > In particular, we know of the following cases: > -(1) checkpoint() accepts Spark constant StorageLevel directly;- > -(2) certain things in CheckpointedDRM;- > -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.- > -(5) drmBroadcast returns a Spark-specific Broadcast object- > (6) Stratosphere/Flink conceptual api changes. > *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, > need new PR for remaining things once ready. > *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985 ] Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 2:55 PM: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? was (Author: gokhancapan): [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchOps and possibly CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, and so on. What do you think and if you and others are positive, how do you think that should be handled? > Finalize abstraction of distributed logical plans from backend operations > - > > Key: MAHOUT-1529 > URL: https://issues.apache.org/jira/browse/MAHOUT-1529 > Project: Mahout > Issue Type: Improvement >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > We have a few situations when algorithm-facing API has Spark dependencies > creeping in. > In particular, we know of the following cases: > -(1) checkpoint() accepts Spark constant StorageLevel directly;- > -(2) certain things in CheckpointedDRM;- > -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.- > -(5) drmBroadcast returns a Spark-specific Broadcast object- > (6) Stratosphere/Flink conceptual api changes. > *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, > need new PR for remaining things once ready. > *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014985#comment-14014985 ] Gokhan Capan commented on MAHOUT-1529: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchOps and possibly CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, and so on. What do you think and if you and others are positive, how do you think that should be handled? > Finalize abstraction of distributed logical plans from backend operations > - > > Key: MAHOUT-1529 > URL: https://issues.apache.org/jira/browse/MAHOUT-1529 > Project: Mahout > Issue Type: Improvement >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > We have a few situations when algorithm-facing API has Spark dependencies > creeping in. > In particular, we know of the following cases: > -(1) checkpoint() accepts Spark constant StorageLevel directly;- > -(2) certain things in CheckpointedDRM;- > -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.- > -(5) drmBroadcast returns a Spark-specific Broadcast object- > (6) Stratosphere/Flink conceptual api changes. > *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, > need new PR for remaining things once ready. > *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012140#comment-14012140 ] Gokhan Capan commented on MAHOUT-1565: -- Sorry, now I can read the patch properly. The MR1 versions of those configurations are already set in bin/mahout, and you're suggesting to add MR2 versions of them, too, right? I am personally not a fan of setting such configurations in Mahout, and I would remove them as well. > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012126#comment-14012126 ] Gokhan Capan commented on MAHOUT-1565: -- I think there is no point of configuring output compression, number of reducers, etc. for Mahout. > add MR2 options to MAHOUT_OPTS in bin/mahout > > > Key: MAHOUT-1565 > URL: https://issues.apache.org/jira/browse/MAHOUT-1565 > Project: Mahout > Issue Type: Improvement >Affects Versions: 1.0, 0.9 >Reporter: Nishkam Ravi > Attachments: MAHOUT-1565.patch > > > MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add > those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005719#comment-14005719 ] Gokhan Capan commented on MAHOUT-1329: -- Please check http://mahout.apache.org/developers/buildingmahout.html for instructions to build mahout against to hadoop-2 > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1534. -- Resolution: Fixed The instructions are now available on the BuildingMahout page: http://mahout.apache.org/developers/buildingmahout.html > Add documentation for using Mahout with Hadoop2 to the website > -- > > Key: MAHOUT-1534 > URL: https://issues.apache.org/jira/browse/MAHOUT-1534 > Project: Mahout > Issue Type: Task > Components: Documentation >Reporter: Sebastian Schelter >Assignee: Gokhan Capan > Fix For: 1.0 > > > MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. > We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005663#comment-14005663 ] Gokhan Capan commented on MAHOUT-1534: -- We might want to add the link to the Mahout News, but let's wait and see if the users could locate the page. > Add documentation for using Mahout with Hadoop2 to the website > -- > > Key: MAHOUT-1534 > URL: https://issues.apache.org/jira/browse/MAHOUT-1534 > Project: Mahout > Issue Type: Task > Components: Documentation >Reporter: Sebastian Schelter >Assignee: Gokhan Capan > Fix For: 1.0 > > > MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. > We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1534: Assignee: Gokhan Capan > Add documentation for using Mahout with Hadoop2 to the website > -- > > Key: MAHOUT-1534 > URL: https://issues.apache.org/jira/browse/MAHOUT-1534 > Project: Mahout > Issue Type: Task > Components: Documentation >Reporter: Sebastian Schelter >Assignee: Gokhan Capan > Fix For: 1.0 > > > MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. > We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004662#comment-14004662 ] Gokhan Capan commented on MAHOUT-1534: -- [~ssc] I added the directions to the BuildingMahout page. If you're happy with the staged, I'll "Publish Site" > Add documentation for using Mahout with Hadoop2 to the website > -- > > Key: MAHOUT-1534 > URL: https://issues.apache.org/jira/browse/MAHOUT-1534 > Project: Mahout > Issue Type: Task > Components: Documentation >Reporter: Sebastian Schelter > Fix For: 1.0 > > > MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. > We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996351#comment-13996351 ] Gokhan Capan commented on MAHOUT-1550: -- Paul, Did you try build mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version= Let us know if this fails > Naive Bayes training fails with Hadoop 2 > > > Key: MAHOUT-1550 > URL: https://issues.apache.org/jira/browse/MAHOUT-1550 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 1.0 > Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2 >Reporter: Paul Marret >Priority: Minor > Labels: bayesian, training > Attachments: mahout-snapshot.patch, stacktrace.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > When using the trainnb option of the program, we get the following error: > Exception in thread "main" java.lang.IncompatibleClassChangeError: Found > interface org.apache.hadoop.mapreduce.JobContext, but class was expected > at > org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) > at > org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) > at > org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100) > [...] > It is possible to correct this by modifying the file > mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and > converting the instance job (line 174) to a Job object (it is a JobContext in > the current version). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996351#comment-13996351 ] Gokhan Capan edited comment on MAHOUT-1550 at 5/13/14 1:10 PM: --- Paul, Did you try building mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version= Let us know if this fails was (Author: gokhancapan): Paul, Did you try build mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version= Let us know if this fails > Naive Bayes training fails with Hadoop 2 > > > Key: MAHOUT-1550 > URL: https://issues.apache.org/jira/browse/MAHOUT-1550 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 1.0 > Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2 >Reporter: Paul Marret >Priority: Minor > Labels: bayesian, training > Attachments: mahout-snapshot.patch, stacktrace.txt > > Original Estimate: 0h > Remaining Estimate: 0h > > When using the trainnb option of the program, we get the following error: > Exception in thread "main" java.lang.IncompatibleClassChangeError: Found > interface org.apache.hadoop.mapreduce.JobContext, but class was expected > at > org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) > at > org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) > at > org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100) > [...] > It is possible to correct this by modifying the file > mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and > converting the instance job (line 174) to a Job object (it is a JobContext in > the current version). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968254#comment-13968254 ] Gokhan Capan commented on MAHOUT-1178: -- The thing is it just 'loads' a Lucene index in memory as a matrix. You construct a matrix with the lucene index directory location and that's it. So it is not a fix for incremental document management issue. The alternative approach is querying the index when a row/column vector, or cell is required. I, however, am not sure if the SolrMatrix thing is fast enough for that. I haven't been available lately, and now I'm reading through the changes in and proposals for Mahout's future, and trying to set up my perspective for Mahout2. We probably can come up with a better way of document storage (still Lucene/Solr based). Let me leave this as is now, and then we can discuss the input formats further. Is that OK for you? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968221#comment-13968221 ] Gokhan Capan commented on MAHOUT-1178: -- I personally like the idea of integrating additional storage layers as matrix inputs, but not like the implementation I did here. After agreeing on the new algorithm layers, we can later move to the the additional input formats. So my vote also is for "Won't Fix" > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968148#comment-13968148 ] Gokhan Capan commented on MAHOUT-1178: -- Well I can add this, but considering the current status of the project, I think this is no longer in people's interest. What do you say [~ssc], should we 'won't fix' it or commit? > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13918159#comment-13918159 ] Gokhan Capan commented on MAHOUT-1178: -- Let me get the pieces together and submit a patch in a few days. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon >Assignee: Gokhan Capan > Labels: gsoc2013, mentor > Fix For: 1.0 > > Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13914494#comment-13914494 ] Gokhan Capan commented on MAHOUT-1329: -- Sure I can. Although my vote would be passing the version, considering different distributions out there, people may want to build mahout against whatever hadoop2 distro they use (I am not very sure about my own argument actually, It would be great to hear a counter-argument) > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, > 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911436#comment-13911436 ] Gokhan Capan commented on MAHOUT-1329: -- I committed this to trunk > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1329: - Resolution: Fixed Status: Resolved (was: Patch Available) > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908443#comment-13908443 ] Gokhan Capan commented on MAHOUT-1329: -- Good news that I tried that too, on a 2.2.0 cluster. seqdir, seq2sparse, and kmeans worked without a problem. I'm gonna wait till Monday to commit this, in case folks want to verify that it works. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM: --- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster [EDIT:Sorry I missed you mentioned that you ran the examples, great then] was (Author: gokhancapan): Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM: --- Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. was (Author: gokhancapan): Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908126#comment-13908126 ] Gokhan Capan commented on MAHOUT-1329: -- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1329: Assignee: Gokhan Capan (was: Suneel Marthi) > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Gokhan Capan > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1329: - Attachment: 1329-3.patch > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Suneel Marthi > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907480#comment-13907480 ] Gokhan Capan commented on MAHOUT-1329: -- Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Suneel Marthi > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329-3.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907237#comment-13907237 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/20/14 5:50 PM: --- Hi Sergey, thank you for that, I am copying from MAHOUT-1354: Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project." Ted: "It is important to keep modules like mahout math free of the massive Hadoop dependency." I think pushing dependencies to the root is not something that we desire, but let me look into this further. was (Author: gokhancapan): Hi Sergey, thank you for that, I am copying from MAHOUT-1354: Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project." Ted: "It is important to keep modules like mahout math free of the massive Hadoop dependency." I think pushing dependencies to the root is not something that we desire I think, but let me look into this further. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Suneel Marthi > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907237#comment-13907237 ] Gokhan Capan commented on MAHOUT-1329: -- Hi Sergey, thank you for that, I am copying from MAHOUT-1354: Gokhan: "Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project." Ted: "It is important to keep modules like mahout math free of the massive Hadoop dependency." I think pushing dependencies to the root is not something that we desire I think, but let me look into this further. > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Suneel Marthi > Labels: patch > Fix For: 1.0 > > Attachments: 1329-2.patch, 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906062#comment-13906062 ] Gokhan Capan commented on MAHOUT-1329: -- Is it OK to add hadoop dependencies to the project root, and to the math module (actually to all modules even they already depend on the core module)? I remember that's what we wanted to avoid > Mahout for hadoop 2 > --- > > Key: MAHOUT-1329 > URL: https://issues.apache.org/jira/browse/MAHOUT-1329 > Project: Mahout > Issue Type: Task > Components: build >Affects Versions: 0.9 >Reporter: Sergey Svinarchuk >Assignee: Suneel Marthi > Labels: patch > Fix For: 1.0 > > Attachments: 1329.patch > > > Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843226#comment-13843226 ] Gokhan Capan commented on MAHOUT-1354: -- Yeah, I agree > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > Attachments: MAHOUT-1354_initial.patch > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842960#comment-13842960 ] Gokhan Capan commented on MAHOUT-1354: -- Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project. CC'ing [~frankscholten] > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > Attachments: MAHOUT-1354_initial.patch > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1354: - Attachment: MAHOUT-1354_initial.patch Could you guys test this initial patch against different versions of clusters to see if that works? Usage: mahout against hadoop1 (version 1.2.1): mvn package mahout against hadoop2-stable (version 2.2.0, by default): mvn package -Phadoop2 mahout against hadoop2-earlier: mvn package -Phadoop2 -Dhadoop.version= > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > Attachments: MAHOUT-1354_initial.patch > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837933#comment-13837933 ] Gokhan Capan commented on MAHOUT-1354: -- Today I had some troubles with integration's transitive dependencies, let me dig further. So this still should stay in 1.0 queue > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836965#comment-13836965 ] Gokhan Capan commented on MAHOUT-1354: -- Let me submit a patch first, probably tomorrow. Best > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836953#comment-13836953 ] Gokhan Capan commented on MAHOUT-1354: -- Well, I tried something and want to share. Based on: In hadoop-2-stable, compatibility with hadoop-1 is preferred over with hadoop-2-alpha (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html). For example, return type for ProgramDriver#driver(String) was void in hadoop-1 (which we use in MahoutDriver), int in hadoop-2-alpha, void again in hadoop-2-stable. It seems if we select the right artifacts, there is nothing to worry about the compatibility. My conclusion was: The current hadoop-0.20 and hadoop-0.23 profiles can be utilized: we can rename them to hadoop-1 and hadoop-2, respectively, then make hadoop-2 (stable) the default profile, then set the hadoop.version property to 2.2.0. We need to worry about some third party dependencies though, for instance, hbase-client in mahout-integration is dependent to hadoop-1 (for that particular artifact, simply excluding hadoop-core did not break any tests, by the way). > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836661#comment-13836661 ] Gokhan Capan commented on MAHOUT-1354: -- Do you think we should support hadoop-1 and hadoop-2 at the same time? > Mahout Support for Hadoop 2 > > > Key: MAHOUT-1354 > URL: https://issues.apache.org/jira/browse/MAHOUT-1354 > Project: Mahout > Issue Type: Improvement >Affects Versions: 0.8 >Reporter: Suneel Marthi >Assignee: Suneel Marthi > Fix For: 1.0 > > > Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836102#comment-13836102 ] Gokhan Capan commented on MAHOUT-1286: -- Let's "Won't Fix" this issue. I think what we need to do is implementing more sparse matrix (or alike) data structures for different access patterns, other than the current map of maps approach. The ideas would apply to current 2 FastByIDMaps based DataModel. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, > Semifinal-implementation-added.patch, benchmark.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806106#comment-13806106 ] Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM: Peng, I am attaching a patch (not to be committed) that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. was (Author: gokhancapan): Peng, I am attaching a patch --not to be committed-- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: benchmark.patch, InMemoryDataModel.java, > InMemoryDataModelTest.java, Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13806106#comment-13806106 ] Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM: Peng, I am attaching a patch --not to be committed-- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. was (Author: gokhancapan): Peng, I am attaching a patch -not to be committed- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: benchmark.patch, InMemoryDataModel.java, > InMemoryDataModelTest.java, Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1286: - Attachment: benchmark.patch Peng, I am attaching a patch -not to be committed- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: benchmark.patch, InMemoryDataModel.java, > InMemoryDataModelTest.java, Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799916#comment-13799916 ] Gokhan Capan commented on MAHOUT-1178: -- Hi [~smarthi], Although I'm not sure if there is no more an interest, I have a Lucene matrix (in-memory) and a Solr matrix (that does not load the index into memory) implementations. I believe both can be committed after a couple review rounds. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Fix For: Backlog > > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759021#comment-13759021 ] Gokhan Capan edited comment on MAHOUT-1286 at 9/5/13 12:22 PM: --- Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. was (Author: gokhancapan): There was a thread on updating "int" indices and "double" values in matrices, but there are simply too many consequences of that update that we can't deal with right now. Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, > Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759021#comment-13759021 ] Gokhan Capan commented on MAHOUT-1286: -- There was a thread on updating "int" indices and "double" values in matrices, but there are simply too many consequences of that update that we can't deal with right now. Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, > Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757801#comment-13757801 ] Gokhan Capan commented on MAHOUT-1286: -- Here is what I think: 1- We should implement a matrix that uses your 2d Hopscotch hash table as the underlying data structure (or the current open addressing hash table implementation that already exists in Mahout, depending on benchmarks) 2- We should handle concurrency issues that might be introduced by that matrix implementation 3- We then can replace the FastByIDMap(s) with that matrix, trust at the underlying matrix for concurrent updates, and never create a PreferenceArray unless there is an iteration over users (or items) What do you think? > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, > Semifinal-implementation-added.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751053#comment-13751053 ] Gokhan Capan commented on MAHOUT-1286: -- By the way, it seems the link to the paper is broken, if it is not just me. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng >Assignee: Sean Owen > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751049#comment-13751049 ] Gokhan Capan commented on MAHOUT-1286: -- Hi Peng, could you submit the diff files instead of .javas? That would be more convenient for me if it is possible. > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng >Assignee: Sean Owen > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13737267#comment-13737267 ] Gokhan Capan commented on MAHOUT-1286: -- Peng, With a SparseRowMatrix, column access (getPreferencesForItem), but row access is pretty fast (getPreferencesFromUsers). I agree with all other problems you mentioned. In Mahout's SVD-based recommenders and FactorizablePreferences, while computing top-N recommendations, I believe we compute predictions for each item, and return the top-N. So basically, a SVD based recommender needs fast access to the rows of the matrix, but not the columns (It still needs to iterate over item ids, though). It is only needed in an item-based recommender, or if a CandidateItemsStrategy is used. In my tests for Netflix data, I saw a 3G heap, too. Let me compare this particular approach with the SparseRowMatrix backed one. I will investigate your approach further. Ted, Additionally, I recently implemented a read-only SolrMatrix, which might be beneficial while implementing the SolrRecommender, if we want to use existing mahout library for similarities etc. I will open a new thread for that. Best > Memory-efficient DataModel, supporting fast online updates and element-wise > iteration > - > > Key: MAHOUT-1286 > URL: https://issues.apache.org/jira/browse/MAHOUT-1286 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Peng Cheng >Assignee: Sean Owen > Labels: collaborative-filtering, datamodel, patch, recommender > Fix For: 0.9 > > Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java > > Original Estimate: 336h > Remaining Estimate: 336h > > Most DataModel implementation in current CF component use hash map to enable > fast 2d indexing and update. This is not memory-efficient for big data set. > e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. > Improved implementation of DataModel should use more compact data structure > (like arrays), this can trade a little of time complexity in 2d indexing for > vast improvement in memory efficiency. In addition, any online recommender or > online-to-batch converted recommender will not be affected by this in > training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix
[ https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642752#comment-13642752 ] Gokhan Capan commented on MAHOUT-1193: -- Sorry I missed that. I modified the SparseMatrix code to handle dense rows and I am happy with that. The code is not patch-quality, but I can implement a flexible extension to the current implementation if that is desired (I believe that might be a common use case). I personally liked the BlockSparseMatrix idea and its really flexible schema. I did a quick implementation to make it work with configurable block size, in a few days I can submit an additional diff to the reviewboard so we can discuss on code. One thing to consider, I suspect my version's CPU usage is kind of high. I believe both versions are valuable and important, they have their own benefits, particularly as an input to online learning algorithms. > We may want a BlockSparseMatrix > --- > > Key: MAHOUT-1193 > URL: https://issues.apache.org/jira/browse/MAHOUT-1193 > Project: Mahout > Issue Type: Bug >Reporter: Ted Dunning > Attachments: MAHOUT-1193.patch > > > Here is an implementation. > Is it good enough to commit? > Is it useful? > Is it redundant? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix
[ https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640268#comment-13640268 ] Gokhan Capan commented on MAHOUT-1193: -- Ok, here are the updates: I modified the code a little (made it run and modified as I had commented previosly), and did some tests within the real application that I mentioned int the user list. Performance of get and sets (bigger is better): DenseMatrix > SparseMatrix (with dense rows) > BlockSparseMatrix > SparseMatrix (with sparse rows) > SparseColumnMatrix Performance difference between SparseMatrix with dense rows and BlockSparseMatrix is small. One drawback of SparseMatrix might be that you need to specify the rowSize in advance (which means you need to set a boundary for your row indices). This wasn't a problem for me, but it's worth mentioning. With this version of BlockSparseMatrix, there might also be a memory overhead depending on blockSize. I decided to go for SparseMatrix with dense rows for now, but I also work on BlockSparseMatrix code (thanks to the flexible schema). > We may want a BlockSparseMatrix > --- > > Key: MAHOUT-1193 > URL: https://issues.apache.org/jira/browse/MAHOUT-1193 > Project: Mahout > Issue Type: Bug >Reporter: Ted Dunning > Attachments: MAHOUT-1193.patch > > > Here is an implementation. > Is it good enough to commit? > Is it useful? > Is it redundant? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1193) We may want a BlockSparseMatrix
[ https://issues.apache.org/jira/browse/MAHOUT-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635670#comment-13635670 ] Gokhan Capan commented on MAHOUT-1193: -- Is it just me or doesn't it compile because it does not have super-matching constructor and cardinality is not declared? What I understand from the implementation is that we create a Map, each Entry of which represents a block and the associated DenseMatrix. If I didn't totally misunderstand the implementation, if the blockSize always will be 1, this associates a matrix with each row. Say I want to sacrifice some memory and try to set blockSize to 5, so if there were n actual rows in [row/blockSize, row/blockSize+5), there would be 5-n empty ones, and I am OK with that. Shouldn't we modify the extendToThisRow method such that: int blockIndex = row / blockSize; Matrix block = data.get(blockIndex); if (block == null) { data.put(blockIndex, new DenseMatrix(blockSize, columns)); } else if (!block.hasRow(row)) { block.assignRow(row % blockIndex, new DenseVector(columns)) } rows = Math.max(row + 1, rows); cardinality[ROW] = rows; > We may want a BlockSparseMatrix > --- > > Key: MAHOUT-1193 > URL: https://issues.apache.org/jira/browse/MAHOUT-1193 > Project: Mahout > Issue Type: Bug >Reporter: Ted Dunning > Attachments: MAHOUT-1193.patch > > > Here is an implementation. > Is it good enough to commit? > Is it useful? > Is it redundant? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634477#comment-13634477 ] Gokhan Capan commented on MAHOUT-1178: -- Thanks for the valuable reviews. I updated the review request, but not the patch here. I will do it after another review round. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056 ] Gokhan Capan edited comment on MAHOUT-1178 at 4/11/13 4:21 PM: --- Hi Sebastian, I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, the diff here and there are not the same (the base directories I created the diffs are different, and the one in reviewboard is in a single diff file. Code is same though, I hope this is not a problem) Update: adding the link https://reviews.apache.org/r/10420/ was (Author: gokhancapan): Hi Sebastian, I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, the diff here and there are not the same (the base directories I created the diffs are different, and the one in reviewboard is in a single diff file. Code is same though, I hope this is not a problem) > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629056#comment-13629056 ] Gokhan Capan commented on MAHOUT-1178: -- Hi Sebastian, I did, though I'm not sure if I did it correctly:) Anyway, if it is correct, the diff here and there are not the same (the base directories I created the diffs are different, and the one in reviewboard is in a single diff file. Code is same though, I hope this is not a problem) > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1178: - Attachment: MAHOUT-1178.patch MAHOUT-1178-TEST.patch Hi, I am adding a Matrix impementation that loads the entire data of a field of a Lucene index to an underlying SparseRowMatrix here. It delegates reading from index logic to the existing LuceneIterator. When I changed LuceneIterator code a little to make this support StringFields, it broke LuceneIteratorTest, so I am going to add a new version of of LuceneIterator that supports StringFields later. Also there is an ongoing effort on another version of LuceneMatrix that lazy-loads from index while iterating over matrix. I am going to start a separate issue for that. I put the code to the integration module, and test and actual code are in different diff files. > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch > > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626619#comment-13626619 ] Gokhan Capan commented on MAHOUT-1178: -- Ted, do you think this should load the entire index to memory as a matrix? Or should it ask to the index when a get request is done? (And if this is the option, should set methods also update the lucene index itself?) > GSOC 2013: Improve Lucene support in Mahout > --- > > Key: MAHOUT-1178 > URL: https://issues.apache.org/jira/browse/MAHOUT-1178 > Project: Mahout > Issue Type: New Feature >Reporter: Dan Filimon > Labels: gsoc2013, mentor > > [via Ted Dunning] > It should be possible to view a Lucene index as a matrix. This would > require that we standardize on a way to convert documents to rows. There > are many choices, the discussion of which should be deferred to the actual > work on the project, but there are a few obvious constraints: > a) it should be possible to get the same result as dumping the term vectors > for each document each to a line and converting that result using standard > Mahout methods. > b) numeric fields ought to work somehow. > c) if there are multiple text fields that ought to work sensibly as well. > Two options include dumping multiple matrices or to convert the fields > into a single row of a single matrix. > d) it should be possible to refer back from a row of the matrix to find the > correct document. THis might be because we remember the Lucene doc number > or because a field is named as holding a unique id. > e) named vectors and matrices should be used if plausible. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
[ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1069: - Attachment: MAHOUT-1069.patch Fixed a few minor bugs and updated the patch > Multi-target, side-info aware, SGD-based recommender algorithms, examples, > and tools to run > --- > > Key: MAHOUT-1069 > URL: https://issues.apache.org/jira/browse/MAHOUT-1069 > Project: Mahout > Issue Type: Improvement > Components: CLI, Collaborative Filtering >Affects Versions: 0.8 >Reporter: Gokhan Capan >Assignee: Sean Owen > Labels: cf, improvement, sgd > Attachments: MAHOUT-1069.patch, MAHOUT-1069.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Upon our conversations on dev-list, I would like to state that I have > completed the merge of the recommender algorithms that is mentioned in > http://goo.gl/fh4d9 to mahout. > These are a set of learning algorithms for matrix factorization based > recommendation, which are capable of: > * Recommending multiple targets: > *# Numerical Recommendation with OLS Regression > *# Binary Recommendation with Logistic Regression > *# Multinomial Recommendation with Softmax Regression > *# Ordinal Recommendation with Proportional Odds Model > * Leveraging side info in mahout vector format where available > *# User side information > *# Item side information > *# Dynamic side information (side info at feedback moment, such as proximity, > day of week etc.) > * Online learning > Some command-line tools are provided as mahout jobs, for pre-experiment > utilities and running experiments. > Evaluation tools for numerical and categorical recommenders are added. > A simple example for Movielens-1M data is provided, and it achieved pretty > good results (0.851 RMSE in a randomly generated test data after some > validation to determine learning and regularization rates on a separate > validation data) > There is no modification in the existing Mahout code, except the added lines > in driver.class.props for command-line tools. However, that became a huge > patch with dozens of new source files. > These algorithms are highly inspired from various influential Recommender > System papers, especially Yehuda Koren's. For example, the Ordinal model is > from Koren's OrdRec paper, except the cuts are not user-specific but global. > Left for future: > # The core algorithms are tested, but there probably exists some parts those > tests do not cover. I saw many of those in action without problem, but I am > going to add new tests regularly. > # Not all algorithms have been tried on appropriate datasets, and they may > need some improvement. However, I use the algorithms also for my M.Sc. > thesis, which means I will eventually submit more experiments. As the > experimenting infrastructure exists, I believe community may provide more > experiments, too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
[ https://issues.apache.org/jira/browse/MAHOUT-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1069: - Attachment: MAHOUT-1069.patch Attached is the patch. > Multi-target, side-info aware, SGD-based recommender algorithms, examples, > and tools to run > --- > > Key: MAHOUT-1069 > URL: https://issues.apache.org/jira/browse/MAHOUT-1069 > Project: Mahout > Issue Type: Improvement > Components: CLI, Collaborative Filtering >Affects Versions: 0.8 >Reporter: Gokhan Capan >Assignee: Sean Owen > Labels: cf, improvement, sgd > Attachments: MAHOUT-1069.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > Upon our conversations on dev-list, I would like to state that I have > completed the merge of the recommender algorithms that is mentioned in > http://goo.gl/fh4d9 to mahout. > These are a set of learning algorithms for matrix factorization based > recommendation, which are capable of: > * Recommending multiple targets: > *# Numerical Recommendation with OLS Regression > *# Binary Recommendation with Logistic Regression > *# Multinomial Recommendation with Softmax Regression > *# Ordinal Recommendation with Proportional Odds Model > * Leveraging side info in mahout vector format where available > *# User side information > *# Item side information > *# Dynamic side information (side info at feedback moment, such as proximity, > day of week etc.) > * Online learning > Some command-line tools are provided as mahout jobs, for pre-experiment > utilities and running experiments. > Evaluation tools for numerical and categorical recommenders are added. > A simple example for Movielens-1M data is provided, and it achieved pretty > good results (0.851 RMSE in a randomly generated test data after some > validation to determine learning and regularization rates on a separate > validation data) > There is no modification in the existing Mahout code, except the added lines > in driver.class.props for command-line tools. However, that became a huge > patch with dozens of new source files. > These algorithms are highly inspired from various influential Recommender > System papers, especially Yehuda Koren's. For example, the Ordinal model is > from Koren's OrdRec paper, except the cuts are not user-specific but global. > Left for future: > # The core algorithms are tested, but there probably exists some parts those > tests do not cover. I saw many of those in action without problem, but I am > going to add new tests regularly. > # Not all algorithms have been tried on appropriate datasets, and they may > need some improvement. However, I use the algorithms also for my M.Sc. > thesis, which means I will eventually submit more experiments. As the > experimenting infrastructure exists, I believe community may provide more > experiments, too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1069) Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run
Gokhan Capan created MAHOUT-1069: Summary: Multi-target, side-info aware, SGD-based recommender algorithms, examples, and tools to run Key: MAHOUT-1069 URL: https://issues.apache.org/jira/browse/MAHOUT-1069 Project: Mahout Issue Type: Improvement Components: CLI, Collaborative Filtering Affects Versions: 0.8 Reporter: Gokhan Capan Assignee: Sean Owen Upon our conversations on dev-list, I would like to state that I have completed the merge of the recommender algorithms that is mentioned in http://goo.gl/fh4d9 to mahout. These are a set of learning algorithms for matrix factorization based recommendation, which are capable of: * Recommending multiple targets: *# Numerical Recommendation with OLS Regression *# Binary Recommendation with Logistic Regression *# Multinomial Recommendation with Softmax Regression *# Ordinal Recommendation with Proportional Odds Model * Leveraging side info in mahout vector format where available *# User side information *# Item side information *# Dynamic side information (side info at feedback moment, such as proximity, day of week etc.) * Online learning Some command-line tools are provided as mahout jobs, for pre-experiment utilities and running experiments. Evaluation tools for numerical and categorical recommenders are added. A simple example for Movielens-1M data is provided, and it achieved pretty good results (0.851 RMSE in a randomly generated test data after some validation to determine learning and regularization rates on a separate validation data) There is no modification in the existing Mahout code, except the added lines in driver.class.props for command-line tools. However, that became a huge patch with dozens of new source files. These algorithms are highly inspired from various influential Recommender System papers, especially Yehuda Koren's. For example, the Ordinal model is from Koren's OrdRec paper, except the cuts are not user-specific but global. Left for future: # The core algorithms are tested, but there probably exists some parts those tests do not cover. I saw many of those in action without problem, but I am going to add new tests regularly. # Not all algorithms have been tried on appropriate datasets, and they may need some improvement. However, I use the algorithms also for my M.Sc. thesis, which means I will eventually submit more experiments. As the experimenting infrastructure exists, I believe community may provide more experiments, too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1064) Weird behavior of vector dumper
[ https://issues.apache.org/jira/browse/MAHOUT-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1064: - Attachment: MAHOUT-1064.patch Attached is a test that fails, and a quick fix. > Weird behavior of vector dumper > --- > > Key: MAHOUT-1064 > URL: https://issues.apache.org/jira/browse/MAHOUT-1064 > Project: Mahout > Issue Type: Bug > Components: Integration >Affects Versions: 0.8 >Reporter: Gokhan Capan >Priority: Minor > Labels: sort, vectordump > Fix For: 0.8 > > Attachments: MAHOUT-1064.patch > > > When vectordump utility is executed with sort flag true, I expect the > resulting vector that is sorted by values. If that is the case, sometimes > VectorHelper.vectorToJson method returns unexpected results. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1064) Weird behavior of vector dumper
Gokhan Capan created MAHOUT-1064: Summary: Weird behavior of vector dumper Key: MAHOUT-1064 URL: https://issues.apache.org/jira/browse/MAHOUT-1064 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.8 Reporter: Gokhan Capan Priority: Minor Fix For: 0.8 When vectordump utility is executed with sort flag true, I expect the resulting vector that is sorted by values. If that is the case, sometimes VectorHelper.vectorToJson method returns unexpected results. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs
[ https://issues.apache.org/jira/browse/MAHOUT-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432039#comment-13432039 ] Gokhan Capan commented on MAHOUT-1051: -- Jake, I've run the new version without an error and checked a few documents if they are related with the inferred topics. It works for me. > InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs > > > Key: MAHOUT-1051 > URL: https://issues.apache.org/jira/browse/MAHOUT-1051 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.8 >Reporter: Gokhan Capan >Priority: Minor > Labels: cvb, lda > Fix For: 0.8 > > Attachments: MAHOUT-1051.patch, MAHOUT-1051.patch > > > Based upon our conversation with Jake in the user-list, I have modified the > o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so > that it does not ignore document ids in input. To preserve backwards > compatibility, it behaves as it did earlier if a ClassCastException is > thrown; which occurs when ids are not integers, and/or the document vector > (or getDelegate() if it is a NamedVector) cannot be cast to a > RandomAccessSparseVector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs
[ https://issues.apache.org/jira/browse/MAHOUT-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1051: - Attachment: MAHOUT-1051.patch Attached is the patch. > InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs > > > Key: MAHOUT-1051 > URL: https://issues.apache.org/jira/browse/MAHOUT-1051 > Project: Mahout > Issue Type: Improvement > Components: Clustering >Affects Versions: 0.8 >Reporter: Gokhan Capan >Priority: Minor > Labels: cvb, lda > Fix For: 0.8 > > Attachments: MAHOUT-1051.patch > > > Based upon our conversation with Jake in the user-list, I have modified the > o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so > that it does not ignore document ids in input. To preserve backwards > compatibility, it behaves as it did earlier if a ClassCastException is > thrown; which occurs when ids are not integers, and/or the document vector > (or getDelegate() if it is a NamedVector) cannot be cast to a > RandomAccessSparseVector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (MAHOUT-1051) InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs
Gokhan Capan created MAHOUT-1051: Summary: InMemoryCollapsedVariationalBayes0 to load input vectors with docIDs Key: MAHOUT-1051 URL: https://issues.apache.org/jira/browse/MAHOUT-1051 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.8 Reporter: Gokhan Capan Priority: Minor Fix For: 0.8 Attachments: MAHOUT-1051.patch Based upon our conversation with Jake in the user-list, I have modified the o.a.m.clustering.lda.cvb.InMemoryCollapsedVariationalBayes0.loadVectors so that it does not ignore document ids in input. To preserve backwards compatibility, it behaves as it did earlier if a ClassCastException is thrown; which occurs when ids are not integers, and/or the document vector (or getDelegate() if it is a NamedVector) cannot be cast to a RandomAccessSparseVector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira