Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair
Congratulations, Andrew! - G > On Jul 18, 2018, at 22:30, Andrew Palumbo wrote: > > Please join me in congratulating Andrew Musselman as the new Chair of the > Apache Mahout Project Management Committee. I would like to thank Andrew > for stepping up, all of us who have worked with him over the years know his > dedication to the project to be invaluable. I look forward to Andrew > taking taking the project into the future. > > Thank you, > > Andy
Re: Welcome Anand Avati
Welcome Anand! Sent from my iPhone On Apr 22, 2015, at 20:47, Dmitriy Lyubimov dlie...@gmail.com wrote: congrats and thank you! -d On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: Welcome to the team Anand; thanks for your contributions! On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati av...@gluster.org wrote: Thank you Suneel, I am thrilled to join the team! I am a relative newbie to data mining and machine learning. I currently work at Red Hat, but have joined grad school (in machine learning) starting this fall. I look forward to continuing my contributions, and thank you once again for the opportunity. Anand On Wed, Apr 22, 2015, 08:08 Suneel Marthi smar...@apache.org wrote: In recognition of the contributions of Anand Avati to the Mahout project over the past year, the PMC is pleased to announce that he has accepted our invitation to join the Mahout project as a committer. As is customary, I will leave it to Anand to provide a little bit of background about himself. Congratulations and Welcome! -Suneel Marthi On Behalf of Mahout PMC
Re: TF-IDF, seq2sparse and DataFrame support
Andrew, Maybe making class tag evident in mapBlock calls?, i.e: val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]]) Best, Gokhan On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo ap@outlook.com wrote: This (last commit on this branch) should be the beginning of a workaround for the problem of reading and returning a Generic-Writable keyed Drm: https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30 aae3f37e14 However the keyClassTag of the DrmLike returned by the mapBlock() calls and finally by the method itself is somehow converted to object. I'm not exactly sure why this is happening. I think that the implicit evidence is being dropped in the mapBlock call on a [Object]casted CheckPointedDrm. Maybe by calling it out of the scope of this method (breaking down the method would fix it.) valtfMatrix = drmMetadata.keyClassTagmatch{ casect ifct == ClassTag.Int= { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Int]] } casectifct ==ClassTag(classOf[String]) = { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[String]] } casectifct == ClassTag.Long= { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Long]] } case_ = { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Int]] } } tfMatrix.checkpoint() // make sure that the classtag of the tf matrix matches the metadata keyClasstag assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag) -- Passes here with eg. String keys val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... } assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag) -- Fails here for all with tfIdfMatrix.keyClassTag as an Object. I'll keep looking at it a bit. If anybody has any ideas please let me know. On 03/09/2015 02:12 PM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrelp...@occamsmachete.com wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimovdlie...@gmail.com wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumboap@outlook.com wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1]https
Re: TF-IDF, seq2sparse and DataFrame support
Some answers: - Non-integer document ids: The implementation does not use operations defined for DrmLike[Int]-only, so the row keys do not have to be Int's. I just couldn't manage to create the returning DrmLike with the correct key type. Although while wrapping into a DrmLike, I tried to pass the key-class using HDFS utils like they are being used in drmDfsRead, but I somehow wasn't successful. So non-int document ids is not an actual issue here. - Breaking the implementation out to smaller pieces: Let's just collect the requirements and adjust the implementation accordingly. I honestly didn't think very much about where the implementation fits in, architecturally, and what pieces are of public interest. Best Gokhan On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi suneel.mar...@gmail.com wrote: AP, How is ur impl different from Gokhan's? On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo ap@outlook.com wrote: BTW, i'm not sure o.a.m.nlp is the best package name for either, I was using because o.a.m.vectorizer, which is probably a better name, had conflicts in mrlegacy. On 03/09/2015 09:29 PM, Andrew Palumbo wrote: I meant would o.a.m.nlp in the spark module be a good place for Gokhan's seq2sparse implementation to live. On 03/09/2015 09:07 PM, Pat Ferrel wrote: Does o.a.m.nlp in the spark module seem like a good place for this to live? I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo ap@outlook.com wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel p...@occamsmachete.com wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String - Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan gkhn...@gmail.com wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel p...@occamsmachete.com wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo ap@outlook.com wrote: On 02/04/2015 11:13 AM, Pat Ferrel
Re: TF-IDF, seq2sparse and DataFrame support
So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel p...@occamsmachete.com wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo ap@outlook.com wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles .java [2] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java [3]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. java On Feb 4, 2015, at 7:47 AM, Andrew Palumbo ap@outlook.com wrote: Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream data structure. (my told you so moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require more grace about column naming and manipulation. Maybe relevant to TF-IDF work when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel p...@occamsmachete.com wrote: Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part
Re: Codebase refactoring proposal
What I am saying is that for certain algorithms including both engine-specific (such as aggregation) and DSL stuff, what is the best way of handling them? i) should we add the distributed operations to Mahout codebase as it is proposed in #62? ii) should we have [engine]-ml modules (like spark-bindings and h2o-bindings) where we can mix the DSL and engine-specific stuff? Picking i. has the advantage of writing an ML-algorithm once and then it can be run on alternative engines, but it requires wrapping/duplicating existing distributed operations. Picking ii. has the advantage of avoiding writing distributed operations, but since we're mixing the DSL and the engine-specific stuff, an ML-algorithm written for an engine would not be available for the others. I just wanted to hear some opinions. Gokhan On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: I took it Gokhan had objections himself, based on his comments. if we are talking about #62. He also expressed concerns about computing GSGD but i suspect it can still be algebraically computed. On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel p...@occamsmachete.com wrote: BTW Ted and Andrew have both expressed interest in the distributed aggregation stuff. It sounds like we are agreeing that non-algebra—computation method type things can be engine specific. So does anyone have an objection to Gokhan pushing his PR? On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo ap@outlook.com wrote: My thought was not to bring primitive engine specific aggregetors, combiners, etc. into math-scala. Yeah. +1. I would like to support that as an experiment, see where it goes. Clearly some distributed use cases are simple enough while also pervasive enough.
Re: TF-IDF, seq2sparse and DataFrame support
I think I have a sketch of implementation for creating a drm from a sequence file of Int, Texts, a.k.a. seq2sparse, using Spark. Give me a couple days day and I will provide an initial implementation. Best Gokhan On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo ap@outlook.com wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles .java [2] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java [3]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. java On Feb 4, 2015, at 7:47 AM, Andrew Palumbo ap@outlook.com wrote: Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream data structure. (my told you so moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require more grace about column naming and manipulation. Maybe relevant to TF-IDF work when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel p...@occamsmachete.com wrote: Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part of the seq2sparse code that I've not spent too much time in. I think that this would be very worthy effort as well- I believe seq2sparse is a particular strong mahout feature. I'll start another thread since we're now way off topic from the refactoring proposal. My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better. I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so
Re: Code quality questions
+1 for favoring native scala types. I think in terms of Scala code, we need a clear style standards definition to adhere to. Gokhan On Fri, Jan 23, 2015 at 9:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: in TextDelimitedReaderWriter.scala: === val itemList: collection.mutable.MutableList[org.apache.mahout.common.Pair[Integer, Double]] = new collection.mutable.MutableList[org.apache.mahout.common.Pair[Integer, Double]] for (ve - itemVector.nonZeroes) { val item: org.apache.mahout.common.Pair[Integer, Double] = new org.apache.mahout.common.Pair[Integer, Double](ve.index, ve.get) itemList += item } (1) why scala code attempts to use commons.pair? What was wrong about native Tuple type of scala? (I am trying to clean out mrlegacy dependencies from spark module). (2) why it is so horribly styled (even for me)? comments are misaligned, the lines routinely exceed 120 characters? Can these problems please be addressed? in particular, stuff like o.a.m.common.Pair? And why it is even signed off on in the first place by committers despite of clear style violations? thank you.
[jira] [Created] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks
Gokhan Capan created MAHOUT-1626: Summary: Support for required quasi-algebraic operations and starting with aggregating rows/blocks Key: MAHOUT-1626 URL: https://issues.apache.org/jira/browse/MAHOUT-1626 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 1.0 Reporter: Gokhan Capan Fix For: 1.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions
[ https://issues.apache.org/jira/browse/MAHOUT-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1616. -- Resolution: Fixed Better support for hadoop dependencies of multiple versions Key: MAHOUT-1616 URL: https://issues.apache.org/jira/browse/MAHOUT-1616 Project: Mahout Issue Type: Improvement Components: build Reporter: Gokhan Capan Assignee: Gokhan Capan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: SGD Implementation and Questions for mapBlock like functionality
Awesome. So we are going to implement certain required DistributedOperations, in a separate trait similar to, but other than the DistributedEngine. I'll think about this a little more, and propose an initial implementation that hopefully we can agree on. Best, Gokhan On Thu, Nov 13, 2014 at 1:35 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Nov 12, 2014 at 1:44 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Wed, Nov 12, 2014 at 1:27 PM, Gokhan Capan gkhn...@gmail.com wrote: My only concern is to add certain loss minimization tools for people to write machine learning algorithms. mapBlock as you suggested can work equally, but I happened to have implemented the aggregate op while thinking. Apart from this SGD implementation, blockify-a-matrix-and-run-an-operation-in-parallel-on-blocks is, I believe, certainly required, since block level parallelization is really common in matrix computations. Plus, if we are to add, say, a descriptive statistics package, that would require a similar functionality, too. If mapBlocks for passing custom operators was more flexible, I'd be more than happy, but I understand the idea behind its requirement of mapping should be block-to-block with the same row size. Could you give a little more detail on the 'common distributed strategy' idea? the idea is simple. First, not use logical plan construction. In practice it means that while say A.%*%(B) create a logical plan element (which is subsequently run thru optimizer), something like aggregate(..) does not do that. Instead, it just produces ... whatever it produces, directly. So it doesn't form any new logical nor physical plan. Second, it means that we can define internal strategy trait, something like DistributedOperations, which will include this set of operations. Subsequently, we will define native implementations of this trait in the same way we defined some native stuff for DistributedEngine trait. (but don't make it part of DistributedEngine trait please -- maybe an attribute perhaps). At run time we will have to ask current engine to provide distributed operation implementation and delegate execution of common fragments to it .
Re: SGD Implementation and Questions for mapBlock like functionality
My only concern is to add certain loss minimization tools for people to write machine learning algorithms. mapBlock as you suggested can work equally, but I happened to have implemented the aggregate op while thinking. Apart from this SGD implementation, blockify-a-matrix-and-run-an-operation-in-parallel-on-blocks is, I believe, certainly required, since block level parallelization is really common in matrix computations. Plus, if we are to add, say, a descriptive statistics package, that would require a similar functionality, too. If mapBlocks for passing custom operators was more flexible, I'd be more than happy, but I understand the idea behind its requirement of mapping should be block-to-block with the same row size. Could you give a little more detail on the 'common distributed strategy' idea? Aside: Do we have certain elementwise Math functions in Matrix DSL? That is, how can I do this? 1 + exp(drmA) Gokhan On Wed, Nov 12, 2014 at 7:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: yes i usually follow #2 too. The thing is, pretty often algorithm can define its own set of strategies the backend need to support (like this distributedEngine strategy) and keep a lot of logic still common accross all strategies. But then if all-reduce aggregate operation is incredibly common among such algorithm speicfic strategies, then it stands to reason to implement it only once. I have an idea. Maybe we need a common distributed strategy which is different from algebraic optimizer. That way we don't have to mess with algebraic rewrites. how about that? On Wed, Nov 12, 2014 at 9:12 AM, Pat Ferrel p...@occamsmachete.com wrote: So you are following #2, which is good. #1 seems a bit like a hack. For a long time to come we will have to add things to the DSL if it is to be kept engine independent. Yours looks pretty general and simple. Are you familiar with the existing Mahout aggregate methods? They show up in the SGDHelper.java and other places in legacy code. I don’t know much about them but they seem to be a pre-functional programming attempt at this kind of thing. It looks like you are proposing a replacement for those based on rdd.aggregate, if so that would be very useful. For one thing it looks like the old aggregate was not parallel, rdd.aggregate is. On Nov 11, 2014, at 1:18 PM, Gokhan Capan gkhn...@gmail.com wrote: So the alternatives are: 1- mapBlock to a matrix whose all rows-but-the first are empty, then aggregate 2- depend on a backend 1 is obviously OK. I don't like the idea of depending on a backend since SGD is a generic loss minimization, on which other algorithms will possibly depend. In this context, client-side aggregation is not an overhead, but even if it happens to be so, it doesn't have to be a client-side aggregate at all. Alternative to 1, I am thinking of at least having an aggregation operation, which will return an accumulated value anyway, and shouldn't affect algebra optimizations. I quickly implemented a naive one (supporting only Spark- I know I said that I don't like depending on a backend, but at least the backends-wide interface is consistent, and as a client, I still don't have to deal with Spark primitives directly). Is this nice enough? Is it too bad to have in the DSL? https://github.com/gcapan/mahout/compare/accumulateblocks Best Gokhan On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Oh. algorithm actually collects the vectors and runs another cycle in the client! Still, technically, you can collect almost-empty blocks to the client (since they are mostly empty, it won't cause THAT huge overhead compared to collecting single vectors, after all, how many partitions are we talking about? 1000? ). On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( http://martin.zinkevich.org/publications/nips2010.pdf), I tried to implement SGD, and a regularized least squares solution for linear regression (can easily be extended to other GLMs, too). How the algorithm works is as follows: 1. Split data into partitions of T examples 2. in parallel, for each partition: 2.0. shuffle partition 2.1. initialize parameter vector 2.2. for each example in the shuffled partition 2.2.1 update the parameter vector 3. Aggregate all the parameter vectors and return I guess technically it is possible (transform each block to a SparseRowMatrix or SparseMatrix with only first valid row) and then invoke colSums() or colMeans() (whatever aggregate means). However, i am not sure it is worth the ugliness. isn't it easier to declare these things quasi-algebraic and just
Re: SGD Implementation and Questions for mapBlock like functionality
Ted, Can we easily integrate t-digest for descriptives once we have block aggregates? This might count one more reason. Gokhan On Thu, Nov 13, 2014 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 12, 2014 at 9:53 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: once we start mapping aggregate, there's no reason not to map other engine specific capabilities, which are vast. At this point dilemma is, no matter what we do we are losing coherency: if we map it all, then other engines will have trouble supporting all of it. If we don't map it all, then we are forcing capability reduction compared to what the engine actually can do. It is obvious to me that all-reduce aggregate will make a lot of sense -- even if it means math checkpoint. but then where do we stop in mapping those. E.g. do we do fold? cartesian? And what is that true reason we are remapping everything if it is already natively available? etc. etc. For myself, I still haven't figured a good answer to those . Actually, I disagree with the premise here. There *is* a reason not to map all other engine specific capabilities. That reason is we don't need them. Yet. So far, we *clearly* need some sort of block aggregate for a host of hog-wild sorts of algorithms. That doesn't imply that we need all kinds of mapping aggregates. It just means that we are clear on one need for now. So let's get this one in and see how far we can go. Also, having one kind of aggregation in the DSL does not restrict anyone from using engine specific capabilities. It just means that one kind of idiom can be done without engine specificity.
Re: SGD Implementation and Questions for mapBlock like functionality
So the alternatives are: 1- mapBlock to a matrix whose all rows-but-the first are empty, then aggregate 2- depend on a backend 1 is obviously OK. I don't like the idea of depending on a backend since SGD is a generic loss minimization, on which other algorithms will possibly depend. In this context, client-side aggregation is not an overhead, but even if it happens to be so, it doesn't have to be a client-side aggregate at all. Alternative to 1, I am thinking of at least having an aggregation operation, which will return an accumulated value anyway, and shouldn't affect algebra optimizations. I quickly implemented a naive one (supporting only Spark- I know I said that I don't like depending on a backend, but at least the backends-wide interface is consistent, and as a client, I still don't have to deal with Spark primitives directly). Is this nice enough? Is it too bad to have in the DSL? https://github.com/gcapan/mahout/compare/accumulateblocks Best Gokhan On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Oh. algorithm actually collects the vectors and runs another cycle in the client! Still, technically, you can collect almost-empty blocks to the client (since they are mostly empty, it won't cause THAT huge overhead compared to collecting single vectors, after all, how many partitions are we talking about? 1000? ). On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( http://martin.zinkevich.org/publications/nips2010.pdf), I tried to implement SGD, and a regularized least squares solution for linear regression (can easily be extended to other GLMs, too). How the algorithm works is as follows: 1. Split data into partitions of T examples 2. in parallel, for each partition: 2.0. shuffle partition 2.1. initialize parameter vector 2.2. for each example in the shuffled partition 2.2.1 update the parameter vector 3. Aggregate all the parameter vectors and return I guess technically it is possible (transform each block to a SparseRowMatrix or SparseMatrix with only first valid row) and then invoke colSums() or colMeans() (whatever aggregate means). However, i am not sure it is worth the ugliness. isn't it easier to declare these things quasi-algebraic and just do direct spark calls on the matrix RDD (map, aggregate)? The real danger is to introduce non-algebra things into algebra so that the rest of the algebra doesn't optimize any more.
Re: SGD Implementation and Questions for mapBlock like functionality
Well, in that specific case, I will accumulate in the client side, collection of the intermediate parameters is not that big (numBlocks x X.ncol). What I need is just mapping (keys, block) to a vector (currently, a mapBlock has to map the block to the new block) From a general perspective, you are right, this is an accumulation. Gokhan On Mon, Nov 10, 2014 at 8:26 PM, Pat Ferrel p...@occamsmachete.com wrote: Do you need a reduce or could you use an accumulator? Either is not really supported in the DSL but clearly these are required for certain algos. Broadcast vals supported but are read only. On Nov 8, 2014, at 12:42 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper ( http://martin.zinkevich.org/publications/nips2010.pdf), I tried to implement SGD, and a regularized least squares solution for linear regression (can easily be extended to other GLMs, too). How the algorithm works is as follows: 1. Split data into partitions of T examples 2. in parallel, for each partition: 2.0. shuffle partition 2.1. initialize parameter vector 2.2. for each example in the shuffled partition 2.2.1 update the parameter vector 3. Aggregate all the parameter vectors and return Here is an initial implementation to illustrate where I am stuck: https://github.com/gcapan/mahout/compare/optimization (See TODO in SGD.minimizeWithSgd[K]) I was thinking that using a blockified matrix of training instances, step 2 of the algorithm can run on blocks, and they can be aggregated in client-side. However, the only operator that I know in the DSL is mapBlock, and it requires the BlockMapFunction to map a block to another block of the same row size. In this context, I want to map a block (numRows x n) to the parameter vector of size n. The question is: 1- Is it possible to easily implement the above algorithm using DSL's current functionality? Could you tell me what I'm missing? 2- If there is not an easy way other than using the currently-non-existing mapBlock-like method, shall we add such an operator? Best, Gokhan
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154918#comment-14154918 ] Gokhan Capan commented on MAHOUT-1329: -- Jay, This is integrated in trunk, not in 0.9, and should work. Also, you can find MAHOUT-1616 useful for a recent simplification and further improvement effort. Best Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: https://mahout.apache.org/developers/buildingmahout.html
By the way, I tried to simplify and improve things a bit here: MAHOUT-1616 Sent from my iPhone On Oct 1, 2014, at 15:26, Suneel Marthi suneel.mar...@gmail.com wrote: Mahout 0.9 doesn't support hadoop 2x and was built with hadoop 1.2.1 and hence the runtime errors u r seeing. Present codebase (unreleased) supports hadoop 2x Sent from my iPhone On Oct 1, 2014, at 8:14 AM, Ted Dunning ted.dunn...@gmail.com wrote: I believe that the POM assumes particular versions as listed are version 2 and all others 1. Inspection of the top-level pom would provide the most authoritative answer. On Wed, Oct 1, 2014 at 7:08 AM, jay vyas jayunit100.apa...@gmail.com wrote: hi mahout: Can we use any hadoop version to build mahout? i.e. 2.4.1 ? It seems like if you give it a garbage hadoop version i.e. (2.3.4.5) , it still builds, yet at runtime, it is clear that the version built is a 1.x version. thanks ! FYI this is in relation to BIGTOP=-1470, where we are just getting ready for our 0.8 release, so any feedback would be much appreciated ! -- jay vyas
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154937#comment-14154937 ] Gokhan Capan commented on MAHOUT-1329: -- Jay, here is the documentation: http://mahout.apache.org/developers/buildingmahout.html And the instructions apply to trunk, not to the 0.9 release Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155309#comment-14155309 ] Gokhan Capan commented on MAHOUT-1329: -- Correct Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions
Gokhan Capan created MAHOUT-1616: Summary: Better support for hadoop dependencies of multiple versions Key: MAHOUT-1616 URL: https://issues.apache.org/jira/browse/MAHOUT-1616 Project: Mahout Issue Type: Improvement Components: build Reporter: Gokhan Capan Assignee: Gokhan Capan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Upgrade to spark 1.0.x
+1 to merging spark-1.0.x to master Sent from my iPhone On Aug 8, 2014, at 22:06, Dmitriy Lyubimov dlie...@gmail.com wrote: Current master is still at Spark 0.9.x . MAHOUT-1603 (PR #40) is making a number of valuable tweaks to enable Spark 1.0.x and (Spark SQL code, by extension. I did a quick test, SQL seems to work for my simple tests in Mahout environment). This squashed PR is pushed to apache/mahout branch spark-1.0.x rather than master. Whenever (if) folks are ready, i can merge it to the master. Alternative approach would be to maintain both 1.0.x and 0.9.x branches for some time. I don't see it as valuable as the costs would likely overrun any benefit here, but if anyone still clings to spark 0.9.x dependency, please let me know in this thread. thanks. -d
Re: standardizing minimal Matrix I/O capability
Pat, I was thinking of something like: https://github.com/gcapan/mahout/compare/cellin It's just an example of where I believe new input formats should go (the example is to input a DRM from a text file of row_id,col_id,value lines). Best Gokhan On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel p...@occamsmachete.com wrote: Some work on this is being done as part of MAHOUT-1568, which is currently very early and in https://github.com/apache/mahout/pull/36 The idea there only covers text-delimited files and proposes a standard DRM-ish format but supports a configurable schema. Default is: rowIDtabitemID1:value1spaceitemID2:value2… The IDs can be mahout keys of any type since they are written as text or they can be application specific IDs meaningful in a particular usage, like a user ID hash, or SKU from a catalog, or URL. As far as dataframe-ish requirements, it seems to me there are two different things needed. The dataframe is needed while preforming an algorithm or calculation and is kept in distributed data structures. There probably won’t be a lot of files kept around with the new engines. Any text files can be used for pipelines in a pinch but generally would be for import/export. Therefore MAHOUT-1568 concentrates on import/export not dataframes, though it could use them when they are ready. On Jul 30, 2014, at 7:53 AM, Gokhan Capan notificati...@github.com wrote: I believe the next step should be standardizing minimal Matrix I/O capability (i.e. a couple file formats other than [row_id, VectorWritable] SequenceFiles) required for a distributed computation engine, and adding data frame like structures those allow text columns.
[jira] [Resolved] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1565. -- Resolution: Fixed add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Assignee: Gokhan Capan Fix For: 1.0 Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1565: Assignee: Gokhan Capan add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Assignee: Gokhan Capan Fix For: 1.0 Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062041#comment-14062041 ] Gokhan Capan commented on MAHOUT-1565: -- Sorry guys, I committed this 2 weeks ago, but I forgot to close the issue. Thank you, [~nravi] add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Fix For: 1.0 Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: H2O integration - completion and review
I'll write longer, but in general, +1 to Anand Sent from my iPhone On Jul 11, 2014, at 20:54, Anand Avati av...@gluster.org wrote: On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel pat.fer...@gmail.com wrote: Duplicated from a comment on the PR: Beyond these details (specific merge issues) I have a bigger problem with merging this. Now every time the DSL is changed it may break things in h2o specific code. Merging this would require every committer who might touch the DSL to sign up for fixing any broken tests on both engines. To solve this the entire data prep pipeline must be virtualized to run on either engine so the tests for things like CF and ItemSimilarity or matrix factorization (and the multitude of others to come) pass and are engine independent. As it stands any DSL change that breaks the build will have to rely on a contributor's fix. Even if one of you guys was made a committer we will have this problem where a needed change breaks one or the other engine specific code. Unless 99% of the entire pipeline is engine neutral the build will be unmaintainable. For instance I am making a small DSL change that is required for cooccurrence and ItemSimilarity to work. This would break ItemSimilarity and its tests, which are in the spark module but since I’m working on that I can fix everything. If someone working on an h2o specific thing had to change the DSL in a way that broke spark code like ItemSimilarity you might not be able to fix it and I certainly do not want to fix stuff in h2o specific code when I change the DSL. I have a hard enough time keeping mine running :-) The way I interpret the above points, the problem you are trying to highlight is with having multiple backends in general, and not this backend in specific? Hypothetically, even if this backend is abandoned for the above problems, as more backends get added in the future, the same problems will continue to apply to all of them. Crudely speaking this means doing away with all references to a SparkContext and any use of it. So it's not just a matter of reproducing the spark module but reducing the need for one. Making it so small that breakages in one or the other engines code will be infrequent and changes to neutral code will only rarely break an engine that the committer is unfamiliar with. I think things are already very close to this ideal situation you describe above. As a pipeline implementor we should just use DistributedContext, and not SparkContext. And we need an engine neutral way to get hold of a DistributedContext from within the math-scala module, like this pseudocode: import org.apache.mahout.math.drm._ val dc = DistributedContextCreate(System.getenv(MAHOUT_BACKEND), System.getenv(BACKEND_ID), opts...) If environment variables are not set, DistributedContextCreate could default to Spark and local. But all of the pipeline code should ideally exist outside any engine specific module. I raised this red flag a long time ago but in the heat of other issues it got lost. I don't think this can be ignored anymore. The only missing piece I think is having a DistributedContextCreate() call such as above? I don't think things are in such a dire state really.. Am I missing something? I would propose that we should remain two separate projects with a mostly shared DSL until the maintainability issues are resolved. This seems way to early to merge. Call me an optimist, but I was hoping more of a let's work together now to make the DSL abstractions easier for future contributors. I will explore such a DistributedContextCreate() method in math-scala. That might also be the answer for test cases to remain in math-scala. Thanks
Re: TF-IDF vector persistence with normalization enabled
That post implies that in order to have tf-idf vectors persisted, in the tf vectors creation phase you need those options set. Or you can always run the Driver directly and easily, preferably from mahout's commandline, i.e. bin/mahout seq2sparse Gokhan On Tue, Jun 3, 2014 at 9:37 AM, David Noel david.i.n...@gmail.com wrote: I made an observation similar to what was pointed out in this mailing list post here: http://comments.gmane.org/gmane.comp.apache.mahout.user/17819; that TF-IDF vectors do not seem to persist when generating them with normalization enabled. According to Gokhan Capan: It seems to have tf-idf vectors later, you need to create tf vectors (DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option set to false, and normPower option set to -1.0f. Is there some reason for this? It would seem useful if they persisted. Can someone explain the reasoning behind them not? I figure there's a perfectly good reason, I just can't seem to figure out what it is.
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016372#comment-14016372 ] Gokhan Capan commented on MAHOUT-1329: -- Seems like the dependencies are correctly set. Are you certain that the cluster you're running mahout against is an hadoop-2 and M/R-2 cluster? Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016378#comment-14016378 ] Gokhan Capan commented on MAHOUT-1565: -- We agree, conceptually, but this needs some further testing. add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Fix For: 1.0 Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016565#comment-14016565 ] Gokhan Capan commented on MAHOUT-1329: -- Brian, This was actually well-tested. But I'm gonna build and test it again, probably tomorrow. By the way can you run a {{$ find . -name hadoop*.jar}} after building mahout, in the mahout root director. Best Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016998#comment-14016998 ] Gokhan Capan commented on MAHOUT-1529: -- Alright, I'm sold. Finalize abstraction of distributed logical plans from backend operations - Key: MAHOUT-1529 URL: https://issues.apache.org/jira/browse/MAHOUT-1529 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 We have a few situations when algorithm-facing API has Spark dependencies creeping in. In particular, we know of the following cases: -(1) checkpoint() accepts Spark constant StorageLevel directly;- -(2) certain things in CheckpointedDRM;- -(3) drmParallelize etc. routines in the drm and sparkbindings package.- -(5) drmBroadcast returns a Spark-specific Broadcast object- (6) Stratosphere/Flink conceptual api changes. *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR for remaining things once ready. *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985 ] Gokhan Capan commented on MAHOUT-1529: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchOps and possibly CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, and so on. What do you think and if you and others are positive, how do you think that should be handled? Finalize abstraction of distributed logical plans from backend operations - Key: MAHOUT-1529 URL: https://issues.apache.org/jira/browse/MAHOUT-1529 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 We have a few situations when algorithm-facing API has Spark dependencies creeping in. In particular, we know of the following cases: -(1) checkpoint() accepts Spark constant StorageLevel directly;- -(2) certain things in CheckpointedDRM;- -(3) drmParallelize etc. routines in the drm and sparkbindings package.- -(5) drmBroadcast returns a Spark-specific Broadcast object- (6) Stratosphere/Flink conceptual api changes. *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR for remaining things once ready. *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985 ] Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 2:55 PM: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? was (Author: gokhancapan): [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchOps and possibly CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, and so on. What do you think and if you and others are positive, how do you think that should be handled? Finalize abstraction of distributed logical plans from backend operations - Key: MAHOUT-1529 URL: https://issues.apache.org/jira/browse/MAHOUT-1529 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 We have a few situations when algorithm-facing API has Spark dependencies creeping in. In particular, we know of the following cases: -(1) checkpoint() accepts Spark constant StorageLevel directly;- -(2) certain things in CheckpointedDRM;- -(3) drmParallelize etc. routines in the drm and sparkbindings package.- -(5) drmBroadcast returns a Spark-specific Broadcast object- (6) Stratosphere/Flink conceptual api changes. *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR for remaining things once ready. *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations
[ https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985 ] Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM: -- [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for memory-based algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? was (Author: gokhancapan): [~dlyubimov], I imagine in the near future we will want to add a matrix implementation with fast row and column access for in-memory algorithms such as neighborhood based recommendation. This could be a new persistent storage engineered for locality preservation of kNN, the new Solr backend potentially cast to a Matrix, or something else. Anyway, my point is that we could want to add different types of distributed matrices with engine (or data structure) specific strengths in the future. I suggest turning each bahavior (such as Caching) into an additional trait, which the distributed execution engine (or data structure) author can mixin to her concrete implementation (For example Spark's matrix is one with Caching and Broadcasting). It might even help with easier logical planning (if it supports caching cache it, if partitioned in the same way do this else do this, if one matrix is small broadcast it etc.). So I suggest a a base Matrix trait with nrows and ncols methods (as it currently is), a BatchExecution trait with methods for partitioning and execution in parallel behavior, a Caching trait with methods for caching/uncaching behavior, in the future a RandomAccess trait with methods for accessing rows and columns (and possibly cells) functionality. Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and so on. What do you think and if you and others are positive, how do you think that should be handled? Finalize abstraction of distributed logical plans from backend operations - Key: MAHOUT-1529 URL: https://issues.apache.org/jira/browse/MAHOUT-1529 Project: Mahout Issue Type: Improvement Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 1.0 We have a few situations when algorithm-facing API has Spark dependencies creeping in. In particular, we know of the following cases: -(1) checkpoint() accepts Spark constant StorageLevel directly;- -(2) certain things in CheckpointedDRM;- -(3) drmParallelize etc. routines in the drm and sparkbindings package.- -(5) drmBroadcast returns a Spark-specific Broadcast object- (6) Stratosphere/Flink conceptual api changes. *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, need new PR for remaining things once ready. *Pull requests are welcome*. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012126#comment-14012126 ] Gokhan Capan commented on MAHOUT-1565: -- I think there is no point of configuring output compression, number of reducers, etc. for Mahout. add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012140#comment-14012140 ] Gokhan Capan commented on MAHOUT-1565: -- Sorry, now I can read the patch properly. The MR1 versions of those configurations are already set in bin/mahout, and you're suggesting to add MR2 versions of them, too, right? I am personally not a fan of setting such configurations in Mahout, and I would remove them as well. add MR2 options to MAHOUT_OPTS in bin/mahout Key: MAHOUT-1565 URL: https://issues.apache.org/jira/browse/MAHOUT-1565 Project: Mahout Issue Type: Improvement Affects Versions: 1.0, 0.9 Reporter: Nishkam Ravi Attachments: MAHOUT-1565.patch MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add those options. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Hadoop 2 support in a real release?
My vote would be releasing mahout with hadoop1 and hadoop2 classifiers Gokhan On Fri, May 23, 2014 at 4:43 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: Big +1 Am 23.05.2014 15:33 schrieb Ted Dunning ted.dunn...@gmail.com: What do folks think about spinning out a new version of 0.9 that only changes which version of Hadoop the build uses? There have been quite a few questions lately on this topic. My suggestion would be that we use minor version numbering to maintain this and the normal 0.9 release simultaneously if we decide to do a bug fix release. Any thoughts?
[jira] [Assigned] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1534: Assignee: Gokhan Capan Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Assignee: Gokhan Capan Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005663#comment-14005663 ] Gokhan Capan commented on MAHOUT-1534: -- We might want to add the link to the Mahout News, but let's wait and see if the users could locate the page. Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Assignee: Gokhan Capan Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan resolved MAHOUT-1534. -- Resolution: Fixed The instructions are now available on the BuildingMahout page: http://mahout.apache.org/developers/buildingmahout.html Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Assignee: Gokhan Capan Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005719#comment-14005719 ] Gokhan Capan commented on MAHOUT-1329: -- Please check http://mahout.apache.org/developers/buildingmahout.html for instructions to build mahout against to hadoop-2 Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Git Migration
Works for me as well Gokhan On Thu, May 22, 2014 at 9:23 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Thanks; I just pushed successfully. On Thu, May 22, 2014 at 10:55 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: did you read Jake's email earlier at dev/infra discussion? he describes and makes references here. It is two-fold: first we can push whatever commits to master of https://git-wip-us.apache.org/repos/asf?p=mahout.git However the other side of the coin is that significant commits should go thru pull requests directly to (if i understand it correctly) apache/mahout mirror on github. Such pull requests are managed thru commits to git-wp as well by specific messages (again, see references in Jake's email). My understanding is that github integration features are not yet enabled, only commits to master of git-wp-us.a.o are at this point. At this point I simply would like everyone to verify they can push commits to master branch of git-wp-us.a.o per instructions in INFRA- and report back there (I can push). I guess someone (perhaps me) will have to write the manual for working with github pull requests (mainly, merging them to git-wp-us.o.a and closing them). On Thu, May 22, 2014 at 10:47 AM, Andrew Musselman andrew.mussel...@gmail.com wrote: What's the workflow to commit a change? I'm totally in the dark about that. On Thu, May 22, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hi, (1) git migration of the project is now complete. Any volunteers to verify per INFRA-? If you do, please report back to the issue. (2) Anybody knows what to do with jenkins now? i still don't have proper privileges on it. thanks. [1] https://issues.apache.org/jira/browse/INFRA-
Re: consensus statement?
I want to express my opinions for the vision, too. I tried to capture those words from various discussions in the dev-list, and hope that most, of them support the common sense of excitement the new Mahout arouses To me, the fundamental benefit of the shift that Mahout is undergoing is a better separation of the distributed execution engine, distributed data structures, matrix computations, and algorithms layers, which will allow the users/devs of Mahout with different roles focus on the relevant parts of the framework: 1. A machine learning scientist, independent from the underlying distributed execution engine, can utilize the matrix language and the decompositions to implement new algorithms (which implies that the current distributed mahout algorithms are to be rewritten in the matrix language) 2. A math-scala module contributor, for the benefit of higher level algorithms, can add new, or improve existing functions (the set of decompositions is an example) with optimization plans (such as if two matrices are partitioned in the same way, ...), where the concrete implementations of those optimizations are delegated to the distributed execution engine layer 3. A distributed execution engine author can add machine learning capabilities to her platform with i)concrete Matrix and Matrix I/O implementation ii)partitioning, checkpointing, broadcasting behaviors, iii)BLAS 4. A Mahout user with access to a cluster operated by a Mahout-supporting distributed execution engine can run machine learning algorithms implemented on top of the matrix language Best Gokhan On Tue, May 20, 2014 at 8:30 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: inline On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter s...@apache.org wrote: Let's take the next from our homepage as starting point. What should we add/remove/modify? The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. We are building our future implementations on top of a Scala DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel for Apache Spark. More platforms to be added in the future. Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.
[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website
[ https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004662#comment-14004662 ] Gokhan Capan commented on MAHOUT-1534: -- [~ssc] I added the directions to the BuildingMahout page. If you're happy with the staged, I'll Publish Site Add documentation for using Mahout with Hadoop2 to the website -- Key: MAHOUT-1534 URL: https://issues.apache.org/jira/browse/MAHOUT-1534 Project: Mahout Issue Type: Task Components: Documentation Reporter: Sebastian Schelter Fix For: 1.0 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. We should have a page on the website describing this for our users. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: VOTE: moving commits to git-wp.o.a github PR features.
+1 Sent from my iPhone On May 16, 2014, at 21:38, Dmitriy Lyubimov dlie...@gmail.com wrote: Hi, I would like to initiate a procedural vote moving to git as our primary commit system, and using github PRs as described in Jake Farrel's email to @dev [1] [1] https://blogs.apache.org/infra/entry/improved_integration_between_apache_and If voting succeeds, i will file a ticket with infra to commence necessary changes and to move our project to git-wp as primary source for commits as well as add github integration features [1]. (I assume pure git commits will be required after that's done, with no svn commits allowed). The motivation is to engage GIT and github PR features as described, and avoid git mirror history messes like we've seen associated with authors.txt file fluctations. PMC and committers have binding votes, so please vote. Lazy consensus with minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time for weekend (i.e. Tuesday afternoon PST) . here is my +1 -d
[jira] [Commented] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996351#comment-13996351 ] Gokhan Capan commented on MAHOUT-1550: -- Paul, Did you try build mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION Let us know if this fails Naive Bayes training fails with Hadoop 2 Key: MAHOUT-1550 URL: https://issues.apache.org/jira/browse/MAHOUT-1550 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2 Reporter: Paul Marret Priority: Minor Labels: bayesian, training Attachments: mahout-snapshot.patch, stacktrace.txt Original Estimate: 0h Remaining Estimate: 0h When using the trainnb option of the program, we get the following error: Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100) [...] It is possible to correct this by modifying the file mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and converting the instance job (line 174) to a Job object (it is a JobContext in the current version). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996351#comment-13996351 ] Gokhan Capan edited comment on MAHOUT-1550 at 5/13/14 1:10 PM: --- Paul, Did you try building mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION Let us know if this fails was (Author: gokhancapan): Paul, Did you try build mahout using hadoop 2 profile first? The way to do it is: mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION Let us know if this fails Naive Bayes training fails with Hadoop 2 Key: MAHOUT-1550 URL: https://issues.apache.org/jira/browse/MAHOUT-1550 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 1.0 Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2 Reporter: Paul Marret Priority: Minor Labels: bayesian, training Attachments: mahout-snapshot.patch, stacktrace.txt Original Estimate: 0h Remaining Estimate: 0h When using the trainnb option of the program, we get the following error: Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100) [...] It is possible to correct this by modifying the file mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and converting the instance job (line 174) to a Job object (it is a JobContext in the current version). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968148#comment-13968148 ] Gokhan Capan commented on MAHOUT-1178: -- Well I can add this, but considering the current status of the project, I think this is no longer in people's interest. What do you say [~ssc], should we 'won't fix' it or commit? GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Assignee: Gokhan Capan Labels: gsoc2013, mentor Fix For: 1.0 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968221#comment-13968221 ] Gokhan Capan commented on MAHOUT-1178: -- I personally like the idea of integrating additional storage layers as matrix inputs, but not like the implementation I did here. After agreeing on the new algorithm layers, we can later move to the the additional input formats. So my vote also is for Won't Fix GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Assignee: Gokhan Capan Labels: gsoc2013, mentor Fix For: 1.0 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968254#comment-13968254 ] Gokhan Capan commented on MAHOUT-1178: -- The thing is it just 'loads' a Lucene index in memory as a matrix. You construct a matrix with the lucene index directory location and that's it. So it is not a fix for incremental document management issue. The alternative approach is querying the index when a row/column vector, or cell is required. I, however, am not sure if the SolrMatrix thing is fast enough for that. I haven't been available lately, and now I'm reading through the changes in and proposals for Mahout's future, and trying to set up my perspective for Mahout2. We probably can come up with a better way of document storage (still Lucene/Solr based). Let me leave this as is now, and then we can discuss the input formats further. Is that OK for you? GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Assignee: Gokhan Capan Labels: gsoc2013, mentor Fix For: 1.0 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918159#comment-13918159 ] Gokhan Capan commented on MAHOUT-1178: -- Let me get the pieces together and submit a patch in a few days. GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Assignee: Gokhan Capan Labels: gsoc2013, mentor Fix For: 1.0 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914494#comment-13914494 ] Gokhan Capan commented on MAHOUT-1329: -- Sure I can. Although my vote would be passing the version, considering different distributions out there, people may want to build mahout against whatever hadoop2 distro they use (I am not very sure about my own argument actually, It would be great to hear a counter-argument) Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1329: - Resolution: Fixed Status: Resolved (was: Patch Available) Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911436#comment-13911436 ] Gokhan Capan commented on MAHOUT-1329: -- I committed this to trunk Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907480#comment-13907480 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM: --- Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. was (Author: gokhancapan): Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908126#comment-13908126 ] Gokhan Capan commented on MAHOUT-1329: -- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908126#comment-13908126 ] Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM: --- Yeah, you're right, edit coming. Did you manage to run jobs against the cluster [EDIT:Sorry I missed you mentioned that you ran the examples, great then] was (Author: gokhancapan): Yeah, you're right, edit coming. Did you manage to run jobs against the cluster? Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908443#comment-13908443 ] Gokhan Capan commented on MAHOUT-1329: -- Good news that I tried that too, on a 2.2.0 cluster. seqdir, seq2sparse, and kmeans worked without a problem. I'm gonna wait till Monday to commit this, in case folks want to verify that it works. Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907237#comment-13907237 ] Gokhan Capan commented on MAHOUT-1329: -- Hi Sergey, thank you for that, I am copying from MAHOUT-1354: Gokhan: Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project. Ted: It is important to keep modules like mahout math free of the massive Hadoop dependency. I think pushing dependencies to the root is not something that we desire I think, but let me look into this further. Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Suneel Marthi Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1329: - Attachment: 1329-3.patch Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Suneel Marthi Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907480#comment-13907480 ] Gokhan Capan commented on MAHOUT-1329: -- Sergey, I modified your patch and produced a new version. Looking into the dependency tree, it seems it builds against the correct hadoop version. (This may seem irrelevant when looking at the patch, but I had to set argLine to -Xmx1024m in order not the unit tests to fail because of an OOM) for hadoop version 1.2.1: mvn clean package for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0 I unit tested this for both versions and saw the tests passed, but I don't have access to a hadoop test environment currently, so could you guys test if this actually work (I'll do it tomorrow anyway)? Then we can commit it. Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Suneel Marthi Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Assigned] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan reassigned MAHOUT-1329: Assignee: Gokhan Capan (was: Suneel Marthi) Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Gokhan Capan Labels: patch Fix For: 1.0 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Mahout on Spark?
I imagine in Mahout offering an option to the users to select from different execution engines (just like we currently do by giving M/R or sequential options), and starting from Spark. I am not sure what changes needed in the codebase, though. Maybe following MLI (or alike) and implementing some more stuff, such as common interfaces for iterating over data (the M/R way and the Spark way). IMO, another effort might be porting pre-online machine learning (such transforming text into vector based on the dictionary generated by seq2sparse before), machine learning based on mini-batches, and streaming summarization stuff in Mahout to Spark-Streaming. Best, Gokhan On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: PS I am moving along cost optimizer for spark-backed DRMs on some multiplicative pipelines that is capable of figuring different cost-based rewrites and R-Like DSL that mixes in-core and distributed matrix representations and blocks but it is painfully slow, i really only doing it like couple nights in a month. It does not look like i will be doing it on company time any time soon (and even if i did, the company doesn't seem to be inclined to contribute anything I do anything new on their time). It is all painfully slow, there's no direct funding for it anywhere with no string attached. That probably will be primary reason why Mahout would not be able to get much traction compared to university-based contributions. On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Unfortunately methinks the prospects of something like Mahout/MLLib merge seem very unlikely due to vastly diverged approach to the basics of linear algebra (and other things). Just like one cannot grow single tree out of two trunks -- not easily, anyway. It is fairly easy to port (and subsequently beat) MLib at this point from collection of algorithms point of view. But IMO goal should be more MLI-like first, and port second. And be very careful with concepts. Something that i so far don't see happening with MLib. MLib seems to be old-style Mahout-like rush to become a collection of basic algorithms rather than coherent foundation. Admittedly, i havent looked very closely. On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org wrote: I'm also convinced that Spark is a superior platform for executing distributed ML algorithms. We've had a discussion about a change from Hadoop to another platform some time ago, but at that point in time it was not clear which of the upcoming dataflow processing systems (Spark, Hyracks, Stratosphere) would establish itself amongst the users. To me it seems pretty obvious that Spark made the race. I concur with Ted, it would be great to have the communities work together. I know that at least 4 mahout committers (including me) are already following Spark's mailinglist and actively participating in the discussions. What are the ideas how a fruitful cooperation look like? Best, Sebastian PS: I ported LLR-based cooccurrence analysis (aka item-based recommendation) to Spark some time ago, but I haven't had time to test my code on a large dataset yet. I'd be happy to see someone help with that. On 02/19/2014 08:04 AM, Nick Pentreath wrote: I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together. It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means? N -- Sent from Mailbox for iPhone On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote: My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well as the fantastic depth of experience of Mahout devs) I think a lot can be achieved! It makes a lot of sense that Spark would be better than Hadoop for ML purposes given that Hadoop was intended to do web-crawl kinds of things and Spark was intentionally built to support machine learning. Given that Spark has been announced by a majority of the Hadoop-based distribution vendors, it makes sense that maybe Mahout should jump in. I really would prefer it if the two communities (MLib/MLI and Mahout) could work more closely together. There is a lot of good to be had on both sides.
[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906062#comment-13906062 ] Gokhan Capan commented on MAHOUT-1329: -- Is it OK to add hadoop dependencies to the project root, and to the math module (actually to all modules even they already depend on the core module)? I remember that's what we wanted to avoid Mahout for hadoop 2 --- Key: MAHOUT-1329 URL: https://issues.apache.org/jira/browse/MAHOUT-1329 Project: Mahout Issue Type: Task Components: build Affects Versions: 0.9 Reporter: Sergey Svinarchuk Assignee: Suneel Marthi Labels: patch Fix For: 1.0 Attachments: 1329.patch Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: MAHOUT 0.9 Release - New URL
Using CentOS 6.5 and hadoop 1.2.1, all passed. +1 from me Gokhan On Thu, Jan 23, 2014 at 6:01 PM, Andrew Palumbo ap@outlook.com wrote: a),b),c),d) all passed on CentOS for me Date: Thu, 23 Jan 2014 13:43:06 +0200 Subject: Re: MAHOUT 0.9 Release - New URL From: ssvinarc...@hortonworks.com To: dev@mahout.apache.org I did a), b), c), d) and all steps pass. +1 On Thu, Jan 23, 2014 at 1:40 PM, Grant Ingersoll gsing...@apache.org wrote: +1 from me. On Jan 22, 2014, at 5:55 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Fixed the issues that were reported this week and restored FP mining into the codebase. Here's the URL for the final release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1003/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Committers and PMC, need a minimum of 3 '+1' votes for the release to be finalized. Grant Ingersoll | @gsingers http://www.lucidworks.com -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Mahout 0.9 release
+1 for 1.0. This is more challenging than expected (the old hadoop 0.23 profile support is misleading) Sent from my iPhone On Dec 19, 2013, at 19:48, Andrew Musselman andrew.mussel...@gmail.com wrote: +1 On Thu, Dec 19, 2013 at 9:20 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: +1 Sent from my iPhone On Dec 19, 2013, at 12:17 PM, Frank Scholten fr...@frankscholten.nl wrote: I am looking at M-1329 (Support for Hadoop 2.x) as we speak. This change requires quite some testing and I prefer to push this to 1.0. I am thinking of creating a unit test that starts miniclusters for each versions and runs a job in them. On Thu, Dec 19, 2013 at 12:28 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: There's M-1329 that covers this. Hopefully it should make it for 0.9 Sent from my iPhone On Dec 18, 2013, at 6:20 PM, Isabel Drost-Fromm isa...@apache.org wrote: On Mon, 16 Dec 2013 23:16:36 +0200 Gokhan Capan gkhn...@gmail.com wrote: M-1354 (Support for Hadoop 2.x) - Patch available. Gokhan, any updates on this. Nope, still couldn't make it work. Should we push that for 1.0 then (if this is shortly before completion and there's too much in 1.0 to push for a release early next year, I'd also be happy to have a smaller release between now and Berlin Buzzwords that includes the fix...). Isabel
Re: Mahout 0.9 release
Gokhan On Mon, Dec 16, 2013 at 11:08 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Its time to freeze trunk the this week, here's the status of JIRAs:- Suneel -- M-1319 - Patch available, would appreciate if someone could review/test the patch before I commit to trunk. Pat - M-1288 Solr Recommender Pat, I see that you have the code in ur Github repo, could u create a patch that could be merged into Mahout trunk. Frank M-1364 (Upgrade to Lucene 4.6) - Patch available. Grant, do u have cycles to review this patch? Gokhan -- M-1354 (Support for Hadoop 2.x) - Patch available. Gokhan, any updates on this. Nope, still couldn't make it work. On Sunday, December 8, 2013 6:23 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: We need to freeze the trunk this coming week in preparation for 0.9 release, below are the pending JIRAs:- Wiki (not a show stopper for 0.9) - M-1245, M-1304, M-1305, M-1307, M-1326 Suneel --- M-1319 (i can work on this tomorrow) M-1265 (Multi Layer Perceptron) - Need to be merged into trunk, the code's available for review on ReviewBoard. It would help if another set of eyes reviewed the test cases (Isabel, Stevo.. ?) Pat M-1288 Solr Recommender (What's the status of this Pat, this needs to be in 0.9 Release.) Stevo --- M-1366 (this can be at time of 0.9 Release and has no impact on trunk) Frank M-1364 (Upgrade to Lucene 4.6) - Patch available. It would be nice to have this go in 0.9 The patch worked for me Frank, I agree that this needs to be reviewed by someone who's more familiar with Lucene. Gokhan -- M-1354 (Support for Hadoop 2.x) - Patch available. This is targeted for 1.0. The patch worked for me on Hadoop 1.2.1, it would be good if someone could try the patch on hadoop 2.x instance. Others -- M-1371 - This was reported on @user and a patch was submitted. If we don't hear from the author within this week, this can be deferred to 1.0 On Tuesday, December 3, 2013 8:13 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: JIRAs Update for 0.9 release:- Wiki - Isabel, Sebastian and other volunteers - M-1245, M-1304, M-1305, M-1307, M-1326 Suneel --- M-1319 M-1242 (Patch available to be committed to trunk) Pat --- M-1288 Solr Recommender Yexi, Suneel --- M-1265 - Multi Layer Perceptron Stevo, Isabel - M-1366 Andrew -- M-1030, M-1349 Ted -- M-1368 (Patch available to be committed to trunk) On Sunday, December 1, 2013 7:57 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Open JIRAs for 0.9 release :- Wiki - Isabel, Sebastian and other volunteers - M-1245, M-1304, M-1305, M-1307, M-1326 Suneel --- M-1319, M-1328 Pat --- M-1288 Solr Recommender Sebastian, Peng M-1286 Yexi, Suneel --- M-1265 - Multi Layer Perceptron Ted, do u have cycles to review this, the patch's up on Reviewboard. Stevo, Isabel - M-1366 - Please delete old releases from mirroring system M-1345 - Enable Randomized testing for all modules Andrew -- M-1030 Open Issues (any takers for these ???) M-1242 M-1349 On Friday, November 29, 2013 12:07 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: On 29.11.2013 17:59, Suneel Marthi wrote: Open JIRAs for 0.9: Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - related to Wiki updates. Definitely appreciate more hands here to review/update the wiki M-1286 - Peng and Sebastian, no updates on this. Can this be included in 0.9? I will look into this over the weekend! M-1030 - Andrew Musselman M-1319, M-1328 - Suneel M-1347 - Suneel, patch has been committed to trunk. M-1265 - I have been working with Yexi on this. Ted, would u have time to review this; the code's on Reviewboard. M-1288 - Sole Recommender, Pat Ferrel M-1345: Isabel, Frank. I think we are good on this patch. Isabel, could u commit this to trunk? M-1312: Stevo, could u look at this? M-1349: Any takers for this?? Others: Spectral Kmeans clustering documentation (Shannon) On Thursday, November 28, 2013 10:38 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Adding Mahout-1349 to the list of JIRAs . On Thursday, November 28, 2013 10:37 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Update on Open JIRAs for 0.9: Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all related to Wiki updates, please see Isabel's updates. M-1286 - Peng and Sebastian, we had talked about this during the last hangout. Can this be included in 0.9? M-1030- Andrew Musselman,
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842960#comment-13842960 ] Gokhan Capan commented on MAHOUT-1354: -- Looks like when hadoop-2 profile is activated, this patch fails to apply the hadoop-2 related dependencies to integration and examples modules, despite they are both dependent to core and core is dependent to hadoop-2. For me, moving hadoop dependencies to the root solved the problem, but I think we wouldn't want that since hadoop is not a common dependency for all modules of the project. CC'ing [~frankscholten] Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Attachments: MAHOUT-1354_initial.patch Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843226#comment-13843226 ] Gokhan Capan commented on MAHOUT-1354: -- Yeah, I agree Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Attachments: MAHOUT-1354_initial.patch Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
Re: Welcome to Frank Scholten as new Mahout committer
Congratulations, Frank! Gokhan On Tue, Dec 3, 2013 at 3:27 PM, Isabel Drost-Fromm isa...@apache.orgwrote: Hi, this is to announce that the Project Management Committee (PMC) for Apache Mahout has asked Frank Scholten to become committer and we are pleased to announce that he has accepted. Being a committer enables easier contribution to the project since in addition to posting patches on JIRA it also gives write access to the code repository. That also means that now we have yet another person who can commit patches submitted by others to our repo *wink* Frank, you've been following the project for quite some time now - contributing valuable changes over and over again. I certainly look forward to working with you in the future. Welcome! Isabel
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837933#comment-13837933 ] Gokhan Capan commented on MAHOUT-1354: -- Today I had some troubles with integration's transitive dependencies, let me dig further. So this still should stay in 1.0 queue Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836661#comment-13836661 ] Gokhan Capan commented on MAHOUT-1354: -- Do you think we should support hadoop-1 and hadoop-2 at the same time? Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836953#comment-13836953 ] Gokhan Capan commented on MAHOUT-1354: -- Well, I tried something and want to share. Based on: In hadoop-2-stable, compatibility with hadoop-1 is preferred over with hadoop-2-alpha (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html). For example, return type for ProgramDriver#driver(String) was void in hadoop-1 (which we use in MahoutDriver), int in hadoop-2-alpha, void again in hadoop-2-stable. It seems if we select the right artifacts, there is nothing to worry about the compatibility. My conclusion was: The current hadoop-0.20 and hadoop-0.23 profiles can be utilized: we can rename them to hadoop-1 and hadoop-2, respectively, then make hadoop-2 (stable) the default profile, then set the hadoop.version property to 2.2.0. We need to worry about some third party dependencies though, for instance, hbase-client in mahout-integration is dependent to hadoop-1 (for that particular artifact, simply excluding hadoop-core did not break any tests, by the way). Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2
[ https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836965#comment-13836965 ] Gokhan Capan commented on MAHOUT-1354: -- Let me submit a patch first, probably tomorrow. Best Mahout Support for Hadoop 2 Key: MAHOUT-1354 URL: https://issues.apache.org/jira/browse/MAHOUT-1354 Project: Mahout Issue Type: Improvement Affects Versions: 0.8 Reporter: Suneel Marthi Assignee: Suneel Marthi Fix For: 1.0 Mahout support for Hadoop , now that Hadoop 2 is official. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836102#comment-13836102 ] Gokhan Capan commented on MAHOUT-1286: -- Let's Won't Fix this issue. I think what we need to do is implementing more sparse matrix (or alike) data structures for different access patterns, other than the current map of maps approach. The ideas would apply to current 2 FastByIDMaps based DataModel. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch, benchmark.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
Re: [jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark
I'll look into this too, possibly in two days Sent from my iPhone On Nov 26, 2013, at 22:30, Dmitriy Lyubimov (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Lyubimov updated MAHOUT-1365: - Attachment: distributed-als-with-confidence.pdf Weighted ALS-WR iterator for Spark -- Key: MAHOUT-1365 URL: https://issues.apache.org/jira/browse/MAHOUT-1365 Project: Mahout Issue Type: Task Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: Backlog Attachments: distributed-als-with-confidence.pdf Given preference P and confidence C distributed sparse matrices, compute ALS-WR solution for implicit feedback (Spark Bagel version). Following Hu-Koren-Volynsky method (stripping off any concrete methodology to build C matrix), with parameterized test for convergence. The computational scheme is followsing ALS-WR method (which should be slightly more efficient for sparser inputs). The best performance will be achieved if non-sparse anomalies prefilitered (eliminated) (such as an anomalously active user which doesn't represent typical user anyway). the work is going here https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am porting away our (A1) implementation so there are a few issues associated with that. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806106#comment-13806106 ] Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM: Peng, I am attaching a patch --not to be committed-- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. was (Author: gokhancapan): Peng, I am attaching a patch -not to be committed- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: benchmark.patch, InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gokhan Capan updated MAHOUT-1286: - Attachment: benchmark.patch Peng, I am attaching a patch -not to be committed- that includes some benchmarking code in case you need one, and 2 in-memory data models as a baseline. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: benchmark.patch, InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
[ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799916#comment-13799916 ] Gokhan Capan commented on MAHOUT-1178: -- Hi [~smarthi], Although I'm not sure if there is no more an interest, I have a Lucene matrix (in-memory) and a Solr matrix (that does not load the index into memory) implementations. I believe both can be committed after a couple review rounds. GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Labels: gsoc2013, mentor Fix For: Backlog Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.1#6144)
Re: Mahout's future
I'll be traveling tomorrow, and will appreciate if the videos are gonna be accessible later. Best Sent from my iPhone On Oct 16, 2013, at 23:15, Suneel Marthi suneel_mar...@yahoo.com wrote: Thanks Dmitriy. Let me check if its possible to setup automatic calendar invites to PMC. I'll go ahead and send a hangout link for Thursday, Oct 16 from 6 - 7pm (Eastern Time). The purpose of this hangout would be to talk about Mahout 0.9 release which is tentatively being planned for Nov-Dec 2013. I'll send an email with what I see as being targeted for 0.9 and we can take it from there. There's been a discussion thread about Mahout Future Roadmap (interpreting this as post Mahout 0.9), we can get to that if time permits else we can have another hangout next week to talk about it. Suneel On Wednesday, October 16, 2013 4:05 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: 3 to 4 On Oct 16, 2013 1:02 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Dmitriy, what time works for you on thursday? On Wednesday, October 16, 2013 3:47 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Doesnt work for me. Friday is better, or thrusday earlier afternoon. I d also appreciate automatic calendar invitations to pmc if at all possible. D On Oct 14, 2013 10:21 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Will schedule a hangout for this Thursday - 7pm (Eastern Time) tentatively. I would like us to first discuss about Mahout 0.9 release, will send out an agenda once I schedule it. Regards, Suneel On Tuesday, October 15, 2013 12:24 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: Following up , Suneel/Grant are we still on for meeting this week on a google hangout, would love to neet this week. From: sxk1...@hotmail.com To: dev@mahout.apache.org Subject: RE: Mahout's future Date: Sun, 6 Oct 2013 07:00:50 -0700 +1Can you send out a quick agenda (hopefully with my input incorporated) before the hangout?Regards Date: Sun, 6 Oct 2013 03:58:10 -0700 From: suneel_mar...@yahoo.com Subject: Re: Mahout's future To: dev@mahout.apache.org Grant would be available the week of Oct 14 for a hangout (tentatively). We could go ahead and schedule one next week if there's (and seems very much like it) enough response. I can go ahead and facilitate one. I will be 100% focused on Mahout from next week once I start at my new job from Monday. Regarding building something for Deep Learning, Yexi's patch for MLP (see M-1265) may be a good place to refactor/start thinking about the foundations. I guess Ted is alluring to build something like what's been described in the Google paper (see http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf ). Correct? Suneel From: Ted Dunning ted.dunn...@gmail.com To: dev@mahout.apache.org dev@mahout.apache.org Cc: dev@mahout.apache.org dev@mahout.apache.org Sent: Sunday, October 6, 2013 2:10 AM Subject: Re: Mahout's future Saikat These are all good suggestions. I would have a hard time suggesting a prioritization of them. Does anybody remember what grant said about having another hangout? Sent from my iPhone On Oct 6, 2013, at 7:15, Saikat Kanjilal sxk1...@hotmail.com wrote: I wanted to mention a few other things:1)It might be useful to take and embed a few already productionalized use cases into the integration tests in mahout, this will help additional users get on board faster2) Deep learning is really interesting, however I'd like to help research some common use cases first before tying this into mahout3) It'd be good to put some thought into documenting when you would choose what type of algorithm given a production machine learning recommendation system to build, this would give more visibility for users into choosing the right mixture of algorithms to build a production ready recommender, often what I've found is that a bulk of the time in building productionalized recommenders is spent cleaning and filtering noisy data4) I'd like to also explore how to tie in machine learning algorithms into real time systems built using twitter storm (http://storm-project.net/), it seems that industry more and more is wanting to do real time analytics on the fly, I'm curious what type of algorithms we'd need for this and back propagate these into mahout It'd be good to meet like minded devs together locally (Seattle) or over gtalk/conference to talk through possibilities. Regards From: ted.dunn...@gmail.com Date: Sat, 5 Oct 2013 18:13:40 -0700 Subject: Re: Mahout's future To: dev@mahout.apache.org On Sat, Oct 5, 2013 at 5:08 PM, Saikat Kanjilal sxk1...@hotmail.com wrote: Does it make sense to have a quick meeting of interested developers over google chat/conference rather than email to discuss and assign folks to specifics? Thoughts? Great idea. I think that Grant may have been
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759021#comment-13759021 ] Gokhan Capan commented on MAHOUT-1286: -- There was a thread on updating int indices and double values in matrices, but there are simply too many consequences of that update that we can't deal with right now. Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759021#comment-13759021 ] Gokhan Capan edited comment on MAHOUT-1286 at 9/5/13 12:22 PM: --- Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. was (Author: gokhancapan): There was a thread on updating int indices and double values in matrices, but there are simply too many consequences of that update that we can't deal with right now. Even if it is not an exact Matrix structure, we can start with 2d hash tables and proceed later. Let's start this. I tried to insert Netflix ratings into: i- DataModel backed by 2 matrices. ii- The one in this patch. Good news is insert performance is good enough. I am going to try gets and iterations, too. Tomorrow I am starting the 2d hash table based on your implementation with a matrix-like interface, I am going to share a github link with you. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757801#comment-13757801 ] Gokhan Capan commented on MAHOUT-1286: -- Here is what I think: 1- We should implement a matrix that uses your 2d Hopscotch hash table as the underlying data structure (or the current open addressing hash table implementation that already exists in Mahout, depending on benchmarks) 2- We should handle concurrency issues that might be introduced by that matrix implementation 3- We then can replace the FastByIDMap(s) with that matrix, trust at the underlying matrix for concurrent updates, and never create a PreferenceArray unless there is an iteration over users (or items) What do you think? Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751049#comment-13751049 ] Gokhan Capan commented on MAHOUT-1286: -- Hi Peng, could you submit the diff files instead of .javas? That would be more convenient for me if it is possible. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751053#comment-13751053 ] Gokhan Capan commented on MAHOUT-1286: -- By the way, it seems the link to the paper is broken, if it is not just me. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: You are invited to Apache Mahout meet-up
Have a great day! On Aug 22, 2013, at 8:44 PM, Piero Giacomelli pgiac...@gmail.com wrote: Mee to so any online material could be vety helpfull Il giorno 22/ago/2013 19:31, Peng Cheng pc...@uowmail.edu.au ha scritto: Is the presentation going to be uploaded on Youtube or Slideshare? Sorry I cannot be there. On 13-08-22 08:46 AM, Yexi Jiang wrote: A great event. I wish I were in Bay area. 2013/8/22 Shannon Quinn squ...@gatech.edu I'm only sorry I'm not in the Bay area. Sounds great! On 8/22/13 3:38 AM, Stevo Slavić wrote: Retweeted meetup invite. Have fun! Kind regards, Stevo Slavic. On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com wrote: Very cool. Would love to see folks turn out for this. On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman b.ellen.fried...@gmail.comwrote: The Apache Mahout user group has been re-activated. If you are in the Bay Area in California, join us on Aug 27 (Redwood City). Sebastian Schelter will be the main speaker, talking about new directions with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be there to do a short introduction for the meet-up and update on the 0.8 release. Here's the link to rsvp: http://bit.ly/16K32hg Hope you can come, and please spread the word. Ellen
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737267#comment-13737267 ] Gokhan Capan commented on MAHOUT-1286: -- Peng, With a SparseRowMatrix, column access (getPreferencesForItem), but row access is pretty fast (getPreferencesFromUsers). I agree with all other problems you mentioned. In Mahout's SVD-based recommenders and FactorizablePreferences, while computing top-N recommendations, I believe we compute activeUser,item predictions for each item, and return the top-N. So basically, a SVD based recommender needs fast access to the rows of the matrix, but not the columns (It still needs to iterate over item ids, though). It is only needed in an item-based recommender, or if a CandidateItemsStrategy is used. In my tests for Netflix data, I saw a 3G heap, too. Let me compare this particular approach with the SparseRowMatrix backed one. I will investigate your approach further. Ted, Additionally, I recently implemented a read-only SolrMatrix, which might be beneficial while implementing the SolrRecommender, if we want to use existing mahout library for similarities etc. I will open a new thread for that. Best Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Regarding Online Recommenders
Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G for the Netflix Data, still large. The same history is encoded in 2 SparseRowMatrices, one is row-indexed by users and one is by item. It has serious concurrency issues at several places, though (sets and removes need to be thread-safe). Best Gokhan On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: Hi, Just one simple question: Is the org.apache.mahout.math.**BinarySearch.binarySearch() function an optimized version of Arrays.binarySearch()? If it is not, why implement it again? Yours Peng On 13-07-17 06:31 PM, Sebastian Schelter wrote: You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future. 2013/7/17 Peng Cheng pc...@uowmail.edu.au One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: MongoDBDataModel additions
Paul, Actually we are now working on an OnlineRecommender, and we plan to support new users and items. You can find the discussion in Regarding Online Recommenders dev-list. You may want to take a look at it. Best, Gokhan On Mon, Jul 22, 2013 at 8:40 AM, Paul Scott pscott...@gmail.com wrote: On 19/07/2013 19:40, Gokhan Capan wrote: Hi Paul, I am sure Sebastian will provide further information, but there was a JIRA ticket that you may find relevant. https://issues.apache.org/**jira/browse/MAHOUT-1050https://issues.apache.org/jira/browse/MAHOUT-1050 Thanks! OK, so the data model is immutable because of constant refreshing. Seems OK to me, although may be a bit heavy with many millions of users no? Anyway, I will leave it for now and look at other ways to help out this awesome project! Thanks for the reply and link -- Paul -- http://paulscott.co.za/blog/
Re: MongoDBDataModel additions
Hi Paul, I am sure Sebastian will provide further information, but there was a JIRA ticket that you may find relevant. https://issues.apache.org/jira/browse/MAHOUT-1050 Best Gokhan On Fri, Jul 19, 2013 at 9:43 AM, Paul Scott pscott...@gmail.com wrote: Hi all, Let me do a quick introduction. I am Paul and I work at DStv Online in South Africa. I would normally lurk on a list a lot longer than this, but I do feel that I can contribute almost immediately. Please excuse me if I am at all out of bounds here... I have noticed that in the MongoDBDataModel in mahout-inegration that the methods: public void setPreference(long userID, long itemID, float value) and public void removePreference(long userID, long itemID) both throw UnsupportedOperationExceptions**. Is this by design, or can I actually implement these methods and send through a patch? Also, obviously, I would need to open a Jira ticket. Do I need to sign up for that or what is the process there? As a second contribution, I would also like to start exploring/discussing a Neo4jDataModel for working with the Neo4j Graph database. Again, apologies if this has already been discussed, but I couldn't find any other references to this online. Many thanks! -- Paul http://paulscott.co.za/blog
Re: Regarding Online Recommenders
handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before actually trying it. This is essentially a first step for me to choose my side on the DataModel implementation discussion:) Gokhan On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng pc...@uowmail.edu.au wrote: Wow, that's lightning fast. Is it a SparseMatrix or DenseMatrix? On 13-07-18 07:23 PM, Gokhan Capan wrote: I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations. I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote: I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added
Re: Regarding Online Recommenders
Hi Pat, please see my response inline. Best, Gokhan On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? If you are referring to the recommender of discussion here, no, updating the model can be done with a single preference, using stochastic gradient descent, by updating the particular user and item factors simultaneously. Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. My first question, are those actions are only positive, like purchase as you mentioned? I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. I personally think an A/B test is the best way to evaluate a recommender, and if you will be able to share it, I personally look forward to see the results. I believe that would be a great contribution for some future decisions. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Regarding Online Recommenders
Based on the conversation in MAHOUT-1274, I put some code here: https://github.com/gcapan/mahout/tree/onlinerec I hope that would initiate a discussion on OnlineRecommender approaches. I think the OnlineRecommender would require (similar to what Sebastian commented there): 1- A DataModel that allows adding new users/items and performs fast iteration 2- An online learning interface that allows updating the model with a feedback, and make predictions based on the latest model The code is a very early effort for the latter, and it contains a matrix factorization-based implementation where training is done by SGD. The model is stored in a DenseMatrix --it should be replaced with a matrix that allows adding new rows and doesn't allocate space for empty rows (please search for DenseRowMatrix and BlockSparseMatrix in the dev-list, and see MAHOUT-1193 for relevant issue). I didn't try that on a dataset yet. The DataModel I imagine would follow the current API, where underlying preference storage is replaced with a matrix. A Recommender would then use the DataModel and the OnlineLearner, where Recommender#setPreference is delegated to DataModel#setPreference (like it does now), and DataModel#setPreference triggers OnlineLearner#train. Gokhan
Re: Welcome new committers Gokhan Capan and Stevo Slavic
Hi, Sorry I was on a vacation. Congratulations, Stevo! I think being a Mahout committer is a big deal, and I am really pleased that I am one now. I am a Researcher at Anadolu University, Turkey, and a Data Scientist at Dilisim, a company specialized in IR, NLP, and Data Science solutions. I hope I can participate well to committers' great efforts to empower users to perform massive, real-world machine learning. Thank you very much. Best regards, Gokhan On Tue, Jun 11, 2013 at 12:39 PM, Dmitriy Lyubimov dlie...@gmail.comwrote: congratulations! On Mon, Jun 10, 2013 at 10:22 PM, Dan Filimon dangeorge.fili...@gmail.comwrote: Congratulations to the both of you! :) It's great to have you on board! On Tue, Jun 11, 2013 at 3:58 AM, Stevo Slavić ssla...@gmail.com wrote: Thanks Grant, Suneel and rest of the team, I'm a Java software developer and OSS enthusiast from Serbia with 7 years of professional experience in IT industry. Together with teams I've been part of, I have designed, built and successfully delivered multiple applications and websites from various business domains (online media, e-government, telecommunications, e-commerce). In both small and large enterprise scale apps, open source technologies and communities around them were and remain to be one of the key components and ingredients for success. It's always a great pleasure for me to give back to OSS projects that I use, through submitting patches or just being good community member. So far I've contributed to and been involved the most on Spring framework and other associated projects from the Spring portfolio. Back in April last year I rediscovered my passion and interest in machine learning, AI and computer science in general through prof. Andrew Ng's Coursera machine learning MOOC https://www.coursera.org/course/ml which I successfully completed http://bit.ly/sslavic-coursera-ml. Going from ML theory to practice, through the mist of Big Data hype, lead me to the greatness of Apache Mahout project. You all do me great honor by accepting me into the team, team of exceptional individuals yet great team players, with such positive and creative atmosphere. My contributions to the project so far were rather limited, and in near future they are likely to remain so as I still have lots to learn first. At least in the beginning, more than anything else I expect that I'll be able to contribute to the project by making it even more approachable to general audience of IT practitioners like myself through actively promoting it, supporting users on the mailing list to my best, and working on the documentation. Level of commitment will surely increase with time. I thank you all once more for this wonderful opportunity, and wish us and the project lots of success! Kind regards, Stevo Slavic. On Tue, Jun 11, 2013 at 1:10 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Congrats Gokhan and Stevo!! From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org dev@mahout.apache.org Sent: Monday, June 10, 2013 5:04 PM Subject: Welcome new committers Gokhan Capan and Stevo Slavic Please join me in congratulating Mahout's newest committers, Gokhan Capan and Stevo Slavic, both of whom have been contributing to Mahout for some time now. Gokhan, Stevo, new committer tradition is to give a brief background on yourself, so you have the floor! Congrats, Grant
HBase backed matrices
Hi, For taking large matrices as input and persisting large models (like factor models), I created an HBase-backed version of Mahout matrix. It allows random access to cells and rows as well as assignment, and iteration over rows. viewRow returns a view, and lazy loads actual data if a get is actually invoked. I plan to add a VectorInputFormat on top of it, too. The code that we need to have for our algorithms is tested, but there are still parts of it that are not. I am going to speak about this at HBaseCon, and I wanted to let you know that it can be contributed after some refactoring. Is there any interest? -- Gokhan
Re: HBase backed matrices
2 options: 1- row index as the row key, column index as column identifier, and value as value 2- row index and column index combined as the row key, and value in a column called value Row indices are kept in a member variable in memory, to make iteration fast. On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: How did you store the matrix in HBase? On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, For taking large matrices as input and persisting large models (like factor models), I created an HBase-backed version of Mahout matrix. It allows random access to cells and rows as well as assignment, and iteration over rows. viewRow returns a view, and lazy loads actual data if a get is actually invoked. I plan to add a VectorInputFormat on top of it, too. The code that we need to have for our algorithms is tested, but there are still parts of it that are not. I am going to speak about this at HBaseCon, and I wanted to let you know that it can be contributed after some refactoring. Is there any interest? -- Gokhan -- Gokhan
Re: HBase backed matrices
Nope, I simply thought that would make accessing and setting individual cells more difficult. Should I? Do you think it would perform better? And I would want to hear if you have more design choices in your mind. On Wed, May 8, 2013 at 12:22 AM, Ted Dunning ted.dunn...@gmail.com wrote: Have you experimented with, for instance, row number as id, value as binary serialized vector? On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan gkhn...@gmail.com wrote: 2 options: 1- row index as the row key, column index as column identifier, and value as value 2- row index and column index combined as the row key, and value in a column called value Row indices are kept in a member variable in memory, to make iteration fast. On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: How did you store the matrix in HBase? On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, For taking large matrices as input and persisting large models (like factor models), I created an HBase-backed version of Mahout matrix. It allows random access to cells and rows as well as assignment, and iteration over rows. viewRow returns a view, and lazy loads actual data if a get is actually invoked. I plan to add a VectorInputFormat on top of it, too. The code that we need to have for our algorithms is tested, but there are still parts of it that are not. I am going to speak about this at HBaseCon, and I wanted to let you know that it can be contributed after some refactoring. Is there any interest? -- Gokhan -- Gokhan -- Gokhan
Re: HBase backed matrices
So if rows are small, blob is probably better; and if they get larger I can make blocks of blobs. I will experiment this. On Wed, May 8, 2013 at 1:06 AM, Ted Dunning ted.dunn...@gmail.com wrote: It really depends on your access patterns. Blob storage of rows will be much faster for scans and will take much less space. Column storage of values may or may not make things faster, but it is conceptually nicer to not have to update so much. In practice, I am not convinced that you will notice the difference except for really big rows. Remember that you don't have to commit to a single choice. You could use a rolled up representation most of the time and then break the rollups in to regions as they get bigger. On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan gkhn...@gmail.com wrote: Nope, I simply thought that would make accessing and setting individual cells more difficult. Should I? Do you think it would perform better? And I would want to hear if you have more design choices in your mind. On Wed, May 8, 2013 at 12:22 AM, Ted Dunning ted.dunn...@gmail.com wrote: Have you experimented with, for instance, row number as id, value as binary serialized vector? On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan gkhn...@gmail.com wrote: 2 options: 1- row index as the row key, column index as column identifier, and value as value 2- row index and column index combined as the row key, and value in a column called value Row indices are kept in a member variable in memory, to make iteration fast. On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: How did you store the matrix in HBase? On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi, For taking large matrices as input and persisting large models (like factor models), I created an HBase-backed version of Mahout matrix. It allows random access to cells and rows as well as assignment, and iteration over rows. viewRow returns a view, and lazy loads actual data if a get is actually invoked. I plan to add a VectorInputFormat on top of it, too. The code that we need to have for our algorithms is tested, but there are still parts of it that are not. I am going to speak about this at HBaseCon, and I wanted to let you know that it can be contributed after some refactoring. Is there any interest? -- Gokhan -- Gokhan -- Gokhan -- Gokhan