Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-18 Thread Gokhan Capan
Congratulations, Andrew!

- G

> On Jul 18, 2018, at 22:30, Andrew Palumbo  wrote:
> 
> Please join me in congratulating Andrew Musselman as the new Chair of the
> Apache Mahout Project Management Committee. I would like to thank Andrew
> for stepping up, all of us who have worked with him over the years know his
> dedication to the project to be invaluable.  I look forward to Andrew
> taking taking the project into the future.
> 
> Thank you,
> 
> Andy


Re: Welcome Anand Avati

2015-04-22 Thread Gokhan Capan
Welcome Anand!

Sent from my iPhone

 On Apr 22, 2015, at 20:47, Dmitriy Lyubimov dlie...@gmail.com wrote:

 congrats and thank you!

 -d

 On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Welcome to the team Anand; thanks for your contributions!

 On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati av...@gluster.org wrote:

 Thank you Suneel, I am thrilled to join the team!

 I am a relative newbie to data mining and machine learning. I currently
 work at Red Hat, but have joined grad school (in machine learning)
 starting
 this fall.

 I look forward to continuing my contributions, and thank you once again
 for
 the opportunity.

 Anand

 On Wed, Apr 22, 2015, 08:08 Suneel Marthi smar...@apache.org wrote:

 In recognition of the contributions of Anand Avati to the Mahout
 project
 over the past year, the PMC is pleased to announce that he has accepted
 our
 invitation to join the Mahout project as a committer.

 As is customary, I will leave it to Anand to provide a little bit of
 background about himself.

 Congratulations and Welcome!

 -Suneel Marthi
 On Behalf of Mahout PMC



Re: TF-IDF, seq2sparse and DataFrame support

2015-03-24 Thread Gokhan Capan
Andrew,

Maybe making class tag evident in mapBlock calls?, i.e:
val tfIdfMatrix = tfMatrix.mapBlock(..){
...idf transformation, etc...
  }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]])

Best,
Gokhan

On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo ap@outlook.com wrote:


 This (last commit on this branch) should be the beginning of a workaround
 for the problem of reading and returning a Generic-Writable keyed Drm:

 https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30
 aae3f37e14

 However the keyClassTag of the DrmLike returned by the  mapBlock() calls
 and finally by the method itself is somehow converted to object.  I'm not
 exactly sure why this is happening.  I think that the implicit evidence is
 being dropped in the mapBlock call on a [Object]casted CheckPointedDrm.
 Maybe by calling it out of the scope of this method (breaking down the
 method would fix it.)


 valtfMatrix = drmMetadata.keyClassTagmatch{

   casect  ifct  == ClassTag.Int= {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
 CheckpointedDrmSpark[Int]]
   }
   casectifct ==ClassTag(classOf[String]) = {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
 CheckpointedDrmSpark[String]]
   }
   casectifct == ClassTag.Long= {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
 CheckpointedDrmSpark[Long]]
   }
   case_ = {
 (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE)
   (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[
 CheckpointedDrmSpark[Int]]
   }
 }

 tfMatrix.checkpoint()

 // make sure that the classtag of the tf matrix matches the metadata
 keyClasstag
 assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag)  -- Passes here
 with eg. String keys

 val tfIdfMatrix = tfMatrix.mapBlock(..){
 ...idf transformation, etc...
   }

 assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag)  -- Fails here
 for all with tfIdfMatrix.keyClassTag
 as an
 Object.


 I'll keep looking at it a bit.  If anybody has any ideas please let me
 know.







 On 03/09/2015 02:12 PM, Gokhan Capan wrote:

 So, here is a sketch of a Spark implementation of seq2sparse, returning a
 (matrix:DrmLike, dictionary:Map):

 https://github.com/gcapan/mahout/tree/seq2sparse

 Although it should be possible, I couldn't manage to make it process
 non-integer document ids. Any fix would be appreciated. There is a simple
 test attached, but I think there is more to do in terms of handling all
 parameters of the original seq2sparse implementation.

 I put it directly to the SparkEngine ---not that I think of this object is
 the most appropriate placeholder, it just seemed convenient to me.

 Best


 Gokhan

 On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrelp...@occamsmachete.com  wrote:

  IndexedDataset might suffice until real DataFrames come along.

 On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimovdlie...@gmail.com  wrote:

 Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
 byproduct of it IIRC. matrix definitely not a structure to hold those.

 On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumboap@outlook.com
 wrote:

  On 02/04/2015 11:13 AM, Pat Ferrel wrote:

  Andrew, not sure what you mean about storing strings. If you mean
 something like a DRM of tokens, that is a DataFrame with row=doc column

 =

 token. A one row DataFrame is a slightly heavy weight string/document. A
 DataFrame with token counts would be perfect for input TF-IDF, no? It

 would

 be a vector that maintains the tokens as ids for the counts, right?

  Yes- dataframes will be perfect for this.  The problem that i was
 referring to was that we dont have a DSL Data Structure to to do the
 initial distributed tokenizing of the documents[1] line:257, [2] . For

 this

 I believe we would need something like a Distributed vector of Strings

 that

 could be broadcast to a mapBlock closure and then tokenized from there.
 Even there, MapBlock may not be perfect for this, but some of the new
 Distributed functions that Gockhan is working on may.

  I agree seq2sparse type input is a strong feature. Text files into an
 all-documents DataFrame basically. Colocation?

  as far as collocations i believe that the n-gram are computed and
 counted
 in the CollocDriver [3] (i might be wrong her...its been a while since i
 looked at the code...) either way, I dont think I ever looked too
 closely
 and i was a bit fuzzy on this...

 These were just some thoughts that I had when briefly looking at porting
 seq2sparse to the DSL before.. Obviously we don't have to follow this
 algorithm but its a nice starting point.

 [1]https

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-10 Thread Gokhan Capan
Some answers:

- Non-integer document ids:
The implementation does not use operations defined for DrmLike[Int]-only,
so the row keys do not have to be Int's. I just couldn't manage to create
the returning DrmLike with the correct key type. Although while wrapping
into a DrmLike, I tried to pass the key-class using HDFS utils like they
are being used in drmDfsRead, but I somehow wasn't successful. So non-int
document ids is not an actual issue here.

- Breaking the implementation out to smaller pieces: Let's just collect the
requirements and adjust the implementation accordingly. I honestly didn't
think very much about where the implementation fits in, architecturally,
and what pieces are of public interest.

Best

Gokhan

On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi suneel.mar...@gmail.com
wrote:

 AP, How is ur impl different from Gokhan's?

 On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo ap@outlook.com wrote:

  BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
  using because o.a.m.vectorizer, which is probably a better name, had
  conflicts in mrlegacy.
 
 
  On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
 
 
  I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
  seq2sparse implementation to live.
 
  On 03/09/2015 09:07 PM, Pat Ferrel wrote:
 
  Does o.a.m.nlp  in the spark module seem like a good place for this to
  live?
 
  I think you meant math-scala?
 
  Actually we should rename math to core
 
 
  On Mar 9, 2015, at 3:15 PM, Andrew Palumbo ap@outlook.com wrote:
 
  Cool- This is great! I think this is really important to have in.
 
  +1 to a pull request for comments.
 
  I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
  very simple TF and TFIDF classes based on lucene's IDF calculation and
  MLlib's  I just got a bad flu and haven't had a chance to push it.  It
  creates an o.a.m.nlp package in mahout-math. I will push that as soon
 as i
  can in case you want to use them.
 
  Does o.a.m.nlp  in the spark module seem like a good place for this to
  live?
 
  Those classes may be of use to you- they're very simple and are
 intended
  for new document vectorization once the legacy deps are removed from
 the
  spark module.  They also might make interoperability with easier.
 
  One thought having not been able to look at this too closely yet.
 
   //do we need do calculate df-vector?
 
  1.  We do need a document frequency map or vector to be able to
  calculate the IDF terms when vectorizing a new document outside of the
  original corpus.


 
 
 
  On 03/09/2015 05:10 PM, Pat Ferrel wrote:
 
  Ah, you are doing all the lucene analyzer, ngrams and other
 tokenizing,
  nice.
 
  On Mar 9, 2015, at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Ah I found the right button in Github no PR necessary.
 
  On Mar 9, 2015, at 1:55 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
  If you create a PR it’s easier to see what was changed.
 
  Wouldn’t it be better to read in files from a directory assigning
  doc-id = filename and term-ids = terms or are their still Hadoop
 pipeline
  tools that are needed to create the sequence files? This sort of
 mimics the
  way Spark reads SchemaRDDs from Json files.
 
  BTW this can also be done with a new reader trait on the
  IndexedDataset. It will give you two bidirectional maps (BiMap) and a
  DrmLike[Int]. One BiMap gives any String - Int for rows, the other
 does
  the same for columns (text tokens). This would be a few lines of code
 since
  the string mapping and DRM creation is already written, The only
 thing to
  do would be map the doc/row ids to filenames. This allows you to take
 the
  non-int doc ids out of the DRM and replace them with a map. Not based
 on a
  Spark dataframe yet probably will be.
 
  On Mar 9, 2015, at 11:12 AM, Gokhan Capan gkhn...@gmail.com wrote:
 
  So, here is a sketch of a Spark implementation of seq2sparse,
 returning
  a
  (matrix:DrmLike, dictionary:Map):
 
  https://github.com/gcapan/mahout/tree/seq2sparse
 
  Although it should be possible, I couldn't manage to make it process
  non-integer document ids. Any fix would be appreciated. There is a
  simple
  test attached, but I think there is more to do in terms of handling
 all
  parameters of the original seq2sparse implementation.
 
  I put it directly to the SparkEngine ---not that I think of this
 object
  is
  the most appropriate placeholder, it just seemed convenient to me.
 
  Best
 
 
  Gokhan
 
  On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel p...@occamsmachete.com
  wrote:
 
   IndexedDataset might suffice until real DataFrames come along.
 
  On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
  is a
  byproduct of it IIRC. matrix definitely not a structure to hold
 those.
 
  On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo ap@outlook.com
  wrote:
 
   On 02/04/2015 11:13 AM, Pat Ferrel

Re: TF-IDF, seq2sparse and DataFrame support

2015-03-09 Thread Gokhan Capan
So, here is a sketch of a Spark implementation of seq2sparse, returning a
(matrix:DrmLike, dictionary:Map):

https://github.com/gcapan/mahout/tree/seq2sparse

Although it should be possible, I couldn't manage to make it process
non-integer document ids. Any fix would be appreciated. There is a simple
test attached, but I think there is more to do in terms of handling all
parameters of the original seq2sparse implementation.

I put it directly to the SparkEngine ---not that I think of this object is
the most appropriate placeholder, it just seemed convenient to me.

Best


Gokhan

On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel p...@occamsmachete.com wrote:

 IndexedDataset might suffice until real DataFrames come along.

 On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a
 byproduct of it IIRC. matrix definitely not a structure to hold those.

 On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo ap@outlook.com wrote:

 
  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
 
  Andrew, not sure what you mean about storing strings. If you mean
  something like a DRM of tokens, that is a DataFrame with row=doc column
 =
  token. A one row DataFrame is a slightly heavy weight string/document. A
  DataFrame with token counts would be perfect for input TF-IDF, no? It
 would
  be a vector that maintains the tokens as ids for the counts, right?
 
 
  Yes- dataframes will be perfect for this.  The problem that i was
  referring to was that we dont have a DSL Data Structure to to do the
  initial distributed tokenizing of the documents[1] line:257, [2] . For
 this
  I believe we would need something like a Distributed vector of Strings
 that
  could be broadcast to a mapBlock closure and then tokenized from there.
  Even there, MapBlock may not be perfect for this, but some of the new
  Distributed functions that Gockhan is working on may.
 
 
  I agree seq2sparse type input is a strong feature. Text files into an
  all-documents DataFrame basically. Colocation?
 
  as far as collocations i believe that the n-gram are computed and counted
  in the CollocDriver [3] (i might be wrong her...its been a while since i
  looked at the code...) either way, I dont think I ever looked too closely
  and i was a bit fuzzy on this...
 
  These were just some thoughts that I had when briefly looking at porting
  seq2sparse to the DSL before.. Obviously we don't have to follow this
  algorithm but its a nice starting point.
 
  [1] https://github.com/apache/mahout/blob/master/mrlegacy/
  src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
  .java
  [2] https://github.com/apache/mahout/blob/master/mrlegacy/
  src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
  [3]https://github.com/apache/mahout/blob/master/mrlegacy/
  src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
  java
 
 
 
  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo ap@outlook.com wrote:
 
  Just copied over the relevant last few messages to keep the other thread
  on topic...
 
 
  On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
 
  I'd suggest to consider this: remember all this talk about
  language-integrated spark ql being basically dataframe manipulation
 DSL?
 
  so now Spark devs are noticing this generality as well and are actually
  proposing to rename SchemaRDD into DataFrame and make it mainstream
 data
  structure. (my told you so moment of sorts
 
  What i am getting at, i'd suggest to make DRM and Spark's newly renamed
  DataFrame our two major structures. In particular, standardize on using
  DataFrame for things that may include non-numerical data and require
 more
  grace about column naming and manipulation. Maybe relevant to TF-IDF
 work
  when it deals with non-matrix content.
 
  Sounds like a worthy effort to me.  We'd be basically implementing an
 API
  at the math-scala level for SchemaRDD/Dataframe datastructures correct?
 
  On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  Seems like seq2sparse would be really easy to replace since it takes
 text
  files to start with, then the whole pipeline could be kept in rdds.
 The
  dictionaries and counts could be either in-memory maps or rdds for use
  with
  joins? This would get rid of sequence files completely from the
  pipeline.
  Item similarity uses in-memory maps but the plan is to make it more
  scalable using joins as an alternative with the same API allowing the
  user
  to trade-off footprint for speed.
 
  I think you're right- should be relatively easy.  I've been looking at
  porting seq2sparse  to the DSL for bit now and the stopper at the DSL
 level
  is that we don't have a distributed data structure for strings..Seems
 like
  getting a DataFrame implemented as Dmitriy mentioned above would take
 care
  of this problem.
 
  The other issue i'm a little fuzzy on  is the distributed collocation
  mapping-  it's a part 

Re: Codebase refactoring proposal

2015-02-05 Thread Gokhan Capan
What I am saying is that for certain algorithms including both
engine-specific (such as aggregation) and DSL stuff, what is the best way
of handling them?

i) should we add the distributed operations to Mahout codebase as it is
proposed in #62?

ii) should we have [engine]-ml modules (like spark-bindings and
h2o-bindings) where we can mix the DSL and engine-specific stuff?

Picking i. has the advantage of writing an ML-algorithm once and then it
can be run on alternative engines, but it requires wrapping/duplicating
existing distributed operations.

Picking ii. has the advantage of avoiding writing distributed operations,
but since we're mixing the DSL and the engine-specific stuff, an
ML-algorithm written for an engine would not be available for the others.

I just wanted to hear some opinions.

Gokhan

On Thu, Feb 5, 2015 at 4:11 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 I took it Gokhan had objections himself, based on his comments. if we are
 talking about #62.

 He also expressed concerns about computing GSGD but i suspect it can still
 be algebraically computed.

 On Wed, Feb 4, 2015 at 5:52 PM, Pat Ferrel p...@occamsmachete.com wrote:

  BTW Ted and Andrew have both expressed interest in the distributed
  aggregation stuff. It sounds like we are agreeing that
  non-algebra—computation method type things can be engine specific.
 
  So does anyone have an objection to Gokhan pushing his PR?
 
  On Feb 4, 2015, at 2:20 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 
  On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo ap@outlook.com
 wrote:
 
  
  
  
   My thought was not to bring primitive engine specific aggregetors,
   combiners,  etc. into math-scala.
  
 
  Yeah. +1. I would like to support that as an experiment, see where it
 goes.
  Clearly some distributed use cases are simple enough while also pervasive
  enough.
 
 



Re: TF-IDF, seq2sparse and DataFrame support

2015-02-04 Thread Gokhan Capan
I think I have a sketch of implementation for creating a drm from a
sequence file of Int, Texts, a.k.a. seq2sparse, using Spark.

Give me a couple days day and I will provide an initial implementation.

Best

Gokhan

On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo ap@outlook.com wrote:


 On 02/04/2015 11:13 AM, Pat Ferrel wrote:

 Andrew, not sure what you mean about storing strings. If you mean
 something like a DRM of tokens, that is a DataFrame with row=doc column =
 token. A one row DataFrame is a slightly heavy weight string/document. A
 DataFrame with token counts would be perfect for input TF-IDF, no? It would
 be a vector that maintains the tokens as ids for the counts, right?


 Yes- dataframes will be perfect for this.  The problem that i was
 referring to was that we dont have a DSL Data Structure to to do the
 initial distributed tokenizing of the documents[1] line:257, [2] . For this
 I believe we would need something like a Distributed vector of Strings that
 could be broadcast to a mapBlock closure and then tokenized from there.
 Even there, MapBlock may not be perfect for this, but some of the new
 Distributed functions that Gockhan is working on may.


 I agree seq2sparse type input is a strong feature. Text files into an
 all-documents DataFrame basically. Colocation?

 as far as collocations i believe that the n-gram are computed and counted
 in the CollocDriver [3] (i might be wrong her...its been a while since i
 looked at the code...) either way, I dont think I ever looked too closely
 and i was a bit fuzzy on this...

 These were just some thoughts that I had when briefly looking at porting
 seq2sparse to the DSL before.. Obviously we don't have to follow this
 algorithm but its a nice starting point.

 [1] https://github.com/apache/mahout/blob/master/mrlegacy/
 src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
 .java
 [2] https://github.com/apache/mahout/blob/master/mrlegacy/
 src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
 [3]https://github.com/apache/mahout/blob/master/mrlegacy/
 src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
 java



 On Feb 4, 2015, at 7:47 AM, Andrew Palumbo ap@outlook.com wrote:

 Just copied over the relevant last few messages to keep the other thread
 on topic...


 On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:

 I'd suggest to consider this: remember all this talk about
 language-integrated spark ql being basically dataframe manipulation DSL?

 so now Spark devs are noticing this generality as well and are actually
 proposing to rename SchemaRDD into DataFrame and make it mainstream data
 structure. (my told you so moment of sorts

 What i am getting at, i'd suggest to make DRM and Spark's newly renamed
 DataFrame our two major structures. In particular, standardize on using
 DataFrame for things that may include non-numerical data and require more
 grace about column naming and manipulation. Maybe relevant to TF-IDF work
 when it deals with non-matrix content.

 Sounds like a worthy effort to me.  We'd be basically implementing an API
 at the math-scala level for SchemaRDD/Dataframe datastructures correct?

 On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Seems like seq2sparse would be really easy to replace since it takes text
 files to start with, then the whole pipeline could be kept in rdds. The
 dictionaries and counts could be either in-memory maps or rdds for use
 with
 joins? This would get rid of sequence files completely from the
 pipeline.
 Item similarity uses in-memory maps but the plan is to make it more
 scalable using joins as an alternative with the same API allowing the
 user
 to trade-off footprint for speed.

 I think you're right- should be relatively easy.  I've been looking at
 porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
 is that we don't have a distributed data structure for strings..Seems like
 getting a DataFrame implemented as Dmitriy mentioned above would take care
 of this problem.

 The other issue i'm a little fuzzy on  is the distributed collocation
 mapping-  it's a part of the seq2sparse code that I've not spent too much
 time in.

 I think that this would be very worthy effort as well-  I believe
 seq2sparse is a particular strong mahout feature.

 I'll start another thread since we're now way off topic from the
 refactoring proposal.

 My use for TF-IDF is for row similarity and would take a DRM (actually
 IndexedDataset) and calculate row/doc similarities. It works now but only
 using LLR. This is OK when thinking of the items as tags or metadata but
 for text tokens something like cosine may be better.

 I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
 like how CF preferences are downsampled. This would produce an sparsified
 all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
 terms before row similarity uses cosine. This is not so 

Re: Code quality questions

2015-01-24 Thread Gokhan Capan
+1 for favoring native scala types.

I think in terms of Scala code, we need a clear style standards definition
to adhere to.


Gokhan

On Fri, Jan 23, 2015 at 9:38 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 in TextDelimitedReaderWriter.scala:

 ===
  val itemList:
 collection.mutable.MutableList[org.apache.mahout.common.Pair[Integer,
 Double]] = new
 collection.mutable.MutableList[org.apache.mahout.common.Pair[Integer,
 Double]]
 for (ve - itemVector.nonZeroes) {
   val item: org.apache.mahout.common.Pair[Integer, Double] = new
 org.apache.mahout.common.Pair[Integer, Double](ve.index, ve.get)
   itemList += item
 }
 

 (1) why scala code attempts to use commons.pair? What was wrong about
 native Tuple type of scala? (I am trying to clean out mrlegacy dependencies
 from spark module).

 (2) why it is so horribly styled (even for me)? comments are misaligned,
 the lines routinely exceed 120 characters?

 Can these problems please be addressed? in particular, stuff like
 o.a.m.common.Pair? And why it is even signed off on in the first place by
 committers despite of clear style violations?

 thank you.



[jira] [Created] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2014-11-15 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1626:


 Summary: Support for required quasi-algebraic operations and 
starting with aggregating rows/blocks
 Key: MAHOUT-1626
 URL: https://issues.apache.org/jira/browse/MAHOUT-1626
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 1.0
Reporter: Gokhan Capan
 Fix For: 1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions

2014-11-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1616.
--
Resolution: Fixed

 Better support for hadoop dependencies of multiple versions 
 

 Key: MAHOUT-1616
 URL: https://issues.apache.org/jira/browse/MAHOUT-1616
 Project: Mahout
  Issue Type: Improvement
  Components: build
Reporter: Gokhan Capan
Assignee: Gokhan Capan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-13 Thread Gokhan Capan
Awesome.

So we are going to implement certain required DistributedOperations, in a
separate trait similar to, but other than the DistributedEngine.

I'll think about this a little more, and propose an initial implementation
that hopefully we can agree on.

Best,

Gokhan

On Thu, Nov 13, 2014 at 1:35 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 On Wed, Nov 12, 2014 at 1:44 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

 
 
  On Wed, Nov 12, 2014 at 1:27 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  My only concern is to add certain loss minimization tools for people to
  write machine learning algorithms.
 
  mapBlock as you suggested can work equally, but I happened to have
  implemented the aggregate op while thinking.
 
  Apart from this SGD implementation,
  blockify-a-matrix-and-run-an-operation-in-parallel-on-blocks is, I
  believe,
  certainly required, since block level parallelization is really common
 in
  matrix computations. Plus, if we are to add, say, a descriptive
 statistics
  package, that would require a similar functionality, too.
 
  If mapBlocks for passing custom operators was more flexible, I'd be more
  than happy, but I understand the idea behind its requirement of mapping
  should be block-to-block with the same row size.
 
  Could you give a little more detail on the 'common distributed strategy'
  idea?
 
 
 the idea is simple.

 First, not use logical plan construction. In practice it means that while
 say A.%*%(B) create a logical plan element (which is subsequently run
 thru optimizer), something like aggregate(..) does not do that. Instead, it
 just produces ... whatever it produces, directly. So it doesn't form any
 new logical nor physical plan.

 Second, it means that we can define internal strategy trait, something like
 DistributedOperations, which will include this set of operations.
 Subsequently, we will define native implementations of this trait in the
 same way we defined some native stuff for DistributedEngine trait. (but
 don't make it part of DistributedEngine trait please -- maybe an attribute
 perhaps). At run time we will have to ask current engine to provide
 distributed operation implementation and delegate execution of common
 fragments to it .



Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-12 Thread Gokhan Capan
My only concern is to add certain loss minimization tools for people to
write machine learning algorithms.

mapBlock as you suggested can work equally, but I happened to have
implemented the aggregate op while thinking.

Apart from this SGD implementation,
blockify-a-matrix-and-run-an-operation-in-parallel-on-blocks is, I believe,
certainly required, since block level parallelization is really common in
matrix computations. Plus, if we are to add, say, a descriptive statistics
package, that would require a similar functionality, too.

If mapBlocks for passing custom operators was more flexible, I'd be more
than happy, but I understand the idea behind its requirement of mapping
should be block-to-block with the same row size.

Could you give a little more detail on the 'common distributed strategy'
idea?


Aside: Do we have certain elementwise Math functions in Matrix DSL? That
is, how can I do this?

1 + exp(drmA)




Gokhan

On Wed, Nov 12, 2014 at 7:55 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 yes i usually follow #2 too.

 The thing is, pretty often algorithm can define its own set of strategies
 the backend need to support (like this distributedEngine strategy) and keep
 a lot of logic still common accross all strategies. But then if all-reduce
 aggregate operation is incredibly common among such algorithm speicfic
 strategies, then it stands to reason to implement it only once.

 I have an idea.

 Maybe we need a common distributed strategy which is different from
 algebraic optimizer. That way we don't have to mess with algebraic
 rewrites. how about that?

 On Wed, Nov 12, 2014 at 9:12 AM, Pat Ferrel p...@occamsmachete.com wrote:

  So you are following #2, which is good. #1 seems a bit like a hack. For a
  long time to come we will have to add things to the DSL if it is to be
 kept
  engine independent. Yours looks pretty general and simple.
 
  Are you familiar with the existing Mahout aggregate methods? They show up
  in the SGDHelper.java and other places in legacy code. I don’t know much
  about them but they seem to be a pre-functional programming attempt at
 this
  kind of thing. It looks like you are proposing a replacement for those
  based on rdd.aggregate, if so that would be very useful. For one thing it
  looks like the old aggregate was not parallel, rdd.aggregate is.
 
 
  On Nov 11, 2014, at 1:18 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  So the alternatives are:
 
  1- mapBlock to a matrix whose all rows-but-the first are empty, then
  aggregate
  2- depend on a backend
 
  1 is obviously OK.
 
  I don't like the idea of depending on a backend since SGD is a generic
 loss
  minimization, on which other algorithms will possibly depend.
 
  In this context, client-side aggregation is not an overhead, but even if
 it
  happens to be so, it doesn't have to be a client-side aggregate at all.
 
  Alternative to 1, I am thinking of at least having an aggregation
  operation, which will return an accumulated value anyway, and shouldn't
  affect algebra optimizations.
 
  I quickly implemented a naive one (supporting only Spark- I know I said
  that I don't like depending on a backend, but at least the backends-wide
  interface is consistent, and as a client, I still don't have to deal with
  Spark primitives directly).
 
  Is this nice enough? Is it too bad to have in the DSL?
  https://github.com/gcapan/mahout/compare/accumulateblocks
 
  Best
 
  Gokhan
 
  On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
   Oh. algorithm actually collects the vectors and runs another cycle in
 the
   client!
  
   Still, technically, you can collect almost-empty blocks to the client
   (since they are mostly empty, it won't cause THAT huge overhead
 compared
  to
   collecting single vectors, after all, how many partitions are we
 talking
   about? 1000? ).
  
   On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
  
  
   On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan gkhn...@gmail.com
  wrote:
  
   Hi,
  
   Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
   http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
   implement SGD, and a regularized least squares solution for linear
   regression (can easily be extended to other GLMs, too).
  
   How the algorithm works is as follows:
   1. Split data into partitions of T examples
   2. in parallel, for each partition:
 2.0. shuffle partition
 2.1. initialize parameter vector
 2.2. for each example in the shuffled partition
 2.2.1 update the parameter vector
   3. Aggregate all the parameter vectors and return
  
  
   I guess technically it is possible (transform each block to a
   SparseRowMatrix or SparseMatrix with only first valid row) and then
  invoke
   colSums() or colMeans() (whatever aggregate means).
  
   However, i am not sure it is worth the ugliness. isn't it easier to
   declare these things quasi-algebraic and just

Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-12 Thread Gokhan Capan
Ted,

Can we easily integrate t-digest for descriptives once we have block
aggregates? This might count one more reason.

Gokhan

On Thu, Nov 13, 2014 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Wed, Nov 12, 2014 at 9:53 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  once we start mapping aggregate, there's no reason not to
  map other engine specific capabilities, which are vast. At this point
  dilemma is, no matter what we do we are losing coherency: if we map it
 all,
  then other engines will have trouble supporting all of it. If we don't
 map
  it all, then we are forcing capability reduction compared to what the
  engine actually can do.
 
  It is obvious to me that all-reduce aggregate will make a lot of sense --
  even if it means math checkpoint. but then where do we stop in mapping
  those. E.g. do we do fold? cartesian? And what is that true reason we are
  remapping everything if it is already natively available? etc. etc. For
  myself, I still haven't figured a good answer to those .
 

 Actually, I disagree with the premise here.

 There *is* a reason not to map all other engine specific capabilities.
 That reason is we don't need them.  Yet.

 So far, we *clearly* need some sort of block aggregate for a host of
 hog-wild sorts of algorithms.  That doesn't imply that we need all kinds of
 mapping aggregates.  It just means that we are clear on one need for now.

 So let's get this one in and see how far we can go.

 Also, having one kind of aggregation in the DSL does not restrict anyone
 from using engine specific capabilities.  It just means that one kind of
 idiom can be done without engine specificity.



Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-11 Thread Gokhan Capan
So the alternatives are:

1- mapBlock to a matrix whose all rows-but-the first are empty, then
aggregate
2- depend on a backend

1 is obviously OK.

I don't like the idea of depending on a backend since SGD is a generic loss
minimization, on which other algorithms will possibly depend.

In this context, client-side aggregation is not an overhead, but even if it
happens to be so, it doesn't have to be a client-side aggregate at all.

Alternative to 1, I am thinking of at least having an aggregation
operation, which will return an accumulated value anyway, and shouldn't
affect algebra optimizations.

I quickly implemented a naive one (supporting only Spark- I know I said
that I don't like depending on a backend, but at least the backends-wide
interface is consistent, and as a client, I still don't have to deal with
Spark primitives directly).

Is this nice enough? Is it too bad to have in the DSL?
https://github.com/gcapan/mahout/compare/accumulateblocks

Best

Gokhan

On Tue, Nov 11, 2014 at 10:45 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

 Oh. algorithm actually collects the vectors and runs another cycle in the
 client!

 Still, technically, you can collect almost-empty blocks to the client
 (since they are mostly empty, it won't cause THAT huge overhead compared to
 collecting single vectors, after all, how many partitions are we talking
 about? 1000? ).

 On Tue, Nov 11, 2014 at 12:41 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:



 On Sat, Nov 8, 2014 at 12:42 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Hi,

 Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
 http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
 implement SGD, and a regularized least squares solution for linear
 regression (can easily be extended to other GLMs, too).

 How the algorithm works is as follows:
 1. Split data into partitions of T examples
 2. in parallel, for each partition:
2.0. shuffle partition
2.1. initialize parameter vector
2.2. for each example in the shuffled partition
2.2.1 update the parameter vector
 3. Aggregate all the parameter vectors and return


 I guess technically it is possible (transform each block to a
 SparseRowMatrix or SparseMatrix with only first valid row) and then invoke
 colSums() or colMeans() (whatever aggregate means).

 However, i am not sure it is worth the ugliness. isn't it easier to
 declare these things quasi-algebraic and just do direct spark calls on the
 matrix RDD (map, aggregate)?

 The real danger is to introduce non-algebra things into algebra so that
 the rest of the algebra doesn't optimize any more.





Re: SGD Implementation and Questions for mapBlock like functionality

2014-11-10 Thread Gokhan Capan
Well, in that specific case, I will accumulate in the client side,
collection of the intermediate parameters is not that big (numBlocks x
X.ncol). What I need is just mapping (keys, block) to a vector (currently,
a mapBlock has to map the block to the new block)

From a general perspective, you are right, this is an accumulation.

Gokhan

On Mon, Nov 10, 2014 at 8:26 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Do you need a reduce or could you use an accumulator? Either is not really
 supported in the DSL but clearly these are required for certain algos.
 Broadcast vals supported but are read only.

 On Nov 8, 2014, at 12:42 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Hi,

 Based on Zinkevich et al.'s Parallelized Stochastic Gradient paper (
 http://martin.zinkevich.org/publications/nips2010.pdf), I tried to
 implement SGD, and a regularized least squares solution for linear
 regression (can easily be extended to other GLMs, too).

 How the algorithm works is as follows:
 1. Split data into partitions of T examples
 2. in parallel, for each partition:
   2.0. shuffle partition
   2.1. initialize parameter vector
   2.2. for each example in the shuffled partition
   2.2.1 update the parameter vector
 3. Aggregate all the parameter vectors and return

 Here is an initial implementation to illustrate where I am stuck:
 https://github.com/gcapan/mahout/compare/optimization

 (See TODO in SGD.minimizeWithSgd[K])

 I was thinking that using a blockified matrix of training instances, step 2
 of the algorithm can run on blocks, and they can be aggregated in
 client-side. However, the only operator that I know in the DSL is mapBlock,
 and it requires the BlockMapFunction to map a block to another block of the
 same row size. In this context, I want to map a block (numRows x n) to the
 parameter vector of size n.

 The question is:
 1- Is it possible to easily implement the above algorithm using DSL's
 current functionality? Could you tell me what I'm missing?
 2- If there is not an easy way other than using the currently-non-existing
 mapBlock-like method, shall we add such an operator?

 Best,

 Gokhan




[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154918#comment-14154918
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Jay,

This is integrated in trunk, not in 0.9, and should work. Also, you can find 
MAHOUT-1616 useful for a recent simplification and further improvement effort.

Best

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: https://mahout.apache.org/developers/buildingmahout.html

2014-10-01 Thread Gokhan Capan
By the way, I tried to simplify and improve things a bit here: MAHOUT-1616

Sent from my iPhone

 On Oct 1, 2014, at 15:26, Suneel Marthi suneel.mar...@gmail.com wrote:

 Mahout  0.9 doesn't support hadoop 2x and was built with hadoop 1.2.1 and 
 hence the runtime errors u r seeing. Present codebase (unreleased) supports 
 hadoop 2x

 Sent from my iPhone

 On Oct 1, 2014, at 8:14 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 I believe that the POM assumes particular versions as listed are version 2
 and all others 1.

 Inspection of the top-level pom would provide the most authoritative answer.

 On Wed, Oct 1, 2014 at 7:08 AM, jay vyas jayunit100.apa...@gmail.com
 wrote:

 hi mahout:

 Can we use any hadoop version to build mahout?  i.e. 2.4.1 ?
 It seems like if you give it a garbage hadoop version i.e. (2.3.4.5) , it
 still builds, yet
 at runtime, it is clear that the version built is a 1.x version.

 thanks !

 FYI this is in relation to BIGTOP=-1470, where we are just getting ready
 for our 0.8 release, so any feedback would be much appreciated !

 --
 jay vyas



[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14154937#comment-14154937
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Jay, here is the documentation:

http://mahout.apache.org/developers/buildingmahout.html

And the instructions apply to trunk, not to the 0.9 release

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-10-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14155309#comment-14155309
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Correct

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1616) Better support for hadoop dependencies of multiple versions

2014-09-26 Thread Gokhan Capan (JIRA)
Gokhan Capan created MAHOUT-1616:


 Summary: Better support for hadoop dependencies of multiple 
versions 
 Key: MAHOUT-1616
 URL: https://issues.apache.org/jira/browse/MAHOUT-1616
 Project: Mahout
  Issue Type: Improvement
  Components: build
Reporter: Gokhan Capan
Assignee: Gokhan Capan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Upgrade to spark 1.0.x

2014-08-08 Thread Gokhan Capan
+1 to merging spark-1.0.x to master

Sent from my iPhone

 On Aug 8, 2014, at 22:06, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Current master is still at Spark 0.9.x . MAHOUT-1603 (PR #40) is making a
 number of valuable tweaks to enable Spark 1.0.x and (Spark SQL code, by
 extension. I did a quick test, SQL seems to work for my simple tests in
 Mahout environment).

 This squashed PR is pushed to apache/mahout branch spark-1.0.x rather than
 master. Whenever (if) folks are ready, i can merge it to the master.

 Alternative approach would be to maintain both 1.0.x and 0.9.x branches for
 some time. I don't see it as valuable as the costs would likely overrun any
 benefit here, but if anyone still clings to spark 0.9.x dependency, please
 let me know in this thread.

 thanks.
 -d


Re: standardizing minimal Matrix I/O capability

2014-08-04 Thread Gokhan Capan
Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of row_id,col_id,value lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Some work on this is being done as part of MAHOUT-1568, which is currently
 very early and in https://github.com/apache/mahout/pull/36

 The idea there only covers text-delimited files and proposes a standard
 DRM-ish format but supports a configurable schema. Default is:

 rowIDtabitemID1:value1spaceitemID2:value2…

 The IDs can be mahout keys of any type since they are written as text or
 they can be application specific IDs meaningful in a particular usage, like
 a user ID hash, or SKU from a catalog, or URL.

 As far as dataframe-ish requirements, it seems to me there are two
 different things needed. The dataframe is needed while preforming an
 algorithm or calculation and is kept in distributed data structures. There
 probably won’t be a lot of files kept around with the new engines. Any text
 files can be used for pipelines in a pinch but generally would be for
 import/export. Therefore MAHOUT-1568 concentrates on import/export not
 dataframes, though it could use them when they are ready.


 On Jul 30, 2014, at 7:53 AM, Gokhan Capan notificati...@github.com
 wrote:

 I believe the next step should be standardizing minimal Matrix I/O
 capability (i.e. a couple file formats other than [row_id, VectorWritable]
 SequenceFiles) required for a distributed computation engine, and adding
 data frame like structures those allow text columns.





[jira] [Resolved] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1565.
--

Resolution: Fixed

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
Assignee: Gokhan Capan
 Fix For: 1.0

 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1565:


Assignee: Gokhan Capan

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
Assignee: Gokhan Capan
 Fix For: 1.0

 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-07-15 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062041#comment-14062041
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

Sorry guys, I committed this 2 weeks ago, but I forgot to close the issue. 
Thank you, [~nravi]

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Fix For: 1.0

 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: H2O integration - completion and review

2014-07-11 Thread Gokhan Capan
I'll write longer, but in general, +1 to Anand

Sent from my iPhone

 On Jul 11, 2014, at 20:54, Anand Avati av...@gluster.org wrote:

 On Fri, Jul 11, 2014 at 9:29 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Duplicated from a comment on the PR:

 Beyond these details (specific merge issues)  I have a bigger problem with
 merging this. Now every time the DSL is changed it may break things in h2o
 specific code. Merging this would require every committer who might touch
 the DSL to sign up for fixing any broken tests on both engines.

 To solve this the entire data prep pipeline must be virtualized to run on
 either engine so the tests for things like CF and ItemSimilarity or matrix
 factorization (and the multitude of others to come) pass and are engine
 independent. As it stands any DSL change that breaks the build will have to
 rely on a contributor's fix. Even if one of you guys was made a committer
 we will have this problem where a needed change breaks one or the other
 engine specific code. Unless 99% of the entire pipeline is engine neutral
 the build will be unmaintainable.

 For instance I am making a small DSL change that is required for
 cooccurrence and ItemSimilarity to work. This would break ItemSimilarity
 and its tests, which are in the spark module but since I’m working on that
 I can fix everything. If someone working on an h2o specific thing had to
 change the DSL in a way that broke spark code like ItemSimilarity you might
 not be able to fix it and I certainly do not want to fix stuff in h2o
 specific code when I change the DSL. I have a hard enough time keeping mine
 running :-)

 The way I interpret the above points, the problem you are trying to
 highlight is with having multiple backends in general, and not this backend
 in specific? Hypothetically, even if this backend is abandoned for the
 above problems, as more backends get added in the future, the same
 problems will continue to apply to all of them.


 Crudely speaking this means doing away with all references to a
 SparkContext and any use of it. So it's not just a matter of reproducing
 the spark module but reducing the need for one. Making it so small that
 breakages in one or the other engines code will be infrequent and changes
 to neutral code will only rarely break an engine that the committer is
 unfamiliar with.

 I think things are already very close to this ideal situation you
 describe above. As a pipeline implementor we should just use
 DistributedContext, and not SparkContext. And we need an engine neutral way
 to get hold of a DistributedContext from within the math-scala module, like
 this pseudocode:

  import org.apache.mahout.math.drm._

  val dc = DistributedContextCreate(System.getenv(MAHOUT_BACKEND),
 System.getenv(BACKEND_ID), opts...)

 If environment variables are not set, DistributedContextCreate could
 default to Spark and local. But all of the pipeline code should ideally
 exist outside any engine specific module.



 I raised this red flag a long time ago but in the heat of other issues it
 got lost. I don't think this can be ignored anymore.

 The only missing piece I think is having a DistributedContextCreate() call
 such as above? I don't think things are in such a dire state really.. Am I
 missing something?


 I would propose that we should remain two separate projects with a mostly
 shared DSL until the maintainability issues are resolved. This seems way to
 early to merge.

 Call me an optimist, but I was hoping more of a let's work together now to
 make the DSL abstractions easier for future contributors. I will explore
 such a DistributedContextCreate() method in math-scala. That might also be
 the answer for test cases to remain in math-scala.

 Thanks


Re: TF-IDF vector persistence with normalization enabled

2014-06-03 Thread Gokhan Capan
That post implies that in order to have tf-idf vectors persisted, in the tf
vectors creation phase you need those options set.

Or you can always run the Driver directly and easily, preferably from
mahout's commandline, i.e. bin/mahout seq2sparse

Gokhan


On Tue, Jun 3, 2014 at 9:37 AM, David Noel david.i.n...@gmail.com wrote:

 I made an observation similar to what was pointed out in this mailing
 list post here:
 http://comments.gmane.org/gmane.comp.apache.mahout.user/17819; that
 TF-IDF vectors do not seem to persist when generating them with
 normalization enabled.

 According to Gokhan Capan:

 It seems to have tf-idf vectors later, you need to create tf vectors
 (DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option
 set to false, and normPower option set to -1.0f.

 Is there some reason for this? It would seem useful if they persisted.
 Can someone explain the reasoning behind them not? I figure there's a
 perfectly good reason, I just can't seem to figure out what it is.



[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016372#comment-14016372
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Seems like the dependencies are correctly set. Are you certain that the cluster 
you're running mahout against is an hadoop-2 and M/R-2 cluster?

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016378#comment-14016378
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

We agree, conceptually, but this needs some further testing.

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Fix For: 1.0

 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016565#comment-14016565
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Brian, 
This was actually well-tested. But I'm gonna build and test it again, probably 
tomorrow. 
By the way can you run a 
{{$ find . -name hadoop*.jar}}

after building mahout, in the mahout root director.
Best

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14016998#comment-14016998
 ] 

Gokhan Capan commented on MAHOUT-1529:
--

Alright, I'm sold.

 Finalize abstraction of distributed logical plans from backend operations
 -

 Key: MAHOUT-1529
 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
 Project: Mahout
  Issue Type: Improvement
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 We have a few situations when algorithm-facing API has Spark dependencies 
 creeping in. 
 In particular, we know of the following cases:
 -(1) checkpoint() accepts Spark constant StorageLevel directly;-
 -(2) certain things in CheckpointedDRM;-
 -(3) drmParallelize etc. routines in the drm and sparkbindings package.-
 -(5) drmBroadcast returns a Spark-specific Broadcast object-
 (6) Stratosphere/Flink conceptual api changes.
 *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
 need new PR for remaining things once ready.
 *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan commented on MAHOUT-1529:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

 Finalize abstraction of distributed logical plans from backend operations
 -

 Key: MAHOUT-1529
 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
 Project: Mahout
  Issue Type: Improvement
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 We have a few situations when algorithm-facing API has Spark dependencies 
 creeping in. 
 In particular, we know of the following cases:
 -(1) checkpoint() accepts Spark constant StorageLevel directly;-
 -(2) certain things in CheckpointedDRM;-
 -(3) drmParallelize etc. routines in the drm and sparkbindings package.-
 -(5) drmBroadcast returns a Spark-specific Broadcast object-
 (6) Stratosphere/Flink conceptual api changes.
 *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
 need new PR for remaining things once ready.
 *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 2:55 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchOps and possibly 
CacheOps, a concrete RandomAccessMatrix would be a Matrix with RandomAccessOps, 
and so on. What do you think and if you and others are positive, how do you 
think that should be handled?

 Finalize abstraction of distributed logical plans from backend operations
 -

 Key: MAHOUT-1529
 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
 Project: Mahout
  Issue Type: Improvement
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 We have a few situations when algorithm-facing API has Spark dependencies 
 creeping in. 
 In particular, we know of the following cases:
 -(1) checkpoint() accepts Spark constant StorageLevel directly;-
 -(2) certain things in CheckpointedDRM;-
 -(3) drmParallelize etc. routines in the drm and sparkbindings package.-
 -(5) drmBroadcast returns a Spark-specific Broadcast object-
 (6) Stratosphere/Flink conceptual api changes.
 *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
 need new PR for remaining things once ready.
 *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014985#comment-14014985
 ] 

Gokhan Capan edited comment on MAHOUT-1529 at 6/1/14 3:03 PM:
--

[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for memory-based algorithms such 
as neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?


was (Author: gokhancapan):
[~dlyubimov], I imagine in the near future we will want to add a matrix 
implementation with fast row and column access for in-memory algorithms such as 
neighborhood based recommendation. This could be a new persistent storage 
engineered for locality preservation of kNN, the new Solr backend potentially 
cast to a Matrix, or something else. 

Anyway, my point is that we could want to add different types of distributed 
matrices with engine (or data structure) specific strengths in the future. I 
suggest turning each bahavior (such as Caching) into an additional trait, which 
the distributed execution engine (or data structure) author can mixin to her 
concrete implementation (For example Spark's matrix is one with Caching and 
Broadcasting). It might even help with easier logical planning (if it supports 
caching cache it, if partitioned in the same way do this else do this, if one 
matrix is small broadcast it etc.). 

So I suggest a  a base Matrix trait with nrows and ncols methods (as it 
currently is), a BatchExecution trait with methods for partitioning and 
execution in parallel behavior, a Caching trait with methods for 
caching/uncaching behavior, in the future a RandomAccess trait with methods for 
accessing rows and columns (and possibly cells) functionality. 

Then a concrete DRM (like) would be a Matrix with BatchExecution and possibly 
Caching, a concrete RandomAccessMatrix would be a Matrix with RandomAccess, and 
so on. What do you think and if you and others are positive, how do you think 
that should be handled?

 Finalize abstraction of distributed logical plans from backend operations
 -

 Key: MAHOUT-1529
 URL: https://issues.apache.org/jira/browse/MAHOUT-1529
 Project: Mahout
  Issue Type: Improvement
Reporter: Dmitriy Lyubimov
Assignee: Dmitriy Lyubimov
 Fix For: 1.0


 We have a few situations when algorithm-facing API has Spark dependencies 
 creeping in. 
 In particular, we know of the following cases:
 -(1) checkpoint() accepts Spark constant StorageLevel directly;-
 -(2) certain things in CheckpointedDRM;-
 -(3) drmParallelize etc. routines in the drm and sparkbindings package.-
 -(5) drmBroadcast returns a Spark-specific Broadcast object-
 (6) Stratosphere/Flink conceptual api changes.
 *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
 need new PR for remaining things once ready.
 *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-29 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012126#comment-14012126
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

I think there is no point of configuring output compression, number of 
reducers, etc. for Mahout.

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1565) add MR2 options to MAHOUT_OPTS in bin/mahout

2014-05-29 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012140#comment-14012140
 ] 

Gokhan Capan commented on MAHOUT-1565:
--

Sorry, now I can read the patch properly. The MR1 versions of those 
configurations are already set in bin/mahout, and you're suggesting to add MR2 
versions of them, too, right?

I am personally not a fan of setting such configurations in Mahout, and I would 
remove them as well.

 add MR2 options to MAHOUT_OPTS in bin/mahout
 

 Key: MAHOUT-1565
 URL: https://issues.apache.org/jira/browse/MAHOUT-1565
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 1.0, 0.9
Reporter: Nishkam Ravi
 Attachments: MAHOUT-1565.patch


 MR2 options are missing in MAHOUT_OPTS in bin/mahout and bin/mahout.cmd. Add 
 those options.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Hadoop 2 support in a real release?

2014-05-23 Thread Gokhan Capan
My vote would be releasing mahout with hadoop1 and hadoop2 classifiers

Gokhan


On Fri, May 23, 2014 at 4:43 PM, Sebastian Schelter ssc.o...@googlemail.com
 wrote:

 Big +1
 Am 23.05.2014 15:33 schrieb Ted Dunning ted.dunn...@gmail.com:

  What do folks think about spinning out a new version of 0.9 that only
  changes which version of Hadoop the build uses?
 
  There have been quite a few questions lately on this topic.
 
  My suggestion would be that we use minor version numbering to maintain
 this
  and the normal 0.9 release simultaneously if we decide to do a bug fix
  release.
 
  Any thoughts?
 



[jira] [Assigned] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-22 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1534:


Assignee: Gokhan Capan

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Gokhan Capan
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-22 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005663#comment-14005663
 ] 

Gokhan Capan commented on MAHOUT-1534:
--

We might want to add the link to the Mahout News, but let's wait and see if the 
users could locate the page.

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Gokhan Capan
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-22 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan resolved MAHOUT-1534.
--

Resolution: Fixed

The instructions are now available on the BuildingMahout page: 
http://mahout.apache.org/developers/buildingmahout.html

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
Assignee: Gokhan Capan
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-05-22 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14005719#comment-14005719
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Please check http://mahout.apache.org/developers/buildingmahout.html for 
instructions to build mahout against to hadoop-2

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Git Migration

2014-05-22 Thread Gokhan Capan
Works for me as well

Gokhan


On Thu, May 22, 2014 at 9:23 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Thanks; I just pushed successfully.


 On Thu, May 22, 2014 at 10:55 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  did you read Jake's email earlier at dev/infra discussion? he describes
 and
  makes references here.
 
  It is two-fold: first  we can push whatever commits to master of
  https://git-wip-us.apache.org/repos/asf?p=mahout.git
 
  However the other side of the coin is that significant commits should go
  thru pull requests directly to (if i understand it correctly)
 apache/mahout
  mirror on github. Such pull requests are managed thru commits to git-wp
 as
  well by specific messages (again, see references in Jake's email). My
  understanding is that github integration features are not yet enabled,
 only
  commits to master of git-wp-us.a.o are at this point.
 
  At this point I simply would like everyone to verify they can push
 commits
  to master branch of git-wp-us.a.o per instructions in INFRA- and
 report
  back there (I can push).
 
  I guess someone (perhaps me) will have to write the manual for working
 with
  github pull requests (mainly, merging them to git-wp-us.o.a and closing
  them).
 
 
  On Thu, May 22, 2014 at 10:47 AM, Andrew Musselman 
  andrew.mussel...@gmail.com wrote:
 
   What's the workflow to commit a change?  I'm totally in the dark about
   that.
  
  
   On Thu, May 22, 2014 at 10:14 AM, Dmitriy Lyubimov dlie...@gmail.com
   wrote:
  
Hi,
   
(1) git migration of the project is now complete. Any volunteers to
   verify
per INFRA-? If you do, please report back to the issue.
   
(2) Anybody knows what to do with jenkins now? i still don't have
  proper
privileges on it. thanks.
   
   
   
[1] https://issues.apache.org/jira/browse/INFRA-
   
  
 



Re: consensus statement?

2014-05-21 Thread Gokhan Capan
I want to express my opinions for the vision, too. I tried to capture those
words from various discussions in the dev-list, and hope that most, of them
support the common sense of excitement the new Mahout arouses

To me, the fundamental benefit of the shift that Mahout is undergoing is a
better separation of the distributed execution engine, distributed data
structures, matrix computations, and algorithms layers, which will allow
the users/devs of Mahout with different roles focus on the relevant parts
of the framework:

   1. A machine learning scientist, independent from the underlying
   distributed execution engine, can utilize the matrix language and the
   decompositions to implement new algorithms (which implies that the current
   distributed mahout algorithms are to be rewritten in the matrix language)
   2. A math-scala module contributor, for the benefit of higher level
   algorithms, can add new, or improve existing functions (the set of
   decompositions is an example) with optimization plans (such as if two
   matrices are partitioned in the same way, ...), where the concrete
   implementations of those optimizations are delegated to the distributed
   execution engine layer
   3. A distributed execution engine author can add machine learning
   capabilities to her platform with i)concrete Matrix and Matrix I/O
   implementation  ii)partitioning, checkpointing, broadcasting behaviors,
   iii)BLAS
   4. A Mahout user with access to a cluster operated by a
   Mahout-supporting distributed execution engine can run machine learning
   algorithms implemented on top of the matrix language

Best

Gokhan


On Tue, May 20, 2014 at 8:30 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:

 inline


 On Tue, May 20, 2014 at 12:42 AM, Sebastian Schelter s...@apache.org
 wrote:

 
 
  Let's take the next from our homepage as starting point. What should we
  add/remove/modify?
 
  
  
  The Mahout community decided to move its codebase onto modern data
  processing systems that offer a richer programming model and more
 efficient
  execution than Hadoop MapReduce. Mahout will therefore reject new
 MapReduce
  algorithm implementations from now on. We will however keep our widely
 used
  MapReduce algorithms in the codebase and maintain them.
 
  We are building our future implementations on top of a

 Scala

  DSL for linear algebraic operations which has been developed over the
 last
  months. Programs written in this DSL are automatically optimized and
  executed in parallel for Apache Spark.

 More platforms to be added in the future.

 
  Furthermore, there is an experimental contribution undergoing which aims
  to integrate the h20 platform into Mahout.
  
  
 



[jira] [Commented] (MAHOUT-1534) Add documentation for using Mahout with Hadoop2 to the website

2014-05-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14004662#comment-14004662
 ] 

Gokhan Capan commented on MAHOUT-1534:
--

[~ssc] I added the directions to the BuildingMahout page. If you're happy with 
the staged, I'll Publish Site

 Add documentation for using Mahout with Hadoop2 to the website
 --

 Key: MAHOUT-1534
 URL: https://issues.apache.org/jira/browse/MAHOUT-1534
 Project: Mahout
  Issue Type: Task
  Components: Documentation
Reporter: Sebastian Schelter
 Fix For: 1.0


 MAHOUT-1329 describes how to build the current trunk for usage with Hadoop 2. 
 We should have a page on the website describing this for our users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: VOTE: moving commits to git-wp.o.a github PR features.

2014-05-17 Thread Gokhan Capan
+1

Sent from my iPhone

 On May 16, 2014, at 21:38, Dmitriy Lyubimov dlie...@gmail.com wrote:

 Hi,

 I would like to initiate a procedural vote moving to git as our primary
 commit system, and using github PRs as described in Jake Farrel's email to
 @dev [1]

 [1]
 https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

 If voting succeeds, i will file a ticket with infra to commence necessary
 changes and to move our project to git-wp as primary source for commits as
 well as add github integration features [1]. (I assume pure git commits
 will be required after that's done, with no svn commits allowed).

 The motivation is to engage GIT and github PR features as described, and
 avoid git mirror history messes like we've seen associated with authors.txt
 file fluctations.

 PMC and committers have binding votes, so please vote. Lazy consensus with
 minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
 for weekend (i.e. Tuesday afternoon PST) .

 here is my +1

 -d


[jira] [Commented] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2

2014-05-15 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996351#comment-13996351
 ] 

Gokhan Capan commented on MAHOUT-1550:
--

Paul,

Did you try build mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION

Let us know if this fails

 Naive Bayes training fails with Hadoop 2
 

 Key: MAHOUT-1550
 URL: https://issues.apache.org/jira/browse/MAHOUT-1550
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2
Reporter: Paul Marret
Priority: Minor
  Labels: bayesian, training
 Attachments: mahout-snapshot.patch, stacktrace.txt

   Original Estimate: 0h
  Remaining Estimate: 0h

 When using the trainnb option of the program, we get the following error:
 Exception in thread main java.lang.IncompatibleClassChangeError: Found 
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected
 at 
 org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
 at 
 org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
 at 
 org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100)
 [...]
 It is possible to correct this by modifying the file 
 mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and 
 converting the instance job (line 174) to a Job object (it is a JobContext in 
 the current version).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1550) Naive Bayes training fails with Hadoop 2

2014-05-13 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13996351#comment-13996351
 ] 

Gokhan Capan edited comment on MAHOUT-1550 at 5/13/14 1:10 PM:
---

Paul,

Did you try building mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION

Let us know if this fails


was (Author: gokhancapan):
Paul,

Did you try build mahout using hadoop 2 profile first? The way to do it is:
mvn clean package -DskipTests=true -Dhadoop2.version=YOUR_HADOOP_VERSION

Let us know if this fails

 Naive Bayes training fails with Hadoop 2
 

 Key: MAHOUT-1550
 URL: https://issues.apache.org/jira/browse/MAHOUT-1550
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 1.0
 Environment: Ubuntu - Mahout 1.0-SNAPSHOT - Hadoop 2
Reporter: Paul Marret
Priority: Minor
  Labels: bayesian, training
 Attachments: mahout-snapshot.patch, stacktrace.txt

   Original Estimate: 0h
  Remaining Estimate: 0h

 When using the trainnb option of the program, we get the following error:
 Exception in thread main java.lang.IncompatibleClassChangeError: Found 
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected
 at 
 org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
 at 
 org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
 at 
 org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:100)
 [...]
 It is possible to correct this by modifying the file 
 mrlegacy/src/main/java/org/apache/mahout/common/HadoopUtil.java and 
 converting the instance job (line 174) to a Job object (it is a JobContext in 
 the current version).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968148#comment-13968148
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Well I can add this, but considering the current status of the project, I think 
this is no longer in people's interest.
What do you say [~ssc], should we 'won't fix' it or commit?

 GSOC 2013: Improve Lucene support in Mahout
 ---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
Assignee: Gokhan Capan
  Labels: gsoc2013, mentor
 Fix For: 1.0

 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch


 [via Ted Dunning]
 It should be possible to view a Lucene index as a matrix.  This would
 require that we standardize on a way to convert documents to rows.  There
 are many choices, the discussion of which should be deferred to the actual
 work on the project, but there are a few obvious constraints:
 a) it should be possible to get the same result as dumping the term vectors
 for each document each to a line and converting that result using standard
 Mahout methods.
 b) numeric fields ought to work somehow.
 c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
 into a single row of a single matrix.
 d) it should be possible to refer back from a row of the matrix to find the
 correct document.  THis might be because we remember the Lucene doc number
 or because a field is named as holding a unique id.
 e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968221#comment-13968221
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

I personally like the idea of integrating additional storage layers as matrix 
inputs, but not like the implementation I did here.
After agreeing on the new algorithm layers, we can later move to the the 
additional input formats. 

So my vote also is for Won't Fix

 GSOC 2013: Improve Lucene support in Mahout
 ---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
Assignee: Gokhan Capan
  Labels: gsoc2013, mentor
 Fix For: 1.0

 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch


 [via Ted Dunning]
 It should be possible to view a Lucene index as a matrix.  This would
 require that we standardize on a way to convert documents to rows.  There
 are many choices, the discussion of which should be deferred to the actual
 work on the project, but there are a few obvious constraints:
 a) it should be possible to get the same result as dumping the term vectors
 for each document each to a line and converting that result using standard
 Mahout methods.
 b) numeric fields ought to work somehow.
 c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
 into a single row of a single matrix.
 d) it should be possible to refer back from a row of the matrix to find the
 correct document.  THis might be because we remember the Lucene doc number
 or because a field is named as holding a unique id.
 e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-04-14 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968254#comment-13968254
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

The thing is it just 'loads' a Lucene index in memory as a matrix. You 
construct a matrix with the lucene index directory location and that's it. So 
it is not a fix for incremental document management issue.

The alternative approach is querying the index when a row/column vector, or 
cell is required. I, however, am not sure if the SolrMatrix thing is fast 
enough for that.

I haven't been available lately, and now I'm reading through the changes in and 
proposals for Mahout's future, and trying to set up my perspective for Mahout2. 
We probably can come up with a better way of document storage (still 
Lucene/Solr based). Let me leave this as is now, and then we can discuss the 
input formats further.

Is that OK for you?

 GSOC 2013: Improve Lucene support in Mahout
 ---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
Assignee: Gokhan Capan
  Labels: gsoc2013, mentor
 Fix For: 1.0

 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch


 [via Ted Dunning]
 It should be possible to view a Lucene index as a matrix.  This would
 require that we standardize on a way to convert documents to rows.  There
 are many choices, the discussion of which should be deferred to the actual
 work on the project, but there are a few obvious constraints:
 a) it should be possible to get the same result as dumping the term vectors
 for each document each to a line and converting that result using standard
 Mahout methods.
 b) numeric fields ought to work somehow.
 c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
 into a single row of a single matrix.
 d) it should be possible to refer back from a row of the matrix to find the
 correct document.  THis might be because we remember the Lucene doc number
 or because a field is named as holding a unique id.
 e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-03-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13918159#comment-13918159
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Let me get the pieces together and submit a patch in a few days.

 GSOC 2013: Improve Lucene support in Mahout
 ---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
Assignee: Gokhan Capan
  Labels: gsoc2013, mentor
 Fix For: 1.0

 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch


 [via Ted Dunning]
 It should be possible to view a Lucene index as a matrix.  This would
 require that we standardize on a way to convert documents to rows.  There
 are many choices, the discussion of which should be deferred to the actual
 work on the project, but there are a few obvious constraints:
 a) it should be possible to get the same result as dumping the term vectors
 for each document each to a line and converting that result using standard
 Mahout methods.
 b) numeric fields ought to work somehow.
 c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
 into a single row of a single matrix.
 d) it should be possible to refer back from a row of the matrix to find the
 correct document.  THis might be because we remember the Lucene doc number
 or because a field is named as holding a unique id.
 e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914494#comment-13914494
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Sure I can.

Although my vote would be passing the version, considering different 
distributions out there, people may want to build mahout against whatever 
hadoop2 distro they use (I am not very sure about my own argument actually, It 
would be great to hear a counter-argument)

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3-additional.patch, 1329-3.patch, 
 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1329:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-25 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911436#comment-13911436
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

I committed this to trunk

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907480#comment-13907480
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:52 AM:
---

Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop2.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.


was (Author: gokhancapan):
Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908126#comment-13908126
 ] 

Gokhan Capan edited comment on MAHOUT-1329 at 2/21/14 9:59 AM:
---

Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster [EDIT:Sorry I missed you 
mentioned that you ran the examples, great then]



was (Author: gokhancapan):
Yeah, you're right, edit coming.

Did you manage to run jobs against the cluster?

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-21 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13908443#comment-13908443
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Good news that I tried that too, on a 2.2.0 cluster.
seqdir, seq2sparse, and kmeans worked without a problem.

I'm gonna wait till Monday to commit this, in case folks want to verify that it 
works.



 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907237#comment-13907237
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Hi Sergey, thank you for that, I am copying from MAHOUT-1354:

Gokhan: Looks like when hadoop-2 profile is activated, this patch fails to 
apply the hadoop-2 related dependencies to integration and examples modules, 
despite they are both dependent to core and core is dependent to hadoop-2. For 
me, moving hadoop dependencies to the root solved the problem, but I think we 
wouldn't want that since hadoop is not a common dependency for all modules of 
the project.

Ted: It is important to keep modules like mahout math free of the massive 
Hadoop dependency.

I think pushing dependencies to the root is not something that we desire I 
think, but let me look into this further.


 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Suneel Marthi
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1329:
-

Attachment: 1329-3.patch

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Suneel Marthi
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907480#comment-13907480
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Sergey, I modified your patch and produced a new version. Looking into the 
dependency tree, it seems it builds against the correct hadoop version.

(This may seem irrelevant when looking at the patch, but I had to set argLine 
to -Xmx1024m in order not the unit tests to fail because of an OOM)

for hadoop version 1.2.1: mvn clean package
for hadoop version 2.2.0: mvn clean package -Dhadoop.version=2.2.0

I unit tested this for both versions and saw the tests passed, but I don't have 
access to a hadoop test environment currently, so could you guys test if this 
actually work (I'll do it tomorrow anyway)? 

Then we can commit it.

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Suneel Marthi
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (MAHOUT-1329) Mahout for hadoop 2

2014-02-20 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan reassigned MAHOUT-1329:


Assignee: Gokhan Capan  (was: Suneel Marthi)

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Gokhan Capan
  Labels: patch
 Fix For: 1.0

 Attachments: 1329-2.patch, 1329-3.patch, 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Mahout on Spark?

2014-02-19 Thread Gokhan Capan
I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:

 PS I am moving along cost optimizer for spark-backed DRMs on some
 multiplicative pipelines that is capable of figuring different cost-based
 rewrites and R-Like DSL that mixes in-core and distributed matrix
 representations and blocks but it is painfully slow, i really only doing it
 like couple nights in a month. It does not look like i will be doing it on
 company time any time soon (and even if i did, the company doesn't seem to
 be inclined to contribute anything I do anything new on their time). It is
 all painfully slow, there's no direct funding for it anywhere with no
 string attached. That probably will be primary reason why Mahout would not
 be able to get much traction compared to university-based contributions.


 On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:

  Unfortunately methinks the prospects of something like Mahout/MLLib merge
  seem very unlikely due to vastly diverged approach to the basics of
 linear
  algebra (and other things). Just like one cannot grow single tree out of
  two trunks -- not easily, anyway.
 
  It is fairly easy to port (and subsequently beat) MLib at this point from
  collection of algorithms point of view. But IMO goal should be more
  MLI-like first, and port second. And be very careful with concepts.
  Something that i so far don't see happening with MLib. MLib seems to be
  old-style Mahout-like rush to become a collection of basic algorithms
  rather than coherent foundation. Admittedly, i havent looked very
 closely.
 
 
  On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org
 wrote:
 
  I'm also convinced that Spark is a superior platform for executing
  distributed ML algorithms. We've had a discussion about a change from
  Hadoop to another platform some time ago, but at that point in time it
 was
  not clear which of the upcoming dataflow processing systems (Spark,
  Hyracks, Stratosphere) would establish itself amongst the users. To me
 it
  seems pretty obvious that Spark made the race.
 
  I concur with Ted, it would be great to have the communities work
  together. I know that at least 4 mahout committers (including me) are
  already following Spark's mailinglist and actively participating in the
  discussions.
 
  What are the ideas how a fruitful cooperation look like?
 
  Best,
  Sebastian
 
  PS:
 
  I ported LLR-based cooccurrence analysis (aka item-based recommendation)
  to Spark some time ago, but I haven't had time to test my code on a
 large
  dataset yet. I'd be happy to see someone help with that.
 
 
 
 
 
 
  On 02/19/2014 08:04 AM, Nick Pentreath wrote:
 
  I know the Spark/Mllib devs can occasionally be quite set in ways of
  doing certain things, but we'd welcome as many Mahout devs as possible
 to
  work together.
 
 
  It may be too late, but perhaps a GSoC project to look at a port of
 some
  stuff like co occurrence recommender and streaming k-means?
 
 
 
 
  N
  --
  Sent from Mailbox for iPhone
 
  On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath 
  nick.pentre...@gmail.comwrote:
 
  My (admittedly heavily biased) view is Spark is a superior platform
  overall
  for ML. If the two communities can work together to leverage the
  strengths
  of Spark, and the large amount of good stuff in Mahout (as well as
 the
  fantastic depth of experience of Mahout devs) I think a lot can be
  achieved!
 
   It makes a lot of sense that Spark would be better than Hadoop for
 ML
  purposes given that Hadoop was intended to do web-crawl kinds of
 things
  and
  Spark was intentionally built to support machine learning.
  Given that Spark has been announced by a majority of the Hadoop-based
  distribution vendors, it makes sense that maybe Mahout should jump in.
  I really would prefer it if the two communities (MLib/MLI and Mahout)
  could
  work more closely together.  There is a lot of good to be had on both
  sides.
 
 
 
 



[jira] [Commented] (MAHOUT-1329) Mahout for hadoop 2

2014-02-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906062#comment-13906062
 ] 

Gokhan Capan commented on MAHOUT-1329:
--

Is it OK to add hadoop dependencies to the project root, and to the math module 
(actually to all modules even they already depend on the core module)?

I remember that's what we wanted to avoid

 Mahout for hadoop 2
 ---

 Key: MAHOUT-1329
 URL: https://issues.apache.org/jira/browse/MAHOUT-1329
 Project: Mahout
  Issue Type: Task
  Components: build
Affects Versions: 0.9
Reporter: Sergey Svinarchuk
Assignee: Suneel Marthi
  Labels: patch
 Fix For: 1.0

 Attachments: 1329.patch


 Update mahout for work with hadoop 2.X, targeting this for Mahout 1.0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: MAHOUT 0.9 Release - New URL

2014-01-24 Thread Gokhan Capan
Using CentOS 6.5 and hadoop 1.2.1, all passed.

+1 from me

Gokhan


On Thu, Jan 23, 2014 at 6:01 PM, Andrew Palumbo ap@outlook.com wrote:

 a),b),c),d) all passed on CentOS for me

  Date: Thu, 23 Jan 2014 13:43:06 +0200
  Subject: Re: MAHOUT 0.9 Release - New URL
  From: ssvinarc...@hortonworks.com
  To: dev@mahout.apache.org
 
  I did a), b), c), d) and all steps pass.
  +1
 
 
  On Thu, Jan 23, 2014 at 1:40 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
   +1 from me.
  
   On Jan 22, 2014, at 5:55 PM, Suneel Marthi suneel_mar...@yahoo.com
   wrote:
  
Fixed the issues that were reported this week and restored FP mining
   into the codebase.
   
Here's the URL for the final release in staging:-
   
  
 https://repository.apache.org/content/repositories/orgapachemahout-1003/org/apache/mahout/mahout-distribution/0.9/
   
The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc
   
   
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please
 run
   through all the different options in each script.
   
Committers and PMC, need a minimum of 3 '+1' votes for the release
 to be
   finalized.
  
   
   Grant Ingersoll | @gsingers
   http://www.lucidworks.com
  
  
  
  
  
  
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, please contact the sender
 immediately
  and delete it from your system. Thank You.




Re: Mahout 0.9 release

2013-12-20 Thread Gokhan Capan
+1 for 1.0.

This is more challenging than expected (the old hadoop 0.23 profile
support is misleading)

Sent from my iPhone

 On Dec 19, 2013, at 19:48, Andrew Musselman andrew.mussel...@gmail.com 
 wrote:

 +1


 On Thu, Dec 19, 2013 at 9:20 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 +1

 Sent from my iPhone

 On Dec 19, 2013, at 12:17 PM, Frank Scholten fr...@frankscholten.nl
 wrote:

 I am looking at M-1329 (Support for Hadoop 2.x) as we speak. This change
 requires quite some testing and I prefer to push this to 1.0. I am
 thinking
 of creating a unit test that starts miniclusters for each versions and
 runs
 a job in them.




 On Thu, Dec 19, 2013 at 12:28 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

 There's M-1329 that covers this. Hopefully it should make it for 0.9

 Sent from my iPhone

 On Dec 18, 2013, at 6:20 PM, Isabel Drost-Fromm isa...@apache.org
 wrote:

 On Mon, 16 Dec 2013 23:16:36 +0200
 Gokhan Capan gkhn...@gmail.com wrote:

 M-1354 (Support for Hadoop 2.x) - Patch available.
 Gokhan, any updates on this.

 Nope, still couldn't make it work.


 Should we push that for 1.0 then (if this is shortly before completion
 and there's too much in 1.0 to push for a release early next year, I'd
 also be happy to have a smaller release between now and Berlin
 Buzzwords that includes the fix...).

 Isabel



Re: Mahout 0.9 release

2013-12-16 Thread Gokhan Capan
Gokhan


On Mon, Dec 16, 2013 at 11:08 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Its time to freeze trunk the this week, here's the status of JIRAs:-

 Suneel
 --
 M-1319 - Patch available, would appreciate if someone could review/test
 the patch before I commit to trunk.

 Pat
 -
 M-1288 Solr Recommender

 Pat, I see that you have the code in ur Github repo, could u create a
 patch that could be merged into Mahout trunk.

 Frank
 
 M-1364 (Upgrade to Lucene 4.6) - Patch available.
 Grant, do u have cycles to review this patch?


 Gokhan

 --

 M-1354 (Support for Hadoop 2.x) - Patch available.
 Gokhan, any updates on this.


Nope, still couldn't make it work.








 On Sunday, December 8, 2013 6:23 PM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:

 We need to freeze the trunk this coming week in preparation for 0.9
 release, below are the pending JIRAs:-

 Wiki (not a show stopper for 0.9)

 -
 M-1245, M-1304, M-1305, M-1307, M-1326


 Suneel
 ---
 M-1319 (i can work on this tomorrow)

 M-1265 (Multi Layer Perceptron) -


 Need to be merged into trunk, the code's available for review on
 ReviewBoard.
 It would help if another set of eyes reviewed the test cases (Isabel,
 Stevo.. ?)


 Pat

 
 M-1288 Solr Recommender
 (What's the status of this Pat, this needs to be in 0.9 Release.)

 Stevo
 ---
 M-1366 (this can be at time of 0.9 Release and has no impact on trunk)

 Frank
 
 M-1364 (Upgrade to Lucene 4.6) - Patch available.
   It would be nice to have this go in 0.9

 The patch worked for me Frank, I agree that this needs to be reviewed by
 someone who's more familiar with Lucene.

 Gokhan

 --

 M-1354 (Support for Hadoop 2.x) - Patch available.
 This is targeted for 1.0. The patch worked for me on Hadoop 1.2.1, it
 would be good if someone could try the patch on hadoop 2.x instance.


 Others
 --
 M-1371 - This was reported on @user and a patch was submitted. If we don't
 hear from the author within this week, this can be deferred to 1.0





 On Tuesday, December 3, 2013 8:13 PM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:

 JIRAs Update for 0.9 release:-

 Wiki - Isabel, Sebastian and other volunteers
 -
 M-1245, M-1304, M-1305, M-1307, M-1326

 Suneel
 ---
 M-1319
 M-1242 (Patch available to be committed to trunk)

 Pat
 ---
 M-1288 Solr Recommender

 Yexi, Suneel
 ---
 M-1265 - Multi Layer Perceptron

 Stevo, Isabel
 -
 M-1366

 Andrew
 --
 M-1030, M-1349

 Ted
 --
 M-1368 (Patch available to be committed to trunk)











 On Sunday, December 1, 2013 7:57 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:

 Open JIRAs for 0.9 release :-

 Wiki - Isabel, Sebastian and other volunteers
 -

 M-1245, M-1304, M-1305, M-1307, M-1326

 Suneel
 ---
 M-1319, M-1328

 Pat
 ---
 M-1288 Solr Recommender

 Sebastian, Peng
 
 M-1286

 Yexi, Suneel
 ---
 M-1265 - Multi Layer Perceptron
 Ted, do u have cycles to review this, the patch's up on Reviewboard.

 Stevo, Isabel
 -
 M-1366 - Please delete old releases from mirroring system
 M-1345 - Enable Randomized testing for all modules

 Andrew
 --
 M-1030

 Open Issues (any takers for these ???)
 
 M-1242
 M-1349






 On Friday, November 29, 2013 12:07 PM, Sebastian Schelter 
 ssc.o...@googlemail.com wrote:

 On 29.11.2013 17:59, Suneel Marthi wrote:
  Open JIRAs for 0.9:
 
  Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 -
 related to Wiki updates.
  Definitely appreciate more hands here to review/update the wiki
 
  M-1286 - Peng and
   Sebastian, no updates on this. Can this be included in 0.9?

 I will look into this over the weekend!


 
  M-1030 - Andrew Musselman
 
  M-1319, M-1328 -  Suneel
 
  M-1347 - Suneel, patch has been committed to trunk.
 
  M-1265 - I have been working with Yexi on this. Ted, would u have time
 to review this; the code's on Reviewboard.
 
  M-1288 - Sole Recommender, Pat Ferrel
 
  M-1345: Isabel, Frank. I think we are good on this patch. Isabel, could
 u commit this to trunk?
 
  M-1312: Stevo, could u look at this?
 
  M-1349: Any takers for this??
 
  Others: Spectral Kmeans clustering documentation (Shannon)
 
 
 
 
  On Thursday,
  November 28, 2013 10:38 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:
 
  Adding Mahout-1349 to the list of JIRAs .
 
 
 
 
 
  On Thursday, November 28, 2013 10:37 AM, Suneel Marthi 
 suneel_mar...@yahoo.com wrote:
 
  Update on Open JIRAs for 0.9:
 
  Mahout-1245, Mahout-1304, Mahout-1305, Mahout-1307, Mahout-1326 - all
 related to Wiki updates, please see Isabel's updates.
 
 
 M-1286 - Peng and
   Sebastian, we had
  talked about this during the last hangout. Can this be included in 0.9?
 
  M-1030- Andrew Musselman, 

[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842960#comment-13842960
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Looks like when hadoop-2 profile is activated, this patch fails to apply the 
hadoop-2 related dependencies to integration and examples modules, despite they 
are both dependent to core and core is dependent to hadoop-2. For me, moving 
hadoop dependencies to the root solved the problem, but I think we wouldn't 
want that since hadoop is not a common dependency for all modules of the 
project. 

CC'ing [~frankscholten]

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0

 Attachments: MAHOUT-1354_initial.patch


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-09 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843226#comment-13843226
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Yeah, I agree

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0

 Attachments: MAHOUT-1354_initial.patch


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Welcome to Frank Scholten as new Mahout committer

2013-12-03 Thread Gokhan Capan
Congratulations, Frank!

Gokhan


On Tue, Dec 3, 2013 at 3:27 PM, Isabel Drost-Fromm isa...@apache.orgwrote:


 Hi,

 this is to announce that the Project Management Committee (PMC) for Apache
 Mahout has asked Frank Scholten to become committer and we are pleased to
 announce that he has accepted.

 Being a committer enables easier contribution to the project since in
 addition
 to posting patches on JIRA it also gives write access to the code
 repository.
 That also means that now we have yet another person who can commit patches
 submitted by others to our repo *wink*

 Frank, you've been following the project for quite some time now -
 contributing
 valuable changes over and over again. I certainly look forward to working
 with you in the future. Welcome!


 Isabel





[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-03 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837933#comment-13837933
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Today I had some troubles with integration's transitive dependencies, let me 
dig further.

So this still should stay in 1.0 queue

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836661#comment-13836661
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Do you think we should support hadoop-1 and hadoop-2 at the same time?

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836953#comment-13836953
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Well, I tried something and want to share.

Based on:
In hadoop-2-stable, compatibility with hadoop-1 is preferred over with 
hadoop-2-alpha 
(http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html).
 For example, return type for ProgramDriver#driver(String) was void in hadoop-1 
(which we use in MahoutDriver), int in hadoop-2-alpha, void again in 
hadoop-2-stable. It seems if we select the right artifacts, there is nothing to 
worry about the compatibility. 

My conclusion was:
The current hadoop-0.20 and hadoop-0.23 profiles can be utilized: we can rename 
them to hadoop-1 and hadoop-2, respectively, then make hadoop-2 (stable) the 
default profile, then set the hadoop.version property to 2.2.0. We need to 
worry about some third party dependencies though, for instance, hbase-client in 
mahout-integration is dependent to hadoop-1 (for that particular artifact, 
simply excluding hadoop-core did not break any tests, by the way).

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1354) Mahout Support for Hadoop 2

2013-12-02 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836965#comment-13836965
 ] 

Gokhan Capan commented on MAHOUT-1354:
--

Let me submit a patch first, probably tomorrow.
Best

 Mahout Support for Hadoop 2 
 

 Key: MAHOUT-1354
 URL: https://issues.apache.org/jira/browse/MAHOUT-1354
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Suneel Marthi
Assignee: Suneel Marthi
 Fix For: 1.0


 Mahout support for Hadoop , now that Hadoop 2 is official.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-12-01 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13836102#comment-13836102
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Let's Won't Fix this issue.

I think what we need to do is implementing more sparse matrix (or alike) data 
structures for different access patterns, other than the current map of maps 
approach. The ideas would apply to current 2 FastByIDMaps based DataModel.



 

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch, benchmark.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: [jira] [Updated] (MAHOUT-1365) Weighted ALS-WR iterator for Spark

2013-11-26 Thread Gokhan Capan
I'll look into this too, possibly in two days

Sent from my iPhone

 On Nov 26, 2013, at 22:30, Dmitriy Lyubimov (JIRA) j...@apache.org wrote:


 [ 
 https://issues.apache.org/jira/browse/MAHOUT-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Dmitriy Lyubimov updated MAHOUT-1365:
 -

Attachment: distributed-als-with-confidence.pdf

 Weighted ALS-WR iterator for Spark
 --

Key: MAHOUT-1365
URL: https://issues.apache.org/jira/browse/MAHOUT-1365
Project: Mahout
 Issue Type: Task
   Reporter: Dmitriy Lyubimov
   Assignee: Dmitriy Lyubimov
Fix For: Backlog

Attachments: distributed-als-with-confidence.pdf


 Given preference P and confidence C distributed sparse matrices, compute 
 ALS-WR solution for implicit feedback (Spark Bagel version).
 Following Hu-Koren-Volynsky method (stripping off any concrete methodology 
 to build C matrix), with parameterized test for convergence.
 The computational scheme is followsing ALS-WR method (which should be 
 slightly more efficient for sparser inputs).
 The best performance will be achieved if non-sparse anomalies prefilitered 
 (eliminated) (such as an anomalously active user which doesn't represent 
 typical user anyway).
 the work is going here 
 https://github.com/dlyubimov/mahout-commits/tree/dev-0.9.x-scala. I am 
 porting away our (A1) implementation so there are a few issues associated 
 with that.



 --
 This message was sent by Atlassian JIRA
 (v6.1#6144)


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-10-26 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806106#comment-13806106
 ] 

Gokhan Capan edited comment on MAHOUT-1286 at 10/26/13 2:13 PM:


Peng,

I am attaching a patch --not to be committed-- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.


was (Author: gokhancapan):
Peng,

I am attaching a patch -not to be committed- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: benchmark.patch, InMemoryDataModel.java, 
 InMemoryDataModelTest.java, Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-10-26 Thread Gokhan Capan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gokhan Capan updated MAHOUT-1286:
-

Attachment: benchmark.patch

Peng,

I am attaching a patch -not to be committed- that includes some benchmarking 
code in case you need one, and 2 in-memory data models as a baseline.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: benchmark.patch, InMemoryDataModel.java, 
 InMemoryDataModelTest.java, Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2013-10-19 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13799916#comment-13799916
 ] 

Gokhan Capan commented on MAHOUT-1178:
--

Hi [~smarthi], 

Although I'm not sure if there is no more an interest, I have a Lucene matrix 
(in-memory) and a Solr matrix (that does not load the index into memory) 
implementations. I believe both can be committed after a couple review rounds.



 GSOC 2013: Improve Lucene support in Mahout
 ---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
  Labels: gsoc2013, mentor
 Fix For: Backlog

 Attachments: MAHOUT-1178.patch, MAHOUT-1178-TEST.patch


 [via Ted Dunning]
 It should be possible to view a Lucene index as a matrix.  This would
 require that we standardize on a way to convert documents to rows.  There
 are many choices, the discussion of which should be deferred to the actual
 work on the project, but there are a few obvious constraints:
 a) it should be possible to get the same result as dumping the term vectors
 for each document each to a line and converting that result using standard
 Mahout methods.
 b) numeric fields ought to work somehow.
 c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
 into a single row of a single matrix.
 d) it should be possible to refer back from a row of the matrix to find the
 correct document.  THis might be because we remember the Lucene doc number
 or because a field is named as holding a unique id.
 e) named vectors and matrices should be used if plausible.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


Re: Mahout's future

2013-10-16 Thread Gokhan Capan
I'll be traveling tomorrow, and will appreciate if the videos are
gonna be accessible later.

Best

Sent from my iPhone

On Oct 16, 2013, at 23:15, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Thanks Dmitriy. Let me check if its possible to setup automatic calendar 
 invites to PMC.


 I'll go ahead and send a hangout link for Thursday, Oct 16 from 6 - 7pm 
 (Eastern Time).


 The purpose of this hangout would be to talk about Mahout 0.9 release which 
 is tentatively being planned for Nov-Dec 2013.

 I'll send an email with what I see as being targeted for 0.9 and we can take 
 it from there.


 There's been a discussion thread about Mahout Future Roadmap (interpreting 
 this as post Mahout 0.9),  we can get to that if time permits else  we can 
 have another hangout next week to talk about it.

 Suneel






 On Wednesday, October 16, 2013 4:05 PM, Dmitriy Lyubimov dlie...@gmail.com 
 wrote:

 3 to 4

 On Oct 16, 2013 1:02 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Dmitriy, what time works for you on thursday?





 On Wednesday, October 16, 2013 3:47 PM, Dmitriy Lyubimov 
 dlie...@gmail.com wrote:

 Doesnt work for me. Friday is better, or thrusday earlier afternoon. I d
 also appreciate automatic calendar invitations to pmc if at all possible.

 D

 On Oct 14, 2013 10:21 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:

 Will schedule a hangout for this Thursday - 7pm (Eastern Time)
 tentatively.

 I would like us to first discuss about Mahout 0.9 release, will send out
 an agenda once I schedule it.

 Regards,
 Suneel




 On Tuesday, October 15, 2013 12:24 AM, Saikat Kanjilal 
 sxk1...@hotmail.com wrote:

 Following up , Suneel/Grant are we still on for meeting this week on a
 google hangout, would love to neet this week.

 From: sxk1...@hotmail.com
 To: dev@mahout.apache.org
 Subject: RE: Mahout's future
 Date: Sun, 6 Oct 2013 07:00:50 -0700

 +1Can you send out a quick agenda (hopefully with my input
 incorporated)
 before the hangout?Regards
 Date: Sun, 6 Oct 2013 03:58:10 -0700
 From: suneel_mar...@yahoo.com
 Subject: Re: Mahout's future
 To: dev@mahout.apache.org

 Grant would be available the week of Oct 14 for a hangout
 (tentatively).
 We could go ahead and schedule one next week if there's (and seems
 very much like it) enough response.  I can go ahead and facilitate one.

 I will be 100% focused on Mahout from next week once I start at my
 new
 job from Monday.

 Regarding building something for Deep Learning, Yexi's patch for MLP
 (see M-1265) may be a good place to refactor/start thinking about the
 foundations.
 I guess Ted is alluring to build something like what's been described
 in the Google paper (see

 http://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf
 ).
 Correct?


 Suneel




 
   From: Ted Dunning ted.dunn...@gmail.com
 To: dev@mahout.apache.org dev@mahout.apache.org
 Cc: dev@mahout.apache.org dev@mahout.apache.org
 Sent: Sunday, October 6, 2013 2:10 AM
 Subject: Re: Mahout's future


 Saikat

 These are all good suggestions.  I would have a hard time suggesting
 a
 prioritization of them.

 Does anybody remember what grant said about having another hangout?

 Sent from my iPhone

 On Oct 6, 2013, at 7:15, Saikat Kanjilal sxk1...@hotmail.com
 wrote:

 I wanted to mention a few other things:1)It might be useful to take
 and embed a few already productionalized use cases into the integration
 tests in mahout, this will help additional users get on board faster2)
 Deep
 learning is really interesting, however I'd like to help research some
 common use cases first before tying this into mahout3) It'd be good to
 put
 some thought into  documenting when you would choose what type of
 algorithm
 given a production machine learning recommendation system to build, this
 would give more visibility for users into choosing the right mixture of
 algorithms to build a production ready recommender, often what I've found
 is that a bulk of the time in building productionalized recommenders is
 spent cleaning and filtering noisy data4) I'd like to also explore how to
 tie in machine learning algorithms into real time systems built using
 twitter storm (http://storm-project.net/), it seems that industry more
 and more is
   wanting
   to do real time analytics on the fly, I'm curious what type of
 algorithms we'd need for this and back propagate these into mahout

 It'd be good to meet like minded devs  together locally (Seattle)
 or
 over gtalk/conference to talk through possibilities.
 Regards
 From: ted.dunn...@gmail.com
 Date: Sat, 5 Oct 2013 18:13:40 -0700
 Subject: Re: Mahout's future
 To: dev@mahout.apache.org

 On Sat, Oct 5, 2013 at 5:08 PM, Saikat Kanjilal 
 sxk1...@hotmail.com wrote:

 Does it make sense to have a quick meeting of interested
 developers over
 google chat/conference rather than email to discuss and assign
 folks to
 specifics?

 Thoughts?

 Great idea.

 I think that Grant may have been 

[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-05 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759021#comment-13759021
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

There was a thread on updating int indices and double values in matrices, 
but there are simply too many consequences of that update that we can't deal 
with right now. Even if it is not an exact Matrix structure, we can start with 
2d hash tables and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-05 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759021#comment-13759021
 ] 

Gokhan Capan edited comment on MAHOUT-1286 at 9/5/13 12:22 PM:
---

Even if it is not an exact Matrix structure, we can start with 2d hash tables 
and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.

  was (Author: gokhancapan):
There was a thread on updating int indices and double values in 
matrices, but there are simply too many consequences of that update that we 
can't deal with right now. Even if it is not an exact Matrix structure, we can 
start with 2d hash tables and proceed later. 

Let's start this. I tried to insert Netflix ratings into: i- DataModel backed 
by 2 matrices. ii- The one in this patch. Good news is insert performance is 
good enough. I am going to try gets and iterations, too. Tomorrow I am starting 
the 2d hash table based on your implementation with a matrix-like interface, I 
am going to share a github link with you.
  
 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-04 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757801#comment-13757801
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Here is what I think:

1- We should implement a matrix that uses your 2d Hopscotch hash table as the 
underlying data structure (or the current open addressing hash table 
implementation that already exists in Mahout, depending on benchmarks)

2- We should handle concurrency issues that might be introduced by that matrix 
implementation

3- We then can replace the FastByIDMap(s) with that matrix, trust at the 
underlying matrix for concurrent updates, and never create a PreferenceArray 
unless there is an iteration over users (or items)

What do you think?

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751049#comment-13751049
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Hi Peng, could you submit the diff files instead of .javas? That would be more 
convenient for me if it is possible.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-27 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751053#comment-13751053
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

By the way, it seems the link to the paper is broken, if it is not just me.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: You are invited to Apache Mahout meet-up

2013-08-22 Thread Gokhan Capan
Have a great day!

On Aug 22, 2013, at 8:44 PM, Piero Giacomelli pgiac...@gmail.com wrote:

 Mee to so any online material could be vety helpfull
 Il giorno 22/ago/2013 19:31, Peng Cheng pc...@uowmail.edu.au ha scritto:
 
 Is the presentation going to be uploaded on Youtube or Slideshare? Sorry I
 cannot be there.
 
 On 13-08-22 08:46 AM, Yexi Jiang wrote:
 
 A great event. I wish I were in Bay area.
 
 
 2013/8/22 Shannon Quinn squ...@gatech.edu
 
 I'm only sorry I'm not in the Bay area. Sounds great!
 
 
 On 8/22/13 3:38 AM, Stevo Slavić wrote:
 
 Retweeted meetup invite. Have fun!
 
 Kind regards,
 Stevo Slavic.
 
 
 On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Very cool.
 
 Would love to see folks turn out for this.
 
 
 On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman
 b.ellen.fried...@gmail.comwrote:
 
  The Apache Mahout user group has been re-activated. If you are in the
 
 Bay
 Area in California, join us on Aug 27 (Redwood City).
 
 Sebastian Schelter will be the main speaker, talking about new
 directions
 with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be
 there
 
 to
 
 do a short introduction for the meet-up and update on the 0.8 release.
 
 Here's the link to rsvp: http://bit.ly/16K32hg
 
 Hope you can come, and please spread the word.
 
 Ellen
 
 


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Gokhan Capan (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737267#comment-13737267
 ] 

Gokhan Capan commented on MAHOUT-1286:
--

Peng,

With a SparseRowMatrix, column access (getPreferencesForItem), but row access 
is pretty fast (getPreferencesFromUsers). I agree with all other problems you 
mentioned. 

In Mahout's SVD-based recommenders and FactorizablePreferences, while computing 
top-N recommendations, I believe we compute activeUser,item predictions for 
each item, and return the top-N. So basically, a SVD based recommender needs 
fast access to the rows of the matrix, but not the columns (It still needs to 
iterate over item ids, though). It is only needed in an item-based recommender, 
or if a CandidateItemsStrategy is used.

In my tests for Netflix data, I saw a 3G heap, too. Let me compare this 
particular approach with the SparseRowMatrix backed one. I will investigate 
your approach further.

Ted, 

Additionally, I recently implemented a read-only SolrMatrix, which might be 
beneficial while implementing the SolrRecommender, if we want to use existing 
mahout library for similarities etc. I will open a new thread for that.

Best


 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Regarding Online Recommenders

2013-07-24 Thread Gokhan Capan
Ok, I tested the MatrixBackedDataModel, and the heap size is reduced to 7G
for the Netflix Data, still large.

The same history is encoded in 2 SparseRowMatrices, one is row-indexed by
users and one is by item.

It has serious concurrency issues at several places, though (sets and
removes need to be thread-safe).

Best

Gokhan


On Sat, Jul 20, 2013 at 12:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Hi,

 Just one simple question: Is the 
 org.apache.mahout.math.**BinarySearch.binarySearch()
 function an optimized version of Arrays.binarySearch()? If it is not, why
 implement it again?

 Yours Peng


 On 13-07-17 06:31 PM, Sebastian Schelter wrote:

 You are completely right, the simple interface would only be usable for
 readonly / batch-updatable recommenders. Online recommenders might need
 something different. I tried to widen the discussion here to discuss all
 kinds of API changes in the recommenders that would be necessary in the
 future.



 2013/7/17 Peng Cheng pc...@uowmail.edu.au

  One thing that suddenly comes to my mind is that, for a simple interface
 like FactorizablePreferences, maybe sequential READ in real time is
 possible, but sequential WRITE in O(1) time is Utopia. Because you need
 to
 flush out old preference with same user and item ID (in worst case it
 could
 be an interpolation search), otherwise you are permitting a user rating
 an
 item twice with different values. Considering how FileDataModel suppose
 to
 work (new files flush old files), maybe using the simple interface has
 less
 advantages than we used to believe.


 On 13-07-17 04:58 PM, Sebastian Schelter wrote:

  Hi Peng,

 I never wanted to discard the old interface, I just wanted to split it
 up.
 I want to have a simple interface that only supports sequential access
 (and
 allows for very memory efficient implementions, e.g. by the use of
 primitive arrays). DataModel should *extend* this interface and provide
 sequential and random access (basically what is already does).

 Than a recommender such as SGD could state that it only needs sequential
 access to the preferences and you can either feed it a DataModel (so we
 dont break backwards compatibility) or a memory efficient sequential
 access thingy.

 Does that make sense for you?


 2013/7/17 Peng Cheng pc...@uowmail.edu.au

   I see, OK so we shouldn't use the old implementation. But I mean, the
 old

 interface doesn't have to be discarded. The discrepancy between your
 FactorizablePreferences and DataModel is that, your model supports
 getPreferences(), which returns all preferences as an iterator, and
 DataModel supports a few old functions that returns preferences for an
 individual user or item.

 My point is that, it is not hard for each of them to implement what
 they
 lack of: old DataModel can implement getPreferences() just by a a loop
 in
 abstract class. Your new FactorizablePreferences can implement those
 old
 functions by a binary search that takes O(log n) time, or an
 interpolation
 search that takes O(log log n) time in average. So does the online
 update.
 It will just be a matter of different speed and space, but not
 different
 interface standard, we can use old unit tests, old examples, old
 everything. And we will be more flexible in writing ensemble
 recommender.

 Just a few thoughts, I'll have to validate the idea first before
 creating
 a new JIRA ticket.

 Yours Peng



 On 13-07-16 02:51 PM, Sebastian Schelter wrote:

   I completely agree, Netflix is less than one gigabye in a smart

 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient
 representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

Netflix is a small dataset.  12G for that seems quite excessive.

  Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:

The second idea is indeed splendid, we should separate
 time-complexity

  first and space-complexity first implementation. What I'm not quite
 sure,
 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

   laptop

   can already handle that (emphasis on laptop). And if we replace
 hash

 map
 (the culprit of high memory consumption) with list/linkedList, it
 would
 simply degrade time complexity for a linear search to O(n), not too
 bad
 either. The current DataModel is a result of careful thoughts and
 has
 underwent extensive test, it is easier to expand on top of it
 instead
 of
 subverting it.








Re: MongoDBDataModel additions

2013-07-22 Thread Gokhan Capan
Paul,

Actually we are now working on an OnlineRecommender, and we plan to support
new users and items. You can find the discussion in  Regarding Online
Recommenders dev-list.

You may want to take a look at it.

Best,
Gokhan


On Mon, Jul 22, 2013 at 8:40 AM, Paul Scott pscott...@gmail.com wrote:

 On 19/07/2013 19:40, Gokhan Capan wrote:

 Hi Paul,

 I am sure Sebastian will provide further information, but there was a JIRA
 ticket that you may find relevant.
 https://issues.apache.org/**jira/browse/MAHOUT-1050https://issues.apache.org/jira/browse/MAHOUT-1050


 Thanks! OK, so the data model is immutable because of constant refreshing.
 Seems OK to me, although may be a bit heavy with many millions of users no?

 Anyway, I will leave it for now and look at other ways to help out this
 awesome project!

 Thanks for the reply and link

 -- Paul

 --
 http://paulscott.co.za/blog/




Re: MongoDBDataModel additions

2013-07-19 Thread Gokhan Capan
Hi Paul,

I am sure Sebastian will provide further information, but there was a JIRA
ticket that you may find relevant.
https://issues.apache.org/jira/browse/MAHOUT-1050

Best

Gokhan


On Fri, Jul 19, 2013 at 9:43 AM, Paul Scott pscott...@gmail.com wrote:

 Hi all,

 Let me do a quick introduction. I am Paul and I work at DStv Online in
 South Africa.

 I would normally lurk on a list a lot longer than this, but I do feel that
 I can contribute almost immediately. Please excuse me if I am at all out of
 bounds here...

 I have noticed that in the MongoDBDataModel in mahout-inegration that the
 methods:

 public void setPreference(long userID, long itemID, float value)

 and

 public void removePreference(long userID, long itemID)

 both throw UnsupportedOperationExceptions**. Is this by design, or can I
 actually implement these methods and send through a patch?

 Also, obviously, I would need to open a Jira ticket. Do I need to sign up
 for that or what is the process there?

 As a second contribution, I would also like to start exploring/discussing
 a Neo4jDataModel for working with the Neo4j Graph database.

 Again, apologies if this has already been discussed, but I couldn't find
 any other references to this online.

 Many thanks!

 -- Paul
 http://paulscott.co.za/blog



Re: Regarding Online Recommenders

2013-07-18 Thread Gokhan Capan
 handled? Do

 you

 plan to require batch model refactorization for any update? Or perform

 some

 partial update by maybe just transforming new data into the LF space
 already in place then doing full refactorization every so often in
 batch
 mode?

 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user
 asked
 to pick a few items to start the rec process, or an old user with some

 new

 action history not yet in the model. Are you going to allow for
 passing

 the

 entire history vector or userID+incremental new history to the

 recommender?

 I hope so.

 For what it's worth we did a comparison of Mahout Item based CF to
 Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6
 months

 of

 data. The data was purchase data from a diverse ecom source with a
 large
 variety of products from electronics to clothes. We found Item based
 CF

 did

 far better than ALS. As we increased the number of latent factors the
 results got better but were never within 10% of item based (we used
 MAP

 as

 the offline metric). Not sure why but maybe it has to do with the

 diversity

 of the item types.

 I understand that a full item based online recommender has very
 different
 tradeoffs and anyway others may not have seen this disparity of
 results.
 Furthermore we don't have A/B test results yet to validate the offline
 metric.

 On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Peng,

 This is the reason I separated out the DataModel, and only put the

 learner

 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)***noOfLatentFactors, and does not
 care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to
 deploy

 a

 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me

 try a

 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

  I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient

 representation,

 tested on KDD Music dataset which is approx 2.5 times Netflix and
 fits

 into

 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take  1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au

 wrote:

 The second idea is indeed splendid, we should separate
 time-complexity
 first and space-complexity first implementation. What I'm not quite

 sure,

 is that if we really need to create two interfaces instead of one.
 Personally, I think 12G heap space is not that high right? Most new

 laptop

 can already handle that (emphasis on laptop). And if we replace
 hash

 map

 (the culprit of high memory consumption) with list/linkedList, it

 would

  simply degrade time complexity for a linear search to O(n), not too

 bad

  either. The current DataModel is a result of careful thoughts and has
 underwent extensive test, it is easier to expand on top of it
 instead

 of

 subverting it.











Re: Regarding Online Recommenders

2013-07-18 Thread Gokhan Capan
It is 2 SparseRowMatrices, Peng. But I don't want to comment on it before
actually trying it. This is essentially a first step for me to choose my
side on the DataModel implementation discussion:)

Gokhan

On Fri, Jul 19, 2013 at 2:25 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 Wow, that's lightning fast.

 Is it a SparseMatrix or DenseMatrix?


 On 13-07-18 07:23 PM, Gokhan Capan wrote:

 I just started to implement a Matrix backed data model and pushed it, to
 check the performance and memory considerations.

 I believe I can try it on some data tomorrow.

 Best

 Gokhan


 On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au
 wrote:

  I see, sorry I was too presumptuous. I only recently worked and tested
 SVDRecommender, never could have known its efficiency using an item-based
 recommender. Maybe there is space for algorithmic optimization.

 The online recommender Gokhan is working on is also an SVDRecommender. An
 online user-based or item-based recommender based on clustering technique
 would definitely be critical, but we need an expert to volunteer :)

 Perhaps Dr Dunning can have a few words? He announced the online
 clustering component.

 Yours Peng


 On 13-07-18 03:54 PM, Pat Ferrel wrote:

  No it was CPU bound not memory. I gave it something like 14G heap. It
 was
 running, just too slow to be of any real use. We switched to the hadoop
 version and stored precalculated recs in a db for every user.

 On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

 Strange, its just a little bit larger than limibseti dataset (17m
 ratings), did you encountered an outOfMemory or GCTimeOut exception?
 Allocating more heap space usually help.

 Yours Peng

 On 13-07-18 02:27 PM, Pat Ferrel wrote:

  It was about 2.5M users and 500K items with 25M actions over 6 months
 of
 data.

 On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

 If I remember right, a highlight of 0.8 release is an online clustering
 algorithm. I'm not sure if it can be used in item-based recommender,
 but
 this is definitely I would like to pursue. It's probably the only
 advantage
 a non-hadoop implementation can offer in the future.

 Many non-hadoop recommenders are pretty fast. But existing in-memory
 GenericDataModel and FileDataModel are largely implemented for
 sandboxes,
 IMHO they are the culprit of scalability problem.

 May I ask about the scale of your dataset? how many rating does it
 have?

 Yours Peng

 On 13-07-18 12:14 PM, Sebastian Schelter wrote:

  Well, with itembased the only problem is new items. New users can
 immediately be served by the model (although this is not well
 supported
 by
 the API in Mahout). For the majority of usecases I saw, it is
 perfectly
 fine to have a short delay until new items enter the recommender,
 usually
 this happens after a retraining in batch. You have to care for
 cold-start
 and collect some interactions anyway.


 2013/7/18 Pat Ferrel pat.fer...@gmail.com

   Yes, what Myrrix does is good.

 My last aside was a wish for an item-based online recommender not
 only
 factorized. Ted talks about using Solr for this, which we're
 experimenting
 with alongside Myrrix. I suspect Solr works but it does require a bit
 of
 tinkering and doesn't have quite the same set of options--no llr
 similarity
 for instance.

 On the same subject I recently attended a workshop in Seattle for
 UAI2013
 where Walmart reported similar results using a factorized
 recommender.
 They
 had to increase the factor number past where it would perform well.
 Along
 the way they saw increasing performance measuring precision offline.
 They
 eventually gave up on a factorized solution. This decision seems odd
 but
 anyway… In the case of Walmart and our data set they are quite
 diverse. The
 best idea is probably to create different recommenders for separate
 parts
 of the catalog but if you create one model on all items our intuition
 is
 that item-based works better than factorized. Again caveat--no A/B
 tests to
 support this yet.

 Doing an online item-based recommender would quickly run into scaling
 problems, no? We put together the simple Mahout in-memory version and
 it
 could not really handle more than a down-sampled few months of our
 data.
 Down-sampling lost us 20% of our precision scores so we moved to the
 hadoop
 version. Now we have use-cases for an online recommender that handles
 anonymous new users and that takes the story full circle.

 On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
 wrote:

 Hi Pat

 I think we should provide a simple support for recommending to
 anonymous
 users. We should have a method recommendToAnonymous() that takes a
 PreferenceArray as argument. For itembased recommenders, its
 straightforward to compute recommendations, for userbased you have to
 search through all users once, for latent factor models, you have to
 fold
 the user vector into the low dimensional space.

 I think Sean already added

Re: Regarding Online Recommenders

2013-07-17 Thread Gokhan Capan
Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 May I ask how you plan to support model updates and 'anonymous' users?

 I assume the latent factors model is calculated offline still in batch
 mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.

Do you plan to require batch model refactorization for any update? Or
 perform some partial update by maybe just transforming new data into the LF
 space already in place then doing full refactorization every so often in
 batch mode?

 By 'anonymous users' I mean users with some history that is not yet
 incorporated in the LF model. This could be history from a new user asked
 to pick a few items to start the rec process, or an old user with some new
 action history not yet in the model. Are you going to allow for passing the
 entire history vector or userID+incremental new history to the recommender?
 I hope so.


 For what it's worth we did a comparison of Mahout Item based CF to Mahout
 ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of
 data. The data was purchase data from a diverse ecom source with a large
 variety of products from electronics to clothes. We found Item based CF did
 far better than ALS. As we increased the number of latent factors the
 results got better but were never within 10% of item based (we used MAP as
 the offline metric). Not sure why but maybe it has to do with the diversity
 of the item types.


My first question, are those actions are only positive, like purchase as
you mentioned?


 I understand that a full item based online recommender has very different
 tradeoffs and anyway others may not have seen this disparity of results.
 Furthermore we don't have A/B test results yet to validate the offline
 metric.


I personally think an A/B test is the best way to evaluate a recommender,
and if you will be able to share it, I personally look forward to see the
results. I believe that would be a great contribution for some future
decisions.


 On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Peng,

 This is the reason I separated out the DataModel, and only put the learner
 stuff there. The learner I mentioned yesterday just stores the
 parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
 where preferences are stored.

 I, kind of, agree with the multi-level DataModel approach:
 One for iterating over all preferences, one for if one wants to deploy a
 recommender and perform a lot of top-N recommendation tasks.

 (Or one DataModel with a strategy that might reduce existing memory
 consumption, while still providing fast access, I am not sure. Let me try a
 matrix-backed DataModel approach)

 Gokhan


 On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
 wrote:

  I completely agree, Netflix is less than one gigabye in a smart
  representation, 12x more memory is a nogo. The techniques used in
  FactorizablePreferences allow a much more memory efficient
 representation,
  tested on KDD Music dataset which is approx 2.5 times Netflix and fits
 into
  3GB with that approach.
 
 
  2013/7/16 Ted Dunning ted.dunn...@gmail.com
 
  Netflix is a small dataset.  12G for that seems quite excessive.
 
  Note also that this is before you have done any work.
 
  Ideally, 100million observations should take  1GB.
 
  On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
  wrote:
 
  The second idea is indeed splendid, we should separate time-complexity
  first and space-complexity first implementation. What I'm not quite
  sure,
  is that if we really need to create two interfaces instead of one.
  Personally, I think 12G heap space is not that high right? Most new
  laptop
  can already handle that (emphasis on laptop). And if we replace hash
  map
  (the culprit of high memory consumption) with list/linkedList, it would
  simply degrade time complexity for a linear search to O(n), not too bad
  either. The current DataModel is a result of careful thoughts and has
  underwent extensive test, it is easier to expand on top of it instead
  of
  subverting it.
 
 




Re: Regarding Online Recommenders

2013-07-16 Thread Gokhan Capan
Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote:

 I completely agree, Netflix is less than one gigabye in a smart
 representation, 12x more memory is a nogo. The techniques used in
 FactorizablePreferences allow a much more memory efficient representation,
 tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
 3GB with that approach.


 2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.
 
  Note also that this is before you have done any work.
 
  Ideally, 100million observations should take  1GB.
 
  On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
 wrote:
 
   The second idea is indeed splendid, we should separate time-complexity
   first and space-complexity first implementation. What I'm not quite
 sure,
   is that if we really need to create two interfaces instead of one.
   Personally, I think 12G heap space is not that high right? Most new
  laptop
   can already handle that (emphasis on laptop). And if we replace hash
 map
   (the culprit of high memory consumption) with list/linkedList, it would
   simply degrade time complexity for a linear search to O(n), not too bad
   either. The current DataModel is a result of careful thoughts and has
   underwent extensive test, it is easier to expand on top of it instead
 of
   subverting it.
 



Regarding Online Recommenders

2013-07-15 Thread Gokhan Capan
Based on the conversation in MAHOUT-1274, I put some code here:

https://github.com/gcapan/mahout/tree/onlinerec
I hope that would initiate a discussion on OnlineRecommender approaches.

I think the OnlineRecommender would require (similar to what Sebastian
commented there):

1- A DataModel that allows adding new users/items and performs fast
iteration
2- An online learning interface that allows updating the model with a
feedback, and make predictions based on the latest model

The code is a very early effort for the latter, and it contains a matrix
factorization-based implementation where training is done by SGD.

The model is stored in a DenseMatrix --it should be replaced with a matrix
that allows adding new rows and doesn't allocate space for empty rows
(please search for DenseRowMatrix and BlockSparseMatrix in the dev-list,
and see MAHOUT-1193 for relevant issue).

I didn't try that on a dataset yet.

The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.

Gokhan


Re: Welcome new committers Gokhan Capan and Stevo Slavic

2013-06-12 Thread Gokhan Capan
Hi,

Sorry I was on a vacation.
Congratulations, Stevo!

I think being a Mahout committer is a big deal, and I am really pleased
that I am one now.

I am a Researcher at Anadolu University, Turkey, and a Data Scientist at
Dilisim, a company specialized in IR, NLP, and Data Science solutions.

I hope I can participate well to committers' great efforts to empower users
to perform massive, real-world machine learning.

Thank you very much.

Best regards,
Gokhan


On Tue, Jun 11, 2013 at 12:39 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:

 congratulations!


 On Mon, Jun 10, 2013 at 10:22 PM, Dan Filimon
 dangeorge.fili...@gmail.comwrote:

  Congratulations to the both of you! :)
  It's great to have you on board!
 
 
  On Tue, Jun 11, 2013 at 3:58 AM, Stevo Slavić ssla...@gmail.com wrote:
 
   Thanks Grant, Suneel and rest of the team,
  
   I'm a Java software developer and OSS enthusiast from Serbia with 7
 years
   of professional experience in IT industry.
   Together with teams I've been part of, I have designed, built and
   successfully delivered multiple applications and websites from various
   business domains (online media, e-government, telecommunications,
   e-commerce). In both small and large enterprise scale apps, open source
   technologies and communities around them were and remain to be one of
 the
   key components and ingredients for success.
  
   It's always a great pleasure for me to give back to OSS projects that I
   use, through submitting patches or just being good community member.
   So far I've contributed to and been involved the most on Spring
 framework
   and other associated projects from the Spring portfolio.
  
   Back in April last year I rediscovered my passion and interest in
 machine
   learning, AI and computer science in general through prof. Andrew Ng's
   Coursera
   machine learning MOOC https://www.coursera.org/course/ml which I
   successfully
   completed http://bit.ly/sslavic-coursera-ml. Going from ML theory to
   practice, through the mist of Big Data hype, lead me to the greatness
 of
   Apache Mahout project.
  
   You all do me great honor by accepting me into the team, team of
   exceptional individuals yet great team players, with such positive and
   creative atmosphere.
   My contributions to the project so far were rather limited, and in near
   future they are likely to remain so as I still have lots to learn
 first.
   At least in the beginning, more than anything else I expect that I'll
 be
   able to contribute to the project by making it even more approachable
 to
   general audience of IT practitioners like myself through actively
  promoting
   it, supporting users on the mailing list to my best, and working on the
   documentation. Level of commitment will surely increase with time.
  
   I thank you all once more for this wonderful opportunity, and wish us
 and
   the project lots of success!
  
   Kind regards,
   Stevo Slavic.
  
  
   On Tue, Jun 11, 2013 at 1:10 AM, Suneel Marthi 
 suneel_mar...@yahoo.com
   wrote:
  
Congrats Gokhan and Stevo!!
   
   
   
   

 From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org dev@mahout.apache.org
Sent: Monday, June 10, 2013 5:04 PM
Subject: Welcome new committers Gokhan Capan and Stevo Slavic
   
   
Please join me in congratulating Mahout's newest committers, Gokhan
  Capan
and Stevo Slavic, both of whom have been contributing to Mahout for
  some
time now.
   
Gokhan, Stevo, new committer tradition is to give a brief background
 on
yourself, so you have the floor!
   
Congrats,
Grant
   
  
 



HBase backed matrices

2013-05-07 Thread Gokhan Capan
Hi,

For taking large matrices as input and persisting large models (like factor
models), I created an HBase-backed version of Mahout matrix.

It allows random access to cells and rows as well as assignment, and
iteration over rows. viewRow returns a view, and lazy loads actual data if
a get is actually invoked.

I plan to add a VectorInputFormat on top of it, too.

The code that we need to have for our algorithms is tested, but there are
still parts of it that are not.

I am going to speak about this at HBaseCon, and I wanted to let you know
that it can be contributed after some refactoring.

Is there any interest?

-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
2 options:

1- row index as the row key, column index as column identifier, and value
as value
2- row index and column index combined as the row key, and value in a
column called value

Row indices are kept in a member variable in memory, to make iteration fast.



On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 How did you store the matrix in HBase?


 On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Hi,
 
  For taking large matrices as input and persisting large models (like
 factor
  models), I created an HBase-backed version of Mahout matrix.
 
  It allows random access to cells and rows as well as assignment, and
  iteration over rows. viewRow returns a view, and lazy loads actual data
 if
  a get is actually invoked.
 
  I plan to add a VectorInputFormat on top of it, too.
 
  The code that we need to have for our algorithms is tested, but there are
  still parts of it that are not.
 
  I am going to speak about this at HBaseCon, and I wanted to let you know
  that it can be contributed after some refactoring.
 
  Is there any interest?
 
  --
  Gokhan
 




-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
Nope,

I simply thought that would make accessing and setting individual cells
more difficult.

Should I? Do you think it would perform better? And I would want to hear if
you have more design choices in your mind.


On Wed, May 8, 2013 at 12:22 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Have you experimented with, for instance, row number as id, value as binary
 serialized vector?




 On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan gkhn...@gmail.com wrote:

  2 options:
 
  1- row index as the row key, column index as column identifier, and value
  as value
  2- row index and column index combined as the row key, and value in a
  column called value
 
  Row indices are kept in a member variable in memory, to make iteration
  fast.
 
 
 
  On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   How did you store the matrix in HBase?
  
  
   On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com
 wrote:
  
Hi,
   
For taking large matrices as input and persisting large models (like
   factor
models), I created an HBase-backed version of Mahout matrix.
   
It allows random access to cells and rows as well as assignment, and
iteration over rows. viewRow returns a view, and lazy loads actual
 data
   if
a get is actually invoked.
   
I plan to add a VectorInputFormat on top of it, too.
   
The code that we need to have for our algorithms is tested, but there
  are
still parts of it that are not.
   
I am going to speak about this at HBaseCon, and I wanted to let you
  know
that it can be contributed after some refactoring.
   
Is there any interest?
   
--
Gokhan
   
  
 
 
 
  --
  Gokhan
 




-- 
Gokhan


Re: HBase backed matrices

2013-05-07 Thread Gokhan Capan
So if rows are small, blob is probably better; and if they get larger I can
make blocks of blobs. I will experiment this.


On Wed, May 8, 2013 at 1:06 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 It really depends on your access patterns.

 Blob storage of rows will be much faster for scans and will take much less
 space.

 Column storage of values may or may not make things faster, but it is
 conceptually nicer to not have to update so much.  In practice, I am not
 convinced that you will notice the difference except for really big rows.

 Remember that you don't have to commit to a single choice.  You could use a
 rolled up representation most of the time and then break the rollups in to
 regions as they get bigger.


 On Tue, May 7, 2013 at 2:32 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Nope,
 
  I simply thought that would make accessing and setting individual cells
  more difficult.
 
  Should I? Do you think it would perform better? And I would want to hear
 if
  you have more design choices in your mind.
 
 
  On Wed, May 8, 2013 at 12:22 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Have you experimented with, for instance, row number as id, value as
  binary
   serialized vector?
  
  
  
  
   On Tue, May 7, 2013 at 2:16 PM, Gokhan Capan gkhn...@gmail.com
 wrote:
  
2 options:
   
1- row index as the row key, column index as column identifier, and
  value
as value
2- row index and column index combined as the row key, and value in a
column called value
   
Row indices are kept in a member variable in memory, to make
 iteration
fast.
   
   
   
On Wed, May 8, 2013 at 12:11 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
   
 How did you store the matrix in HBase?


 On Tue, May 7, 2013 at 1:08 PM, Gokhan Capan gkhn...@gmail.com
   wrote:

  Hi,
 
  For taking large matrices as input and persisting large models
  (like
 factor
  models), I created an HBase-backed version of Mahout matrix.
 
  It allows random access to cells and rows as well as assignment,
  and
  iteration over rows. viewRow returns a view, and lazy loads
 actual
   data
 if
  a get is actually invoked.
 
  I plan to add a VectorInputFormat on top of it, too.
 
  The code that we need to have for our algorithms is tested, but
  there
are
  still parts of it that are not.
 
  I am going to speak about this at HBaseCon, and I wanted to let
 you
know
  that it can be contributed after some refactoring.
 
  Is there any interest?
 
  --
  Gokhan
 

   
   
   
--
Gokhan
   
  
 
 
 
  --
  Gokhan
 




-- 
Gokhan


  1   2   >