[jira] [Updated] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

2013-07-16 Thread Kun Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kun Yang updated MAHOUT-1273:
-

Attachment: PenalizedLinear.pdf

Draft

> Single Pass Algorithm for Penalized Linear Regression on MapReduce
> --
>
> Key: MAHOUT-1273
> URL: https://issues.apache.org/jira/browse/MAHOUT-1273
> Project: Mahout
>  Issue Type: New Feature
>Reporter: Kun Yang
> Attachments: PenalizedLinear.pdf
>
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in 
> machine learning, but there are no very efficient scalable implementations on 
> MapReduce.
> The published distributed algorithms for solving this problem is either 
> iterative (which is not good for MapReduce, see Steven Boyd's paper) or 
> approximate (what if we need exact solutions, see Paralleled stochastic 
> gradient descent); another disadvantage of these algorithms is that they can 
> not do cross validation in the training phase, which requires a 
> user-specified penalty parameter in advance. 
> My ideas can train the model with cross validation in a single pass. They are 
> based on some simple observations.
> I have implemented the primitive version of this algorithm in Alpine Data 
> Labs. Advanced features such as inner-mapper combiner are employed to reduce 
> the network traffic in the shuffle phase.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [jira] [Created] (MAHOUT-1273) Single Pass Algorithm for Penalized Linear Regression on MapReduce

2013-07-16 Thread Michael Kun Yang
Hi all,

I noticed that in "kmeans" "canopy", "vectorWritable" is used as the
mapper's input.

However, the dataframe is more widely used for regression such as .csv and
.txt with the first line as feature names (of course, it can have no
feature names sometimes), "vectorWritable" seems not applicable when
feature names in the first line.

Actually, .csv is the default data format used in mahout SGD, so the users
can select the features based on their names. I plan to adopt this
convention and also extend the capability to allow user to input
interaction items as in R.

Any feedback?

Thanks
-Kun


On Wed, Jul 3, 2013 at 11:48 AM, Kun Yang (JIRA)  wrote:

> Kun Yang created MAHOUT-1273:
> 
>
>  Summary: Single Pass Algorithm for Penalized Linear
> Regression on MapReduce
>  Key: MAHOUT-1273
>  URL: https://issues.apache.org/jira/browse/MAHOUT-1273
>  Project: Mahout
>   Issue Type: New Feature
> Reporter: Kun Yang
>
>
> Penalized linear regression such as Lasso, Elastic-net are widely used in
> machine learning, but there are no very efficient scalable implementations
> on MapReduce.
>
> The published distributed algorithms for solving this problem is either
> iterative (which is not good for MapReduce, see Steven Boyd's paper) or
> approximate (what if we need exact solutions, see Paralleled stochastic
> gradient descent); another disadvantage of these algorithms is that they
> can not do cross validation in the training phase, which requires a
> user-specified penalty parameter in advance.
>
> My ideas can train the model with cross validation in a single pass. They
> are based on some simple observations.
>
> I have implemented the primitive version of this algorithm in Alpine Data
> Labs. Advanced features such as inner-mapper combiner are employed to
> reduce the network traffic in the shuffle phase.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>


Re: Regarding Online Recommenders

2013-07-16 Thread Sebastian Schelter
Hi Abhishek,

Great to hear that you're willing to put some work into this! Have you
ever worked with Mahout's recommenders before? If not, then a good first
step would be to get familiar with them and code up a few examples.

Best,
Sebastian

On 17.07.2013 07:29, Abhishek Sharma wrote:
> Sorry to interrupt guys, but I just wanted to bring it to your notice that
> I am also interested in contributing to this idea. I am planning to
> participate in ASF-ICFOSS mentor-ship
> programme.
> (this is very similar to GSOC)
> 
> I do have strong concepts in machine learning (have done the ML course by
> Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
> experience). I am not really sure of how can I approach this problem (but I
> do have a strong interest to work on this problem) hence would like to pair
> up on this. I am currently working as a research intern at Indian Institute
> of Science (IISc), Bangalore India and can put up 15-20 hrs per week.
> 
> Please let me know your thoughts if I can be a part of this.
> 
> Thanks & Regards,
> Abhishek Sharma
> http://www.linkedin.com/in/abhi21
> https://github.com/abhi21
> 
> 
> On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan  wrote:
> 
>> Peng,
>>
>> This is the reason I separated out the DataModel, and only put the learner
>> stuff there. The learner I mentioned yesterday just stores the
>> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
>> where preferences are stored.
>>
>> I, kind of, agree with the multi-level DataModel approach:
>> One for iterating over "all" preferences, one for if one wants to deploy a
>> recommender and perform a lot of top-N recommendation tasks.
>>
>> (Or one DataModel with a strategy that might reduce existing memory
>> consumption, while still providing fast access, I am not sure. Let me try a
>> matrix-backed DataModel approach)
>>
>> Gokhan
>>
>>
>> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter 
>> wrote:
>>
>>> I completely agree, Netflix is less than one gigabye in a smart
>>> representation, 12x more memory is a nogo. The techniques used in
>>> FactorizablePreferences allow a much more memory efficient
>> representation,
>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>> into
>>> 3GB with that approach.
>>>
>>>
>>> 2013/7/16 Ted Dunning 
>>>
 Netflix is a small dataset.  12G for that seems quite excessive.

 Note also that this is before you have done any work.

 Ideally, 100million observations should take << 1GB.

 On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng 
>>> wrote:

> The second idea is indeed splendid, we should separate
>> time-complexity
> first and space-complexity first implementation. What I'm not quite
>>> sure,
> is that if we really need to create two interfaces instead of one.
> Personally, I think 12G heap space is not that high right? Most new
 laptop
> can already handle that (emphasis on laptop). And if we replace hash
>>> map
> (the culprit of high memory consumption) with list/linkedList, it
>> would
> simply degrade time complexity for a linear search to O(n), not too
>> bad
> either. The current DataModel is a result of careful thoughts and has
> underwent extensive test, it is easier to expand on top of it instead
>>> of
> subverting it.

>>>
>>
> 
> 
> 



Re: Regarding Online Recommenders

2013-07-16 Thread Abhishek Sharma
Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programme.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks & Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21


On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan  wrote:

> Peng,
>
> This is the reason I separated out the DataModel, and only put the learner
> stuff there. The learner I mentioned yesterday just stores the
> parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
> where preferences are stored.
>
> I, kind of, agree with the multi-level DataModel approach:
> One for iterating over "all" preferences, one for if one wants to deploy a
> recommender and perform a lot of top-N recommendation tasks.
>
> (Or one DataModel with a strategy that might reduce existing memory
> consumption, while still providing fast access, I am not sure. Let me try a
> matrix-backed DataModel approach)
>
> Gokhan
>
>
> On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter 
> wrote:
>
> > I completely agree, Netflix is less than one gigabye in a smart
> > representation, 12x more memory is a nogo. The techniques used in
> > FactorizablePreferences allow a much more memory efficient
> representation,
> > tested on KDD Music dataset which is approx 2.5 times Netflix and fits
> into
> > 3GB with that approach.
> >
> >
> > 2013/7/16 Ted Dunning 
> >
> > > Netflix is a small dataset.  12G for that seems quite excessive.
> > >
> > > Note also that this is before you have done any work.
> > >
> > > Ideally, 100million observations should take << 1GB.
> > >
> > > On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng 
> > wrote:
> > >
> > > > The second idea is indeed splendid, we should separate
> time-complexity
> > > > first and space-complexity first implementation. What I'm not quite
> > sure,
> > > > is that if we really need to create two interfaces instead of one.
> > > > Personally, I think 12G heap space is not that high right? Most new
> > > laptop
> > > > can already handle that (emphasis on laptop). And if we replace hash
> > map
> > > > (the culprit of high memory consumption) with list/linkedList, it
> would
> > > > simply degrade time complexity for a linear search to O(n), not too
> bad
> > > > either. The current DataModel is a result of careful thoughts and has
> > > > underwent extensive test, it is easier to expand on top of it instead
> > of
> > > > subverting it.
> > >
> >
>



-- 
--
Abhishek Sharma
ThoughtWorks


Build failed in Jenkins: mahout-nightly ยป Mahout Integration #1293

2013-07-16 Thread Apache Jenkins Server
See 


--
[INFO] 
[INFO] 
[INFO] Building Mahout Integration 0.9-SNAPSHOT
[INFO] 
[INFO] [INFO] Deleting 


[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ mahout-integration 
---
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
mahout-integration ---
[INFO] Copying 0 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 131 source files to 

[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[WARNING] Note: 

 uses unchecked or unsafe operations.
[WARNING] Note: Recompile with -Xlint:unchecked for details.
[INFO] [INFO] Using 'UTF-8' encoding to copy filtered resources.

[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
mahout-integration ---
[INFO] Copying 10 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
mahout-integration ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 39 source files to 

[WARNING] Note: Some input files use or override a deprecated API.
[WARNING] Note: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.15:test (default-test) @ mahout-integration 
---
[INFO] Surefire report directory: 

[INFO] parallel='classes', perCoreThreadCount=false, threadCount=1, 
useUnlimitedThreads=false

---
 T E S T S
---

---
 T E S T S
---
Running 
org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.226 sec - in 
org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest
Running org.apache.mahout.clustering.TestClusterEvaluator
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.02 sec - in 
org.apache.mahout.clustering.TestClusterEvaluator
Running org.apache.mahout.clustering.TestClusterDumper
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.003 sec - in 
org.apache.mahout.clustering.TestClusterDumper
Running org.apache.mahout.clustering.dirichlet.TestL1ModelClustering
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.919 sec - in 
org.apache.mahout.clustering.dirichlet.TestL1ModelClustering
Running org.apache.mahout.clustering.cdbw.TestCDbwEvaluator
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.096 sec - in 
org.apache.mahout.clustering.cdbw.TestCDbwEvaluator
Running org.apache.mahout.utils.TestConcatenateVectorsJob
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.456 sec - in 
org.apache.mahout.utils.TestConcatenateVectorsJob
Running org.apache.mahout.utils.vectors.lucene.LuceneIterableTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.571 sec - in 
org.apache.mahout.utils.vectors.lucene.LuceneIterableTest
Running org.apache.mahout.utils.vectors.lucene.DriverTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.92 sec - in 
org.apache.mahout.utils.vectors.lucene.DriverTest
Running org.apache.mahout.utils.vectors.lucene.CachedTermInfoTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.404 sec - in 
org.apache.mahout.utils.vectors.lucene.CachedTermInfoTest
Running org.apache.mahout.utils.vectors.csv.CSVVectorIteratorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.226 sec - in 
org.apache.mahout.utils.vectors.csv.CSVVectorIteratorTest
Running org.apache.mahout.utils.vectors.VectorHelperTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.03 sec - in 
org.apache.mahout.utils.vectors.VectorHelperTest
Running org.apache.mahout.utils.vectors.io.VectorWriterTest
Tests run: 2

Build failed in Jenkins: mahout-nightly #1293

2013-07-16 Thread Apache Jenkins Server
See 

--
[...truncated 1692 lines...]
[INFO] 
[INFO] --- maven-jar-plugin:2.4:test-jar (default) @ mahout-core ---
[INFO] Building jar: 

[WARNING] Artifact org.apache.mahout:mahout-core:test-jar:tests:0.9-SNAPSHOT 
already attached to project, ignoring duplicate
[INFO] [INFO] Reading assembly descriptor: src/main/assembly/job.xml

[INFO] --- maven-assembly-plugin:2.4:single (job) @ mahout-core ---
[INFO] Building jar: 

[WARNING] Artifact org.apache.mahout:mahout-core:jar:job:0.9-SNAPSHOT already 
attached to project, ignoring duplicate
[INFO] 
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (attach-sources) @ mahout-core 
---
[WARNING] Artifact 
org.apache.mahout:mahout-core:java-source:sources:0.9-SNAPSHOT already attached 
to project, ignoring duplicate
[INFO] [INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-SNAPSHOT.jar

[INFO] --- maven-install-plugin:2.4:install (default-install) @ mahout-core 
---[INFO] Installing 
 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-SNAPSHOT.pom

[INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-SNAPSHOT-tests.jar
[INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-SNAPSHOT-job.jar
[INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/0/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-SNAPSHOT-sources.jar
[INFO] 
[INFO] Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
--- maven-deploy-plugin:2.5:deploy (default-deploy) @ mahout-core ---
Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
 (2 KB at 5.2 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6.jar
 (1605 KB at 11376.0 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6.pom
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6.pom
 (7 KB at 142.8 KB/sec)
Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
 (382 B at 3.2 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
 (2 KB at 14.6 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/maven-metadata.xml
 (382 B at 3.4 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6-tests.jar
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/mahout-core-0.9-20130716.232352-6-tests.jar
 (2446 KB at 13078.8 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT/maven-metadata.xml
 (2 KB at 24.9 KB/sec)
Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/mahout/mahout-core/0.9-SNAPSHOT

Build failed in Jenkins: Mahout-Quality #2147

2013-07-16 Thread Apache Jenkins Server
See 

--
[...truncated 197812 lines...]
Running org.apache.mahout.classifier.sgd.GradientMachineTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.169 sec - in 
org.apache.mahout.classifier.sgd.GradientMachineTest
Running org.apache.mahout.classifier.df.split.RegressionSplitTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.454 sec - in 
org.apache.mahout.classifier.df.split.RegressionSplitTest
Running org.apache.mahout.classifier.df.split.DefaultIgSplitTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.422 sec - in 
org.apache.mahout.classifier.df.split.DefaultIgSplitTest
Running org.apache.mahout.classifier.df.split.OptIgSplitTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.49 sec - in 
org.apache.mahout.classifier.df.split.OptIgSplitTest
Running org.apache.mahout.classifier.df.data.DatasetTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.425 sec - in 
org.apache.mahout.classifier.df.data.DatasetTest
Running org.apache.mahout.classifier.df.data.DescriptorUtilsTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.063 sec - in 
org.apache.mahout.classifier.df.data.DescriptorUtilsTest
Running org.apache.mahout.classifier.df.data.DataLoaderTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.862 sec - in 
org.apache.mahout.classifier.df.data.DataLoaderTest
Running org.apache.mahout.classifier.df.data.DataConverterTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.362 sec - in 
org.apache.mahout.classifier.df.data.DataConverterTest
Running org.apache.mahout.classifier.df.data.DataTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.783 sec - in 
org.apache.mahout.classifier.df.data.DataTest
Running org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.561 sec - in 
org.apache.mahout.classifier.df.mapreduce.partial.Step1MapperTest
Running org.apache.mahout.classifier.df.mapreduce.partial.TreeIDTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.274 sec - in 
org.apache.mahout.classifier.df.mapreduce.partial.TreeIDTest
Running org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.46 sec - in 
org.apache.mahout.classifier.df.mapreduce.partial.PartialBuilderTest
Running org.apache.mahout.classifier.df.mapreduce.inmem.InMemInputSplitTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec - in 
org.apache.mahout.classifier.df.mapreduce.inmem.InMemInputSplitTest
Running org.apache.mahout.classifier.df.mapreduce.inmem.InMemInputFormatTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.152 sec - in 
org.apache.mahout.classifier.df.mapreduce.inmem.InMemInputFormatTest
Running org.apache.mahout.classifier.df.builder.DecisionTreeBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.091 sec - in 
org.apache.mahout.classifier.df.builder.DecisionTreeBuilderTest
Running org.apache.mahout.classifier.df.builder.DefaultTreeBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.069 sec - in 
org.apache.mahout.classifier.df.builder.DefaultTreeBuilderTest
Running org.apache.mahout.classifier.df.builder.InfiniteRecursionTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.307 sec - in 
org.apache.mahout.classifier.df.builder.InfiniteRecursionTest
Running org.apache.mahout.classifier.df.DecisionForestTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.357 sec - in 
org.apache.mahout.classifier.df.DecisionForestTest
Running org.apache.mahout.classifier.df.node.NodeTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.037 sec - in 
org.apache.mahout.classifier.df.node.NodeTest
Running org.apache.mahout.classifier.df.tools.VisualizerTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.329 sec - in 
org.apache.mahout.classifier.df.tools.VisualizerTest
Running org.apache.mahout.classifier.RegressionResultAnalyzerTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.025 sec - in 
org.apache.mahout.classifier.RegressionResultAnalyzerTest
Running org.apache.mahout.classifier.ConfusionMatrixTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.045 sec - in 
org.apache.mahout.classifier.ConfusionMatrixTest
Running org.apache.mahout.classifier.naivebayes.NaiveBayesTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.248 sec - in 
org.apache.mahout.classifier.naivebayes.NaiveBayesTest
Running 
org.apache.mahout.classifier.naivebayes.ComplementaryNaiveBayesClassifierTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.061 sec - in 
org.apache.mahout.classifier.naivebayes.Complemen

Re: Regarding Online Recommenders

2013-07-16 Thread Gokhan Capan
Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over "all" preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter  wrote:

> I completely agree, Netflix is less than one gigabye in a smart
> representation, 12x more memory is a nogo. The techniques used in
> FactorizablePreferences allow a much more memory efficient representation,
> tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
> 3GB with that approach.
>
>
> 2013/7/16 Ted Dunning 
>
> > Netflix is a small dataset.  12G for that seems quite excessive.
> >
> > Note also that this is before you have done any work.
> >
> > Ideally, 100million observations should take << 1GB.
> >
> > On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng 
> wrote:
> >
> > > The second idea is indeed splendid, we should separate time-complexity
> > > first and space-complexity first implementation. What I'm not quite
> sure,
> > > is that if we really need to create two interfaces instead of one.
> > > Personally, I think 12G heap space is not that high right? Most new
> > laptop
> > > can already handle that (emphasis on laptop). And if we replace hash
> map
> > > (the culprit of high memory consumption) with list/linkedList, it would
> > > simply degrade time complexity for a linear search to O(n), not too bad
> > > either. The current DataModel is a result of careful thoughts and has
> > > underwent extensive test, it is easier to expand on top of it instead
> of
> > > subverting it.
> >
>


[VOTE] Release Mahout 0.8

2013-07-16 Thread Grant Ingersoll
Applying a forcing function:

Please vote on releasing the 0.8 artifacts at 
https://repository.apache.org/content/repositories/orgapachemahout-113/org/apache/mahout/.
  

Release notes are at 
https://cwiki.apache.org/confluence/display/MAHOUT/Release+0.8

[] +1 Looks good
[] 0 - No opinion
[] -1 Don't release

Vote criteria from https://www.apache.org/dev/release.html

What are the ASF requirements on approving a release?
Votes on whether a package is ready to be released use majority approval -- 
i.e., at least three PMC members must vote affirmatively for release, and there 
must be more positive than negative votes. Releases may not be vetoed. Before 
voting +1 PMC members are required to download the signed source code package, 
compile it as provided, and test the resulting executable on their own 
platform, along with also verifying that the package meets the requirements of 
the ASF policy on releases.

Thanks,
Grant

Re: Regarding Online Recommenders

2013-07-16 Thread Sebastian Schelter
I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning 

> Netflix is a small dataset.  12G for that seems quite excessive.
>
> Note also that this is before you have done any work.
>
> Ideally, 100million observations should take << 1GB.
>
> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng  wrote:
>
> > The second idea is indeed splendid, we should separate time-complexity
> > first and space-complexity first implementation. What I'm not quite sure,
> > is that if we really need to create two interfaces instead of one.
> > Personally, I think 12G heap space is not that high right? Most new
> laptop
> > can already handle that (emphasis on laptop). And if we replace hash map
> > (the culprit of high memory consumption) with list/linkedList, it would
> > simply degrade time complexity for a linear search to O(n), not too bad
> > either. The current DataModel is a result of careful thoughts and has
> > underwent extensive test, it is easier to expand on top of it instead of
> > subverting it.
>


Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters-II #544

2013-07-16 Thread Apache Jenkins Server
See 



Re: Regarding Online Recommenders

2013-07-16 Thread Ted Dunning
Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take << 1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng  wrote:

> The second idea is indeed splendid, we should separate time-complexity
> first and space-complexity first implementation. What I'm not quite sure,
> is that if we really need to create two interfaces instead of one.
> Personally, I think 12G heap space is not that high right? Most new laptop
> can already handle that (emphasis on laptop). And if we replace hash map
> (the culprit of high memory consumption) with list/linkedList, it would
> simply degrade time complexity for a linear search to O(n), not too bad
> either. The current DataModel is a result of careful thoughts and has
> underwent extensive test, it is easier to expand on top of it instead of
> subverting it.


Re: Regarding Online Recommenders

2013-07-16 Thread Peng Cheng
Yeah, setPreference() and removePreference() shouldn't be there, but 
injecting Recommender back to DataModel is kind of a strong dependency, 
which may intermingle components for different concerns. Maybe we can do 
something to RefreshHelper class? e.g. push something into a swap field 
so the downstream of a refreshable chain can read it out. I have read 
Gokhan's UpdateAwareDataModel, and feel that it's probably too 
heavyweight for a model selector as every time he change the algorithm 
he has to re-register that.


The second idea is indeed splendid, we should separate time-complexity 
first and space-complexity first implementation. What I'm not quite 
sure, is that if we really need to create two interfaces instead of one. 
Personally, I think 12G heap space is not that high right? Most new 
laptop can already handle that (emphasis on laptop). And if we replace 
hash map (the culprit of high memory consumption) with list/linkedList, 
it would simply degrade time complexity for a linear search to O(n), not 
too bad either. The current DataModel is a result of careful thoughts 
and has underwent extensive test, it is easier to expand on top of it 
instead of subverting it.


All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:

Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian



The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.