Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream data structure. (my "told

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline.

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
On 02/03/2015 12:44 PM, Andrew Palumbo wrote: On 02/03/2015 12:22 PM, Pat Ferrel wrote: Some issues WRT lower level Spark integration: 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they ha

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

2015-02-03 Thread Pat Ferrel
BTW if you want to try it out quickly the CLI can be run for each pair. This recalculates A’A multiple times but requires less node memory and no code changes. Run it once for every A & B input where B is one of the secondary actions. On Feb 3, 2015, at 12:33 PM, Pat Ferrel wrote: Yes, full

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

2015-02-03 Thread Pat Ferrel
Yes, full support for multiple cross-cooccurrence is supported by the API. Whether you write your own app/driver or use the shell you can pass in as many inputs as you need. The driver cli is already too complicated. To pass a script to the shell doesn’t require you to go through creating a pro

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

2015-02-03 Thread Dmitriy Lyubimov
PS to run mahout shell, one can use MASTER= mahout/bin spark-shell Syntax to load scripts is retained from Scala shell. ideally one also needs stuf like MAHOUT_OPTS=-Xmx=5G but as i mentioned it is broken right now, you can do a quick hack On Tue, Feb 3, 2015 at 12:06 PM, Dmitriy Lyubimov wrot

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

2015-02-03 Thread Dmitriy Lyubimov
On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов wrote: > Hello. > I develop recommendation system and use mahout on spark (1.0 snapshot). In > the process I have found, that spark-itemsimilarity driver do not allow to > process more than two action types. After reading the documentation, I > found t

Extending spark-itemsimilarity for calculation multiple cross-indicators

2015-02-03 Thread Олег Зотов
Hello. I develop recommendation system and use mahout on spark (1.0 snapshot). In the process I have found, that spark-itemsimilarity driver do not allow to process more than two action types. After reading the documentation, I found that, I should run it multiple times or use SimilarityAnalysis.c

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
On 02/03/2015 12:22 PM, Pat Ferrel wrote: Some issues WRT lower level Spark integration: 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance. 2) wider acceptance of Mahout D

Jenkins build is back to normal : Mahout-Quality #2949

2015-02-03 Thread Apache Jenkins Server
See

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
Some issues WRT lower level Spark integration: 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance. 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me whe

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
On Feb 3, 2015 7:20 AM, "Pat Ferrel" wrote: > > BTW what level of difficulty would making the DSL run on MLlib Vectors and RowMatrix be? Looking at using their hashing TF-IDF but it raises impedance mismatch between DRM and MLlib RowMatrix. This would further reduce artifact size by a bunch. Shor

Re: Codebase refactoring proposal

2015-02-03 Thread Dmitriy Lyubimov
But first I need to do massive fixes and improvements to the distributed optimizer itself. Still waiting on green light for that. On Feb 3, 2015 8:45 AM, "Dmitriy Lyubimov" wrote: > > On Feb 3, 2015 7:20 AM, "Pat Ferrel" wrote: > > > > BTW what level of difficulty would making the DSL run on MLl

Re: Codebase refactoring proposal

2015-02-03 Thread Andrew Palumbo
Pat, I dont know if this would be useful but I was looking at porting our TF-IDF classes over from MRLegacy. They're pretty simple and basically just wrapper classes for a Lucene analyzer but they require a Lucene dependency. I'm not sure if we want Lucene dependency in math-scala. This was

Re: Codebase refactoring proposal

2015-02-03 Thread Pat Ferrel
BTW what level of difficulty would making the DSL run on MLlib Vectors and RowMatrix be? Looking at using their hashing TF-IDF but it raises impedance mismatch between DRM and MLlib RowMatrix. This would further reduce artifact size by a bunch. Also backing something like a DRM with DStreams. P

[jira] [Commented] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2015-02-03 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303257#comment-14303257 ] ASF GitHub Bot commented on MAHOUT-1626: Github user gcapan commented on the pull

[jira] [Commented] (MAHOUT-1626) Support for required quasi-algebraic operations and starting with aggregating rows/blocks

2015-02-03 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303214#comment-14303214 ] ASF GitHub Bot commented on MAHOUT-1626: Github user gcapan commented on a diff i