[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x

2014-08-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086989#comment-14086989
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51276205
  
also, tests run much slower although cpu remains unsaturated. Something 
about setting up and tearing down local spark context ???


> Tweaks for Spark 1.0.x 
> ---
>
> Key: MAHOUT-1603
> URL: https://issues.apache.org/jira/browse/MAHOUT-1603
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x

2014-08-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086975#comment-14086975
 ] 

ASF GitHub Bot commented on MAHOUT-1603:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/40#issuecomment-51274951
  
itemsimilarity driver stuff is failing on this. 

ItemSimilarityDriverSuite:
113754 [ScalaTest-main-running-ItemSimilarityDriverSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtA$  - Applying slim A'A.
114171 [ScalaTest-main-running-ItemSimilarityDriverSuite] DEBUG 
org.apache.mahout.sparkbindings.blas.AtB$  - A and B for A'B are not 
identically partitioned, performing inner join.
- ItemSimilarityDriver, non-full-spec CSV *** FAILED ***
  Set(iphone
galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897,
 surface   surface:4.498681156950466, nexus
iphone:1.7260924347106847,ipad:0.6795961471815897,surface:0.6795961471815897,nexus:0.6795961471815897,galaxy:1.7260924347106847,
 ipad   
galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897,
 galaxy
galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897)
 did not equal Set(nexus   
nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,surface:0.6795961471815897,galaxy:1.7260924347106847,
 ipad   
nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847,
 surface   surface:4.498681156950466, iphone   
nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847,
 galaxy
nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847)
 (ItemSimilarityDriverSuite.scala:142) 


the rest seems to pass


> Tweaks for Spark 1.0.x 
> ---
>
> Key: MAHOUT-1603
> URL: https://issues.apache.org/jira/browse/MAHOUT-1603
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> Tweaks necessary current codebase on top of spark 1.0.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: standardizing minimal Matrix I/O capability

2014-08-05 Thread Pat Ferrel
Oh, and how about calling a single value from a matrix an "Element" as we do in 
Vector.Element? This only applies to naming the reader functions "readElements" 
or some derivative.

Sent from my iPhone

> On Aug 5, 2014, at 8:34 AM, Pat Ferrel  wrote:
> 
> The benefit of your read/write is that there are no dictionaries to take up 
> memory. This is an optimization that I haven’t done yet. The purpose of mine 
> was specifically to preserve external/non-Mahout IDs. So yours is more like 
> drm.writeDrm, which writes seqfiles (also sc.readDrm). 
> 
> The benefit of the stuff currently in mahout.drivers in the Spark module is 
> that even in a pipeline it will preserve external IDs or use Mahout 
> sequential Int keys as requested. The downside is that it requires a Schema, 
> though there are several default ones defined (in the PR) that would support 
> your exact use case. And it is not yet optimized for use without 
> dictionaries. 
> 
> How should we resolve the overlap. Pragmatically if you were to merge your 
> code I could call it in the case where I don’t need dictionaries, solving my 
> optimization issue but this will result in some duplicated code. Not sure if 
> this is a problem. Maybe if yours took a Schema, defaulted to the one the we 
> agree has the correct delimiters?
> 
> The stuff in drivers does not read a text drm yet. That will be part of 
> MAHOUT-1604
> 
> On Aug 4, 2014, at 8:32 AM, Pat Ferrel  wrote:
> 
> Thiis is great. We should definitely talk. What I’ve done is first cut and a 
> data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but 
> it also maintains dictionaries so external IDs can be preserved and 
> re-attached when written, after any math or algo is done. It also has driver 
> and option processing stuff.
> 
> No hard-coded “,”, you’d get that by using the default file schema but the 
> user can change it if they want. This is especially useful for using existing 
> files like log files as input, where appropriate. It’s also the beginnings of 
> writing to DBs since the Schema class is pretty flexible it can contain DB 
> connections and schema info. Was planning to put some in an example dir. I 
> need Mongo but have also done Cassandra in a previous life.
> 
> I like some of your nomenclature better and agree that cells and DRMs are the 
> primary data types to read. I am working on reading DRMs now for a Spark RSJ 
> (1541 is itemsimilarity) So I may use part of your code but add the schema to 
> it and use dictionaries to preserve application specific IDs. It’s tied to 
> RDD textFile so is parallel for input and output.
> 
> MAHOUT-1541 is already merged, maybe we can find a way to get this stuff 
> together. 
> 
> Thanks to Comcast I only have internet in Starbucks so be patient. 
> 
> On Aug 4, 2014, at 1:30 AM, Gokhan Capan  wrote:
> 
> Pat,
> 
> I was thinking of something like:
> https://github.com/gcapan/mahout/compare/cellin
> 
> It's just an example of where I believe new input formats should go (the
> example is to input a DRM from a text file of  lines).
> 
> Best
> 
> 
> Gokhan
> 
> 
>> On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel  wrote:
>> 
>> Some work on this is being done as part of MAHOUT-1568, which is currently
>> very early and in https://github.com/apache/mahout/pull/36
>> 
>> The idea there only covers text-delimited files and proposes a standard
>> DRM-ish format but supports a configurable schema. Default is:
>> 
>> rowIDitemID1:value1itemID2:value2…
>> 
>> The IDs can be mahout keys of any type since they are written as text or
>> they can be application specific IDs meaningful in a particular usage, like
>> a user ID hash, or SKU from a catalog, or URL.
>> 
>> As far as dataframe-ish requirements, it seems to me there are two
>> different things needed. The dataframe is needed while preforming an
>> algorithm or calculation and is kept in distributed data structures. There
>> probably won’t be a lot of files kept around with the new engines. Any text
>> files can be used for pipelines in a pinch but generally would be for
>> import/export. Therefore MAHOUT-1568 concentrates on import/export not
>> dataframes, though it could use them when they are ready.
>> 
>> 
>> On Jul 30, 2014, at 7:53 AM, Gokhan Capan 
>> wrote:
>> 
>> I believe the next step should be standardizing minimal Matrix I/O
>> capability (i.e. a couple file formats other than [row_id, VectorWritable]
>> SequenceFiles) required for a distributed computation engine, and adding
>> data frame like structures those allow text columns.
> 
> 


Re: standardizing minimal Matrix I/O capability

2014-08-05 Thread Pat Ferrel
The benefit of your read/write is that there are no dictionaries to take up 
memory. This is an optimization that I haven’t done yet. The purpose of mine 
was specifically to preserve external/non-Mahout IDs. So yours is more like 
drm.writeDrm, which writes seqfiles (also sc.readDrm). 

The benefit of the stuff currently in mahout.drivers in the Spark module is 
that even in a pipeline it will preserve external IDs or use Mahout sequential 
Int keys as requested. The downside is that it requires a Schema, though there 
are several default ones defined (in the PR) that would support your exact use 
case. And it is not yet optimized for use without dictionaries. 

How should we resolve the overlap. Pragmatically if you were to merge your code 
I could call it in the case where I don’t need dictionaries, solving my 
optimization issue but this will result in some duplicated code. Not sure if 
this is a problem. Maybe if yours took a Schema, defaulted to the one the we 
agree has the correct delimiters?

The stuff in drivers does not read a text drm yet. That will be part of 
MAHOUT-1604

On Aug 4, 2014, at 8:32 AM, Pat Ferrel  wrote:

Thiis is great. We should definitely talk. What I’ve done is first cut and a 
data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it 
also maintains dictionaries so external IDs can be preserved and re-attached 
when written, after any math or algo is done. It also has driver and option 
processing stuff.

No hard-coded “,”, you’d get that by using the default file schema but the user 
can change it if they want. This is especially useful for using existing files 
like log files as input, where appropriate. It’s also the beginnings of writing 
to DBs since the Schema class is pretty flexible it can contain DB connections 
and schema info. Was planning to put some in an example dir. I need Mongo but 
have also done Cassandra in a previous life.

I like some of your nomenclature better and agree that cells and DRMs are the 
primary data types to read. I am working on reading DRMs now for a Spark RSJ 
(1541 is itemsimilarity) So I may use part of your code but add the schema to 
it and use dictionaries to preserve application specific IDs. It’s tied to RDD 
textFile so is parallel for input and output.

MAHOUT-1541 is already merged, maybe we can find a way to get this stuff 
together. 

Thanks to Comcast I only have internet in Starbucks so be patient. 

On Aug 4, 2014, at 1:30 AM, Gokhan Capan  wrote:

Pat,

I was thinking of something like:
https://github.com/gcapan/mahout/compare/cellin

It's just an example of where I believe new input formats should go (the
example is to input a DRM from a text file of  lines).

Best


Gokhan


On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel  wrote:

> Some work on this is being done as part of MAHOUT-1568, which is currently
> very early and in https://github.com/apache/mahout/pull/36
> 
> The idea there only covers text-delimited files and proposes a standard
> DRM-ish format but supports a configurable schema. Default is:
> 
> rowIDitemID1:value1itemID2:value2…
> 
> The IDs can be mahout keys of any type since they are written as text or
> they can be application specific IDs meaningful in a particular usage, like
> a user ID hash, or SKU from a catalog, or URL.
> 
> As far as dataframe-ish requirements, it seems to me there are two
> different things needed. The dataframe is needed while preforming an
> algorithm or calculation and is kept in distributed data structures. There
> probably won’t be a lot of files kept around with the new engines. Any text
> files can be used for pipelines in a pinch but generally would be for
> import/export. Therefore MAHOUT-1568 concentrates on import/export not
> dataframes, though it could use them when they are ready.
> 
> 
> On Jul 30, 2014, at 7:53 AM, Gokhan Capan 
> wrote:
> 
> I believe the next step should be standardizing minimal Matrix I/O
> capability (i.e. a couple file formats other than [row_id, VectorWritable]
> SequenceFiles) required for a distributed computation engine, and adding
> data frame like structures those allow text columns.
> 
> 
> 




[jira] [Created] (MAHOUT-1604) Create a RowSimilarity for Spark

2014-08-05 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1604:
--

 Summary: Create a RowSimilarity for Spark
 Key: MAHOUT-1604
 URL: https://issues.apache.org/jira/browse/MAHOUT-1604
 Project: Mahout
  Issue Type: Bug
  Components: CLI
Affects Versions: 1.0
 Environment: Spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel


Using CooccurrenceAnalysis.cooccurrence create a driver that reads a text DRM 
or two and produces LLR similarity/cross-similarity matrices.

This will produce the same results as ItemSimilarity but take a Drm as input 
instead of individual cells.

The first version will only support LLR, other similarity measures will need to 
be in separate Jiras



--
This message was sent by Atlassian JIRA
(v6.2#6252)