[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086989#comment-14086989 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51276205 also, tests run much slower although cpu remains unsaturated. Something about setting up and tearing down local spark context ??? > Tweaks for Spark 1.0.x > --- > > Key: MAHOUT-1603 > URL: https://issues.apache.org/jira/browse/MAHOUT-1603 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAHOUT-1603) Tweaks for Spark 1.0.x
[ https://issues.apache.org/jira/browse/MAHOUT-1603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086975#comment-14086975 ] ASF GitHub Bot commented on MAHOUT-1603: Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/40#issuecomment-51274951 itemsimilarity driver stuff is failing on this. ItemSimilarityDriverSuite: 113754 [ScalaTest-main-running-ItemSimilarityDriverSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtA$ - Applying slim A'A. 114171 [ScalaTest-main-running-ItemSimilarityDriverSuite] DEBUG org.apache.mahout.sparkbindings.blas.AtB$ - A and B for A'B are not identically partitioned, performing inner join. - ItemSimilarityDriver, non-full-spec CSV *** FAILED *** Set(iphone galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897, surface surface:4.498681156950466, nexus iphone:1.7260924347106847,ipad:0.6795961471815897,surface:0.6795961471815897,nexus:0.6795961471815897,galaxy:1.7260924347106847, ipad galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897, galaxy galaxy:1.7260924347106847,iphone:1.7260924347106847,ipad:0.6795961471815897,nexus:0.6795961471815897) did not equal Set(nexus nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,surface:0.6795961471815897,galaxy:1.7260924347106847, ipad nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847, surface surface:4.498681156950466, iphone nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847, galaxy nexus:0.6795961471815897,iphone:1.7260924347106847,ipad:0.6795961471815897,galaxy:1.7260924347106847) (ItemSimilarityDriverSuite.scala:142) the rest seems to pass > Tweaks for Spark 1.0.x > --- > > Key: MAHOUT-1603 > URL: https://issues.apache.org/jira/browse/MAHOUT-1603 > Project: Mahout > Issue Type: Task >Affects Versions: 0.9 >Reporter: Dmitriy Lyubimov >Assignee: Dmitriy Lyubimov > Fix For: 1.0 > > > Tweaks necessary current codebase on top of spark 1.0.x -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: standardizing minimal Matrix I/O capability
Oh, and how about calling a single value from a matrix an "Element" as we do in Vector.Element? This only applies to naming the reader functions "readElements" or some derivative. Sent from my iPhone > On Aug 5, 2014, at 8:34 AM, Pat Ferrel wrote: > > The benefit of your read/write is that there are no dictionaries to take up > memory. This is an optimization that I haven’t done yet. The purpose of mine > was specifically to preserve external/non-Mahout IDs. So yours is more like > drm.writeDrm, which writes seqfiles (also sc.readDrm). > > The benefit of the stuff currently in mahout.drivers in the Spark module is > that even in a pipeline it will preserve external IDs or use Mahout > sequential Int keys as requested. The downside is that it requires a Schema, > though there are several default ones defined (in the PR) that would support > your exact use case. And it is not yet optimized for use without > dictionaries. > > How should we resolve the overlap. Pragmatically if you were to merge your > code I could call it in the case where I don’t need dictionaries, solving my > optimization issue but this will result in some duplicated code. Not sure if > this is a problem. Maybe if yours took a Schema, defaulted to the one the we > agree has the correct delimiters? > > The stuff in drivers does not read a text drm yet. That will be part of > MAHOUT-1604 > > On Aug 4, 2014, at 8:32 AM, Pat Ferrel wrote: > > Thiis is great. We should definitely talk. What I’ve done is first cut and a > data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but > it also maintains dictionaries so external IDs can be preserved and > re-attached when written, after any math or algo is done. It also has driver > and option processing stuff. > > No hard-coded “,”, you’d get that by using the default file schema but the > user can change it if they want. This is especially useful for using existing > files like log files as input, where appropriate. It’s also the beginnings of > writing to DBs since the Schema class is pretty flexible it can contain DB > connections and schema info. Was planning to put some in an example dir. I > need Mongo but have also done Cassandra in a previous life. > > I like some of your nomenclature better and agree that cells and DRMs are the > primary data types to read. I am working on reading DRMs now for a Spark RSJ > (1541 is itemsimilarity) So I may use part of your code but add the schema to > it and use dictionaries to preserve application specific IDs. It’s tied to > RDD textFile so is parallel for input and output. > > MAHOUT-1541 is already merged, maybe we can find a way to get this stuff > together. > > Thanks to Comcast I only have internet in Starbucks so be patient. > > On Aug 4, 2014, at 1:30 AM, Gokhan Capan wrote: > > Pat, > > I was thinking of something like: > https://github.com/gcapan/mahout/compare/cellin > > It's just an example of where I believe new input formats should go (the > example is to input a DRM from a text file of lines). > > Best > > > Gokhan > > >> On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel wrote: >> >> Some work on this is being done as part of MAHOUT-1568, which is currently >> very early and in https://github.com/apache/mahout/pull/36 >> >> The idea there only covers text-delimited files and proposes a standard >> DRM-ish format but supports a configurable schema. Default is: >> >> rowIDitemID1:value1itemID2:value2… >> >> The IDs can be mahout keys of any type since they are written as text or >> they can be application specific IDs meaningful in a particular usage, like >> a user ID hash, or SKU from a catalog, or URL. >> >> As far as dataframe-ish requirements, it seems to me there are two >> different things needed. The dataframe is needed while preforming an >> algorithm or calculation and is kept in distributed data structures. There >> probably won’t be a lot of files kept around with the new engines. Any text >> files can be used for pipelines in a pinch but generally would be for >> import/export. Therefore MAHOUT-1568 concentrates on import/export not >> dataframes, though it could use them when they are ready. >> >> >> On Jul 30, 2014, at 7:53 AM, Gokhan Capan >> wrote: >> >> I believe the next step should be standardizing minimal Matrix I/O >> capability (i.e. a couple file formats other than [row_id, VectorWritable] >> SequenceFiles) required for a distributed computation engine, and adding >> data frame like structures those allow text columns. > >
Re: standardizing minimal Matrix I/O capability
The benefit of your read/write is that there are no dictionaries to take up memory. This is an optimization that I haven’t done yet. The purpose of mine was specifically to preserve external/non-Mahout IDs. So yours is more like drm.writeDrm, which writes seqfiles (also sc.readDrm). The benefit of the stuff currently in mahout.drivers in the Spark module is that even in a pipeline it will preserve external IDs or use Mahout sequential Int keys as requested. The downside is that it requires a Schema, though there are several default ones defined (in the PR) that would support your exact use case. And it is not yet optimized for use without dictionaries. How should we resolve the overlap. Pragmatically if you were to merge your code I could call it in the case where I don’t need dictionaries, solving my optimization issue but this will result in some duplicated code. Not sure if this is a problem. Maybe if yours took a Schema, defaulted to the one the we agree has the correct delimiters? The stuff in drivers does not read a text drm yet. That will be part of MAHOUT-1604 On Aug 4, 2014, at 8:32 AM, Pat Ferrel wrote: Thiis is great. We should definitely talk. What I’ve done is first cut and a data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but it also maintains dictionaries so external IDs can be preserved and re-attached when written, after any math or algo is done. It also has driver and option processing stuff. No hard-coded “,”, you’d get that by using the default file schema but the user can change it if they want. This is especially useful for using existing files like log files as input, where appropriate. It’s also the beginnings of writing to DBs since the Schema class is pretty flexible it can contain DB connections and schema info. Was planning to put some in an example dir. I need Mongo but have also done Cassandra in a previous life. I like some of your nomenclature better and agree that cells and DRMs are the primary data types to read. I am working on reading DRMs now for a Spark RSJ (1541 is itemsimilarity) So I may use part of your code but add the schema to it and use dictionaries to preserve application specific IDs. It’s tied to RDD textFile so is parallel for input and output. MAHOUT-1541 is already merged, maybe we can find a way to get this stuff together. Thanks to Comcast I only have internet in Starbucks so be patient. On Aug 4, 2014, at 1:30 AM, Gokhan Capan wrote: Pat, I was thinking of something like: https://github.com/gcapan/mahout/compare/cellin It's just an example of where I believe new input formats should go (the example is to input a DRM from a text file of lines). Best Gokhan On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel wrote: > Some work on this is being done as part of MAHOUT-1568, which is currently > very early and in https://github.com/apache/mahout/pull/36 > > The idea there only covers text-delimited files and proposes a standard > DRM-ish format but supports a configurable schema. Default is: > > rowIDitemID1:value1itemID2:value2… > > The IDs can be mahout keys of any type since they are written as text or > they can be application specific IDs meaningful in a particular usage, like > a user ID hash, or SKU from a catalog, or URL. > > As far as dataframe-ish requirements, it seems to me there are two > different things needed. The dataframe is needed while preforming an > algorithm or calculation and is kept in distributed data structures. There > probably won’t be a lot of files kept around with the new engines. Any text > files can be used for pipelines in a pinch but generally would be for > import/export. Therefore MAHOUT-1568 concentrates on import/export not > dataframes, though it could use them when they are ready. > > > On Jul 30, 2014, at 7:53 AM, Gokhan Capan > wrote: > > I believe the next step should be standardizing minimal Matrix I/O > capability (i.e. a couple file formats other than [row_id, VectorWritable] > SequenceFiles) required for a distributed computation engine, and adding > data frame like structures those allow text columns. > > >
[jira] [Created] (MAHOUT-1604) Create a RowSimilarity for Spark
Pat Ferrel created MAHOUT-1604: -- Summary: Create a RowSimilarity for Spark Key: MAHOUT-1604 URL: https://issues.apache.org/jira/browse/MAHOUT-1604 Project: Mahout Issue Type: Bug Components: CLI Affects Versions: 1.0 Environment: Spark Reporter: Pat Ferrel Assignee: Pat Ferrel Using CooccurrenceAnalysis.cooccurrence create a driver that reads a text DRM or two and produces LLR similarity/cross-similarity matrices. This will produce the same results as ItemSimilarity but take a Drm as input instead of individual cells. The first version will only support LLR, other similarity measures will need to be in separate Jiras -- This message was sent by Atlassian JIRA (v6.2#6252)