Oh, and how about calling a single value from a matrix an "Element" as we do in Vector.Element? This only applies to naming the reader functions "readElements" or some derivative.
Sent from my iPhone > On Aug 5, 2014, at 8:34 AM, Pat Ferrel <[email protected]> wrote: > > The benefit of your read/write is that there are no dictionaries to take up > memory. This is an optimization that I haven’t done yet. The purpose of mine > was specifically to preserve external/non-Mahout IDs. So yours is more like > drm.writeDrm, which writes seqfiles (also sc.readDrm). > > The benefit of the stuff currently in mahout.drivers in the Spark module is > that even in a pipeline it will preserve external IDs or use Mahout > sequential Int keys as requested. The downside is that it requires a Schema, > though there are several default ones defined (in the PR) that would support > your exact use case. And it is not yet optimized for use without > dictionaries. > > How should we resolve the overlap. Pragmatically if you were to merge your > code I could call it in the case where I don’t need dictionaries, solving my > optimization issue but this will result in some duplicated code. Not sure if > this is a problem. Maybe if yours took a Schema, defaulted to the one the we > agree has the correct delimiters? > > The stuff in drivers does not read a text drm yet. That will be part of > MAHOUT-1604 > > On Aug 4, 2014, at 8:32 AM, Pat Ferrel <[email protected]> wrote: > > Thiis is great. We should definitely talk. What I’ve done is first cut and a > data prep pipeline. It takes DRMs or cells and creates an RDD backed DRM but > it also maintains dictionaries so external IDs can be preserved and > re-attached when written, after any math or algo is done. It also has driver > and option processing stuff. > > No hard-coded “,”, you’d get that by using the default file schema but the > user can change it if they want. This is especially useful for using existing > files like log files as input, where appropriate. It’s also the beginnings of > writing to DBs since the Schema class is pretty flexible it can contain DB > connections and schema info. Was planning to put some in an example dir. I > need Mongo but have also done Cassandra in a previous life. > > I like some of your nomenclature better and agree that cells and DRMs are the > primary data types to read. I am working on reading DRMs now for a Spark RSJ > (1541 is itemsimilarity) So I may use part of your code but add the schema to > it and use dictionaries to preserve application specific IDs. It’s tied to > RDD textFile so is parallel for input and output. > > MAHOUT-1541 is already merged, maybe we can find a way to get this stuff > together. > > Thanks to Comcast I only have internet in Starbucks so be patient. > > On Aug 4, 2014, at 1:30 AM, Gokhan Capan <[email protected]> wrote: > > Pat, > > I was thinking of something like: > https://github.com/gcapan/mahout/compare/cellin > > It's just an example of where I believe new input formats should go (the > example is to input a DRM from a text file of <row_id,col_id,value> lines). > > Best > > > Gokhan > > >> On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <[email protected]> wrote: >> >> Some work on this is being done as part of MAHOUT-1568, which is currently >> very early and in https://github.com/apache/mahout/pull/36 >> >> The idea there only covers text-delimited files and proposes a standard >> DRM-ish format but supports a configurable schema. Default is: >> >> rowID<tab>itemID1:value1<space>itemID2:value2… >> >> The IDs can be mahout keys of any type since they are written as text or >> they can be application specific IDs meaningful in a particular usage, like >> a user ID hash, or SKU from a catalog, or URL. >> >> As far as dataframe-ish requirements, it seems to me there are two >> different things needed. The dataframe is needed while preforming an >> algorithm or calculation and is kept in distributed data structures. There >> probably won’t be a lot of files kept around with the new engines. Any text >> files can be used for pipelines in a pinch but generally would be for >> import/export. Therefore MAHOUT-1568 concentrates on import/export not >> dataframes, though it could use them when they are ready. >> >> >> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]> >> wrote: >> >> I believe the next step should be standardizing minimal Matrix I/O >> capability (i.e. a couple file formats other than [row_id, VectorWritable] >> SequenceFiles) required for a distributed computation engine, and adding >> data frame like structures those allow text columns. > >
