Pat, I was thinking of something like: https://github.com/gcapan/mahout/compare/cellin
It's just an example of where I believe new input formats should go (the example is to input a DRM from a text file of <row_id,col_id,value> lines). Best Gokhan On Thu, Jul 31, 2014 at 12:00 AM, Pat Ferrel <[email protected]> wrote: > Some work on this is being done as part of MAHOUT-1568, which is currently > very early and in https://github.com/apache/mahout/pull/36 > > The idea there only covers text-delimited files and proposes a standard > DRM-ish format but supports a configurable schema. Default is: > > rowID<tab>itemID1:value1<space>itemID2:value2… > > The IDs can be mahout keys of any type since they are written as text or > they can be application specific IDs meaningful in a particular usage, like > a user ID hash, or SKU from a catalog, or URL. > > As far as dataframe-ish requirements, it seems to me there are two > different things needed. The dataframe is needed while preforming an > algorithm or calculation and is kept in distributed data structures. There > probably won’t be a lot of files kept around with the new engines. Any text > files can be used for pipelines in a pinch but generally would be for > import/export. Therefore MAHOUT-1568 concentrates on import/export not > dataframes, though it could use them when they are ready. > > > On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]> > wrote: > > I believe the next step should be standardizing minimal Matrix I/O > capability (i.e. a couple file formats other than [row_id, VectorWritable] > SequenceFiles) required for a distributed computation engine, and adding > data frame like structures those allow text columns. > > >
