Some work on this is being done as part of MAHOUT-1568, which is currently very 
early and in https://github.com/apache/mahout/pull/36

The idea there only covers text-delimited files and proposes a standard DRM-ish 
format but supports a configurable schema. Default is:

rowID<tab>itemID1:value1<space>itemID2:value2…

The IDs can be mahout keys of any type since they are written as text or they 
can be application specific IDs meaningful in a particular usage, like a user 
ID hash, or SKU from a catalog, or URL.

As far as dataframe-ish requirements, it seems to me there are two different 
things needed. The dataframe is needed while preforming an algorithm or 
calculation and is kept in distributed data structures. There probably won’t be 
a lot of files kept around with the new engines. Any text files can be used for 
pipelines in a pinch but generally would be for import/export. Therefore 
MAHOUT-1568 concentrates on import/export not dataframes, though it could use 
them when they are ready.


> On Jul 30, 2014, at 7:53 AM, Gokhan Capan <[email protected]> wrote:
> I believe the next step should be standardizing minimal Matrix I/O capability 
> (i.e. a couple file formats other than [row_id, VectorWritable] 
> SequenceFiles) required for a distributed computation engine, and adding data 
> frame like structures those allow text columns.
> 

Reply via email to