Using different data sources for input/output
---------------------------------------------

                 Key: MAHOUT-522
                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
             Project: Mahout
          Issue Type: Improvement
          Components: Utils
            Reporter: Oleksandr Petrov


Hi,

Mahout is currently bound to the file system, at least from my feeling. Most of 
the time data structures i'm working with aren't located on the file system, 
the same way as output isn't bound to the FS, most of time i'm forced to export 
my datasets from DB to FS, and then load them back to DB afterwards.

Most likely, it's not quite interesting for the core developers, who're working 
on the algorithms implementation to start writing adapters to DBs or anything 
like that.

For instance,  SequenceFilesFromDirectory is a simple way to get your files 
from directory and convert it all to Sequence Files. Some people would be 
extremely grateful if there would be an interface they may implement to throw 
their files from DB straight to the Sequence File without a medium of a File 
System. If anyone's interested, i can provide a patch.

Second issue is related to the workflow process itself. For instance, what if i 
already do have Dictionary, TF-IDF, and TF in some particular format that was 
created by other things in my infrastructure. Again, I need to convert those to 
the Mahout data-structures. Can't we just allow other jobs to accept more 
generic types (or interfaces, for instance) when working with TF-IDF, TF and 
Dictionaries, without binding those to Hadoop FS. 

I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's 
also an independent project, so it may benefit and get a more wide adoption, if 
it allows to work with any format. I do have an idea of how to implement it, 
and partially implemented it for our infrastructure needs, but i really want to 
hear some output from users and hadoop developers, whether it's suitable and if 
anyone may benefit out of that.

Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to