Using different data sources for input/output
---------------------------------------------
Key: MAHOUT-522
URL: https://issues.apache.org/jira/browse/MAHOUT-522
Project: Mahout
Issue Type: Improvement
Components: Utils
Reporter: Oleksandr Petrov
Hi,
Mahout is currently bound to the file system, at least from my feeling. Most of
the time data structures i'm working with aren't located on the file system,
the same way as output isn't bound to the FS, most of time i'm forced to export
my datasets from DB to FS, and then load them back to DB afterwards.
Most likely, it's not quite interesting for the core developers, who're working
on the algorithms implementation to start writing adapters to DBs or anything
like that.
For instance, SequenceFilesFromDirectory is a simple way to get your files
from directory and convert it all to Sequence Files. Some people would be
extremely grateful if there would be an interface they may implement to throw
their files from DB straight to the Sequence File without a medium of a File
System. If anyone's interested, i can provide a patch.
Second issue is related to the workflow process itself. For instance, what if i
already do have Dictionary, TF-IDF, and TF in some particular format that was
created by other things in my infrastructure. Again, I need to convert those to
the Mahout data-structures. Can't we just allow other jobs to accept more
generic types (or interfaces, for instance) when working with TF-IDF, TF and
Dictionaries, without binding those to Hadoop FS.
I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's
also an independent project, so it may benefit and get a more wide adoption, if
it allows to work with any format. I do have an idea of how to implement it,
and partially implemented it for our infrastructure needs, but i really want to
hear some output from users and hadoop developers, whether it's suitable and if
anyone may benefit out of that.
Thank you!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.