[
https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917608#action_12917608
]
Oleksandr Petrov commented on MAHOUT-522:
-----------------------------------------
@Drew
API is an awesome idea.
I'll probably polish my stuff that's related to Sequence files creation and TF,
TF/IDF, Dictionary import. I realize that existing file format is the best
thing for map/reduce usage. But still, as you mentioned, API would speed up
development process a lot.
There may be an input reader, that will provide an easy iterable interfrase
(getNext, maybe - getCount) for all four of mentioned things. Plus several
sample adapters: MongoDB and SQL one, just for instance.
I do agree that there's no such thing as a steady DB format for such things.
Everyone uses his own schema. So, moving from easy to difficult would be:
a) allow people to use / reuse / provide their own readers
b) allow single configuration point (which is questionable, since people may
want to handle it all themselves)
c) implement a single writer interface that will accept reader type and read
possible items out of it
We have bits of that already covered, and it surely needs to become a bit more
generic to allow reuse. I'll be working on it throughout the next week or so.
> Using different data sources for input/output
> ---------------------------------------------
>
> Key: MAHOUT-522
> URL: https://issues.apache.org/jira/browse/MAHOUT-522
> Project: Mahout
> Issue Type: Improvement
> Components: Utils
> Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most
> of the time data structures i'm working with aren't located on the file
> system, the same way as output isn't bound to the FS, most of time i'm forced
> to export my datasets from DB to FS, and then load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're
> working on the algorithms implementation to start writing adapters to DBs or
> anything like that.
> For instance, SequenceFilesFromDirectory is a simple way to get your files
> from directory and convert it all to Sequence Files. Some people would be
> extremely grateful if there would be an interface they may implement to throw
> their files from DB straight to the Sequence File without a medium of a File
> System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if
> i already do have Dictionary, TF-IDF, and TF in some particular format that
> was created by other things in my infrastructure. Again, I need to convert
> those to the Mahout data-structures. Can't we just allow other jobs to accept
> more generic types (or interfaces, for instance) when working with TF-IDF, TF
> and Dictionaries, without binding those to Hadoop FS.
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's
> also an independent project, so it may benefit and get a more wide adoption,
> if it allows to work with any format. I do have an idea of how to implement
> it, and partially implemented it for our infrastructure needs, but i really
> want to hear some output from users and hadoop developers, whether it's
> suitable and if anyone may benefit out of that.
> Thank you!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.