[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Oleksandr Petrov (JIRA) Mon, 04 Oct 2010 06:50:51 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917514#action_12917514
 ]


Oleksandr Petrov commented on MAHOUT-522:
-----------------------------------------

Ted, that was exactly my point: I realize that mahout is very Hadoop-oriernted, 
since it's a part of its ecosystem. That was a part of my choice of Mahout. I 
tried Maui, which relies on Weka and Weka 3 itself, they both are great, but 
Mahout turned out to be better for my needs, partly, because of built-in 
map/reduce features. 

Ok, I'll try to make patches a bit more human-oriented, or will just provide a 
tutorial for people who want to do things the same way I've done them to save 
some of their investigation time.

Thanks for response. 

> Using different data sources for input/output
> ---------------------------------------------
>
>                 Key: MAHOUT-522
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most 
> of the time data structures i'm working with aren't located on the file 
> system, the same way as output isn't bound to the FS, most of time i'm forced 
> to export my datasets from DB to FS, and then load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're 
> working on the algorithms implementation to start writing adapters to DBs or 
> anything like that.
> For instance,  SequenceFilesFromDirectory is a simple way to get your files 
> from directory and convert it all to Sequence Files. Some people would be 
> extremely grateful if there would be an interface they may implement to throw 
> their files from DB straight to the Sequence File without a medium of a File 
> System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if 
> i already do have Dictionary, TF-IDF, and TF in some particular format that 
> was created by other things in my infrastructure. Again, I need to convert 
> those to the Mahout data-structures. Can't we just allow other jobs to accept 
> more generic types (or interfaces, for instance) when working with TF-IDF, TF 
> and Dictionaries, without binding those to Hadoop FS. 
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's 
> also an independent project, so it may benefit and get a more wide adoption, 
> if it allows to work with any format. I do have an idea of how to implement 
> it, and partially implemented it for our infrastructure needs, but i really 
> want to hear some output from users and hadoop developers, whether it's 
> suitable and if anyone may benefit out of that.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Reply via email to