[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Oleksandr Petrov (JIRA) Tue, 12 Oct 2010 12:00:01 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920304#action_12920304
 ]


Oleksandr Petrov commented on MAHOUT-522:
-----------------------------------------

Thank you Ted, i did. I assume that was FeatureVectorEncoder, if not, i 
couldn't find it. 
Sorry for delay with answer, i just wanted to present some more or less 
representative set of thoughts.

I've been investigating Mahout code to find overall patterns. Maybe i've found 
an idea i've 
been looking for, but still it'd be nice to discuss it.

If we take a look on (just for instance) on SequenceFilesFrom directory. That 
class contains 
ChunkedWriter (which is a great idea), but that chunked writer isn't used in 
DictionaryVectorizer 
(which is another possible place of usage). The overall idea is that 
ChunkedWriter may be a subclass 
of a more generic type (interface/abstract class) which would be simply called 
Writer. 

SequenceFilesFromDirectory also does some reading. That may be applicable for 
the same exact thing. 
We may create an interface, that does reading, and returns current text, just 
as it's implemented 
right now: http://gist.github.com/620827. Creating that interfaces will allow 
people to write their 
own custom interfaces for reading.

Readers should definately be generic, since in case with 
SequenceFilesFromDirectory it should retrieve
Text/Text pair, with DocumentProcessor - Text/StringTuple etc.

Same exact thing is applicable to DocumentProcessor.tokenizeDocuments. 
Althuogh, since here we have an
already processed text (which we got as an output of 
SequenceFilesFromDirectory), reader should read pairs
from the existing data source in a different manner. For instance, if File 
Identifier (which is text in current 
case) and StringTuple array, might be stored in relational DB or any other data 
source for that matter.

Same thing is applicable to DictionaryVectorizer.createTermFrequencyVectors. 
Here we have different frequency
vectors, each of those has their own format. 

Please let me know if it does make some sense, and that contribution might be 
useful in any way, and if 
anyone might want to help with that.

Thank you in advance

> Using different data sources for input/output
> ---------------------------------------------
>
>                 Key: MAHOUT-522
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most 
> of the time data structures i'm working with aren't located on the file 
> system, the same way as output isn't bound to the FS, most of time i'm forced 
> to export my datasets from DB to FS, and then load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're 
> working on the algorithms implementation to start writing adapters to DBs or 
> anything like that.
> For instance,  SequenceFilesFromDirectory is a simple way to get your files 
> from directory and convert it all to Sequence Files. Some people would be 
> extremely grateful if there would be an interface they may implement to throw 
> their files from DB straight to the Sequence File without a medium of a File 
> System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if 
> i already do have Dictionary, TF-IDF, and TF in some particular format that 
> was created by other things in my infrastructure. Again, I need to convert 
> those to the Mahout data-structures. Can't we just allow other jobs to accept 
> more generic types (or interfaces, for instance) when working with TF-IDF, TF 
> and Dictionaries, without binding those to Hadoop FS. 
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's 
> also an independent project, so it may benefit and get a more wide adoption, 
> if it allows to work with any format. I do have an idea of how to implement 
> it, and partially implemented it for our infrastructure needs, but i really 
> want to hear some output from users and hadoop developers, whether it's 
> suitable and if anyone may benefit out of that.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Reply via email to