I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .

Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative definitions (if we keep analogy with hadoop,
they use Configuration stuff for serializing something like that --
but of course property based definitions are probably quite
underwhelming for this case). Similarly, Lucene defines Analyzer
preprocessing strategies. Surely, we could probably define some
strategies handling rows of re-standardized inputs producing
vectorized and standardized inputs as a result.

A little bit bigger Q is what to use for pre-vectorized inputs as
Vector obviously won't handle various datatypes esp. qualitative
inputs.

But perhaps we already have some of this, i am not sure. I saw a fare
amount of classes that adapt various formats (what was it? TSV?
ARFF?), perhaps we could we strategize that as well.

On Fri, Apr 22, 2011 at 9:10 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Yes.
>
> But how do we specify the input?  And how do we specify the encodings?
>
> This is what has always held me back in the past.  Should we just allow
> classes to be specified on the command line?
>
> On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>
>> Maybe there's a space for Mr based input conversion job indeed as a command
>> line routine? I was kind of thinking about the same. Maybe even along with
>> standartisation of the values. Some formal definition of inputs being fed
>> to
>> it.
>>
>

Reply via email to