Hi,

I think we should restart this conversation.
Matthieu, do you think we can review the branch?
Or do you want to do any update on it before?

Cheers,

--
Gianmarco

On 26 January 2015 at 16:20, Albert Bifet <[email protected]> wrote:

> Hi Matthieu,
>
> Thanks for your answers! I agree with using double values to store
> attribute information. I think we need to define how to maintain the
> mapping, as some learners need to know if attributes are discrete or
> numeric, in order to learn and do predictions, and how many values  the
> discrete attributes have.
>
> Cheers, Albert
>
> On Mon, Jan 26, 2015 at 7:33 PM, Matthieu Morel <[email protected]> wrote:
>
> > - discrete attributes are eventually mapped to double values, and
> > that's the appropriate input to instances, in my understanding. My
> > idea was to maintain the mapping in the feature extraction step, and
> > share it in some way with the processing topology.
> >
> > - regarding performance in sparse instances, I haven't done any sort
> > of benchmark yet. The implementation can be changed while keeping the
> > same API.
> > From what I see, on the one hand, in the current approach using an
> > index array, we have the extra constraints that 1/ this index array
> > must be sorted (adds building time), and 2/ we have to do a binary
> > search for the index value (log(n)).
> > On the other hand, there are some very efficient map implementations
> > that we could reuse. For example, CERN's colt package, actually
> > already imported in the mahout-collections ASF package.
> >
> > I hope this answers your questions,
> >
> > Matthieu
> >
> >
> > On Mon, Jan 26, 2015 at 7:30 AM, Albert Bifet <[email protected]>
> > wrote:
> > > Nice and simple API! Some things to comment:
> > >
> > > - how can we manage discrete attributes, for example attribute class:
> > > "+","-"?
> > >
> > > - In sparse instances, is the performance of a map similar to the
> > > performance of two arrays, one for indices and one for values?
> > >
> > > Albert
> > >
> > > On Sat, Jan 24, 2015 at 1:38 AM, Matthieu Morel <
> > [email protected]>
> > > wrote:
> > >
> > >> I took a shot at drafting a simplified API for instances.
> > >> https://github.com/matthieumorel/samoa/tree/new-instances
> > >>
> > >> As pointed out in this thread, the current API is too exhaustive, too
> > >> tied to a specific implementation, and too tied to WEKA/MOA APIs.
> > >>
> > >> In addition, I feel the header/information does not belong to the
> > >> instance. This is something which is used when parsing arff files
> > >> where static information about the stream is available from the start.
> > >> But for a real streaming use case, we should not make such assumption.
> > >> Whatever is known at the begining should be loaded by the topology,
> > >> but not included in the instances. Arff files can still be loaded and
> > >> generate instances in the new format. Only the headers should be
> > >> parsed separately.
> > >>
> > >> This proposal is a draft and single label only. It should be easy to
> > >> add functionality suggested by Albert for multi labels.
> > >>
> > >> Feel free to comment!
> > >>
> > >> Matthieu
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Jan 21, 2015 at 2:31 AM, Albert Bifet <[email protected]>
> > >> wrote:
> > >> > 1/ Learners as decision trees can deal with new instances that
> arrive
> > >> > with more label classes. New instances can arrive with new headers.
> > >> >
> > >> > 2/ To change class labels dynamically, we need to add a method
> > >> > "setValue(int, string)" in the Attribute class to add dynamically
> new
> > >> > attribute values.
> > >> >
> > >> > 3/ The current design is being compatible with the methods in weka
> > >> > instances. It could be nice to have a fresher design. I will need
> some
> > >> > help to have a simplified and fresher design of the instances as
> I'm a
> > >> > bit conditioned by the previous instance usage :)
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Albert
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Jan 21, 2015 at 2:33 AM, Olivier Van Laere
> > >> > <[email protected]> wrote:
> > >> >> Hey Matthieu,
> > >> >>
> > >> >>> On Jan 20, 2015, at 1:47 AM, Matthieu Morel <
> > [email protected]>
> > >> wrote:
> > >> >>>
> > >> >>> I'm confused. From what I see the number of classes is currently
> > fixed
> > >> >>> in the instance header. As if it was static. I suppose you can
> work
> > >> >>> around that limitation with some hacks but I want to use a clean
> API
> > >> >>> for that.
> > >> >>>
> > >> >>> Or is there a recommended way I'm missing?
> > >> >>
> > >> >> Ah, I think I remember now what happened. As far as I encountered
> > this
> > >> situation, the data had say an .arff format with a header stating the
> > >> number of class values, and the instance header was read from that,
> > while
> > >> the instances were then read by the line.
> > >> >>
> > >> >> I worked around that by just storing the class label seen in the
> > >> instances on the fly when building a model, and ignored that field of
> > the
> > >> instance header. Sorry for the confusion!
> > >> >>
> > >> >> Cheers,
> > >> >> Olivier
> > >> >>
> > >> >>
> > >>
> >
>

Reply via email to