Thanks Jayadeep,

I think the docs could go in the website docs.
Not sure about the test datasets. Maybe as attachments in the wiki if they
are not too big?

--
Gianmarco

On 30 November 2015 at 14:38, Jayadeep J <[email protected]> wrote:

> Hi Gianmarco,
>
> I have closed the PR
>
> Let me know where to put the instructions for using AVRO, Input format
> document & test data sets ???
>
>
> https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing
>
>
>
> https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing
>
>
>
> https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing
>
>
> Thanks
> Jay
> https://github.com/jayadeepj
>
>
> On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <[email protected]> wrote:
>
> > Hi Gianmarco,
> >
> > All the test instructions, test data & other details are updated on the
> > pull request
> >
> > Thanks
> > Jay
> > https://github.com/jayadeepj
> >
> > On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales <
> > [email protected]> wrote:
> >
> >> Thanks Jay,
> >>
> >> I'll test it this weekend. Do you have some instructions and data I
> could
> >> use to try it out?
> >>
> >> --
> >> Gianmarco
> >>
> >> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote:
> >>
> >> > Hi Gianmarco,
> >> >
> >> > I have implemented this functionality as per the suggestions and have
> >> > raised a pull request.
> >> >
> >> > The implementation details are as below.
> >> >
> >> > 1) A new AvroFileStream as a subclass of existing FileStream that will
> >> take
> >> > in the encoding format (json/binary) from command-line. It will use
> >> > InputStream  instead of current io Reader to handle Binary Streams.
> >> > 2) A common Loader interface to make the parsing of streams generic
> >> rather
> >> > than only ARFF
> >> > 3) A new AvroLoader abstract class in samoa-instances that will handle
> >> the
> >> > parsing of the Avro Generic Records from InputStream into SAMOA
> >> instances.
> >> > If even one attribute in the Avro schema has a null union (nullable
> >> > attribute) then it will be converted into  a SAMOA Sparse Instance
> else
> >> > DenseInstance
> >> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
> >> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro
> >> > schema on initialization. They will use separate decoders to read from
> >> the
> >> > stream
> >> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to use
> the
> >> new
> >> > Loader interface
> >> >
> >> > Though I have seen that the Travis build has failed. Couldn't see from
> >> the
> >> > logs if it is due to this code change
> >> >
> >> > Thanks
> >> > Jay
> >> > https://github.com/jayadeepj
> >> >
> >> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
> >> > [email protected]> wrote:
> >> >
> >> > > Hi Jay,
> >> > >
> >> > > 1) I agree custom data types would be overkill.
> >> > > I was thinking of the second option you mentioned, distinguishing it
> >> > > inside the code.
> >> > > So the parser code would expect either all values to be optional, or
> >> all
> >> > > values to be required.
> >> > >
> >> > > I think the plan you have in mind is quite reasonable.
> >> > > I don't have other suggestions right now.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > --
> >> > > Gianmarco
> >> > >
> >> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]>
> wrote:
> >> > >
> >> > >> Hi Gianmarco,
> >> > >>
> >> > >> Thanks for your reply. Regarding the points you mentioned,
> >> > >>
> >> > >> 1) W.r.t  Sparse & Dense instances, I am trying to understand what
> >> you
> >> > >> meant by "prototypes". Did you mean creating custom Avro data types
> >> like
> >> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes,
> the
> >> > actual
> >> > >> data stored in the file (JSON encoded) may become heavy. For e.g
> for
> >> the
> >> > >> iris data-set, if we decide to use a 'SparseNumeric' type for
> >> > >> 'sepallength',
> >> > >>
> >> > >> {"name":
> >> > >>
> >> >
> >>
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
> >> > >>
> >> > >> the data may look like this,
> >> > >>
> >> > >>
> >> >
> >>
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
> >> > >>
> >> > >>
> >> >
> >>
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
> >> > >>
> >> > >> The complexity of a user with an existing Avro data to convert
> into a
> >> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier
> if
> >> we
> >> > >> just distinguish it inside the code , say if at least one attribute
> >> in
> >> > the
> >> > >> metadata uses the generic Avro optionality (e.g ["null",
> "double"]),
> >> > then
> >> > >> we do readInstanceSparse() in the Loader and map correspondingly ?
> >> Or is
> >> > >> there some other complexity that I have not looked at?
> >> > >>
> >> > >> 2) Yes . Skipping the Date-type attributes will make it easier !
> >> > >>
> >> > >> Regarding the engineering aspects,
> >> > >>
> >> > >> We can have the Avro dependecy in the deployable jar of SAMOA. In
> the
> >> > >> code, may be
> >> > >>
> >> > >> 1) We could have an Avro equivalent of ArffFileStream.java &
> >> ArffLoader
> >> > >> 2) May be a different Reader altogether for handling binary stream
> >> > >> 3) A user option to switch between JSON/Binary encoding
> >> > >>
> >> > >> If there is a better way to do it, kindly advice.
> >> > >>
> >> > >> Thanks
> >> > >> Jay
> >> > >> https://github.com/jayadeepj
> >> > >>
> >> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
> >> > >> [email protected]> wrote:
> >> > >>
> >> > >>> Hi Jayadeep,
> >> > >>>
> >> > >>> I think it's pretty cool!
> >> > >>> If we get both Avro and Kafka support right, we can connect to
> >> almost
> >> > >>> anything.
> >> > >>>
> >> > >>> The document looks very comprehensive, you seem to have given a
> lot
> >> of
> >> > >>> thought to it.
> >> > >>> I am not extremely familiar with Avro myself, I've just used it a
> >> > couple
> >> > >>> of times, but I'll try to provide some suggestions.
> >> > >>>
> >> > >>> - The general idea of where and how to store data and meta-data
> >> seems
> >> > >>> right.
> >> > >>> - In general, all attributes in a sparse instance are optional,
> and
> >> all
> >> > >>> attributes in a dense instance are required. Maybe we want to be
> >> more
> >> > >>> granular than this in the future, but it seems that Avro supports
> a
> >> > >>> superset of these settings. We may want to have some defaults
> >> > "prototypes"
> >> > >>> in order to make mapping the current dense/sparse instances easy.
> >> > >>> - Right now we are not making use of Date-type attributes in SAMOA
> >> > >>> (there is no such thing in samoa-instances), so if it makes it
> >> easier
> >> > we
> >> > >>> could skip supporting it. Ideally we could have algorithms that
> >> respect
> >> > >>> event-time as provided by timestamps in the instances (as opposed
> to
> >> > >>> processing the event whenever it arrives), however we are not
> there
> >> > yet :)
> >> > >>>
> >> > >>> All the rest seems pretty straightforward.
> >> > >>>
> >> > >>> Moving to the more software-engineering oriented aspects, where
> >> would
> >> > we
> >> > >>> have dependencies for Avro? And how should they be deployed? Would
> >> they
> >> > >>> simply go inside the deployable uber-jar of SAMOA?
> >> > >>>
> >> > >>> Thanks,
> >> > >>>
> >> > >>> --
> >> > >>> Gianmarco
> >> > >>>
> >> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]>
> >> wrote:
> >> > >>>
> >> > >>>> Hi Gianmarco / All,
> >> > >>>>
> >> > >>>> I am working on an integration of SAMOA with Apache Avro.
> >> Basically I
> >> > >>>> want to use data stored in Avro Files to be used as input to
> SAMOA.
> >> > >>>>
> >> > >>>> As I understand, current SAMOA readers only support ARFF format.
> Do
> >> > you
> >> > >>>> think such a feature would be useful to SAMOA in general ? Avro
> >> > allows two
> >> > >>>> encodings for the data: Binary & JSON. Hence an Avro support may
> >> allow
> >> > >>>> users with JSON data also to use SAMOA.
> >> > >>>>
> >> > >>>> Based on the input given by @gdfm to @ctippur, I have prepared an
> >> > Input
> >> > >>>> Format document in Google Docs.
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> >
> >>
> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
> >> > >>>>
> >> > >>>>
> >> > >>>> Would it be possible for you to have a look and provide your
> >> valuable
> >> > >>>> suggestions ? Thanks
> >> > >>>>
> >> > >>>>
> >> > >>>> Thanks
> >> > >>>> Jay
> >> > >>>> https://github.com/jayadeepj
> >> > >>>>
> >> > >>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Thanks
> >> > >> Jay
> >> > >>
> >> > >>
> >>
> >
>

Reply via email to