Hi Gianmarco,

I have created a PR with the documentation for website docs.

The zipped test data-sets are 2 files of 20 MB each for JSON & Binary. If
you can attach it in wiki, then that is great. I don't have access to
create a wiki page I guess. The links to download the files are below

https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing


https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing


Thanks
Jay



On Mon, Nov 30, 2015 at 9:09 PM, Gianmarco De Francisci Morales <
[email protected]> wrote:

> Thanks Jayadeep,
>
> I think the docs could go in the website docs.
> Not sure about the test datasets. Maybe as attachments in the wiki if they
> are not too big?
>
> --
> Gianmarco
>
> On 30 November 2015 at 14:38, Jayadeep J <[email protected]> wrote:
>
> > Hi Gianmarco,
> >
> > I have closed the PR
> >
> > Let me know where to put the instructions for using AVRO, Input format
> > document & test data sets ???
> >
> >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing
> >
> >
> >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing
> >
> >
> >
> >
> https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing
> >
> >
> > Thanks
> > Jay
> > https://github.com/jayadeepj
> >
> >
> > On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <[email protected]> wrote:
> >
> > > Hi Gianmarco,
> > >
> > > All the test instructions, test data & other details are updated on the
> > > pull request
> > >
> > > Thanks
> > > Jay
> > > https://github.com/jayadeepj
> > >
> > > On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales <
> > > [email protected]> wrote:
> > >
> > >> Thanks Jay,
> > >>
> > >> I'll test it this weekend. Do you have some instructions and data I
> > could
> > >> use to try it out?
> > >>
> > >> --
> > >> Gianmarco
> > >>
> > >> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote:
> > >>
> > >> > Hi Gianmarco,
> > >> >
> > >> > I have implemented this functionality as per the suggestions and
> have
> > >> > raised a pull request.
> > >> >
> > >> > The implementation details are as below.
> > >> >
> > >> > 1) A new AvroFileStream as a subclass of existing FileStream that
> will
> > >> take
> > >> > in the encoding format (json/binary) from command-line. It will use
> > >> > InputStream  instead of current io Reader to handle Binary Streams.
> > >> > 2) A common Loader interface to make the parsing of streams generic
> > >> rather
> > >> > than only ARFF
> > >> > 3) A new AvroLoader abstract class in samoa-instances that will
> handle
> > >> the
> > >> > parsing of the Avro Generic Records from InputStream into SAMOA
> > >> instances.
> > >> > If even one attribute in the Avro schema has a null union (nullable
> > >> > attribute) then it will be converted into  a SAMOA Sparse Instance
> > else
> > >> > DenseInstance
> > >> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
> > >> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data &
> Avro
> > >> > schema on initialization. They will use separate decoders to read
> from
> > >> the
> > >> > stream
> > >> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to use
> > the
> > >> new
> > >> > Loader interface
> > >> >
> > >> > Though I have seen that the Travis build has failed. Couldn't see
> from
> > >> the
> > >> > logs if it is due to this code change
> > >> >
> > >> > Thanks
> > >> > Jay
> > >> > https://github.com/jayadeepj
> > >> >
> > >> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
> > >> > [email protected]> wrote:
> > >> >
> > >> > > Hi Jay,
> > >> > >
> > >> > > 1) I agree custom data types would be overkill.
> > >> > > I was thinking of the second option you mentioned, distinguishing
> it
> > >> > > inside the code.
> > >> > > So the parser code would expect either all values to be optional,
> or
> > >> all
> > >> > > values to be required.
> > >> > >
> > >> > > I think the plan you have in mind is quite reasonable.
> > >> > > I don't have other suggestions right now.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > --
> > >> > > Gianmarco
> > >> > >
> > >> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]>
> > wrote:
> > >> > >
> > >> > >> Hi Gianmarco,
> > >> > >>
> > >> > >> Thanks for your reply. Regarding the points you mentioned,
> > >> > >>
> > >> > >> 1) W.r.t  Sparse & Dense instances, I am trying to understand
> what
> > >> you
> > >> > >> meant by "prototypes". Did you mean creating custom Avro data
> types
> > >> like
> > >> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes,
> > the
> > >> > actual
> > >> > >> data stored in the file (JSON encoded) may become heavy. For e.g
> > for
> > >> the
> > >> > >> iris data-set, if we decide to use a 'SparseNumeric' type for
> > >> > >> 'sepallength',
> > >> > >>
> > >> > >> {"name":
> > >> > >>
> > >> >
> > >>
> >
> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
> > >> > >>
> > >> > >> the data may look like this,
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
> > >> > >>
> > >> > >> The complexity of a user with an existing Avro data to convert
> > into a
> > >> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier
> > if
> > >> we
> > >> > >> just distinguish it inside the code , say if at least one
> attribute
> > >> in
> > >> > the
> > >> > >> metadata uses the generic Avro optionality (e.g ["null",
> > "double"]),
> > >> > then
> > >> > >> we do readInstanceSparse() in the Loader and map correspondingly
> ?
> > >> Or is
> > >> > >> there some other complexity that I have not looked at?
> > >> > >>
> > >> > >> 2) Yes . Skipping the Date-type attributes will make it easier !
> > >> > >>
> > >> > >> Regarding the engineering aspects,
> > >> > >>
> > >> > >> We can have the Avro dependecy in the deployable jar of SAMOA. In
> > the
> > >> > >> code, may be
> > >> > >>
> > >> > >> 1) We could have an Avro equivalent of ArffFileStream.java &
> > >> ArffLoader
> > >> > >> 2) May be a different Reader altogether for handling binary
> stream
> > >> > >> 3) A user option to switch between JSON/Binary encoding
> > >> > >>
> > >> > >> If there is a better way to do it, kindly advice.
> > >> > >>
> > >> > >> Thanks
> > >> > >> Jay
> > >> > >> https://github.com/jayadeepj
> > >> > >>
> > >> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales
> <
> > >> > >> [email protected]> wrote:
> > >> > >>
> > >> > >>> Hi Jayadeep,
> > >> > >>>
> > >> > >>> I think it's pretty cool!
> > >> > >>> If we get both Avro and Kafka support right, we can connect to
> > >> almost
> > >> > >>> anything.
> > >> > >>>
> > >> > >>> The document looks very comprehensive, you seem to have given a
> > lot
> > >> of
> > >> > >>> thought to it.
> > >> > >>> I am not extremely familiar with Avro myself, I've just used it
> a
> > >> > couple
> > >> > >>> of times, but I'll try to provide some suggestions.
> > >> > >>>
> > >> > >>> - The general idea of where and how to store data and meta-data
> > >> seems
> > >> > >>> right.
> > >> > >>> - In general, all attributes in a sparse instance are optional,
> > and
> > >> all
> > >> > >>> attributes in a dense instance are required. Maybe we want to be
> > >> more
> > >> > >>> granular than this in the future, but it seems that Avro
> supports
> > a
> > >> > >>> superset of these settings. We may want to have some defaults
> > >> > "prototypes"
> > >> > >>> in order to make mapping the current dense/sparse instances
> easy.
> > >> > >>> - Right now we are not making use of Date-type attributes in
> SAMOA
> > >> > >>> (there is no such thing in samoa-instances), so if it makes it
> > >> easier
> > >> > we
> > >> > >>> could skip supporting it. Ideally we could have algorithms that
> > >> respect
> > >> > >>> event-time as provided by timestamps in the instances (as
> opposed
> > to
> > >> > >>> processing the event whenever it arrives), however we are not
> > there
> > >> > yet :)
> > >> > >>>
> > >> > >>> All the rest seems pretty straightforward.
> > >> > >>>
> > >> > >>> Moving to the more software-engineering oriented aspects, where
> > >> would
> > >> > we
> > >> > >>> have dependencies for Avro? And how should they be deployed?
> Would
> > >> they
> > >> > >>> simply go inside the deployable uber-jar of SAMOA?
> > >> > >>>
> > >> > >>> Thanks,
> > >> > >>>
> > >> > >>> --
> > >> > >>> Gianmarco
> > >> > >>>
> > >> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]>
> > >> wrote:
> > >> > >>>
> > >> > >>>> Hi Gianmarco / All,
> > >> > >>>>
> > >> > >>>> I am working on an integration of SAMOA with Apache Avro.
> > >> Basically I
> > >> > >>>> want to use data stored in Avro Files to be used as input to
> > SAMOA.
> > >> > >>>>
> > >> > >>>> As I understand, current SAMOA readers only support ARFF
> format.
> > Do
> > >> > you
> > >> > >>>> think such a feature would be useful to SAMOA in general ? Avro
> > >> > allows two
> > >> > >>>> encodings for the data: Binary & JSON. Hence an Avro support
> may
> > >> allow
> > >> > >>>> users with JSON data also to use SAMOA.
> > >> > >>>>
> > >> > >>>> Based on the input given by @gdfm to @ctippur, I have prepared
> an
> > >> > Input
> > >> > >>>> Format document in Google Docs.
> > >> > >>>>
> > >> > >>>>
> > >> > >>>>
> > >> >
> > >>
> >
> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
> > >> > >>>>
> > >> > >>>>
> > >> > >>>> Would it be possible for you to have a look and provide your
> > >> valuable
> > >> > >>>> suggestions ? Thanks
> > >> > >>>>
> > >> > >>>>
> > >> > >>>> Thanks
> > >> > >>>> Jay
> > >> > >>>> https://github.com/jayadeepj
> > >> > >>>>
> > >> > >>>
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Thanks
> > >> > >> Jay
> > >> > >>
> > >> > >>
> > >>
> > >
> >
>

Reply via email to