Hi Gianmarco,

I have closed the PR

Let me know where to put the instructions for using AVRO, Input format
document & test data sets ???

https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing


https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing


https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing


Thanks
Jay
https://github.com/jayadeepj


On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <[email protected]> wrote:

> Hi Gianmarco,
>
> All the test instructions, test data & other details are updated on the
> pull request
>
> Thanks
> Jay
> https://github.com/jayadeepj
>
> On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales <
> [email protected]> wrote:
>
>> Thanks Jay,
>>
>> I'll test it this weekend. Do you have some instructions and data I could
>> use to try it out?
>>
>> --
>> Gianmarco
>>
>> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote:
>>
>> > Hi Gianmarco,
>> >
>> > I have implemented this functionality as per the suggestions and have
>> > raised a pull request.
>> >
>> > The implementation details are as below.
>> >
>> > 1) A new AvroFileStream as a subclass of existing FileStream that will
>> take
>> > in the encoding format (json/binary) from command-line. It will use
>> > InputStream  instead of current io Reader to handle Binary Streams.
>> > 2) A common Loader interface to make the parsing of streams generic
>> rather
>> > than only ARFF
>> > 3) A new AvroLoader abstract class in samoa-instances that will handle
>> the
>> > parsing of the Avro Generic Records from InputStream into SAMOA
>> instances.
>> > If even one attribute in the Avro schema has a null union (nullable
>> > attribute) then it will be converted into  a SAMOA Sparse Instance else
>> > DenseInstance
>> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e.
>> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro
>> > schema on initialization. They will use separate decoders to read from
>> the
>> > stream
>> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to use the
>> new
>> > Loader interface
>> >
>> > Though I have seen that the Travis build has failed. Couldn't see from
>> the
>> > logs if it is due to this code change
>> >
>> > Thanks
>> > Jay
>> > https://github.com/jayadeepj
>> >
>> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales <
>> > [email protected]> wrote:
>> >
>> > > Hi Jay,
>> > >
>> > > 1) I agree custom data types would be overkill.
>> > > I was thinking of the second option you mentioned, distinguishing it
>> > > inside the code.
>> > > So the parser code would expect either all values to be optional, or
>> all
>> > > values to be required.
>> > >
>> > > I think the plan you have in mind is quite reasonable.
>> > > I don't have other suggestions right now.
>> > >
>> > > Thanks,
>> > >
>> > > --
>> > > Gianmarco
>> > >
>> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote:
>> > >
>> > >> Hi Gianmarco,
>> > >>
>> > >> Thanks for your reply. Regarding the points you mentioned,
>> > >>
>> > >> 1) W.r.t  Sparse & Dense instances, I am trying to understand what
>> you
>> > >> meant by "prototypes". Did you mean creating custom Avro data types
>> like
>> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the
>> > actual
>> > >> data stored in the file (JSON encoded) may become heavy. For e.g for
>> the
>> > >> iris data-set, if we decide to use a 'SparseNumeric' type for
>> > >> 'sepallength',
>> > >>
>> > >> {"name":
>> > >>
>> >
>> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]},
>> > >>
>> > >> the data may look like this,
>> > >>
>> > >>
>> >
>> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"}
>> > >>
>> > >>
>> >
>> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"}
>> > >>
>> > >> The complexity of a user with an existing Avro data to convert into a
>> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if
>> we
>> > >> just distinguish it inside the code , say if at least one attribute
>> in
>> > the
>> > >> metadata uses the generic Avro optionality (e.g ["null", "double"]),
>> > then
>> > >> we do readInstanceSparse() in the Loader and map correspondingly ?
>> Or is
>> > >> there some other complexity that I have not looked at?
>> > >>
>> > >> 2) Yes . Skipping the Date-type attributes will make it easier !
>> > >>
>> > >> Regarding the engineering aspects,
>> > >>
>> > >> We can have the Avro dependecy in the deployable jar of SAMOA. In the
>> > >> code, may be
>> > >>
>> > >> 1) We could have an Avro equivalent of ArffFileStream.java &
>> ArffLoader
>> > >> 2) May be a different Reader altogether for handling binary stream
>> > >> 3) A user option to switch between JSON/Binary encoding
>> > >>
>> > >> If there is a better way to do it, kindly advice.
>> > >>
>> > >> Thanks
>> > >> Jay
>> > >> https://github.com/jayadeepj
>> > >>
>> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales <
>> > >> [email protected]> wrote:
>> > >>
>> > >>> Hi Jayadeep,
>> > >>>
>> > >>> I think it's pretty cool!
>> > >>> If we get both Avro and Kafka support right, we can connect to
>> almost
>> > >>> anything.
>> > >>>
>> > >>> The document looks very comprehensive, you seem to have given a lot
>> of
>> > >>> thought to it.
>> > >>> I am not extremely familiar with Avro myself, I've just used it a
>> > couple
>> > >>> of times, but I'll try to provide some suggestions.
>> > >>>
>> > >>> - The general idea of where and how to store data and meta-data
>> seems
>> > >>> right.
>> > >>> - In general, all attributes in a sparse instance are optional, and
>> all
>> > >>> attributes in a dense instance are required. Maybe we want to be
>> more
>> > >>> granular than this in the future, but it seems that Avro supports a
>> > >>> superset of these settings. We may want to have some defaults
>> > "prototypes"
>> > >>> in order to make mapping the current dense/sparse instances easy.
>> > >>> - Right now we are not making use of Date-type attributes in SAMOA
>> > >>> (there is no such thing in samoa-instances), so if it makes it
>> easier
>> > we
>> > >>> could skip supporting it. Ideally we could have algorithms that
>> respect
>> > >>> event-time as provided by timestamps in the instances (as opposed to
>> > >>> processing the event whenever it arrives), however we are not there
>> > yet :)
>> > >>>
>> > >>> All the rest seems pretty straightforward.
>> > >>>
>> > >>> Moving to the more software-engineering oriented aspects, where
>> would
>> > we
>> > >>> have dependencies for Avro? And how should they be deployed? Would
>> they
>> > >>> simply go inside the deployable uber-jar of SAMOA?
>> > >>>
>> > >>> Thanks,
>> > >>>
>> > >>> --
>> > >>> Gianmarco
>> > >>>
>> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]>
>> wrote:
>> > >>>
>> > >>>> Hi Gianmarco / All,
>> > >>>>
>> > >>>> I am working on an integration of SAMOA with Apache Avro.
>> Basically I
>> > >>>> want to use data stored in Avro Files to be used as input to SAMOA.
>> > >>>>
>> > >>>> As I understand, current SAMOA readers only support ARFF format. Do
>> > you
>> > >>>> think such a feature would be useful to SAMOA in general ? Avro
>> > allows two
>> > >>>> encodings for the data: Binary & JSON. Hence an Avro support may
>> allow
>> > >>>> users with JSON data also to use SAMOA.
>> > >>>>
>> > >>>> Based on the input given by @gdfm to @ctippur, I have prepared an
>> > Input
>> > >>>> Format document in Google Docs.
>> > >>>>
>> > >>>>
>> > >>>>
>> >
>> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing
>> > >>>>
>> > >>>>
>> > >>>> Would it be possible for you to have a look and provide your
>> valuable
>> > >>>> suggestions ? Thanks
>> > >>>>
>> > >>>>
>> > >>>> Thanks
>> > >>>> Jay
>> > >>>> https://github.com/jayadeepj
>> > >>>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> Thanks
>> > >> Jay
>> > >>
>> > >>
>>
>

Reply via email to