Hi Gianmarco, I have closed the PR
Let me know where to put the instructions for using AVRO, Input format document & test data sets ??? https://drive.google.com/file/d/0B844rHJZHzKMdk5oMHZWREdxMnM/view?usp=sharing https://drive.google.com/file/d/0B844rHJZHzKMSFVwVVRPVjhCOTA/view?usp=sharing https://drive.google.com/file/d/0B844rHJZHzKMSlRRaVA0TU0zRjQ/view?usp=sharing Thanks Jay https://github.com/jayadeepj On Thu, Nov 5, 2015 at 3:05 PM, Jayadeep J <[email protected]> wrote: > Hi Gianmarco, > > All the test instructions, test data & other details are updated on the > pull request > > Thanks > Jay > https://github.com/jayadeepj > > On Thu, Nov 5, 2015 at 12:50 PM, Gianmarco De Francisci Morales < > [email protected]> wrote: > >> Thanks Jay, >> >> I'll test it this weekend. Do you have some instructions and data I could >> use to try it out? >> >> -- >> Gianmarco >> >> On 4 November 2015 at 16:47, Jayadeep J <[email protected]> wrote: >> >> > Hi Gianmarco, >> > >> > I have implemented this functionality as per the suggestions and have >> > raised a pull request. >> > >> > The implementation details are as below. >> > >> > 1) A new AvroFileStream as a subclass of existing FileStream that will >> take >> > in the encoding format (json/binary) from command-line. It will use >> > InputStream instead of current io Reader to handle Binary Streams. >> > 2) A common Loader interface to make the parsing of streams generic >> rather >> > than only ARFF >> > 3) A new AvroLoader abstract class in samoa-instances that will handle >> the >> > parsing of the Avro Generic Records from InputStream into SAMOA >> instances. >> > If even one attribute in the Avro schema has a null union (nullable >> > attribute) then it will be converted into a SAMOA Sparse Instance else >> > DenseInstance >> > 4) Two sub-classes of AvroLoader for Binary & JSON parsing i.e. >> > AvroJsonLoader & AvroBinaryLoader . Both will set the meta-data & Avro >> > schema on initialization. They will use separate decoders to read from >> the >> > stream >> > 5) Appropriate changes in poms , Instances.java & ARFFLoader to use the >> new >> > Loader interface >> > >> > Though I have seen that the Travis build has failed. Couldn't see from >> the >> > logs if it is due to this code change >> > >> > Thanks >> > Jay >> > https://github.com/jayadeepj >> > >> > On Mon, Oct 26, 2015 at 12:39 PM, Gianmarco De Francisci Morales < >> > [email protected]> wrote: >> > >> > > Hi Jay, >> > > >> > > 1) I agree custom data types would be overkill. >> > > I was thinking of the second option you mentioned, distinguishing it >> > > inside the code. >> > > So the parser code would expect either all values to be optional, or >> all >> > > values to be required. >> > > >> > > I think the plan you have in mind is quite reasonable. >> > > I don't have other suggestions right now. >> > > >> > > Thanks, >> > > >> > > -- >> > > Gianmarco >> > > >> > > On 21 October 2015 at 11:39, Jayadeep J <[email protected]> wrote: >> > > >> > >> Hi Gianmarco, >> > >> >> > >> Thanks for your reply. Regarding the points you mentioned, >> > >> >> > >> 1) W.r.t Sparse & Dense instances, I am trying to understand what >> you >> > >> meant by "prototypes". Did you mean creating custom Avro data types >> like >> > >> 'SparseNumeric', 'SparseNominal','DenseInstance' e.t.c ? If yes, the >> > actual >> > >> data stored in the file (JSON encoded) may become heavy. For e.g for >> the >> > >> iris data-set, if we decide to use a 'SparseNumeric' type for >> > >> 'sepallength', >> > >> >> > >> {"name": >> > >> >> > >> "sepallength","type":["null",{"name":"SparseNumeric","type":"record","fields":[{"name":"field","type":["null","int","double","long"]}]}]}, >> > >> >> > >> the data may look like this, >> > >> >> > >> >> > >> {"sepallength":null,"sepalwidth":3.5,"petallength":1.4,"petalwidth":0.2,"class":"setosa"} >> > >> >> > >> >> > >> {"sepallength":{"com.yahoo.labs.samoa.avro.iris.SparseNumeric":{"field":{"double":4.7}}},"sepalwidth":1.4,"petallength":4.9,"petalwidth":0.2,"class":"virginica"} >> > >> >> > >> The complexity of a user with an existing Avro data to convert into a >> > >> 'SAMOA compatible Avro' may become painful. Wouldn't it be easier if >> we >> > >> just distinguish it inside the code , say if at least one attribute >> in >> > the >> > >> metadata uses the generic Avro optionality (e.g ["null", "double"]), >> > then >> > >> we do readInstanceSparse() in the Loader and map correspondingly ? >> Or is >> > >> there some other complexity that I have not looked at? >> > >> >> > >> 2) Yes . Skipping the Date-type attributes will make it easier ! >> > >> >> > >> Regarding the engineering aspects, >> > >> >> > >> We can have the Avro dependecy in the deployable jar of SAMOA. In the >> > >> code, may be >> > >> >> > >> 1) We could have an Avro equivalent of ArffFileStream.java & >> ArffLoader >> > >> 2) May be a different Reader altogether for handling binary stream >> > >> 3) A user option to switch between JSON/Binary encoding >> > >> >> > >> If there is a better way to do it, kindly advice. >> > >> >> > >> Thanks >> > >> Jay >> > >> https://github.com/jayadeepj >> > >> >> > >> On Tue, Oct 20, 2015 at 12:57 PM, Gianmarco De Francisci Morales < >> > >> [email protected]> wrote: >> > >> >> > >>> Hi Jayadeep, >> > >>> >> > >>> I think it's pretty cool! >> > >>> If we get both Avro and Kafka support right, we can connect to >> almost >> > >>> anything. >> > >>> >> > >>> The document looks very comprehensive, you seem to have given a lot >> of >> > >>> thought to it. >> > >>> I am not extremely familiar with Avro myself, I've just used it a >> > couple >> > >>> of times, but I'll try to provide some suggestions. >> > >>> >> > >>> - The general idea of where and how to store data and meta-data >> seems >> > >>> right. >> > >>> - In general, all attributes in a sparse instance are optional, and >> all >> > >>> attributes in a dense instance are required. Maybe we want to be >> more >> > >>> granular than this in the future, but it seems that Avro supports a >> > >>> superset of these settings. We may want to have some defaults >> > "prototypes" >> > >>> in order to make mapping the current dense/sparse instances easy. >> > >>> - Right now we are not making use of Date-type attributes in SAMOA >> > >>> (there is no such thing in samoa-instances), so if it makes it >> easier >> > we >> > >>> could skip supporting it. Ideally we could have algorithms that >> respect >> > >>> event-time as provided by timestamps in the instances (as opposed to >> > >>> processing the event whenever it arrives), however we are not there >> > yet :) >> > >>> >> > >>> All the rest seems pretty straightforward. >> > >>> >> > >>> Moving to the more software-engineering oriented aspects, where >> would >> > we >> > >>> have dependencies for Avro? And how should they be deployed? Would >> they >> > >>> simply go inside the deployable uber-jar of SAMOA? >> > >>> >> > >>> Thanks, >> > >>> >> > >>> -- >> > >>> Gianmarco >> > >>> >> > >>> On 19 October 2015 at 11:24, Jayadeep J <[email protected]> >> wrote: >> > >>> >> > >>>> Hi Gianmarco / All, >> > >>>> >> > >>>> I am working on an integration of SAMOA with Apache Avro. >> Basically I >> > >>>> want to use data stored in Avro Files to be used as input to SAMOA. >> > >>>> >> > >>>> As I understand, current SAMOA readers only support ARFF format. Do >> > you >> > >>>> think such a feature would be useful to SAMOA in general ? Avro >> > allows two >> > >>>> encodings for the data: Binary & JSON. Hence an Avro support may >> allow >> > >>>> users with JSON data also to use SAMOA. >> > >>>> >> > >>>> Based on the input given by @gdfm to @ctippur, I have prepared an >> > Input >> > >>>> Format document in Google Docs. >> > >>>> >> > >>>> >> > >>>> >> > >> https://docs.google.com/document/d/1EiyuXOZFKk7MTs-gWaEJq5PVHYyiphhateTaDJMKuR8/edit?usp=sharing >> > >>>> >> > >>>> >> > >>>> Would it be possible for you to have a look and provide your >> valuable >> > >>>> suggestions ? Thanks >> > >>>> >> > >>>> >> > >>>> Thanks >> > >>>> Jay >> > >>>> https://github.com/jayadeepj >> > >>>> >> > >>> >> > >>> >> > >> >> > >> >> > >> -- >> > >> Thanks >> > >> Jay >> > >> >> > >> >> >
