Re: Avro vs Json
Moving this to the user@avro lists. Please use the right lists for the best answers and the right people. I'd pick Avro out of the two - it is very well designed for typed data and has a very good implementation of the serializer/deserializer, aside of the schema advantages. FWIW, Avro has a tojson CLI tool to dump Avro binary format out as JSON structures, which would be of help if you seek readability and/or integration with apps/systems that already depend on JSON. On Sun, Aug 12, 2012 at 10:41 PM, Mohit Anchlia wrote: > We get data in Json format. I was initially thinking of simply storing Json > in hdfs for processing. I see there is Avro that does the similar thing but > most likely stores it in more optimized format. I wanted to get users > opinion on which one is better. -- Harsh J
Re: FLUME AVRO
Abhishek, Moving this to user@flume lists, as it is Flume specific. P.s. Please do not cross post to multiple lists, it does not guarantee you a faster response nor is mailing to a *-dev list relevant to your question here. Help avoid additional inbox noise! :) On Thu, Aug 9, 2012 at 10:43 PM, abhiTowson cal wrote: > hi all, > > can log data be converted into avro,when data is sent from source to sink. > > Regards > Abhishek -- Harsh J
Re: Avro
On Sat, Aug 4, 2012 at 11:43 PM, Nitin Kesarwani wrote: > Mohit, > > You can use this patch to suit your need: > https://issues.apache.org/jira/browse/PIG-2579 > > New fields in Avro schema descriptor file need to have a non-null default > value. Hence, using the new schema file, you should be able to read older > data as well. Try it out. It is very straight forward. > > Hope this helps! Thanks! I am new to Avro what's the best place to see some examples of how Avro deals with schema changes? I am trying to find some examples. > > On Sun, Aug 5, 2012 at 12:01 AM, Mohit Anchlia > wrote: > > I've heard that Avro provides a good way of dealing with changing schemas. > > I am not sure how it could be done without keeping some kind of structure > > along with the data. Are there any good examples and documentation that I > > can look at? > > > > -N >
Re: Avro
Mohit, You can use this patch to suit your need: https://issues.apache.org/jira/browse/PIG-2579 New fields in Avro schema descriptor file need to have a non-null default value. Hence, using the new schema file, you should be able to read older data as well. Try it out. It is very straight forward. Hope this helps! On Sun, Aug 5, 2012 at 12:01 AM, Mohit Anchlia wrote: I've heard that Avro provides a good way of dealing with changing schemas. > I am not sure how it could be done without keeping some kind of structure > along with the data. Are there any good examples and documentation that I > can look at? > -N
Re: Avro vs Protocol Buffer
We just open sourced our protobuf support for Hive. We built it out because in our line of work protobuf is very common and it gave us the ability to log protobufs directly to files and then query them. https://github.com/edwardcapriolo/hive-protobuf I did not do any heavy benchmarking vs avro. However I did a few things, sorry that I do not have exact numbers here. A compresses SequenceFile of Text verses a sequence file of protobufs is maybe 5-10 percent smaller depending on the data. That is pretty good compression, so space wise your are not hurting there. Speed wise I have to do some more analysis. Our input format is doing reflection so that will have its cost (although we tried to cache things where possible) protobuf has some DynamicObject components which I need to explore to possibly avoid reflection. also you have to consider that protobuf's do more (then TextinputFormat) like validate data, so if you comparing raw speed you have to watch out for apples to oranges type stuff. I never put our ProtoBuf format head to head with the AvroFormat. Generally I hate those type of benchmarks but I would be curious to know. Overall if you have no global serialization format (company wide) you have to look at what tools you have and what they support. Aka Hive has avro and protobuf, but maybe pig only has one of the other. Are you using sqoop? and can it output files in the format that you want? Are you using a language like Ruby and what support do you have there. In my mind speed is important but compatibility is more so, for example, even if reading avro was 2 times slower then reading thrift (which it is not),your jobs might doing some very complex logic with a long shuffle sort and reduce phase. Then the performance of physically reading the file is not as important as it may seem. On Thu, Jul 19, 2012 at 12:34 PM, Harsh J wrote: > +1 to what Bruno's pointed you at. I personally like Avro for its data > files (schema's stored on file, and a good, splittable container for > typed data records). I think speed for serde is on-par with Thrift, if > not faster today. Thrift offers no optimized data container format > AFAIK. > > On Thu, Jul 19, 2012 at 1:57 PM, Bruno Freudensprung > wrote: >> Once new results will be available, you might be interested in: >> https://github.com/eishay/jvm-serializers/wiki/ >> https://github.com/eishay/jvm-serializers/wiki/Staging-Results >> >> My2cts, >> >> Bruno. >> >> Le 16/07/2012 22:49, Mike S a écrit : >> >>> Strictly from speed and performance perspective, is Avro as fast as >>> protocol buffer? >>> >> > > > > -- > Harsh J
Re: Avro vs Protocol Buffer
+1 to what Bruno's pointed you at. I personally like Avro for its data files (schema's stored on file, and a good, splittable container for typed data records). I think speed for serde is on-par with Thrift, if not faster today. Thrift offers no optimized data container format AFAIK. On Thu, Jul 19, 2012 at 1:57 PM, Bruno Freudensprung wrote: > Once new results will be available, you might be interested in: > https://github.com/eishay/jvm-serializers/wiki/ > https://github.com/eishay/jvm-serializers/wiki/Staging-Results > > My2cts, > > Bruno. > > Le 16/07/2012 22:49, Mike S a écrit : > >> Strictly from speed and performance perspective, is Avro as fast as >> protocol buffer? >> > -- Harsh J
Re: Avro vs Protocol Buffer
Once new results will be available, you might be interested in: https://github.com/eishay/jvm-serializers/wiki/ https://github.com/eishay/jvm-serializers/wiki/Staging-Results My2cts, Bruno. Le 16/07/2012 22:49, Mike S a écrit : Strictly from speed and performance perspective, is Avro as fast as protocol buffer?
Re: Avro, Hadoop0.20.2, Jackson Error
Hi, I have moved to CDH3 which doesn't have this issue. Hope that helps anybody stuck with the same issue. best, Deepak On Mon, Mar 26, 2012 at 11:19 PM, Scott Carey wrote: > Does it still happen if you configure > > avro-tools to use > > > org.apache.avro > avro-tools > 1.6.3 > nodeps > > > > ? > > You have two hadoop's, two jacksons, and even two avro:avro artifacts in > your classpath if you use the avro bundle jar with a default classifier. > > avro-tools jar is not intended for inclusion in a project, as it is a jar > with dependencies inside. > https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD > ocumentation-ProjectStructure<https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD%0Aocumentation-ProjectStructure> > > On 3/26/12 7:52 PM, "Deepak Nettem" wrote: > > >When I include some Avro code in my Mapper, I get this error: > > > >Error: > >org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F > >eature;)Lorg/codehaus/jackson/JsonFactory; > > > >Particularly, just these two lines of code: > > > >InputStream in = > >getClass().getResourceAsStream("schema.avsc"); > >Schema schema = Schema.parse(in); > > > >This code works perfectly when run as a stand alone application outside of > >Hadoop. Why do I get this error? and what's the best way to get rid of it? > > > >I am using Hadoop 0.20.2, and writing code in the new API. > > > >I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar > >and jackson-mapper-asl-1.0.1.jar. > > > >I removed these, but got this error: > >hadoop Exception in thread "main" java.lang. > >NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException > > > >I am using Maven as a build tool, and my pom.xml has this dependency: > > > > > >org.codehaus.jackson > > jackson-mapper-asl > > 1.5.2 > > compile > > > > > > > > > > > >I added the dependency: > > > > > > > >org.codehaus.jackson > > jackson-core-asl > > 1.5.2 > > compile > > > > > >But that still gives me this error: > > > >Error: org.codehaus.jackson. > >JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus > >/jackson/JsonFactory; > > > >- > > > >I also tried replacing the earlier dependencies with these: > > > > > >org.apache.avro >groupId> > >avro-tools > >1.6.3 > > > > > > > >org.apache.avro > >avro > >1.6.3 > > > > > > > > > >org.codehaus.jackson > > jackson-mapper-asl > > 1.8.8 > > compile > > > > > > > >org.codehaus.jackson > > jackson-core-asl > > 1.8.8 > > compile > > > > > >And this is my app dependency tree: > > > >[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest --- > >[INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT > >[INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile) > >[INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile > >[INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile > >[INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile > >[INFO] | +- commons-beanutils:commons-beanutils:jar:1.8.0:compile > >[INFO] | +- commons-collections:commons-collections:jar:3.2.1:compile > >[INFO] | +- commons-lang:commons-lang:jar:2.4:compile > >[INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile > >[INFO] | \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile > >[INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile > >[INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile > >[INFO] +- org.apache.avro:avro:jar:1.6.3:compile > >[INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile > >[INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile > >[INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile > >[INFO]+- commons-cli:commons-cli:jar:1.2:compile > >[INFO]+- xmlenc:xmlenc:jar:0.52:compile > >[INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile > >[INFO]+- commons-codec:commons-codec:jar:1.3:compile > >[INFO]+- commons-net:commons-net:jar:1.4.1:compile > >[INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile > >[INFO
Re: Avro, Hadoop0.20.2, Jackson Error
Does it still happen if you configure avro-tools to use org.apache.avro avro-tools 1.6.3 nodeps ? You have two hadoop's, two jacksons, and even two avro:avro artifacts in your classpath if you use the avro bundle jar with a default classifier. avro-tools jar is not intended for inclusion in a project, as it is a jar with dependencies inside. https://cwiki.apache.org/confluence/display/AVRO/Build+Documentation#BuildD ocumentation-ProjectStructure On 3/26/12 7:52 PM, "Deepak Nettem" wrote: >When I include some Avro code in my Mapper, I get this error: > >Error: >org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$F >eature;)Lorg/codehaus/jackson/JsonFactory; > >Particularly, just these two lines of code: > >InputStream in = >getClass().getResourceAsStream("schema.avsc"); >Schema schema = Schema.parse(in); > >This code works perfectly when run as a stand alone application outside of >Hadoop. Why do I get this error? and what's the best way to get rid of it? > >I am using Hadoop 0.20.2, and writing code in the new API. > >I found that the Hadoop lib directory contains jackson-core-asl-1.0.1.jar >and jackson-mapper-asl-1.0.1.jar. > >I removed these, but got this error: >hadoop Exception in thread "main" java.lang. >NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException > >I am using Maven as a build tool, and my pom.xml has this dependency: > > >org.codehaus.jackson > jackson-mapper-asl > 1.5.2 > compile > > > > > >I added the dependency: > > > >org.codehaus.jackson > jackson-core-asl > 1.5.2 > compile > > >But that still gives me this error: > >Error: org.codehaus.jackson. >JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus >/jackson/JsonFactory; > >- > >I also tried replacing the earlier dependencies with these: > > >org.apache.avrogroupId> >avro-tools >1.6.3 > > > >org.apache.avro >avro >1.6.3 > > > > >org.codehaus.jackson > jackson-mapper-asl > 1.8.8 > compile > > > >org.codehaus.jackson > jackson-core-asl > 1.8.8 > compile > > >And this is my app dependency tree: > >[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ AvroTest --- >[INFO] org.avrotest:AvroTest:jar:1.0-SNAPSHOT >[INFO] +- junit:junit:jar:3.8.1:test (scope not updated to compile) >[INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile >[INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile >[INFO] +- net.sf.json-lib:json-lib:jar:jdk15:2.3:compile >[INFO] | +- commons-beanutils:commons-beanutils:jar:1.8.0:compile >[INFO] | +- commons-collections:commons-collections:jar:3.2.1:compile >[INFO] | +- commons-lang:commons-lang:jar:2.4:compile >[INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile >[INFO] | \- net.sf.ezmorph:ezmorph:jar:1.0.6:compile >[INFO] +- org.apache.avro:avro-tools:jar:1.6.3:compile >[INFO] | \- org.slf4j:slf4j-api:jar:1.6.4:compile >[INFO] +- org.apache.avro:avro:jar:1.6.3:compile >[INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile >[INFO] | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile >[INFO] \- org.apache.hadoop:hadoop-core:jar:0.20.2:compile >[INFO]+- commons-cli:commons-cli:jar:1.2:compile >[INFO]+- xmlenc:xmlenc:jar:0.52:compile >[INFO]+- commons-httpclient:commons-httpclient:jar:3.0.1:compile >[INFO]+- commons-codec:commons-codec:jar:1.3:compile >[INFO]+- commons-net:commons-net:jar:1.4.1:compile >[INFO]+- org.mortbay.jetty:jetty:jar:6.1.14:compile >[INFO]+- org.mortbay.jetty:jetty-util:jar:6.1.14:compile >[INFO]+- tomcat:jasper-runtime:jar:5.5.12:compile >[INFO]+- tomcat:jasper-compiler:jar:5.5.12:compile >[INFO]+- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile >[INFO]+- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile >[INFO]| \- ant:ant:jar:1.6.5:compile >[INFO]+- commons-el:commons-el:jar:1.0:compile >[INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile >[INFO]+- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile >[INFO]+- net.sf.kosmosfs:kfs:jar:0.3:compile >[INFO]+- hsqldb:hsqldb:jar:1.8.0.10:compile >[INFO]+- oro:oro:jar:2.0.8:compile >[INFO]\- org.eclipse.jdt:core:jar:3.1.1:compile > >I still get the same error. > >Somebody please please help me with this. I need to resolve this asap!! > >Best, >Deepak
Re: Hadoop Serialization: Avro
Thanks, I will send the question to that last as well, Best, -Leo Sent from my phone On Nov 26, 2011, at 7:32 PM, Brock Noland wrote: > Hi, > > Depending on the response you get here, you might also post the > question separately on avro-user. > > On Sat, Nov 26, 2011 at 1:46 PM, Leonardo Urbina wrote: >> Hey everyone, >> >> First time posting to the list. I'm currently writing a hadoop job that >> will run daily and whose output will be part of the part of the next day's >> input. Also, the output will potentially be read by other programs for >> later analysis. >> >> Since my program's output is used as part of the next day's input, it would >> be nice if it was stored in some binary format that is easy to read the >> next time around. But this format also needs to be readable by other >> outside programs, not necessarily written in Java. After searching for a >> while it seems that Avro is what I want to be using. In any case, I have >> been looking around for a while and I can't seem to find a single example >> of how to use Avro within a Hadoop job. >> >> It seems that in order to use Avro I need to change the io.serializations >> value, however I don't know which value should be specified. Furthermore, I >> found that there are classes Avro{Input,Output}Format but these use a >> series of other Avro classes which, as far as I understand, seem need the >> use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as >> far as I am concerned Avro* (with * replaced with pretty much any Hadoop >> class name). It seems however that these are used so that the Avro format >> is used throughout the Hadoop process to pass objects around. >> >> I just want to use Avro to save my output and read it again as input next >> time around. So far I have been using SequenceFile{Input,Output}Format, and >> have implemented the Writable interface in the relevant classes, however >> this is not portable to other languages. Is there a way to use Avro without >> a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in >> advance, >> >> Best, >> -Leo >> >> -- >> Leo Urbina >> Massachusetts Institute of Technology >> Department of Electrical Engineering and Computer Science >> Department of Mathematics >> lurb...@mit.edu >>
Re: Hadoop Serialization: Avro
Hi, Depending on the response you get here, you might also post the question separately on avro-user. On Sat, Nov 26, 2011 at 1:46 PM, Leonardo Urbina wrote: > Hey everyone, > > First time posting to the list. I'm currently writing a hadoop job that > will run daily and whose output will be part of the part of the next day's > input. Also, the output will potentially be read by other programs for > later analysis. > > Since my program's output is used as part of the next day's input, it would > be nice if it was stored in some binary format that is easy to read the > next time around. But this format also needs to be readable by other > outside programs, not necessarily written in Java. After searching for a > while it seems that Avro is what I want to be using. In any case, I have > been looking around for a while and I can't seem to find a single example > of how to use Avro within a Hadoop job. > > It seems that in order to use Avro I need to change the io.serializations > value, however I don't know which value should be specified. Furthermore, I > found that there are classes Avro{Input,Output}Format but these use a > series of other Avro classes which, as far as I understand, seem need the > use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as > far as I am concerned Avro* (with * replaced with pretty much any Hadoop > class name). It seems however that these are used so that the Avro format > is used throughout the Hadoop process to pass objects around. > > I just want to use Avro to save my output and read it again as input next > time around. So far I have been using SequenceFile{Input,Output}Format, and > have implemented the Writable interface in the relevant classes, however > this is not portable to other languages. Is there a way to use Avro without > a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in > advance, > > Best, > -Leo > > -- > Leo Urbina > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Department of Mathematics > lurb...@mit.edu >
Hadoop Serialization: Avro
Hey everyone, First time posting to the list. I'm currently writing a hadoop job that will run daily and whose output will be part of the part of the next day's input. Also, the output will potentially be read by other programs for later analysis. Since my program's output is used as part of the next day's input, it would be nice if it was stored in some binary format that is easy to read the next time around. But this format also needs to be readable by other outside programs, not necessarily written in Java. After searching for a while it seems that Avro is what I want to be using. In any case, I have been looking around for a while and I can't seem to find a single example of how to use Avro within a Hadoop job. It seems that in order to use Avro I need to change the io.serializations value, however I don't know which value should be specified. Furthermore, I found that there are classes Avro{Input,Output}Format but these use a series of other Avro classes which, as far as I understand, seem need the use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as far as I am concerned Avro* (with * replaced with pretty much any Hadoop class name). It seems however that these are used so that the Avro format is used throughout the Hadoop process to pass objects around. I just want to use Avro to save my output and read it again as input next time around. So far I have been using SequenceFile{Input,Output}Format, and have implemented the Writable interface in the relevant classes, however this is not portable to other languages. Is there a way to use Avro without a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks in advance, Best, -Leo -- Leo Urbina Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Department of Mathematics lurb...@mit.edu
Hadoop/CDH + Avro
Would anyone happen to be able to share a good reference for Avro integration with Hadoop? I can find plenty of material around using Avro by itself but I have found little to no documentation on how to implement it as both the protocol and as custom key/value types. Thanks, Matt This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.