Re: STORE USING AvroStorage - ignores Pig field names, only using their position
Pig tuples have field order. Swap the order of the fields in your avro schema and try again. On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hey guys, When I store with AvroStorage, the names from Pig tuple fields are completely ignored. The field values are put to the result file only by their position. Here is a simplified test case: %declare WORKDIR `pwd` REGISTER ../../../../lib/external/avro-1.7.4.jar REGISTER ../../../../lib/external/json-simple-1.1.jar --this is build (manually with Maven) from the latest source: --http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/ REGISTER ../piggybankBuiltFromSource.jar REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar --$ cat input.txt --data_a data_b --data_a data_b inputs = LOAD 'input.txt' AS (a: chararray, b: chararray); DESCRIBE inputs; DUMP inputs; --output: --inputs: {a: chararray,b: chararray} --(data_a,data_b) --(data_a,data_b) STORE inputs INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage('{ schema: { type : record, name : my_schema, namespace : com.my_namespace, fields : [ { name : b, type : string }, { name : nonsense_name, type : string } ] } }'); --output --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson output/part* --{b:data_a,nonsense_name:data_b} --{b:data_a,nonsense_name:data_b} AvroStorage is build from the latest piggybank code. Using AvroStorage debug: 5 parameter didn't help. $ pig -version Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Any help would be appreciated. Thanks, Ruslan Al-Fakikh
Re: STORE USING AvroStorage - ignores Pig field names, only using their position
How can pig map from a to nonsence_name? On Saturday, November 16, 2013, Ruslan Al-Fakikh wrote: Thanks, Russel! Do you mean that this is the expected behavior? Shouldn't AvroStorage map the pig fields by their names (not their field order) matching them to the names in the avro schema? Thanks, Ruslan Al-Fakikh On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney russell.jur...@gmail.comjavascript:_e({}, 'cvml', 'russell.jur...@gmail.com'); wrote: Pig tuples have field order. Swap the order of the fields in your avro schema and try again. On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh metarus...@gmail.comjavascript:_e({}, 'cvml', 'metarus...@gmail.com'); wrote: Hey guys, When I store with AvroStorage, the names from Pig tuple fields are completely ignored. The field values are put to the result file only by their position. Here is a simplified test case: %declare WORKDIR `pwd` REGISTER ../../../../lib/external/avro-1.7.4.jar REGISTER ../../../../lib/external/json-simple-1.1.jar --this is build (manually with Maven) from the latest source: -- http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/ REGISTER ../piggybankBuiltFromSource.jar REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar --$ cat input.txt --data_a data_b --data_a data_b inputs = LOAD 'input.txt' AS (a: chararray, b: chararray); DESCRIBE inputs; DUMP inputs; --output: --inputs: {a: chararray,b: chararray} --(data_a,data_b) --(data_a,data_b) STORE inputs INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage('{ schema: { type : record, name : my_schema, namespace : com.my_namespace, fields : [ { name : b, type : string }, { name : nonsense_name, type : string } ] } }'); --output --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson output/part* --{b:data_a,nonsense_name:data_b} --{b:data_a,nonsense_name:data_b} AvroStorage is build from the latest piggybank code. Using AvroStorage debug: 5 parameter didn't help. $ pig -version Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Any help would be appreciated. Thanks, Ruslan Al-Fakikh -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: STORE USING AvroStorage - ignores Pig field names, only using their position
I think the expected behavior of AvroStorage is to use the tuple-ordered fields in the order they exist in the tuple. So to fix your problem, swap the order of b/nonsense_name. Otherwise I can't see a way to map from b to nonsense_name at all. Pig can't know how to do that without referencing tuple field order. On Sat, Nov 16, 2013 at 7:42 PM, Ruslan Al-Fakikh metarus...@gmail.comwrote: including this last message to pig user list On Sun, Nov 17, 2013 at 7:40 AM, Ruslan Al-Fakikh metarus...@gmail.comwrote: Russel, Actually this problem came from the situation when I had the same names in pig relation schema and avro schema. And it turned out that AvroStorage switches fields if the order is different. So, my impression is that it should work this way: 1) names correspond - then AvroStorage uses them 2) names do not correspond - then AvroStorage fails to store or does some schema resolution as shown here: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution Thanks On Sun, Nov 17, 2013 at 7:17 AM, Russell Jurney russell.jur...@gmail.com wrote: How can pig map from a to nonsence_name? On Saturday, November 16, 2013, Ruslan Al-Fakikh wrote: Thanks, Russel! Do you mean that this is the expected behavior? Shouldn't AvroStorage map the pig fields by their names (not their field order) matching them to the names in the avro schema? Thanks, Ruslan Al-Fakikh On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney russell.jur...@gmail.com wrote: Pig tuples have field order. Swap the order of the fields in your avro schema and try again. On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hey guys, When I store with AvroStorage, the names from Pig tuple fields are completely ignored. The field values are put to the result file only by their position. Here is a simplified test case: %declare WORKDIR `pwd` REGISTER ../../../../lib/external/avro-1.7.4.jar REGISTER ../../../../lib/external/json-simple-1.1.jar --this is build (manually with Maven) from the latest source: -- http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/ REGISTER ../piggybankBuiltFromSource.jar REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar --$ cat input.txt --data_a data_b --data_a data_b inputs = LOAD 'input.txt' AS (a: chararray, b: chararray); DESCRIBE inputs; DUMP inputs; --output: --inputs: {a: chararray,b: chararray} --(data_a,data_b) --(data_a,data_b) STORE inputs INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage('{ schema: { type : record, name : my_schema, namespace : com.my_namespace, fields : [ { name : b, type : string }, { name : nonsense_name, type : string } ] } }'); --output --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson output/part* --{b:data_a,nonsense_name:data_b} --{b:data_a,nonsense_name:data_b} AvroStorage is build from the latest piggybank code. Using AvroStorage debug: 5 parameter didn't help. $ pig -version Apache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Any help would be appreciated. Thanks, Ruslan Al-Fakikh -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome .com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Is Avro right for me?
Whats more, there are examples and support for Kafka, but not so much for Flume. On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote: I don't have experience with Flume, so I can't comment on that. At LinkedIn we ship logs around by sending Avro-encoded messages to Kafka ( http://kafka.apache.org/). Kafka is nice, it scales very well and gives a great deal of flexibility — logs can be consumed by any number of independent consumers, consumers can catch up on a backlog if they're disconnected for a while, and it comes with Hadoop import out of the box. (RabbitMQ is more designed or use cases where each message corresponds to a task that needs to be performed by a worker. IMHO Kafka is a better fit for logs, which are more stream-like.) With any message broker, you'll need to somehow tag each message with the schema that was used to encode it. You could include the full schema with every message, but unless you have very large messages, that would be a huge overhead. Better to give each version of your schema a sequential version number, or hash the schema, and include the version number/hash in each message. You can then keep a repository of schemas for resolving those version numbers or hashes – simply in files that you distribute to all producers/consumers, or in a simple REST service like https://issues.apache.org/jira/browse/AVRO-1124 Hope that helps, Martin On 26 May 2013 17:39, Mark static.void@gmail.com wrote: Yes our central server would be Hadoop. Exactly how would this work with flume? Would I write Avro to a file source which flume would then ship over to one of our collectors or is there a better/native way? Would I have to include the schema in each event? FYI we would be doing this primarily from a rails application. Does anyone ever use Avro with a message bus like RabbitMQ? On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote: Yep. Avro would be great at that (provided your central consumer is Avro friendly, like a Hadoop system). Make sure that all of your schemas have default values defined for fields so that schema evolution will be easier in the future. On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote: We're thinking about generating logs and events with Avro and shipping them to a central collector service via Flume. Is this a valid use case? -- Sean -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Is Avro/Trevni strictly read-only?
Aaron - is there a way to create a Kiji table from Pig? I'm in the habit of not specifying schemas with Voldemort and MongoDB, just storing a Pig relation and the schema is set in the store. If I can arrange that somehow, I'm all over Kiji. Panthera is a fork :/ On Wed, Jan 30, 2013 at 3:20 PM, Aaron Kimball akimbal...@gmail.com wrote: Hi ccleve, I'd definitely urge you to try out Kiji -- we who work on it think it's a pretty good fit for this specific use case. If you've got further questions about Kiji and how to use it, please send them to me, or ask the kiji user mailing list: http://www.kiji.org/getinvolved#Mailing_Lists cheers, - Aaron On Tue, Jan 29, 2013 at 3:24 PM, Doug Cutting cutt...@apache.org wrote: Avro and Trevni files do not support record update or delete. For large changing datasets you might use Kiji (http://www.kiji.org/) to store Avro data in HBase. Doug On Mon, Jan 28, 2013 at 12:00 PM, ccleve ccleve.t...@gmail.com wrote: I've gone through the documentation, but haven't been able to get a definite answer: is Avro, or specifically Trevni, only for read-only data? Is it possible to update or delete records? If records can be deleted, is there any code that will merge row sets to get rid of the unused space? -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Is Avro/Trevni strictly read-only?
Intel's HBase Panthera has an Avro document store builtin - another option: https://github.com/intel-hadoop/hbase-0.94-panthera On Tue, Jan 29, 2013 at 3:24 PM, Doug Cutting cutt...@apache.org wrote: Avro and Trevni files do not support record update or delete. For large changing datasets you might use Kiji (http://www.kiji.org/) to store Avro data in HBase. Doug On Mon, Jan 28, 2013 at 12:00 PM, ccleve ccleve.t...@gmail.com wrote: I've gone through the documentation, but haven't been able to get a definite answer: is Avro, or specifically Trevni, only for read-only data? Is it possible to update or delete records? If records can be deleted, is there any code that will merge row sets to get rid of the unused space? -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Sync() between records? How do we recover from a bad record, using DataFileReader?
We are trying to recover, report bad record, and move to the next record of an Avro file in PIG-3015 and PIG-3059. It seems that sync blocks don't exist between files, however. How should we recover from a bad record using Avro's DataFileReader? https://issues.apache.org/jira/browse/PIG-3015 https://issues.apache.org/jira/browse/PIG-3059 Russell Jurney http://datasyndrome.com
Re: Output from AVRO mapper
I don't mean to harp, but this is a few lines in Pig: /* Load Avro jars and define shortcut */ register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); /* Load Avros */ input = load 'my.avro' using AvroStorage(); /* Verify input */ describe input; Illustrate input; /* Convert Avros to JSON */ store input into 'my.json' using com.twitter.elephantbird.pig.store.JsonStorage(); store input into 'my.json.lzo' using com.twitter.elephantbird.pig.store.LzoJsonStorage(); /* Convert simple Avros to TSV */ store input into 'my.tsv'; /* Convert Avros to SequenceFiles */ REGISTER '/path/to/elephant-bird.jar'; store input into 'my.seq' using com.twitter.elephantbird.pig.store.SequenceFileStorage( /* example: */ '-c com.twitter.elephantbird.pig.util.IntWritableConverter', '-c com.twitter.elephantbird.pig.util.TextConverter' ); /* Convert Avros to Protobufs */ store input into 'input.protobuf’ using com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage(); /* Convert Avros to a Lucene Index */ store input into 'input.lucene' using LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat'); There are also drivers for most NoSQLish databases... Russell Jurney http://datasyndrome.com On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov wrote: I'm just getting started using AVRO within Map/Reduce and trying to convert some existing non-AVRO code to use AVRO input. So far the data that previously was stored in tab delimited files has been converted to .avro successfully as checked with avro-tools. Where I'm getting hung up extending beyond my book-based examples is in attempting to read from AVRO (using generic records) where the mapper output is NOT in AVRO format. I can't seem to reconcile extending AvroMapper and NOT using AvroCollector. Here are snippets of code that show my non-AVRO M/R code and my [failing] attempts to make this change. If anyone can help me along it would be very much appreciated. -Terry code Pre-Avro version: (Works fine with .tsv input format) public static class HdFlowMapper extends MapReduceBase implements MapperText, HdFlowWritable, LongPair, HdFlowWritable { @Override public void map(Text key, HdFlowWritable value, OutputCollectorLongPair, HdFlowWritable output, Reporter reporter) throws IOException { ...// outKey = new LongPair(value.getSrcIp(), value.getFirst()); HdFlowWritable outValue = value; // pass it all through output.collect(outKey, outValue); } AVRO attempt: conf.setOutputFormat(TextOutputFormat.class); conf.setOutputKeyClass(LongPair.class); conf.setOutputValueClass(AvroFlowWritable.class); SCHEMA = new Schema.Parser().parse(NetflowSchema); AvroJob.setInputSchema(conf, SCHEMA); //AvroJob.setOutputSchema(conf, SCHEMA); AvroJob.setMapperClass(conf, AvroFlowMapper.class); AvroJob.setReducerClass(conf, AvroFlowReducer.class); public static class AvroFlowMapperK extends AvroMapperK, OutputCollector { @Override ** IDE: Method does not override or implement a method from a supertype public void map(K datum, OutputCollectorLongPair, AvroFlowWritable collector, Reporter reporter) throws IOException { GenericRecord record = (GenericRecord) datum; afw = new AvroFlowWritable(record); // ... collector.collect(outKey, afw); } /code
Re: Converting arbitrary JSON to avro
Fwiw, I do this in web apps all the time via the python avro lib and json.dumps Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Sep 18, 2012, at 12:38 PM, Doug Cutting cutt...@apache.org wrote: On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler mar...@braindump.ms wrote: Json.Writer is indeed what I had in mind and I have successfully managed to convert my existing JSON to avro using it. However using GenericDatumReader on this feels pretty unnatural, as I seem to be unable to access fields directly. It seems I have to access the value field on each record which returns a Map which uses Utf8 Objects as keys for the actual fields. Or am I doing something wrong here? Hmm. We could re-factor Json.SCHEMA so the union is the top-level element. That would get rid of the wrapper around every value. It's a more redundant way to write the schema, but the binary encoding is identical (since a record wrapper adds no bytes). It would hence require no changes to Json.Reader or Json.Writer. [ long, double, string, boolean, null, {type : array, items : { type : record, name : org.apache.avro.data.Json, fields : [ { name : value, type : [ long, double, string, boolean, null, {type : array, items : Json}, {type : map, values : Json} ] } ] } }, {type : map, values : Json} ] You can try this by placing this schema in share/schemas/org/apache/avro/data/Json.avsc and re-building the avro jar. Would such a change be useful to you? If so, please file an issue in Jira. Or we could even refactor this schema so that a Json object is the top-level structure: {type : map, values : [ long, double, string, boolean, null, {type : array, items : { type : record, name : org.apache.avro.data.Json, fields : [ { name : value, type : [ long, double, string, boolean, null, {type : array, items : Json}, {type : map, values : Json} ] } ] } }, {type : map, values : Json} ] } This would change the binary format but would not change the representation that GenericDatumReader would hand you from my first example above (since the generic representation unwraps unions). Using this schema would require changes to Json.Writer and Json.Reader. It would better conform to the definition of Json, which only permits objects as the top-level type. Concerning the more specific schema, you are of course completely right. Unfortunately more or less all the fields in the JSON data format are optional and many have substructures, so, at least in my understanding, I have to use unions of null and the actual type throughout the schema. I tried using JsonDecoder first (or rather the fromjson option of the avro tool, which, I think, uses JsonDecoder) but given the current JSON structures, this didn't work. So I'll probably have to look into implementing my own converter. However given the rather complex structure of the original JSON I'm wondering if trying to represent the data in avro is such a good idea in the first place. It would be interesting to see whether, with the appropriate schema, whether the dataset is smaller and faster to process as Avro than as Json. If you have 1000 fields in your data but the typical record only has one or two non-null, then an Avro record is perhaps not a good representation. An Avro map might be better, but if the values are similarly variable then Json might be competitive. Cheers, Doug
Re: Avro file size is too big
This thread looks useful. Are you flushing too often? http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and-deflate-td3870167.html Russell Jurney http://datasyndrome.com On Jul 4, 2012, at 6:33 AM, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hello, In my organization currently we are evaluating Avro as a format. Our concern is file size. I've done some comparisons of a piece of our data. Say we have sequence files, compressed. The payload (values) are just lines. As far as I know we use line number as keys and we use the default codec for compression inside sequence files. The size is 1.6G, when I put it to avro with deflate codec with deflate level 9 it becomes 2.2G. This is interesting, because the values in seq files are just string, but Avro has a normal schema with primitive types. And those are kept binary. Shouldn't Avro be less in size? Also I took another dataset which is 28G (gzip files, plain tab-delimited text, don't know what is the deflate level) and put it to Avro and it became 38G Why Avro is so big in size? Am I missing some size optimization? Thanks in advance!
Re: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to
Consider Pig and AvroStorage. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On May 13, 2012, at 4:49 AM, Jacob Metcalf jacob_metc...@hotmail.com wrote: I have just spent several frustrating hours on getting an example MR job using Avro working with Hadoop and after finally getting it working I thought I would share my findings with everyone. I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and Reduce then attempted to deploy and run. I am setting up a development cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my job and it failed with: org.apache.avro.generic.GenericData$Record cannot be cast to net.jacobmetcalf.avro.Room Where Room is an Avro generated class. I found two problems. The first I have partly solved, the second one is more to do with Hadoop and is as yet unsolved: 1) Why when I am using Avro Specific does it end up going Generic? When deserializing SpecificDatumReader.java attempts to instantiate your target class through reflection. If it fails to create your class it defaults to a GenericData.Record. This Doug has explained here: http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E But why it is doing it was a little harder to work out. Debugging I saw the SpecificDatumReader could not find my class in its classpath. However in my Job Runner I had done: job.setJarByClass(HouseAssemblyJob.class); // This should ensure the JAR is distributed around the cluster I expected with this Hadoop would distribute my Jar around the cluster. It may be doing the distribution but it definitely did not add it to the Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to the directory I am running from. This is not going to work in a real cluster where the Job Runner is on a different machine to where the Reducer so I am keen to figure out whether the problem is Hadoop 0.23, my environment variables or the fact I am running under Cygwin. 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ? Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however want to use 1.6.3 (and 1.7 when it comes out) because of its support for immutability builders in the generated classes. I probably could just hack the old Avro lib out of my Hadoop distribution and drop the new one in. However I thought it would be cleaner to get Hadoop to distribute my jar to all datanodes and then manipulate my classpath to get the latest version of Avro to the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly and tried to do this in my JobRunner: job.setJarByClass( MyJob.class); // This should ensure the JAR is distributed around the cluster config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version of avro? But it continues to use 1.5.3. I suspect it is again to do with my HADOOP_CLASSPATH which has avro-1.5.3 in it: export HADOOP_CLASSPATH=$HADOOP_COMMON_HOME/share/hadoop/mapreduce/* If anyone has done this and has any ideas please let me know? Thanks Jacob
Re: AvroStorage/Avro Schema Question
: }, { name:address, type:[null,string], doc: } ] } ] } ], doc: } ] } On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney russell.jur...@gmail.com wrote: H unable to get this to work: { namespace: agile.data.avro, name: Email, type: record, fields: [ {name:message_id, type: [string, null]}, {name:froms,type: [{type:record, name:from, fields: [{type:array, items:string}, null]}, null]}, {name:tos,type: [{type:record, name:to, fields: [{type:array, items:string}, null]}, null]}, {name:ccs,type: [{type:record, name:cc, fields: [{type:array, items:string}, null]}, null]}, {name:bccs,type: [{type:record, name:bcc, fields: [{type:array, items:string}, null]}, null]}, {name:reply_tos,type: [{type:record, name:reply_to, fields: [{type:array, items:string}, null]}, null]}, {name:in_reply_to, type: [{type:array, items:string}, null]}, {name:subject, type: [string, null]}, {name:body, type: [string, null]}, {name:date, type: [string, null]} ] } On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney russell.jur...@gmail.com wrote: In thinking about it more... it seems that unfortunately, the only thing I can really do is to change the schema for all email address fields: {name:from,type: [{type:array, items:string}, null]}, to: {name:froms,type: [{type:record, name:from, fields: [{type:array, items:string}, null]}, null]}, That is, to pluralize everything and then individually name array elements. I will try running this through my stack. On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey scottca...@apache.org wrote: It appears as though the Avro to PigStorage schema translation names (in pig) all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the field name is not moved onto the bag name. About a year and a half ago I started https://issues.apache.org/jira/browse/AVRO-592 but before finishing it AvroStorage was written elsewhere. I don't recall exactly what I did with the schema translation there, but I recall the mapping from an Avro schema to pig tried to hide the nullable wrappers more. In Avro, arrays are unnamed types, so I see two things you could probably do without any code changes: * Add a line in the pig script to project / rename the fields to what you want (unfortunate and clumbsy, but I think it will work — I think you want from::PIG_WRAPPER::ARRAY_ELEM as from or FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from something like that. * Add a record wrapper to your schema (which may inject more messiness in the pig schema view): { namespace: agile.data.avro, name: Email, type: record, fields: [ {name:message_id, type: [string, null]}, {name:from,type: [{type:record, name:From, fields: [[{type:array, items:string},null]], null]}, … ] } But that is very awkward — requiring a named record for each field that is an unnamed type. Ideally PigStorage would treat any union of null and one other thing as a simple pig type with no wrapper, and project the name of a field or record into the name of the thing inside a bag. -Scott On 3/29/12 6:05 PM, Russell Jurney russell.jur...@gmail.com wrote: Is it possible to name string elements in the schema of an array? Specifically, below I want to name the email addresses in the from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by Pig's AvroStorage. I know I can probably fix this in Java in the Pig AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema. Last time I read Avro's array docs in this context, my hit-points dropped by a third, so pardom me if I've not rtfm this time :) Complete description of what I'm doing follows: Avro schema for my emails: { namespace: agile.data.avro, name: Email, type: record, fields: [ {name:message_id, type: [string, null]}, {name:from,type: [{type:array, items:string}, null]}, {name:to,type: [{type:array, items:string}, null]}, {name:cc,type: [{type:array, items:string}, null]}, {name:bcc,type: [{type:array, items:string}, null]}, {name:reply_to, type: [{type:array, items:string}, null]}, {name:in_reply_to, type: [{type:array, items:string}, null]}, {name:subject, type: [string, null]}, {name:body, type: [string, null]}, {name:date, type: [string, null]} ] } Pig to publish my Avros: grunt emails = load '/me/tmp/emails' using AvroStorage(); grunt describe emails emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: {PIG_WRAPPER
Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem
Thanks Scott, looking at the raw data it seems to have been a truncated record due to UTF problems. Russell Jurney http://datasyndrome.com On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote: It appears to be reading a union index and failing in there somehow. If it did not have any of the pig AvroStorage stuff in there I could tell you more. What does avro-tools.jar 's 'tojson' tool do? (java –jar avro-tools-1.6.3.jar tojson file | your_favorite_text_reader) What version of Avro is the java stack trace below? On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote: I have a problem record I've written in Avro that crashes anything which tries to read it :( Can anyone make sense of these errors? The exception in Pig/AvroStorage is this: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80) at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273) ... 7 more When reading the record in Python: File /me/Collecting-Data/src/python/cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in read_union raise SchemaResolutionException(fail_msg, writers_schema, readers_schema) avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches When reading the record in Ruby: /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in `read_data': Writer's schema and Reader's schema [string,null] do not match. (Avro::IO::SchemaMatchException) -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem
Ok, now I have a followup question... how does one recover from an exception writing an Avro? The incomplete record is being written, which is crashing the reader. On Fri, Mar 23, 2012 at 8:01 PM, Russell Jurney russell.jur...@gmail.comwrote: Thanks Scott, looking at the raw data it seems to have been a truncated record due to UTF problems. Russell Jurney http://datasyndrome.com On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote: It appears to be reading a union index and failing in there somehow. If it did not have any of the pig AvroStorage stuff in there I could tell you more. What does avro-tools.jar 's 'tojson' tool do? (java –jar avro-tools-1.6.3.jar tojson file | your_favorite_text_reader) What version of Avro is the java stack trace below? On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote: I have a problem record I've written in Avro that crashes anything which tries to read it :( Can anyone make sense of these errors? The exception in Pig/AvroStorage is this: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233) at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220) at org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80) at org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273) ... 7 more When reading the record in Python: File /me/Collecting-Data/src/python/cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 650, in read_union raise SchemaResolutionException(fail_msg, writers_schema, readers_schema) avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches When reading the record in Ruby: /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in `read_data': Writer's schema and Reader's schema [string,null] do not match. (Avro::IO::SchemaMatchException) -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: HttpTranceiver and JSON-encoded Avro?
FWIW, there are avro libs for JavaScript and node on github. Russell Jurney http://datasyndrome.com On Feb 15, 2012, at 7:32 AM, Frank Grimes frankgrime...@gmail.com wrote: Hi All, Is there any way to send Avro data over HTTP encoded in JSON? We want to integrate with Node.js and JSON seems to be the best/simplest way to do so. Thanks, Frank Grimes
Re: Pig/Avro Question
Hmmm I applied it, but I still can't open files that don't end in .avro On Fri, Feb 3, 2012 at 2:23 AM, Philipp philipp.p...@metrigo.de wrote: This patch fixes this issue: https://issues.apache.org/**jira/browse/PIG-2492https://issues.apache.org/jira/browse/PIG-2492 On 02/03/2012 07:22 AM, Russell Jurney wrote: I have the same bug. I read the code... there is no obvious fix. Arg. On Feb 2, 2012, at 10:07 PM, Something Somethingmailinglists19@** gmail.com mailinglist...@gmail.com wrote: In my Pig script I have something like this... %default MY_SCHEMA '/user/xyz/my-schema.json'; %default MY_AVRO 'org.apache.pig.piggybank.** storage.avro.AvroStorage(\'$**MY_SCHEMA\')'; my_files = LOAD '$MY_FILES' USING $MY_AVRO; What I have noticed is that when MY_FILES contains only one file, it works fine. %default MY_FILES '/user/xyz/file1.avro' But when I use a comma separated list it doesn't work. e.g. %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro' Basically, I get a message saying something like 'Schema cannot be found'. Is there a way to make it work with multiple files? Please let me know. Thanks. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Pig/Avro Question
btw - the weird thing is... I've read the code. There isn't a filter for .avro in there. Does Hadoop, or Avro itself (not that I can see it is involved) do so? On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney russell.jur...@gmail.comwrote: Hmmm I applied it, but I still can't open files that don't end in .avro On Fri, Feb 3, 2012 at 2:23 AM, Philipp philipp.p...@metrigo.de wrote: This patch fixes this issue: https://issues.apache.org/**jira/browse/PIG-2492https://issues.apache.org/jira/browse/PIG-2492 On 02/03/2012 07:22 AM, Russell Jurney wrote: I have the same bug. I read the code... there is no obvious fix. Arg. On Feb 2, 2012, at 10:07 PM, Something Somethingmailinglists19@** gmail.com mailinglist...@gmail.com wrote: In my Pig script I have something like this... %default MY_SCHEMA '/user/xyz/my-schema.json'; %default MY_AVRO 'org.apache.pig.piggybank.** storage.avro.AvroStorage(\'$**MY_SCHEMA\')'; my_files = LOAD '$MY_FILES' USING $MY_AVRO; What I have noticed is that when MY_FILES contains only one file, it works fine. %default MY_FILES '/user/xyz/file1.avro' But when I use a comma separated list it doesn't work. e.g. %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro' Basically, I get a message saying something like 'Schema cannot be found'. Is there a way to make it work with multiple files? Please let me know. Thanks. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python
Correction: when I read the file in Python, I get the error below. It looks like a unicode problem? Can one tell Avro how to handle this? Traceback (most recent call last): File ./cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 654, in read_union return self.read_data(selected_writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 458, in read_data return self.read_data(writers_schema, s, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 468, in read_data return decoder.read_utf8() File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 233, in read_utf8 return unicode(self.read_bytes(), utf-8) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 543: invalid start byte On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney russell.jur...@gmail.comwrote: I am writing Avro records in Ruby using the avro ruby gem in 1.8.7. I have problems with loading these files sometimes. As a result, I am unable to write large files that are readable. The exception I get is below. Anyone have an idea what this means? It looks like Avro is having trouble parsing the schema. The avro files parse in Ruby and Python, just not Pig. Are there more rigorous checks in Java? Pig Stack Trace --- ERROR 2998: Unhandled internal error. org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; at org.apache.avro.Schema.clinit(Schema.java:82) at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109) at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100) at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582) at org.apache.pig.PigServer.registerQuery(PigServer.java:584) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:495) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method
Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python
A little bit more searching shows this: http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.comwrote: The jars being used are: REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /me/pig/contrib/piggybank/java/piggybank.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.comwrote: HI Russell, I'm not sure about the Python error, but the Java error looks like a classpath problem, not a schema parsing issue. The NoSuchMethodError in the stack trace indicates that Avro was trying to invoke a method in the Jackson library that wasn't present at run-time. My guess is that your program (or Pig?) either has two incompatible versions of the Jackson library on its classpath or maybe Avro's Jackson dependency has been excluded and a version that is incompatible with Avro is on the classpath. Which version of Avro is being used? Running 'mvn dependency:tree' in Avro trunk I see that it's depending on Jackson 1.8.6. Can you verify that only one version of Jackson is on the classpath and that it's the version that is required by whatever version of Avro is on the classpath? -James On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney russell.jur...@gmail.comwrote: Correction: when I read the file in Python, I get the error below. It looks like a unicode problem? Can one tell Avro how to handle this? Traceback (most recent call last): File ./cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 654, in read_union return self.read_data(selected_writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 458, in read_data return self.read_data(writers_schema, s, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 468, in read_data return decoder.read_utf8() File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 233, in read_utf8 return unicode(self.read_bytes(), utf-8) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 543: invalid start byte On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney russell.jur...@gmail.com wrote: I am writing Avro records in Ruby using the avro ruby gem in 1.8.7. I have problems with loading these files sometimes. As a result, I am unable to write large files that are readable. The exception I get is below. Anyone have an idea what this means? It looks like Avro is having trouble parsing the schema. The avro files parse in Ruby and Python, just not Pig. Are there more rigorous checks in Java? Pig Stack Trace --- ERROR 2998: Unhandled internal error. org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; at org.apache.avro.Schema.clinit(Schema.java:82) at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163
Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python
Further examination shows that the problematic emails I am encoding are formatted in ISO-8859-1, not UTF-8. That is why I am getting character problems. Looks like it is not an Avro problem after all. Thanks! :) On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney russell.jur...@gmail.comwrote: A little bit more searching shows this: http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.comwrote: The jars being used are: REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /me/pig/contrib/piggybank/java/piggybank.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.comwrote: HI Russell, I'm not sure about the Python error, but the Java error looks like a classpath problem, not a schema parsing issue. The NoSuchMethodError in the stack trace indicates that Avro was trying to invoke a method in the Jackson library that wasn't present at run-time. My guess is that your program (or Pig?) either has two incompatible versions of the Jackson library on its classpath or maybe Avro's Jackson dependency has been excluded and a version that is incompatible with Avro is on the classpath. Which version of Avro is being used? Running 'mvn dependency:tree' in Avro trunk I see that it's depending on Jackson 1.8.6. Can you verify that only one version of Jackson is on the classpath and that it's the version that is required by whatever version of Avro is on the classpath? -James On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney russell.jur...@gmail.com wrote: Correction: when I read the file in Python, I get the error below. It looks like a unicode problem? Can one tell Avro how to handle this? Traceback (most recent call last): File ./cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum = self.datum_reader.read(self.datum_decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 445, in read return self.read_data(self.writers_schema, self.readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 490, in read_data return self.read_record(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 690, in read_record field_val = self.read_data(field.type, readers_field.type, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 488, in read_data return self.read_union(writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 654, in read_union return self.read_data(selected_writers_schema, readers_schema, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 458, in read_data return self.read_data(writers_schema, s, decoder) File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 468, in read_data return decoder.read_utf8() File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py, line 233, in read_utf8 return unicode(self.read_bytes(), utf-8) UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 543: invalid start byte On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney russell.jur...@gmail.com wrote: I am writing Avro records in Ruby using the avro ruby gem in 1.8.7. I have problems with loading these files sometimes. As a result, I am unable to write large files that are readable. The exception I get is below. Anyone have an idea what this means? It looks like Avro is having trouble parsing the schema. The avro files parse in Ruby and Python, just not Pig. Are there more rigorous checks in Java? Pig Stack Trace --- ERROR 2998: Unhandled internal error. org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg
Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python
Spoken too soon... this happens no matter what avros I load now. I can't figure that anything has changed regarding jars, etc. Confused. I think this happens when Avro is parsing the schema? Pig Stack Trace --- ERROR 2998: Unhandled internal error. org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; at org.apache.avro.Schema.clinit(Schema.java:82) at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109) at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100) at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582) at org.apache.pig.PigServer.registerQuery(PigServer.java:584) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:495) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) On Thu, Feb 2, 2012 at 2:53 PM, Russell Jurney russell.jur...@gmail.comwrote: Further examination shows that the problematic emails I am encoding are formatted in ISO-8859-1, not UTF-8. That is why I am getting character problems. Looks like it is not an Avro problem after all. Thanks! :) On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney russell.jur...@gmail.comwrote: A little bit more searching shows this: http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.comwrote: The jars being used are: REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /me/pig/contrib/piggybank/java/piggybank.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.comwrote: HI Russell, I'm not sure about the Python error, but the Java error looks like a classpath problem, not a schema parsing issue. The NoSuchMethodError in the stack trace indicates that Avro was trying to invoke a method in the Jackson library that wasn't present at run-time. My guess is that your program (or Pig?) either has two incompatible versions of the Jackson library on its classpath or maybe Avro's Jackson dependency has been excluded and a version that is incompatible with Avro is on the classpath. Which version of Avro is being used? Running 'mvn dependency:tree' in Avro trunk I see that it's depending on Jackson 1.8.6. Can you verify that only one version of Jackson is on the classpath and that it's the version that is required by whatever version of Avro is on the classpath? -James On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney russell.jur...@gmail.com wrote: Correction: when I read the file in Python, I get the error below. It looks like a unicode problem? Can one tell Avro how to handle this? Traceback (most recent call last): File ./cat_avro, line 21, in module for record in df_reader: File /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py, line 354, in next datum
Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python
Cleaned up my environment by unsetting HADOOP_HOME, and removing some old jacksons in my CLASSPATH and Pig's AvroStorage works again. Woot! On Thu, Feb 2, 2012 at 3:47 PM, Russell Jurney russell.jur...@gmail.comwrote: Spoken too soon... this happens no matter what avros I load now. I can't figure that anything has changed regarding jars, etc. Confused. I think this happens when Avro is parsing the schema? Pig Stack Trace --- ERROR 2998: Unhandled internal error. org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; java.lang.NoSuchMethodError: org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory; at org.apache.avro.Schema.clinit(Schema.java:82) at org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163) at org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144) at org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150) at org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109) at org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100) at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218) at org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) at org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679) at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582) at org.apache.pig.PigServer.registerQuery(PigServer.java:584) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:495) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) On Thu, Feb 2, 2012 at 2:53 PM, Russell Jurney russell.jur...@gmail.comwrote: Further examination shows that the problematic emails I am encoding are formatted in ISO-8859-1, not UTF-8. That is why I am getting character problems. Looks like it is not an Avro problem after all. Thanks! :) On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney russell.jur...@gmail.comwrote: A little bit more searching shows this: http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/ On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.com wrote: The jars being used are: REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /me/pig/contrib/piggybank/java/piggybank.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.com wrote: HI Russell, I'm not sure about the Python error, but the Java error looks like a classpath problem, not a schema parsing issue. The NoSuchMethodError in the stack trace indicates that Avro was trying to invoke a method in the Jackson library that wasn't present at run-time. My guess is that your program (or Pig?) either has two incompatible versions of the Jackson library on its classpath or maybe Avro's Jackson dependency has been excluded and a version that is incompatible with Avro is on the classpath. Which version of Avro is being used? Running 'mvn dependency:tree' in Avro trunk I see that it's depending on Jackson 1.8.6. Can you verify that only one version of Jackson is on the classpath and that it's the version that is required by whatever version of Avro is on the classpath? -James On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney russell.jur...@gmail.com wrote: Correction: when I read the file in Python, I get the error below. It looks like a unicode problem? Can one tell Avro how
AVRO-981 - Removed snappy as requirement
https://issues.apache.org/jira/browse/AVRO-981 I took Joe Crobak's advice and removed snappy as a dependency in the python client for avro. With the patch in AVRO-981 applied, Avro installs, builds and functions on Mac OS X. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Python for Avro doesn't build on OS X 10.6.8. Stuck.
Avro 1.53 doesn't have this issue? Does it use python-snappy? Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Dec 23, 2011, at 7:05 PM, Ken Krugler kkrugler_li...@transpac.com wrote: I installed brew, then ran 'brew install snappy', which worked. But 'sudo python setup.py install' still fails, though with a different problem (I was getting the same compilation errors as below, now I get a broken pipe error). running install running bdist_egg running egg_info writing requirements to avro.egg-info/requires.txt writing avro.egg-info/PKG-INFO writing top-level names to avro.egg-info/top_level.txt writing dependency_links to avro.egg-info/dependency_links.txt reading manifest file 'avro.egg-info/SOURCES.txt' writing manifest file 'avro.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.6-universal/egg running install_lib running build_py creating build/bdist.macosx-10.6-universal/egg creating build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/__init__.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/datafile.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/io.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/ipc.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/protocol.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/schema.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/tool.py - build/bdist.macosx-10.6-universal/egg/avro copying build/lib/avro/txipc.py - build/bdist.macosx-10.6-universal/egg/avro byte-compiling build/bdist.macosx-10.6-universal/egg/avro/__init__.py to __init__.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/datafile.py to datafile.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/io.py to io.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/ipc.py to ipc.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/protocol.py to protocol.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/schema.py to schema.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/tool.py to tool.pyc byte-compiling build/bdist.macosx-10.6-universal/egg/avro/txipc.py to txipc.pyc creating build/bdist.macosx-10.6-universal/egg/EGG-INFO installing scripts to build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts running install_scripts running build_scripts creating build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts copying build/scripts-2.6/avro - build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts changing mode of build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts/avro to 755 copying avro.egg-info/PKG-INFO - build/bdist.macosx-10.6-universal/egg/EGG-INFO copying avro.egg-info/SOURCES.txt - build/bdist.macosx-10.6-universal/egg/EGG-INFO copying avro.egg-info/dependency_links.txt - build/bdist.macosx-10.6-universal/egg/EGG-INFO copying avro.egg-info/requires.txt - build/bdist.macosx-10.6-universal/egg/EGG-INFO copying avro.egg-info/top_level.txt - build/bdist.macosx-10.6-universal/egg/EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist/avro-1.6.1-py2.6.egg' and adding 'build/bdist.macosx-10.6-universal/egg' to it removing 'build/bdist.macosx-10.6-universal/egg' (and everything under it) Processing avro-1.6.1-py2.6.egg Removing /Library/Python/2.6/site-packages/avro-1.6.1-py2.6.egg Copying avro-1.6.1-py2.6.egg to /Library/Python/2.6/site-packages avro 1.6.1 is already the active version in easy-install.pth Installing avro script to /usr/local/bin Installed /Library/Python/2.6/site-packages/avro-1.6.1-py2.6.egg Processing dependencies for avro==1.6.1 Searching for python-snappy Reading http://pypi.python.org/simple/python-snappy/ Reading http://github.com/andrix/python-snappy Best match: python-snappy 0.3.2 Downloading http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f Processing python-snappy-0.3.2.tar.gz Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-jARda8/python-snappy-0.3.2/egg-dist-tmp-DDcMTW cc1plus: warning: command line option -Wstrict-prototypes is valid for C/ObjC but not for C++ snappymodule.cc:41: warning: ‘_state’ defined but not used /usr/libexec/gcc/powerpc-apple-darwin10/4.2.1/as: assembler (/usr/bin/../libexec/gcc/darwin/ppc/as or /usr/bin/../local/libexec/gcc/darwin/ppc/as) for architecture ppc not installed Installed assemblers are: /usr/bin/../libexec/gcc/darwin/x86_64/as for architecture x86_64 /usr/bin/../libexec/gcc/darwin/i386/as for architecture i386 cc1plus: warning: command line option -Wstrict-prototypes is valid for C/ObjC but not for C++ snappymodule.cc:41: warning: ‘_state’ defined but not used snappymodule.cc:246: fatal error: error writing to -: Broken pipe compilation terminated. cc1plus: warning: command line option -Wstrict-prototypes is valid for C/ObjC but not for C++ snappymodule.cc:41
Python for Avro doesn't build on OS X 10.6.8. Stuck.
I am unable to build Avro for Python on OS X 10.6.8 because python-snappy fails to build. Is there a way around this? I wrote a post here http://datasyndrome.com/post/13707537045/booting-the-analytics-application-events-rubyabout how to use Avro with Pig and Ruby. The code is here: https://github.com/rjurney/Booting-the-Analytics-Application I am trying to create Python instructions that parallel the Ruby ones. I've tried multiple install methods, and I am unable to build snappy-python, so the avro build fails. This is the error message: Installed /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg Processing dependencies for avro==-AVRO-VERSION- Searching for python-snappy Reading http://pypi.python.org/simple/python-snappy/ Reading http://github.com/andrix/python-snappy Best match: python-snappy 0.3.2 Downloading http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f Processing python-snappy-0.3.2.tar.gz Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-0KEFAv/python-snappy-0.3.2/egg-dist-tmp-1BPj3j cc1plus: warning: command line option -Wstrict-prototypes is valid for C/ObjC but not for C++ snappymodule.cc:31:22: error: snappy-c.h: No such file or directory snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*, PyObject*)’: snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:62: error: expected `;' before ‘status’ snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared in this scope snappymodule.cc:79: error: ‘status’ was not declared in this scope snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this scope snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*, PyObject*)’: snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:107: error: expected `;' before ‘status’ snappymodule.cc:120: error: ‘status’ was not declared in this scope snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared in this scope snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this scope snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’: snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:151: error: expected `;' before ‘status’ snappymodule.cc:156: error: ‘status’ was not declared in this scope snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not declared in this scope snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: At global scope: snappymodule.cc:41: warning: ‘_state’ defined but not used error: Setup script exited with error: command '/usr/bin/gcc-4.2' failed with exit status 1 -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Python for Avro doesn't build on OS X 10.6.8. Stuck.
I installed the snappy macport, and I still can't get snappy-python to build. I'll try switching to brew. On Wed, Dec 14, 2011 at 5:46 PM, Joe Crobak joec...@gmail.com wrote: It appears that the python-snappy library assumes that the snappy library has already been installed. I saw similar errors, but once I installed snappy (I use homebrew, so 'brew install snappy') python-snappy installs fine. HTH, Joe On Wed, Dec 14, 2011 at 8:23 PM, Russell Jurney russell.jur...@gmail.comwrote: I am unable to build Avro for Python on OS X 10.6.8 because python-snappy fails to build. Is there a way around this? I wrote a post here http://datasyndrome.com/post/13707537045/booting-the-analytics-application-events-rubyabout how to use Avro with Pig and Ruby. The code is here: https://github.com/rjurney/Booting-the-Analytics-Application I am trying to create Python instructions that parallel the Ruby ones. I've tried multiple install methods, and I am unable to build snappy-python, so the avro build fails. This is the error message: Installed /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg Processing dependencies for avro==-AVRO-VERSION- Searching for python-snappy Reading http://pypi.python.org/simple/python-snappy/ Reading http://github.com/andrix/python-snappy Best match: python-snappy 0.3.2 Downloading http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f Processing python-snappy-0.3.2.tar.gz Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-0KEFAv/python-snappy-0.3.2/egg-dist-tmp-1BPj3j cc1plus: warning: command line option -Wstrict-prototypes is valid for C/ObjC but not for C++ snappymodule.cc:31:22: error: snappy-c.h: No such file or directory snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*, PyObject*)’: snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:62: error: expected `;' before ‘status’ snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared in this scope snappymodule.cc:79: error: ‘status’ was not declared in this scope snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this scope snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*, PyObject*)’: snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:107: error: expected `;' before ‘status’ snappymodule.cc:120: error: ‘status’ was not declared in this scope snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared in this scope snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this scope snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: In function ‘PyObject* snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’: snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope snappymodule.cc:151: error: expected `;' before ‘status’ snappymodule.cc:156: error: ‘status’ was not declared in this scope snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not declared in this scope snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope snappymodule.cc: At global scope: snappymodule.cc:41: warning: ‘_state’ defined but not used error: Setup script exited with error: command '/usr/bin/gcc-4.2' failed with exit status 1 -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com