Re: STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Russell Jurney
Pig tuples have field order. Swap the order of the fields in your avro schema 
and try again.

 On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh metarus...@gmail.com wrote:
 
 Hey guys,
 
 When I store with AvroStorage, the names from Pig tuple fields are completely 
 ignored. The field values are put to the result file only by their position.
 Here is a simplified test case:
 
 %declare WORKDIR `pwd`
 REGISTER ../../../../lib/external/avro-1.7.4.jar
 REGISTER ../../../../lib/external/json-simple-1.1.jar
 --this is build (manually with Maven) from the latest source:
 --http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
 REGISTER ../piggybankBuiltFromSource.jar
 REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar
 REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar
 
 --$ cat input.txt 
 --data_a  data_b
 --data_a  data_b
 inputs = LOAD 'input.txt' AS (a: chararray, b: chararray);
 
 DESCRIBE inputs;
 DUMP inputs;
 
 --output:
 --inputs: {a: chararray,b: chararray}
 --(data_a,data_b)
 --(data_a,data_b)
 
 STORE inputs INTO 'output'
 USING org.apache.pig.piggybank.storage.avro.AvroStorage('{
 schema:
 {
   type : record,
   name : my_schema,
   namespace : com.my_namespace,
   fields : [
   {
 name : b,
 type : string
   },
   {
 name : nonsense_name,
 type : string
   }
   ]
 }
 }');
 
 --output
 --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson 
 output/part*
 --{b:data_a,nonsense_name:data_b}
 --{b:data_a,nonsense_name:data_b}
 
 AvroStorage is build from the latest piggybank code.
 Using AvroStorage debug: 5 parameter didn't help.
 
 $ pig -version
 Apache Pig version 0.11.0-cdh4.3.0 (rexported) 
 compiled May 27 2013, 20:48:21
 
 Any help would be appreciated.
 
 Thanks,
 Ruslan Al-Fakikh


Re: STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Russell Jurney
How can pig map from a to nonsence_name?

On Saturday, November 16, 2013, Ruslan Al-Fakikh wrote:

 Thanks, Russel!

 Do you mean that this is the expected behavior? Shouldn't AvroStorage map
 the pig fields by their names (not their field order) matching them to the
 names in the avro schema?

 Thanks,
 Ruslan Al-Fakikh


 On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney 
 russell.jur...@gmail.comjavascript:_e({}, 'cvml', 
 'russell.jur...@gmail.com');
  wrote:

 Pig tuples have field order. Swap the order of the fields in your avro
 schema and try again.

 On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh 
 metarus...@gmail.comjavascript:_e({}, 'cvml', 'metarus...@gmail.com');
 wrote:

 Hey guys,

 When I store with AvroStorage, the names from Pig tuple fields are
 completely ignored. The field values are put to the result file only by
 their position.
 Here is a simplified test case:

 %declare WORKDIR `pwd`
 REGISTER ../../../../lib/external/avro-1.7.4.jar
 REGISTER ../../../../lib/external/json-simple-1.1.jar
 --this is build (manually with Maven) from the latest source:
 --
 http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
 REGISTER ../piggybankBuiltFromSource.jar
 REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar
 REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar

 --$ cat input.txt
 --data_a data_b
 --data_a data_b
 inputs = LOAD 'input.txt' AS (a: chararray, b: chararray);

 DESCRIBE inputs;
 DUMP inputs;

 --output:
 --inputs: {a: chararray,b: chararray}
 --(data_a,data_b)
 --(data_a,data_b)

 STORE inputs INTO 'output'
 USING org.apache.pig.piggybank.storage.avro.AvroStorage('{
 schema:
 {
   type : record,
   name : my_schema,
   namespace : com.my_namespace,
   fields : [
   {
 name : b,
 type : string
   },
   {
 name : nonsense_name,
 type : string
   }
   ]
 }
 }');

 --output
 --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson
 output/part*
 --{b:data_a,nonsense_name:data_b}
 --{b:data_a,nonsense_name:data_b}

 AvroStorage is build from the latest piggybank code.
 Using AvroStorage debug: 5 parameter didn't help.

 $ pig -version
 Apache Pig version 0.11.0-cdh4.3.0 (rexported)
 compiled May 27 2013, 20:48:21

 Any help would be appreciated.

 Thanks,
 Ruslan Al-Fakikh




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: STORE USING AvroStorage - ignores Pig field names, only using their position

2013-11-16 Thread Russell Jurney
I think the expected behavior of AvroStorage is to use the tuple-ordered
fields in the order they exist in the tuple. So to fix your problem, swap
the order of b/nonsense_name.

Otherwise I can't see a way to map from b to nonsense_name at all. Pig
can't know how to do that without referencing tuple field order.

On Sat, Nov 16, 2013 at 7:42 PM, Ruslan Al-Fakikh metarus...@gmail.comwrote:

 including this last message to pig user list


 On Sun, Nov 17, 2013 at 7:40 AM, Ruslan Al-Fakikh metarus...@gmail.comwrote:

 Russel,

 Actually this problem came from the situation when I had the same names
 in pig relation schema and avro schema. And it turned out that AvroStorage
 switches fields if the order is different.
 So, my impression is that it should work this way:
 1) names correspond - then AvroStorage uses them
 2) names do not correspond - then AvroStorage fails to store or does some
 schema resolution as shown here:
 http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution

 Thanks


 On Sun, Nov 17, 2013 at 7:17 AM, Russell Jurney russell.jur...@gmail.com
  wrote:

 How can pig map from a to nonsence_name?


 On Saturday, November 16, 2013, Ruslan Al-Fakikh wrote:

 Thanks, Russel!

 Do you mean that this is the expected behavior? Shouldn't AvroStorage
 map the pig fields by their names (not their field order) matching them to
 the names in the avro schema?

 Thanks,
 Ruslan Al-Fakikh


 On Sun, Nov 17, 2013 at 6:53 AM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Pig tuples have field order. Swap the order of the fields in your avro
 schema and try again.

 On Nov 16, 2013, at 6:19 PM, Ruslan Al-Fakikh metarus...@gmail.com
 wrote:

  Hey guys,

 When I store with AvroStorage, the names from Pig tuple fields are
 completely ignored. The field values are put to the result file only by
 their position.
 Here is a simplified test case:

 %declare WORKDIR `pwd`
 REGISTER ../../../../lib/external/avro-1.7.4.jar
 REGISTER ../../../../lib/external/json-simple-1.1.jar
 --this is build (manually with Maven) from the latest source:
 --
 http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/
 REGISTER ../piggybankBuiltFromSource.jar
 REGISTER ../../../../lib/external/jackson-core-asl-1.8.8.jar
 REGISTER ../../../../lib/external/jackson-mapper-asl-1.8.8.jar

 --$ cat input.txt
 --data_a data_b
 --data_a data_b
 inputs = LOAD 'input.txt' AS (a: chararray, b: chararray);

 DESCRIBE inputs;
 DUMP inputs;

 --output:
 --inputs: {a: chararray,b: chararray}
 --(data_a,data_b)
 --(data_a,data_b)

 STORE inputs INTO 'output'
 USING org.apache.pig.piggybank.storage.avro.AvroStorage('{
 schema:
 {
   type : record,
   name : my_schema,
   namespace : com.my_namespace,
   fields : [
   {
 name : b,
 type : string
   },
   {
 name : nonsense_name,
 type : string
   }
   ]
 }
 }');

 --output
 --$ java -jar ../../../../lib/external/avro-tools-1.7.4.jar tojson
 output/part*
 --{b:data_a,nonsense_name:data_b}
 --{b:data_a,nonsense_name:data_b}

 AvroStorage is build from the latest piggybank code.
 Using AvroStorage debug: 5 parameter didn't help.

 $ pig -version
 Apache Pig version 0.11.0-cdh4.3.0 (rexported)
 compiled May 27 2013, 20:48:21

 Any help would be appreciated.

 Thanks,
 Ruslan Al-Fakikh




 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
 .com






-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Is Avro right for me?

2013-05-27 Thread Russell Jurney
Whats more, there are examples and support for Kafka, but not so much for
Flume.


On Mon, May 27, 2013 at 6:25 AM, Martin Kleppmann mar...@rapportive.comwrote:

 I don't have experience with Flume, so I can't comment on that. At
 LinkedIn we ship logs around by sending Avro-encoded messages to Kafka (
 http://kafka.apache.org/). Kafka is nice, it scales very well and gives a
 great deal of flexibility — logs can be consumed by any number of
 independent consumers, consumers can catch up on a backlog if they're
 disconnected for a while, and it comes with Hadoop import out of the box.

 (RabbitMQ is more designed or use cases where each message corresponds to
 a task that needs to be performed by a worker. IMHO Kafka is a better fit
 for logs, which are more stream-like.)

 With any message broker, you'll need to somehow tag each message with the
 schema that was used to encode it. You could include the full schema with
 every message, but unless you have very large messages, that would be a
 huge overhead. Better to give each version of your schema a sequential
 version number, or hash the schema, and include the version number/hash in
 each message. You can then keep a repository of schemas for resolving those
 version numbers or hashes – simply in files that you distribute to all
 producers/consumers, or in a simple REST service like
 https://issues.apache.org/jira/browse/AVRO-1124

 Hope that helps,
 Martin


 On 26 May 2013 17:39, Mark static.void@gmail.com wrote:

 Yes our central server would be Hadoop.

 Exactly how would this work with flume? Would I write Avro to a file
 source which flume would then ship over to one of our collectors  or is
 there a better/native way? Would I have to include the schema in each
 event? FYI we would be doing this primarily from a rails application.

 Does anyone ever use Avro with a message bus like RabbitMQ?

 On May 23, 2013, at 9:16 PM, Sean Busbey bus...@cloudera.com wrote:

 Yep. Avro would be great at that (provided your central consumer is Avro
 friendly, like a Hadoop system).  Make sure that all of your schemas have
 default values defined for fields so that schema evolution will be easier
 in the future.


 On Thu, May 23, 2013 at 4:29 PM, Mark static.void@gmail.com wrote:

 We're thinking about generating logs and events with Avro and shipping
 them to a central collector service via Flume. Is this a valid use case?




 --
 Sean






-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Is Avro/Trevni strictly read-only?

2013-01-30 Thread Russell Jurney
Aaron - is there a way to create a Kiji table from Pig? I'm in the habit of
not specifying schemas with Voldemort and MongoDB, just storing a Pig
relation and the schema is set in the store. If I can arrange that somehow,
I'm all over Kiji. Panthera is a fork :/


On Wed, Jan 30, 2013 at 3:20 PM, Aaron Kimball akimbal...@gmail.com wrote:

 Hi ccleve,

 I'd definitely urge you to try out Kiji -- we who work on it think it's a
 pretty good fit for this specific use case. If you've got further questions
 about Kiji and how to use it, please send them to me, or ask the kiji user
 mailing list: http://www.kiji.org/getinvolved#Mailing_Lists

 cheers,
 - Aaron


 On Tue, Jan 29, 2013 at 3:24 PM, Doug Cutting cutt...@apache.org wrote:

 Avro and Trevni files do not support record update or delete.

 For large changing datasets you might use Kiji (http://www.kiji.org/)
 to store Avro data in HBase.

 Doug

 On Mon, Jan 28, 2013 at 12:00 PM, ccleve ccleve.t...@gmail.com wrote:
  I've gone through the documentation, but haven't been able to get a
 definite
  answer: is Avro, or specifically Trevni, only for read-only data?
 
  Is it possible to update or delete records?
 
  If records can be deleted, is there any code that will merge row sets
 to get
  rid of the unused space?
 
 
 





-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Is Avro/Trevni strictly read-only?

2013-01-29 Thread Russell Jurney
Intel's HBase Panthera has an Avro document store builtin - another option:
https://github.com/intel-hadoop/hbase-0.94-panthera


On Tue, Jan 29, 2013 at 3:24 PM, Doug Cutting cutt...@apache.org wrote:

 Avro and Trevni files do not support record update or delete.

 For large changing datasets you might use Kiji (http://www.kiji.org/)
 to store Avro data in HBase.

 Doug

 On Mon, Jan 28, 2013 at 12:00 PM, ccleve ccleve.t...@gmail.com wrote:
  I've gone through the documentation, but haven't been able to get a
 definite
  answer: is Avro, or specifically Trevni, only for read-only data?
 
  Is it possible to update or delete records?
 
  If records can be deleted, is there any code that will merge row sets to
 get
  rid of the unused space?
 
 
 




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Sync() between records? How do we recover from a bad record, using DataFileReader?

2013-01-06 Thread Russell Jurney
We are trying to recover, report bad record, and move to the next record of
an Avro file in PIG-3015 and PIG-3059. It seems that sync blocks don't
exist between files, however.

How should we recover from a bad record using Avro's DataFileReader?

https://issues.apache.org/jira/browse/PIG-3015
https://issues.apache.org/jira/browse/PIG-3059

Russell Jurney http://datasyndrome.com


Re: Output from AVRO mapper

2012-12-21 Thread Russell Jurney
I don't mean to harp, but this is a few lines in Pig:

/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* Load Avros */
input = load 'my.avro' using AvroStorage();

/* Verify input */
describe input;
Illustrate input;

/* Convert Avros to JSON */
store input into 'my.json' using
com.twitter.elephantbird.pig.store.JsonStorage();
store input into 'my.json.lzo' using
com.twitter.elephantbird.pig.store.LzoJsonStorage();

/* Convert simple Avros to TSV */
store input into 'my.tsv';

/* Convert Avros to SequenceFiles */
REGISTER '/path/to/elephant-bird.jar';
 store  input into 'my.seq' using
com.twitter.elephantbird.pig.store.SequenceFileStorage(
/* example: */
'-c com.twitter.elephantbird.pig.util.IntWritableConverter',
'-c com.twitter.elephantbird.pig.util.TextConverter'
);

/* Convert Avros to Protobufs */
store input into 'input.protobuf’ using
com.twitter.elephantbird.examples.proto.pig.store.ProfileProtobufB64LinePigStorage();

/* Convert Avros to a Lucene Index */
store input into 'input.lucene' using
LuceneIndexStorage('com.example.MyPigLuceneIndexOutputFormat');

There are also drivers for most NoSQLish databases...

Russell Jurney http://datasyndrome.com

On Dec 20, 2012, at 9:33 AM, Terry Healy the...@bnl.gov wrote:

I'm just getting started using AVRO within Map/Reduce and trying to
convert some existing non-AVRO code to use AVRO input. So far the data
that previously was stored in tab delimited files has been converted to
.avro successfully as checked with avro-tools.

Where I'm getting hung up extending beyond my book-based examples is in
attempting to read from AVRO (using generic records) where the mapper
output is NOT in AVRO format. I can't seem to reconcile extending
AvroMapper and NOT using AvroCollector.

Here are snippets of code that show my non-AVRO M/R code and my
[failing] attempts to make this change. If anyone can help me along it
would be very much appreciated.

-Terry

code
Pre-Avro version: (Works fine with .tsv input format)

   public static class HdFlowMapper extends MapReduceBase
   implements MapperText, HdFlowWritable, LongPair,
HdFlowWritable {


   @Override
   public void map(Text key, HdFlowWritable value,
   OutputCollectorLongPair, HdFlowWritable output,
   Reporter reporter) throws IOException {

   ...//
   outKey = new LongPair(value.getSrcIp(), value.getFirst());

   HdFlowWritable outValue = value; // pass it all through
   output.collect(outKey, outValue);
   }



AVRO attempt:


   conf.setOutputFormat(TextOutputFormat.class);
   conf.setOutputKeyClass(LongPair.class);
   conf.setOutputValueClass(AvroFlowWritable.class);

   SCHEMA = new Schema.Parser().parse(NetflowSchema);
   AvroJob.setInputSchema(conf, SCHEMA);
   //AvroJob.setOutputSchema(conf, SCHEMA);
   AvroJob.setMapperClass(conf, AvroFlowMapper.class);
   AvroJob.setReducerClass(conf, AvroFlowReducer.class);



public static class AvroFlowMapperK extends AvroMapperK,
OutputCollector {


   @Override
   ** IDE: Method does not override or implement a method from a supertype

   public void map(K datum, OutputCollectorLongPair,
AvroFlowWritable collector, Reporter reporter) throws IOException {


   GenericRecord record = (GenericRecord) datum;
   afw = new AvroFlowWritable(record);
   // ...
   collector.collect(outKey, afw);
}

/code


Re: Converting arbitrary JSON to avro

2012-09-18 Thread Russell Jurney
Fwiw, I do this in web apps all the time via the python avro lib and json.dumps

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Sep 18, 2012, at 12:38 PM, Doug Cutting cutt...@apache.org wrote:

 On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler mar...@braindump.ms 
 wrote:
 Json.Writer is indeed what I had in mind and I have successfully managed to 
 convert my existing JSON to avro using it.
 However using GenericDatumReader on this feels pretty unnatural, as I seem 
 to be unable to access fields directly. It seems I have to access the 
 value field on each record which returns a Map which uses Utf8 Objects as 
 keys for the actual fields. Or am I doing something wrong here?

 Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
 element.  That would get rid of the wrapper around every value.  It's
 a more redundant way to write the schema, but the binary encoding is
 identical (since a record wrapper adds no bytes).  It would hence
 require no changes to Json.Reader or Json.Writer.

 [ long,
  double,
  string,
  boolean,
  null,
  {type : array,
   items : {
   type : record,
   name : org.apache.avro.data.Json,
   fields : [ {
   name : value,
   type : [ long, double, string, boolean, null,
  {type : array, items : Json},
  {type : map, values : Json}
]
   } ]
   }
  },
  {type : map, values : Json}
 ]

 You can try this by placing this schema in
 share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
 jar.

 Would such a change be useful to you?  If so, please file an issue in Jira.

 Or we could even refactor this schema so that a Json object is the
 top-level structure:

 {type : map,
 values : [ long,
  double,
  string,
  boolean,
  null,
  {type : array,
   items : {
   type : record,
   name : org.apache.avro.data.Json,
   fields : [ {
   name : value,
   type : [ long, double, string, boolean, 
 null,
  {type : array, items : Json},
  {type : map, values : Json}
]
   } ]
   }
  },
  {type : map, values : Json}
]
 }

 This would change the binary format but would not change the
 representation that GenericDatumReader would hand you from my first
 example above (since the generic representation unwraps unions).
 Using this schema would require changes to Json.Writer and
 Json.Reader.  It would better conform to the definition of Json, which
 only permits objects as the top-level type.

 Concerning the more specific schema, you are of course completely right. 
 Unfortunately more or less all the fields in the JSON data format are 
 optional and many have substructures, so, at least in my understanding, I 
 have to use unions of null and the actual type throughout the schema. I 
 tried using JsonDecoder first (or rather the fromjson option of the avro 
 tool, which, I think, uses JsonDecoder) but given the current JSON 
 structures, this didn't work.

 So I'll probably have to look into implementing my own converter.  However 
 given the rather complex structure of the original JSON I'm wondering if 
 trying to represent the data in avro is such a good idea in the first place.

 It would be interesting to see whether, with the appropriate schema,
 whether the dataset is smaller and faster to process as Avro than as
 Json.  If you have 1000 fields in your data but the typical record
 only has one or two non-null, then an Avro record is perhaps not a
 good representation.  An Avro map might be better, but if the values
 are similarly variable then Json might be competitive.

 Cheers,

 Doug


Re: Avro file size is too big

2012-07-04 Thread Russell Jurney
This thread looks useful. Are you flushing too often?
http://apache-avro.679487.n3.nabble.com/avro-compression-using-snappy-and-deflate-td3870167.html

Russell Jurney http://datasyndrome.com

On Jul 4, 2012, at 6:33 AM, Ruslan Al-Fakikh metarus...@gmail.com wrote:

 Hello,

 In my organization currently we are evaluating Avro as a format. Our
 concern is file size. I've done some comparisons of a piece of our
 data.
 Say we have sequence files, compressed. The payload (values) are just
 lines. As far as I know we use line number as keys and we use the
 default codec for compression inside sequence files. The size is 1.6G,
 when I put it to avro with deflate codec with deflate level 9 it
 becomes 2.2G.
 This is interesting, because the values in seq files are just string,
 but Avro has a normal schema with primitive types. And those are kept
 binary. Shouldn't Avro be less in size?
 Also I took another dataset which is 28G (gzip files, plain
 tab-delimited text, don't know what is the deflate level) and put it
 to Avro and it became 38G
 Why Avro is so big in size? Am I missing some size optimization?

 Thanks in advance!


Re: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Russell Jurney
Consider Pig and AvroStorage.

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On May 13, 2012, at 4:49 AM, Jacob Metcalf jacob_metc...@hotmail.com
wrote:


I have just spent several frustrating hours on getting an example MR job
using Avro working with Hadoop and after finally getting it working I
thought I would share my findings with everyone.

I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map
and Reduce then attempted to deploy and run. I am setting up a development
cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my
job and it failed with:

org.apache.avro.generic.GenericData$Record cannot be cast to
net.jacobmetcalf.avro.Room

Where Room is an Avro generated class. I found two problems. The first I
have partly solved, the second one is more to do with Hadoop and is as yet
unsolved:

1) Why when I am using Avro Specific does it end up going Generic?

When deserializing SpecificDatumReader.java attempts to instantiate your
target class through reflection. If it fails to create your class it
defaults to a GenericData.Record. This Doug has explained here:
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E

But why it is doing it was a little harder to work out. Debugging I saw
the SpecificDatumReader could not find my class in its classpath. However
in my Job Runner I had done:

job.setJarByClass(HouseAssemblyJob.class); // This should ensure the JAR is
distributed around the cluster

I expected with this Hadoop would distribute my Jar around the cluster. It
may be doing the distribution but it definitely did not add it to the
Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to
the directory I am running from. This is not going to work in a real
cluster where the Job Runner is on a different machine to where the Reducer
so I am keen to figure out whether the problem is Hadoop 0.23, my
environment variables or the fact I am running under Cygwin.


2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?

Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I
however want to use 1.6.3 (and 1.7 when it comes out) because of its
support for immutability  builders in the generated classes. I probably
could just hack the old Avro lib out of my Hadoop distribution and drop the
new one in. However I thought it would be cleaner to get Hadoop to
distribute my jar to all datanodes and then manipulate my classpath to get
the latest version of Avro to the top. So I have packaged Avro 1.6.3 into
my job jar using Maven assembly and tried to do this in my JobRunner:

job.setJarByClass( MyJob.class);
// This should ensure the JAR is
distributed around the cluster
config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST,
true ); // ensure my version of avro?

But it continues to use 1.5.3. I suspect it is again to do with my
HADOOP_CLASSPATH which has avro-1.5.3 in it:

export
HADOOP_CLASSPATH=$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*

If anyone has done this and has any ideas please let me know?

Thanks

Jacob


Re: AvroStorage/Avro Schema Question

2012-04-17 Thread Russell Jurney
:
},
{
name:address,
type:[null,string],
doc:
}
]
}

]
}
],
doc:
}
]
}

On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney russell.jur...@gmail.com 
wrote:
H unable to get this to work:

{
namespace: agile.data.avro,
name: Email,
type: record,
fields: [
{name:message_id, type: [string, null]},
{name:froms,type: [{type:record, name:from, fields: 
[{type:array, items:string}, null]}, null]},
{name:tos,type: [{type:record, name:to, fields: 
[{type:array, items:string}, null]}, null]},
{name:ccs,type: [{type:record, name:cc, fields: 
[{type:array, items:string}, null]}, null]},
{name:bccs,type: [{type:record, name:bcc, fields: 
[{type:array, items:string}, null]}, null]},
{name:reply_tos,type: [{type:record, name:reply_to, 
fields: [{type:array, items:string}, null]}, null]},
{name:in_reply_to, type: [{type:array, items:string}, 
null]},
{name:subject, type: [string, null]},
{name:body, type: [string, null]},
{name:date, type: [string, null]}
]
}

On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney russell.jur...@gmail.com 
wrote:
In thinking about it more... it seems that unfortunately, the only thing I can 
really do is to change the schema for all email address fields:

{name:from,type: [{type:array, items:string}, null]},
to:
{name:froms,type: [{type:record, name:from, fields: 
[{type:array, items:string}, null]}, null]},

That is, to pluralize everything and then individually name array elements. I 
will try running this through my stack.


On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey scottca...@apache.org wrote:
It appears as though the Avro to PigStorage schema translation names (in pig) 
all arrays ARRAY_ELEM.  The nullable wrapper is 'visible' and the field name is 
not moved onto the bag name.   

About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592

but before finishing it AvroStorage was written elsewhere.  I don't recall 
exactly what I did with the schema translation there, but I recall the mapping 
from an Avro schema to pig tried to hide the nullable wrappers more.


In Avro, arrays are unnamed types, so I see two things you could probably do 
without any code changes:

* Add a line in the pig script to project / rename the fields to what you want 
(unfortunate and clumbsy, but I think it will work — I think you want 
from::PIG_WRAPPER::ARRAY_ELEM as from  or 
FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from something like that.
* Add a record wrapper to your schema (which may inject more messiness in the 
pig schema view):
{
namespace: agile.data.avro,
name: Email,
type: record,
fields: [
{name:message_id, type: [string, null]},
{name:from,type: [{type:record, name:From, fields: 
[[{type:array, items:string},null]], null]},
   …
]
}

But that is very awkward — requiring a named record for each field that is an 
unnamed type.


Ideally PigStorage would treat any union of null and one other thing as a 
simple pig type with no wrapper, and project the name of a field or record into 
the name of the thing inside a bag.


-Scott

On 3/29/12 6:05 PM, Russell Jurney russell.jur...@gmail.com wrote:

Is it possible to name string elements in the schema of an array?  
Specifically, below I want to name the email addresses in the 
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by 
Pig's AvroStorage.  I know I can probably fix this in Java in the Pig 
AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.  
Last time I read Avro's array docs in this context, my hit-points dropped by a 
third, so pardom me if I've not rtfm this time :)

Complete description of what I'm doing follows:

Avro schema for my emails:

{
namespace: agile.data.avro,
name: Email,
type: record,
fields: [
{name:message_id, type: [string, null]},
{name:from,type: [{type:array, items:string}, null]},
{name:to,type: [{type:array, items:string}, null]},
{name:cc,type: [{type:array, items:string}, null]},
{name:bcc,type: [{type:array, items:string}, null]},
{name:reply_to, type: [{type:array, items:string}, 
null]},
{name:in_reply_to, type: [{type:array, items:string}, 
null]},
{name:subject, type: [string, null]},
{name:body, type: [string, null]},
{name:date, type: [string, null]}
]
}

Pig to publish my Avros:

grunt emails = load '/me/tmp/emails' using AvroStorage();
grunt describe emails

emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to: 
{PIG_WRAPPER

Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-23 Thread Russell Jurney
Thanks Scott, looking at the raw data it seems to have been a truncated record 
due to UTF problems.

Russell Jurney http://datasyndrome.com

On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote:

 
 It appears to be reading a union index and failing in there somehow.  If it 
 did not have any of the pig AvroStorage stuff in there I could tell you more.
 
 What does avro-tools.jar 's 'tojson' tool do?  (java –jar 
 avro-tools-1.6.3.jar tojson file | your_favorite_text_reader)  
 What version of Avro is the java stack trace below?
 
 
 On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote:
 
 I have a problem record I've written in Avro that crashes anything which 
 tries to read it :(
 
 Can anyone make sense of these errors?
 
 The exception in Pig/AvroStorage is this:
 
 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
   at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
   at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
   at 
 org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
   at 
 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
   at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
   at 
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
   at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
   at 
 org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
   at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
   at 
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
   at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
   at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
   at 
 org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
   at 
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273)
   ... 7 more
 
 When reading the record in Python:  
 
 File /me/Collecting-Data/src/python/cat_avro, line 21, in module
 for record in df_reader:
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
  line 354, in next
 datum = self.datum_reader.read(self.datum_decoder) 
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
  line 445, in read
 return self.read_data(self.writers_schema, self.readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
  line 490, in read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
  line 690, in read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
  line 488, in read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File 
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
  line 650, in read_union
 raise SchemaResolutionException(fail_msg, writers_schema, readers_schema)
 avro.io.SchemaResolutionException: Can't access branch index 64 for union 
 with 2 branches
 
 When reading the record in Ruby:
 
 /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in 
 `read_data': Writer's schema  and Reader's schema [string,null] do not 
 match. (Avro::IO::SchemaMatchException)
 
 -- 
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Problem: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64 / avro.io.SchemaResolutionException: Can't access branch index 64 for union with 2 branches / `read_data': Writer's schem

2012-03-23 Thread Russell Jurney
Ok, now I have a followup question...

how does one recover from an exception writing an Avro?  The incomplete
record is being written, which is crashing the reader.

On Fri, Mar 23, 2012 at 8:01 PM, Russell Jurney russell.jur...@gmail.comwrote:

 Thanks Scott, looking at the raw data it seems to have been a truncated
 record due to UTF problems.

 Russell Jurney http://datasyndrome.com

 On Mar 23, 2012, at 7:59 PM, Scott Carey scottca...@apache.org wrote:


 It appears to be reading a union index and failing in there somehow.  If
 it did not have any of the pig AvroStorage stuff in there I could tell you
 more.

 What does avro-tools.jar 's 'tojson' tool do?  (java –jar
 avro-tools-1.6.3.jar tojson file | your_favorite_text_reader)
 What version of Avro is the java stack trace below?


 On 3/23/12 7:01 PM, Russell Jurney russell.jur...@gmail.com wrote:

 I have a problem record I've written in Avro that crashes anything which
 tries to read it :(

 Can anyone make sense of these errors?

 The exception in Pig/AvroStorage is this:

 java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 64
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:275)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
 at
 org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364)
 at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
 at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
 at
 org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
 at
 org.apache.pig.piggybank.storage.avro.PigAvroDatumReader.readRecord(PigAvroDatumReader.java:67)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
 at
 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
 at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
 at
 org.apache.pig.piggybank.storage.avro.PigAvroRecordReader.getCurrentValue(PigAvroRecordReader.java:80)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getNext(AvroStorage.java:273)
 ... 7 more


 When reading the record in Python:

 File /me/Collecting-Data/src/python/cat_avro, line 21, in module
 for record in df_reader:
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
 line 354, in next
 datum = self.datum_reader.read(self.datum_decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 445, in read
 return self.read_data(self.writers_schema, self.readers_schema,
 decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 490, in read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 690, in read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 488, in read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 650, in read_union
 raise SchemaResolutionException(fail_msg, writers_schema,
 readers_schema)
 avro.io.SchemaResolutionException: Can't access branch index 64 for union
 with 2 branches


 When reading the record in Ruby:

 /Users/peyomp/.rvm/gems/ruby-1.8.7-p352/gems/avro-1.6.1/lib/avro/io.rb:298:in
 `read_data': Writer's schema  and Reader's schema [string,null] do not
 match. (Avro::IO::SchemaMatchException)


 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
 com




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: HttpTranceiver and JSON-encoded Avro?

2012-02-15 Thread Russell Jurney
FWIW, there are avro libs for JavaScript and node on github.

Russell Jurney http://datasyndrome.com

On Feb 15, 2012, at 7:32 AM, Frank Grimes frankgrime...@gmail.com wrote:

 Hi All,
 
 Is there any way to send Avro data over HTTP encoded in JSON?
 We want to integrate with Node.js and JSON seems to be the best/simplest way 
 to do so.
 
 Thanks,
 
 Frank Grimes


Re: Pig/Avro Question

2012-02-03 Thread Russell Jurney
Hmmm I applied it, but I still can't open files that don't end in .avro

On Fri, Feb 3, 2012 at 2:23 AM, Philipp philipp.p...@metrigo.de wrote:

 This patch fixes this issue:

 https://issues.apache.org/**jira/browse/PIG-2492https://issues.apache.org/jira/browse/PIG-2492



 On 02/03/2012 07:22 AM, Russell Jurney wrote:

 I have the same bug. I read the code... there is no obvious fix.  Arg.

 On Feb 2, 2012, at 10:07 PM, Something Somethingmailinglists19@**
 gmail.com mailinglist...@gmail.com  wrote:

  In my Pig script I have something like this...

 %default MY_SCHEMA '/user/xyz/my-schema.json';

 %default MY_AVRO 'org.apache.pig.piggybank.**
 storage.avro.AvroStorage(\'$**MY_SCHEMA\')';

 my_files = LOAD '$MY_FILES' USING $MY_AVRO;



 What I have noticed is that when MY_FILES contains only one file, it
 works fine.

 %default MY_FILES '/user/xyz/file1.avro'


 But when I use a comma separated list it doesn't work. e.g.

 %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'

 Basically, I get a message saying something like 'Schema cannot be
 found'.

 Is there a way to make it work with multiple files?  Please let me know.
  Thanks.





-- 
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com


Re: Pig/Avro Question

2012-02-03 Thread Russell Jurney
btw - the weird thing is... I've read the code.  There isn't a filter for
.avro in there.  Does Hadoop, or Avro itself (not that I can see it is
involved) do so?

On Fri, Feb 3, 2012 at 10:55 AM, Russell Jurney russell.jur...@gmail.comwrote:

 Hmmm I applied it, but I still can't open files that don't end in .avro

 On Fri, Feb 3, 2012 at 2:23 AM, Philipp philipp.p...@metrigo.de wrote:

 This patch fixes this issue:

 https://issues.apache.org/**jira/browse/PIG-2492https://issues.apache.org/jira/browse/PIG-2492



 On 02/03/2012 07:22 AM, Russell Jurney wrote:

 I have the same bug. I read the code... there is no obvious fix.  Arg.

 On Feb 2, 2012, at 10:07 PM, Something Somethingmailinglists19@**
 gmail.com mailinglist...@gmail.com  wrote:

  In my Pig script I have something like this...

 %default MY_SCHEMA '/user/xyz/my-schema.json';

 %default MY_AVRO 'org.apache.pig.piggybank.**
 storage.avro.AvroStorage(\'$**MY_SCHEMA\')';

 my_files = LOAD '$MY_FILES' USING $MY_AVRO;



 What I have noticed is that when MY_FILES contains only one file, it
 works fine.

 %default MY_FILES '/user/xyz/file1.avro'


 But when I use a comma separated list it doesn't work. e.g.

 %default MY_FILES '/user/xyz/file1.avro, /user/xyz/file2.avro'

 Basically, I get a message saying something like 'Schema cannot be
 found'.

 Is there a way to make it work with multiple files?  Please let me
 know.  Thanks.





 --
 Russell Jurney
 twitter.com/rjurney
 russell.jur...@gmail.com
 datasyndrome.com




-- 
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com


Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python

2012-02-02 Thread Russell Jurney
Correction: when I read the file in Python, I get the error below.  It
looks like a unicode problem?  Can one tell Avro how to handle this?

Traceback (most recent call last):
  File ./cat_avro, line 21, in module
for record in df_reader:
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
line 354, in next
datum = self.datum_reader.read(self.datum_decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 445, in read
return self.read_data(self.writers_schema, self.readers_schema, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 490, in read_data
return self.read_record(writers_schema, readers_schema, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 690, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 488, in read_data
return self.read_union(writers_schema, readers_schema, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 654, in read_union
return self.read_data(selected_writers_schema, readers_schema, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 458, in read_data
return self.read_data(writers_schema, s, decoder)
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 468, in read_data
return decoder.read_utf8()
  File
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
line 233, in read_utf8
return unicode(self.read_bytes(), utf-8)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 543:
invalid start byte


On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I am writing Avro records in Ruby using the avro ruby gem in 1.8.7.  I
 have problems with loading these files sometimes.  As a result, I am unable
 to write large files that are readable.

 The exception I get is below.  Anyone have an idea what this means?  It
 looks like Avro is having trouble parsing the schema.  The avro files parse
 in Ruby and Python, just not Pig.  Are there more rigorous checks in Java?

 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error.
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;

 java.lang.NoSuchMethodError:
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;
 at org.apache.avro.Schema.clinit(Schema.java:82)
  at
 org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163)
  at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269)
  at
 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150)
 at
 org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109)
  at
 org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
 at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218)
  at
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
 at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
  at
 org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57)
 at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679)
  at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582)
  at org.apache.pig.PigServer.registerQuery(PigServer.java:584)
 at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942)
  at
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
  at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
  at org.apache.pig.Main.run(Main.java:495)
 at org.apache.pig.Main.main(Main.java:111)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python

2012-02-02 Thread Russell Jurney
A little bit more searching shows this:

http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/

On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.comwrote:

 The jars being used are:

 REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

 On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.comwrote:

 HI Russell,

 I'm not sure about the Python error, but the Java error looks like a
 classpath problem, not a schema parsing issue.  The NoSuchMethodError in
 the stack trace indicates that Avro was trying to invoke a method in the
 Jackson library that wasn't present at run-time.  My guess is that your
 program (or Pig?) either has two incompatible versions of the Jackson
 library on its classpath or maybe Avro's Jackson dependency has been
 excluded and a version that is incompatible with Avro is on the classpath.

 Which version of Avro is being used?  Running 'mvn dependency:tree' in
 Avro trunk I see that it's depending on Jackson 1.8.6.  Can you verify that
 only one version of Jackson is on the classpath and that it's the version
 that is required by whatever version of Avro is on the classpath?

 -James



 On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 Correction: when I read the file in Python, I get the error below.  It
 looks like a unicode problem?  Can one tell Avro how to handle this?

 Traceback (most recent call last):
   File ./cat_avro, line 21, in module
 for record in df_reader:
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
 line 354, in next
 datum = self.datum_reader.read(self.datum_decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 445, in read
 return self.read_data(self.writers_schema, self.readers_schema,
 decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 490, in read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 690, in read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 488, in read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 654, in read_union
 return self.read_data(selected_writers_schema, readers_schema,
 decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 458, in read_data
 return self.read_data(writers_schema, s, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 468, in read_data
 return decoder.read_utf8()
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 233, in read_utf8
 return unicode(self.read_bytes(), utf-8)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 543:
 invalid start byte


  On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 I am writing Avro records in Ruby using the avro ruby gem in 1.8.7.  I
 have problems with loading these files sometimes.  As a result, I am unable
 to write large files that are readable.

 The exception I get is below.  Anyone have an idea what this means?  It
 looks like Avro is having trouble parsing the schema.  The avro files parse
 in Ruby and Python, just not Pig.  Are there more rigorous checks in Java?

 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error.
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;

 java.lang.NoSuchMethodError:
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;
 at org.apache.avro.Schema.clinit(Schema.java:82)
  at
 org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163

Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python

2012-02-02 Thread Russell Jurney
Further examination shows that the problematic emails I am encoding are
formatted in ISO-8859-1, not UTF-8.  That is why I am getting character
problems.  Looks like it is not an Avro problem after all.  Thanks!  :)

On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney russell.jur...@gmail.comwrote:

 A little bit more searching shows this:


 http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/


 On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 The jars being used are:

 REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

 On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari 
 jbaldass...@gmail.comwrote:

 HI Russell,

 I'm not sure about the Python error, but the Java error looks like a
 classpath problem, not a schema parsing issue.  The NoSuchMethodError in
 the stack trace indicates that Avro was trying to invoke a method in the
 Jackson library that wasn't present at run-time.  My guess is that your
 program (or Pig?) either has two incompatible versions of the Jackson
 library on its classpath or maybe Avro's Jackson dependency has been
 excluded and a version that is incompatible with Avro is on the classpath.

 Which version of Avro is being used?  Running 'mvn dependency:tree' in
 Avro trunk I see that it's depending on Jackson 1.8.6.  Can you verify that
 only one version of Jackson is on the classpath and that it's the version
 that is required by whatever version of Avro is on the classpath?

 -James



 On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney russell.jur...@gmail.com
  wrote:

 Correction: when I read the file in Python, I get the error below.  It
 looks like a unicode problem?  Can one tell Avro how to handle this?

 Traceback (most recent call last):
   File ./cat_avro, line 21, in module
 for record in df_reader:
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
 line 354, in next
 datum = self.datum_reader.read(self.datum_decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 445, in read
 return self.read_data(self.writers_schema, self.readers_schema,
 decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 490, in read_data
 return self.read_record(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 690, in read_record
 field_val = self.read_data(field.type, readers_field.type, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 488, in read_data
 return self.read_union(writers_schema, readers_schema, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 654, in read_union
 return self.read_data(selected_writers_schema, readers_schema,
 decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 458, in read_data
 return self.read_data(writers_schema, s, decoder)
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 468, in read_data
 return decoder.read_utf8()
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/io.py,
 line 233, in read_utf8
 return unicode(self.read_bytes(), utf-8)
 UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position
 543: invalid start byte


  On Thu, Feb 2, 2012 at 2:06 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 I am writing Avro records in Ruby using the avro ruby gem in 1.8.7.  I
 have problems with loading these files sometimes.  As a result, I am 
 unable
 to write large files that are readable.

 The exception I get is below.  Anyone have an idea what this means?
  It looks like Avro is having trouble parsing the schema.  The avro files
 parse in Ruby and Python, just not Pig.  Are there more rigorous checks in
 Java?

 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error.
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;

 java.lang.NoSuchMethodError:
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg

Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python

2012-02-02 Thread Russell Jurney
Spoken too soon... this happens no matter what avros I load now.  I can't
figure that anything has changed regarding jars, etc.  Confused.

I think this happens when Avro is parsing the schema?

Pig Stack Trace
---
ERROR 2998: Unhandled internal error.
org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;

java.lang.NoSuchMethodError:
org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;
at org.apache.avro.Schema.clinit(Schema.java:82)
at
org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49)
at
org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163)
at
org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
at
org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269)
at
org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150)
at
org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109)
at
org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218)
at
org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
at
org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57)
at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679)
at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582)
at org.apache.pig.PigServer.registerQuery(PigServer.java:584)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:495)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)


On Thu, Feb 2, 2012 at 2:53 PM, Russell Jurney russell.jur...@gmail.comwrote:

 Further examination shows that the problematic emails I am encoding are
 formatted in ISO-8859-1, not UTF-8.  That is why I am getting character
 problems.  Looks like it is not an Avro problem after all.  Thanks!  :)


 On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 A little bit more searching shows this:


 http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/


 On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 The jars being used are:

 REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

 On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari 
 jbaldass...@gmail.comwrote:

 HI Russell,

 I'm not sure about the Python error, but the Java error looks like a
 classpath problem, not a schema parsing issue.  The NoSuchMethodError in
 the stack trace indicates that Avro was trying to invoke a method in the
 Jackson library that wasn't present at run-time.  My guess is that your
 program (or Pig?) either has two incompatible versions of the Jackson
 library on its classpath or maybe Avro's Jackson dependency has been
 excluded and a version that is incompatible with Avro is on the classpath.

 Which version of Avro is being used?  Running 'mvn dependency:tree' in
 Avro trunk I see that it's depending on Jackson 1.8.6.  Can you verify that
 only one version of Jackson is on the classpath and that it's the version
 that is required by whatever version of Avro is on the classpath?

 -James



 On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Correction: when I read the file in Python, I get the error below.  It
 looks like a unicode problem?  Can one tell Avro how to handle this?

 Traceback (most recent call last):
   File ./cat_avro, line 21, in module
 for record in df_reader:
   File
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg/avro/datafile.py,
 line 354, in next
 datum

Re: Problem with Pig AvroStorage, with Avros that work in Ruby and Python

2012-02-02 Thread Russell Jurney
Cleaned up my environment by unsetting HADOOP_HOME, and removing some old
jacksons in my CLASSPATH and Pig's AvroStorage works again.

Woot!

On Thu, Feb 2, 2012 at 3:47 PM, Russell Jurney russell.jur...@gmail.comwrote:

 Spoken too soon... this happens no matter what avros I load now.  I can't
 figure that anything has changed regarding jars, etc.  Confused.

 I think this happens when Avro is parsing the schema?

 Pig Stack Trace
 ---
 ERROR 2998: Unhandled internal error.
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;

 java.lang.NoSuchMethodError:
 org.codehaus.jackson.JsonFactory.enable(Lorg/codehaus/jackson/JsonParser$Feature;)Lorg/codehaus/jackson/JsonFactory;
 at org.apache.avro.Schema.clinit(Schema.java:82)
  at
 org.apache.pig.piggybank.storage.avro.AvroStorageUtils.clinit(AvroStorageUtils.java:49)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:163)
  at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getAvroSchema(AvroStorage.java:144)
 at
 org.apache.pig.piggybank.storage.avro.AvroStorage.getSchema(AvroStorage.java:269)
  at
 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150)
 at
 org.apache.pig.newplan.logical.relational.LOLoad.getSchema(LOLoad.java:109)
  at
 org.apache.pig.newplan.logical.visitor.LineageFindRelVisitor.visit(LineageFindRelVisitor.java:100)
 at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:218)
  at
 org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
 at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
  at
 org.apache.pig.newplan.logical.visitor.CastLineageSetter.init(CastLineageSetter.java:57)
 at org.apache.pig.PigServer$Graph.compile(PigServer.java:1679)
  at org.apache.pig.PigServer$Graph.validateQuery(PigServer.java:1610)
 at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1582)
  at org.apache.pig.PigServer.registerQuery(PigServer.java:584)
 at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:942)
  at
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386)
 at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
  at
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
  at org.apache.pig.Main.run(Main.java:495)
 at org.apache.pig.Main.main(Main.java:111)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 

 On Thu, Feb 2, 2012 at 2:53 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 Further examination shows that the problematic emails I am encoding are
 formatted in ISO-8859-1, not UTF-8.  That is why I am getting character
 problems.  Looks like it is not an Avro problem after all.  Thanks!  :)


 On Thu, Feb 2, 2012 at 2:49 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 A little bit more searching shows this:


 http://www.harshj.com/2010/04/25/writing-and-reading-avro-data-files-using-python/


 On Thu, Feb 2, 2012 at 2:48 PM, Russell Jurney russell.jur...@gmail.com
  wrote:

 The jars being used are:

 REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
 REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
 REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

 On Thu, Feb 2, 2012 at 2:41 PM, James Baldassari jbaldass...@gmail.com
  wrote:

 HI Russell,

 I'm not sure about the Python error, but the Java error looks like a
 classpath problem, not a schema parsing issue.  The NoSuchMethodError in
 the stack trace indicates that Avro was trying to invoke a method in the
 Jackson library that wasn't present at run-time.  My guess is that your
 program (or Pig?) either has two incompatible versions of the Jackson
 library on its classpath or maybe Avro's Jackson dependency has been
 excluded and a version that is incompatible with Avro is on the classpath.

 Which version of Avro is being used?  Running 'mvn dependency:tree' in
 Avro trunk I see that it's depending on Jackson 1.8.6.  Can you verify 
 that
 only one version of Jackson is on the classpath and that it's the version
 that is required by whatever version of Avro is on the classpath?

 -James



 On Thu, Feb 2, 2012 at 5:21 PM, Russell Jurney 
 russell.jur...@gmail.com wrote:

 Correction: when I read the file in Python, I get the error below.
  It looks like a unicode problem?  Can one tell Avro how

AVRO-981 - Removed snappy as requirement

2012-01-23 Thread Russell Jurney
https://issues.apache.org/jira/browse/AVRO-981

I took Joe Crobak's advice and removed snappy as a dependency in the python
client for avro.  With the patch in AVRO-981 applied, Avro installs, builds
and functions on Mac OS X.

-- 
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com


Re: Python for Avro doesn't build on OS X 10.6.8. Stuck.

2011-12-23 Thread Russell Jurney
Avro 1.53 doesn't have this issue? Does it use python-snappy?

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On Dec 23, 2011, at 7:05 PM, Ken Krugler kkrugler_li...@transpac.com
wrote:

I installed brew, then ran 'brew install snappy', which worked.

But 'sudo python setup.py install' still fails, though with a different
problem (I was getting the same compilation errors as below, now I get a
broken pipe error).

running install
running bdist_egg
running egg_info
writing requirements to avro.egg-info/requires.txt
writing avro.egg-info/PKG-INFO
writing top-level names to avro.egg-info/top_level.txt
writing dependency_links to avro.egg-info/dependency_links.txt
reading manifest file 'avro.egg-info/SOURCES.txt'
writing manifest file 'avro.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-universal/egg
running install_lib
running build_py
creating build/bdist.macosx-10.6-universal/egg
creating build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/__init__.py -
build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/datafile.py -
build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/io.py - build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/ipc.py - build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/protocol.py -
build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/schema.py -
build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/tool.py - build/bdist.macosx-10.6-universal/egg/avro
copying build/lib/avro/txipc.py -
build/bdist.macosx-10.6-universal/egg/avro
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/__init__.py to
__init__.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/datafile.py to
datafile.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/io.py to io.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/ipc.py to ipc.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/protocol.py to
protocol.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/schema.py to
schema.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/tool.py to
tool.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/avro/txipc.py to
txipc.pyc
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
copying build/scripts-2.6/avro -
build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
changing mode of
build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts/avro to 755
copying avro.egg-info/PKG-INFO -
build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying avro.egg-info/SOURCES.txt -
build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying avro.egg-info/dependency_links.txt -
build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying avro.egg-info/requires.txt -
build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying avro.egg-info/top_level.txt -
build/bdist.macosx-10.6-universal/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/avro-1.6.1-py2.6.egg' and adding
'build/bdist.macosx-10.6-universal/egg' to it
removing 'build/bdist.macosx-10.6-universal/egg' (and everything under it)
Processing avro-1.6.1-py2.6.egg
Removing /Library/Python/2.6/site-packages/avro-1.6.1-py2.6.egg
Copying avro-1.6.1-py2.6.egg to /Library/Python/2.6/site-packages
avro 1.6.1 is already the active version in easy-install.pth
Installing avro script to /usr/local/bin

Installed /Library/Python/2.6/site-packages/avro-1.6.1-py2.6.egg
Processing dependencies for avro==1.6.1
Searching for python-snappy
Reading http://pypi.python.org/simple/python-snappy/
Reading http://github.com/andrix/python-snappy
Best match: python-snappy 0.3.2
Downloading
http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
Processing python-snappy-0.3.2.tar.gz
Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
/tmp/easy_install-jARda8/python-snappy-0.3.2/egg-dist-tmp-DDcMTW
cc1plus: warning: command line option -Wstrict-prototypes is valid for
C/ObjC but not for C++
snappymodule.cc:41: warning: ‘_state’ defined but not used
/usr/libexec/gcc/powerpc-apple-darwin10/4.2.1/as: assembler
(/usr/bin/../libexec/gcc/darwin/ppc/as or
/usr/bin/../local/libexec/gcc/darwin/ppc/as) for architecture ppc not
installed
Installed assemblers are:
/usr/bin/../libexec/gcc/darwin/x86_64/as for architecture x86_64
/usr/bin/../libexec/gcc/darwin/i386/as for architecture i386
cc1plus: warning: command line option -Wstrict-prototypes is valid for
C/ObjC but not for C++
snappymodule.cc:41: warning: ‘_state’ defined but not used
snappymodule.cc:246: fatal error: error writing to -: Broken pipe
compilation terminated.
cc1plus: warning: command line option -Wstrict-prototypes is valid for
C/ObjC but not for C++
snappymodule.cc:41

Python for Avro doesn't build on OS X 10.6.8. Stuck.

2011-12-14 Thread Russell Jurney
I am unable to build Avro for Python on OS X 10.6.8 because python-snappy
fails to build.  Is there a way around this?

I wrote a post here
http://datasyndrome.com/post/13707537045/booting-the-analytics-application-events-rubyabout
how to use Avro with Pig and Ruby.  The code is here:
https://github.com/rjurney/Booting-the-Analytics-Application  I am trying
to create Python instructions that parallel the Ruby ones.

I've tried multiple install methods, and I am unable to build
snappy-python, so the avro build fails.  This is the error message:

Installed
/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg
Processing dependencies for avro==-AVRO-VERSION-
Searching for python-snappy
Reading http://pypi.python.org/simple/python-snappy/
Reading http://github.com/andrix/python-snappy
Best match: python-snappy 0.3.2
Downloading
http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
Processing python-snappy-0.3.2.tar.gz
Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
/tmp/easy_install-0KEFAv/python-snappy-0.3.2/egg-dist-tmp-1BPj3j
cc1plus: warning: command line option -Wstrict-prototypes is valid for
C/ObjC but not for C++
snappymodule.cc:31:22: error: snappy-c.h: No such file or directory
snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*,
PyObject*)’:
snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope
snappymodule.cc:62: error: expected `;' before ‘status’
snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not declared
in this scope
snappymodule.cc:79: error: ‘status’ was not declared in this scope
snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this scope
snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope
snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*,
PyObject*)’:
snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope
snappymodule.cc:107: error: expected `;' before ‘status’
snappymodule.cc:120: error: ‘status’ was not declared in this scope
snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared
in this scope
snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope
snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this
scope
snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope
snappymodule.cc: In function ‘PyObject*
snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’:
snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope
snappymodule.cc:151: error: expected `;' before ‘status’
snappymodule.cc:156: error: ‘status’ was not declared in this scope
snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not
declared in this scope
snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope
snappymodule.cc: At global scope:
snappymodule.cc:41: warning: ‘_state’ defined but not used
error: Setup script exited with error: command '/usr/bin/gcc-4.2' failed
with exit status 1

-- 
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com


Re: Python for Avro doesn't build on OS X 10.6.8. Stuck.

2011-12-14 Thread Russell Jurney
I installed the snappy macport, and I still can't get snappy-python to
build.  I'll try switching to brew.

On Wed, Dec 14, 2011 at 5:46 PM, Joe Crobak joec...@gmail.com wrote:

 It appears that the python-snappy library assumes that the snappy library
 has already been installed.  I saw similar errors, but once I installed
 snappy (I use homebrew, so 'brew install snappy') python-snappy installs
 fine.

 HTH,
 Joe


 On Wed, Dec 14, 2011 at 8:23 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 I am unable to build Avro for Python on OS X 10.6.8 because python-snappy
 fails to build.  Is there a way around this?

 I wrote a post here
 http://datasyndrome.com/post/13707537045/booting-the-analytics-application-events-rubyabout
  how to use Avro with Pig and Ruby.  The code is here:
 https://github.com/rjurney/Booting-the-Analytics-Application  I am
 trying to create Python instructions that parallel the Ruby ones.

 I've tried multiple install methods, and I am unable to build
 snappy-python, so the avro build fails.  This is the error message:

 Installed
 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/avro-_AVRO_VERSION_-py2.6.egg
 Processing dependencies for avro==-AVRO-VERSION-
 Searching for python-snappy
 Reading http://pypi.python.org/simple/python-snappy/
 Reading http://github.com/andrix/python-snappy
 Best match: python-snappy 0.3.2
 Downloading
 http://pypi.python.org/packages/source/p/python-snappy/python-snappy-0.3.2.tar.gz#md5=94ec3eb54a780fac3b15a6c141af973f
 Processing python-snappy-0.3.2.tar.gz
 Running python-snappy-0.3.2/setup.py -q bdist_egg --dist-dir
 /tmp/easy_install-0KEFAv/python-snappy-0.3.2/egg-dist-tmp-1BPj3j
 cc1plus: warning: command line option -Wstrict-prototypes is valid for
 C/ObjC but not for C++
 snappymodule.cc:31:22: error: snappy-c.h: No such file or directory
 snappymodule.cc: In function ‘PyObject* snappy__compress(PyObject*,
 PyObject*)’:
 snappymodule.cc:62: error: ‘snappy_status’ was not declared in this scope
 snappymodule.cc:62: error: expected `;' before ‘status’
 snappymodule.cc:75: error: ‘snappy_max_compressed_length’ was not
 declared in this scope
 snappymodule.cc:79: error: ‘status’ was not declared in this scope
 snappymodule.cc:79: error: ‘snappy_compress’ was not declared in this
 scope
 snappymodule.cc:81: error: ‘SNAPPY_OK’ was not declared in this scope
 snappymodule.cc: In function ‘PyObject* snappy__uncompress(PyObject*,
 PyObject*)’:
 snappymodule.cc:107: error: ‘snappy_status’ was not declared in this scope
 snappymodule.cc:107: error: expected `;' before ‘status’
 snappymodule.cc:120: error: ‘status’ was not declared in this scope
 snappymodule.cc:120: error: ‘snappy_uncompressed_length’ was not declared
 in this scope
 snappymodule.cc:121: error: ‘SNAPPY_OK’ was not declared in this scope
 snappymodule.cc:128: error: ‘snappy_uncompress’ was not declared in this
 scope
 snappymodule.cc:129: error: ‘SNAPPY_OK’ was not declared in this scope
 snappymodule.cc: In function ‘PyObject*
 snappy__is_valid_compressed_buffer(PyObject*, PyObject*)’:
 snappymodule.cc:151: error: ‘snappy_status’ was not declared in this scope
 snappymodule.cc:151: error: expected `;' before ‘status’
 snappymodule.cc:156: error: ‘status’ was not declared in this scope
 snappymodule.cc:156: error: ‘snappy_validate_compressed_buffer’ was not
 declared in this scope
 snappymodule.cc:157: error: ‘SNAPPY_OK’ was not declared in this scope
 snappymodule.cc: At global scope:
 snappymodule.cc:41: warning: ‘_state’ defined but not used
 error: Setup script exited with error: command '/usr/bin/gcc-4.2' failed
 with exit status 1

 --
 Russell Jurney
 twitter.com/rjurney
 russell.jur...@gmail.com
 datasyndrome.com





-- 
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com