Re: deserializing nested protobufs

2012-04-03 Thread Benjamin Juhn
Looks like it's covered: public ProtobufBytesToTuple(TypeRef typeRef, ProtobufExtensionRegistry extensionRegistry) { Thanks, Ben On Apr 3, 2012, at 4:41 PM, Raghu Angadi wrote: > extension are not supported yet. there is a patch pending : > https://github.com/kevinweil/elephant-bird/pull/143 >

Re: deserializing nested protobufs

2012-04-03 Thread Raghu Angadi
extension are not supported yet. there is a patch pending : https://github.com/kevinweil/elephant-bird/pull/143 Can you check if that covers your use case? On Tue, Apr 3, 2012 at 4:32 PM, Benjamin Juhn wrote: > Thanks Dmitriy. Doesn't look like that class supports extensions. Am I > missing s

Re: deserializing nested protobufs

2012-04-03 Thread Benjamin Juhn
Thanks Dmitriy. Doesn't look like that class supports extensions. Am I missing something? - Ben On Mar 27, 2012, at 10:01 PM, Dmitriy Ryaboy wrote: > I think you want ProtobufBytesToTuple > (https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank

Re: Is it possible to use Pig streaming (StreamToPig) in a way that handles multiple lines as a single input tuple?

2012-04-03 Thread Raghu Angadi
why not pipe multi-line xml from the executable through another script that understands it? On Wed, Mar 28, 2012 at 8:24 AM, Ahmed Sobhi wrote: > I'm streaming data in a pig script through an executable that returns an > xml fragment for each line of input I stream to it. That xml fragment > hap

Re: Compressing output using block compression

2012-04-03 Thread Raghu Angadi
SequenceFileStorage in elephant-bird lets you load and store to sequence files. If your input is text lines, you can store each line as 'value'. You can experiment with different codecs. depending on your use case, simple bzip2 files may not be a bad choice. On Tue, Apr 3, 2012 at 1:57 PM, Mohit

Re: Compressing output using block compression

2012-04-03 Thread Mohit Anchlia
Thanks for the examples. It appears that snappy is not splittable and suggested approach is to write to sequence files. I know how to load from sequencefiles, but in pig I can't find a way to write to the sequence files using snappy compression. On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi

Re: Compressing output using block compression

2012-04-03 Thread Prashant Kommireddi
Does it mean Snappy is splittable? http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ If so then how can I use it in pig? http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia wrote: > I am currently using Snappy in sequence

Re: Compressing output using block compression

2012-04-03 Thread Mohit Anchlia
I am currently using Snappy in sequence files. I wasn't aware snappy uses block compression. Does it mean Snappy is splittable? If so then how can I use it in pig? Thanks again On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi wrote: > Most companies handling BigData use LZO, a few have start

Re: Compressing output using block compression

2012-04-03 Thread Prashant Kommireddi
Most companies handling BigData use LZO, a few have started exploring/using Snappy as well (which is not any easier to configure). These are the 2 splittable fast-compression algorithms. Note Snappy is not efficient space-wise compared to gzip or other compression algos, but a lot faster (ideal for

Re: Compressing output using block compression

2012-04-03 Thread Mohit Anchlia
Thanks for your input. It looks like it's some work to configure LZO. What are the other alternatives? We read new sequence files and generate output continuously. What are my options? Should I split the output in small pieces and gzip them? How do people solve similar problems where there is cont

Re: Improve Performance of Pig script

2012-04-03 Thread sonia gehlot
Actually I don't expect lots of rows, there should be only one row in the output. I will try with groups rather than distinct. On Tue, Apr 3, 2012 at 12:09 PM, Jonathan Coveney wrote: > woops hit enter. just to see, how long does it take if you just store h? > > 2012/4/3 Jonathan Coveney > > >

Re: Improve Performance of Pig script

2012-04-03 Thread Jonathan Coveney
woops hit enter. just to see, how long does it take if you just store h? 2012/4/3 Jonathan Coveney > point 1: doing dump is dangerous, depending on how many rows you expect in > the relation. you're going to serialize every row in the output to your > console > point 2: the issue is that you're

Re: Improve Performance of Pig script

2012-04-03 Thread Jonathan Coveney
point 1: doing dump is dangerous, depending on how many rows you expect in the relation. you're going to serialize every row in the output to your console point 2: the issue is that you're doing a nested DISTINCT. This is done in memory, and for large data sets can be quite slow. The scalable solut

Re: Compressing output using block compression

2012-04-03 Thread Prashant Kommireddi
Yes, it is splittable. Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs generally being IO bound, Bzip2 sometimes can become the bottleneck with respect to performance due to this slow decompression rate (algorithm unable to decompress at disk read rate). On Tue, Apr 3, 2012 at 11:

Problem after loading Pig on WIndows 7

2012-04-03 Thread yogesh edekar
HI, I use Windows 7 operating system. I have recently started working on Pig and Hadoop. I have no previous experience on both hadoop and pig. I have installed Cygwin so that I can make hadoop and Pig work on my windows system. I have untarred Pig0.9.2 as well as Hadoop1.0.0. I have set envi

Re: Compressing output using block compression

2012-04-03 Thread Mohit Anchlia
Is bzip2 not advisable? I think it can split too and is supported out of the box. On Thu, Mar 29, 2012 at 8:08 PM, 帝归 wrote: > When I use LzoPigStorage, it will load all files under a directory. But I > want compress every file under a directory and keep the file name > unchanged, just with a .l

Re: Improve Performance of Pig script

2012-04-03 Thread sonia gehlot
Thanks Guys, This is pig script which I am running, Dataset is also small for the filtered date, which is around 2 million rows but I am targeting to write this script for larger scope. In here titles is array of JSON object but stored as string datatype so I am using python udf to split it into c

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Dan Young
Dooohhthank you for pointing that outI thought I ran that thru jsonlint.That seemed to fix it Regards, Dano On Tue, Apr 3, 2012 at 12:11 PM, Bill Graham wrote: > In the schema approach the error is that your json is invalid. You're > missing a second '}' before the last ']'

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Bill Graham
In the schema approach the error is that your json is invalid. You're missing a second '}' before the last ']'. On Tue, Apr 3, 2012 at 10:32 AM, Dan Young wrote: > I just updated my pig from svn repo and now am using the latest from trunk: > > pig -i > Apache Pig version 0.11.0-SNAPSHOT (r1309

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Dan Young
I just updated my pig from svn repo and now am using the latest from trunk: pig -i Apache Pig version 0.11.0-SNAPSHOT (r1309051) compiled Apr 03 2012, 11:18:53 Here's the gist with stack traces, both with or without specifying schema. Am using piggybank from trunk. https://gist.github.com/22939

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Dan Young
Here's the version of Pig I'm using: pig -i Apache Pig version 0.11.0-SNAPSHOT (r1304979) compiled Mar 24 2012, 21:48:44 The version of Hadoop: *Version:* 1.0.0, r1214675 Regards, Dan On Tue, Apr 3, 2012 at 11:07 AM, Russell Jurney wrote: > This looks like a bug fixed in 0.10. Mind trying i

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Russell Jurney
This looks like a bug fixed in 0.10. Mind trying it? Russell Jurney http://datasyndrome.com On Apr 3, 2012, at 9:13 AM, Dan Young wrote: > Hello Stan, > > I'm back from Mexico now, and here's my GIST with all the information. > > https://gist.github.com/2293226 > > Any insight into what I'm

Re: Trying to store a bag of tuples using AvroStorage.

2012-04-03 Thread Dan Young
Hello Stan, I'm back from Mexico now, and here's my GIST with all the information. https://gist.github.com/2293226 Any insight into what I'm not doing correctly would be greatly appreciated. Regards, Dan On Mon, Mar 26, 2012 at 9:11 AM, Stan Rosenberg wrote: > Hi Dan, > > Could you attach yo

Sync Marker Issue while reading AVRO files writen with FLUME with PIG

2012-04-03 Thread Markus Resch
Hey everyone, we're facing a problem while reading AVRO files written with FLUME using the AVRO Java API 1.5.4 into a HADOOP cluster. The Avro Data Store complains about missing sync marker. Investigating the problem shows us, that's perfectly right. The sync marker is missing. Thus we have a bloc