Re: LOAD function vs. UDF eval

2012-05-29 Thread Raghu Angadi
I would still use a UDF, it is lot more flexible. Passing large number of ids to the loader is part of the problem.. Your UDF would take a bag of ids and return bag{(session, events:bag{})} You can pass the bag of ids in various ways : - load ids as a relation, group all to put all of them in

Re: Re: RCfile

2012-05-25 Thread Raghu Angadi
lustering performance. > > Best Regards > > Malone > > > 2012-05-25 > > > > yingnan.ma > > > > 发件人: Raghu Angadi > 发送时间: 2012-05-25 02:26:58 > 收件人: user > 抄送: > 主题: Re: RCfile > > another option is > RCFilePigStorage.java< > https

Re: RCfile

2012-05-24 Thread Raghu Angadi
another option is RCFilePigStorage.java in Elephantbird. It is a drop-in replacement for default PigStorage and simple to use. details on IO problem you want to fix?

Re: Writing to rcfile

2012-05-23 Thread Raghu Angadi
Elephantbird has support RCFile storage. current version supports storing Thrift and Protobufs. You can try prototype implementation RCFilePigStorage.java, it can be

Re: Problem loading sequence files with Elephant Bird

2012-05-21 Thread Raghu Angadi
'AS' is almost always dangerous. The loader already has a schema. Use a projection if you want to rename them. On Fri, May 18, 2012 at 4:07 PM, Chris Diehl wrote: > With a little bit of luck, we managed to find an answer. > > Turns out we needed to remove the cast from key and run the script in

Re: Registering *.jar

2012-05-18 Thread Raghu Angadi
if you still want to use 8.1, you can apply 0.8 patch from https://issues.apache.org/jira/browse/PIG-2142 On Wed, May 16, 2012 at 5:04 PM, Mohit Anchlia wrote: > .8.1 > > On Wed, May 16, 2012 at 4:58 PM, Prashant Kommireddi >wrote: > > > What version are you using? > > > > Sent from my iPhone >

Re: deserializing nested protobufs

2012-04-03 Thread Raghu Angadi
extension are not supported yet. there is a patch pending : https://github.com/kevinweil/elephant-bird/pull/143 Can you check if that covers your use case? On Tue, Apr 3, 2012 at 4:32 PM, Benjamin Juhn wrote: > Thanks Dmitriy. Doesn't look like that class supports extensions. Am I > missing s

Re: Is it possible to use Pig streaming (StreamToPig) in a way that handles multiple lines as a single input tuple?

2012-04-03 Thread Raghu Angadi
why not pipe multi-line xml from the executable through another script that understands it? On Wed, Mar 28, 2012 at 8:24 AM, Ahmed Sobhi wrote: > I'm streaming data in a pig script through an executable that returns an > xml fragment for each line of input I stream to it. That xml fragment > hap

Re: Compressing output using block compression

2012-04-03 Thread Raghu Angadi
SequenceFileStorage in elephant-bird lets you load and store to sequence files. If your input is text lines, you can store each line as 'value'. You can experiment with different codecs. depending on your use case, simple bzip2 files may not be a bad choice. On Tue, Apr 3, 2012 at 1:57 PM, Mohit

most high profile user

2012-03-21 Thread Raghu Angadi
unbelievable! https://twitter.com/#!/mcuban/status/182273293347328000 anyone has more scoop on this?

Re: pig and hbase integration = hanging jobs

2012-03-07 Thread Raghu Angadi
> More information at: http://hadoop1:50030/**jobdetails.jsp?jobid=job_** 201203071602_0001 did you check that? From that link you can also navigate to output from mapper task. did you create "info" column family in the table? On We

Re: Best practice for DB connection

2012-03-07 Thread Raghu Angadi
On Tue, Mar 6, 2012 at 5:02 PM, Mark Kerzner wrote: > Hi, > > I need to initialize the HBase connection, which I normally do in > configure() in the Mapper, and then my mapper uses it. How do I do it in > Pig? > > I am ready to define a UDF that will return a handle, but is it a best > practice? >

Re: HBaseStorage STORE method comparison

2012-03-07 Thread Raghu Angadi
fastest might be to use local mode, and avoid even the first map only job :) You are right, for 10 keys it does not really matter. Even doing 1000s of updates to the same row in #2 is still a in-memory update for HBase. The actual cost of HBase put() is probably slightly high for #2, but it is a n

Re: LZO support for Pig-0.9.1

2012-01-23 Thread Raghu Angadi
btw, for simple use cases, you can load/store lzo files using PigStorage(). While stor Only disadvantage is that files are not splittable. grunt> set output.compression.enabled true; grunt> set output.compression.codec com.hadoop.compression.lzo.LzopCodec; grunt> x = load 'x.txt' using PigStorage

Re: Multithreaded UDF

2011-11-09 Thread Raghu Angadi
oh, this is much better than custom loader hack I mentioned to batch up input tuples. On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan wrote: > > A simple solution would be to tag each tuple with a random number (such > that each number has multiple url's associated with it - but not too larg

Re: Multithreaded UDF

2011-11-09 Thread Raghu Angadi
Assuming 1-5 seconds is mainly waiting for IO, using multiple reducers or mapper might not be suitable since it just takes too many mapper an d reducer slots. Couple of options: 1. use streaming : you have full control on how many you handle at a time. Might be tricky to pass url content. 2. a ha

Re: Question on custom store function

2011-11-08 Thread Raghu Angadi
at > > file:/user/bbuda/ids/_temporary/_attempt_local_0001_r_00_0/prefix-W/part-r-0 > final output stores at > > file:/user/bbuda/ids/_temporary/_attempt_local_0001_r_00_0/prefix-X/part-r-0 > final output stores at > > file:/user/bbuda/ids/_temporar

Re: Question on custom store function

2011-11-04 Thread Raghu Angadi
You need to set output path to '/Users/felix/Documents/pig/multi_store_output' in your setStoreLocation(). Alternately for clarity, you could modify your store udf to be more like: store load_log INTO '/Users/felix/Documents/pig/multi_store_output' using MyMultiStorage('ns_{0}/site_{1}', '2,1', '1,

Re: Sequence File Loader

2011-11-04 Thread Raghu Angadi
SequenceFileLoader in ElephantBird is very generic. Lets you load/store any writables. https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java For arbitrary Writable, you can use "GenericWritableConverter" and it returns raw writab

Re: Snappy Compression Json Data

2011-10-14 Thread Raghu Angadi
if 'STORE' worked, LOAD should work fine too. On Thu, Oct 13, 2011 at 6:29 PM, Cameron Gandevia wrote: > Hi > > I currently have a bunch of data in json format in hdfs. I would like to > use > pig to load it dedupe it and store it back using snappy compression. > > Currently I do something like t

Re: Unregistering jars

2011-10-06 Thread Raghu Angadi
what do you mean by 'redeploy a jar'? only place where the jar is mentioned is in 'register' statement, right? in that case, restarting pig should certainly work. Raghu. On Wed, Oct 5, 2011 at 6:00 PM, Eric Czech wrote: > Hi, I'm having trouble updating jar files containing udf's. In my testin

Re: outputSchema for UDF EvalFunc returning a DataBag

2011-10-05 Thread Raghu Angadi
quot;double" ... >"chararray" ... >"bytearray" ... >"int" ... >"long" ... >"float" ... >"double" ... >"chararray" ... > "bytearray" ... > > Two question

Re: outputSchema for UDF EvalFunc returning a DataBag

2011-10-03 Thread Raghu Angadi
; I was wondering if it would be practical to reuse whatever code the > front-end uses to parse schema descriptions from load statements in > scripts. Is this a silly idea? If it isn't silly, does anyone know > where I need to look for that code? > > > On 3 October 2011 22:56, R

Re: outputSchema for UDF EvalFunc returning a DataBag

2011-10-03 Thread Raghu Angadi
my understanding is that Pig 0.8 expects the first form and Pig 0.9 requires the second. Raghu. On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg wrote: > Hi, > > When you have a UDF that returns a bag, and you're writing the > outputSchema method, do you have to explicitly include the mandatory > 'c

Re: protobuf without LZO in elephant-bird

2011-09-23 Thread Raghu Angadi
What format is the file in? You can use ProtobufBytesToTuple() udf to convert protobuf bytes to a PIG tuple. e.g.: define ToTuple com.twitter.elephantbird.pig.piggybank.ProtobufBytesToTuple('com.example.protobufs.Table.User'); users = foreach binary generate ToTuple(user_bytes); describe users

Re: JOIN fails with Index Out Of Bounds Error

2011-09-20 Thread Raghu Angadi
Your script or a simple script that shows the problem would help. On Tue, Sep 20, 2011 at 7:29 AM, Eli Finkelshteyn wrote: > Nope, just a simple inner join. > > > On 9/19/11 7:48 PM, Raghu Angadi wrote: > >> Do you have a FLATTEN() involved? FLATTEN(null) can cause IndexOut

Re: JOIN fails with Index Out Of Bounds Error

2011-09-19 Thread Raghu Angadi
Do you have a FLATTEN() involved? FLATTEN(null) can cause IndexOutOfBounds exception. ( if that is the case, see http://www.mail-archive.com/user@pig.apache.org/msg02275.html ) On Fri, Sep 16, 2011 at 8:04 AM, Eli Finkelshteyn wrote: > Hi, > When doing an inner join on a column where some values

Re: HBase get from within UDF vs. PIG FILTER

2011-08-19 Thread Raghu Angadi
UDF could be faster some of the accesses. We do use a lookup UDF in some of the scripts. Looking up 6% of the rows might be a bit high for some tables. Raghu. On Fri, Aug 19, 2011 at 9:16 AM, Norbert Burger wrote: > I have a need within a larger Pig script to pull just a few records from an > H

Re: Pig 0.9.0 has been released!

2011-07-31 Thread Raghu Angadi
great to see major user facing features. Thanks guys. Will we see some standard macros (e.g. rowcount()) similar to standard UDFs? Even rowcount may not be trivial for casual user to do correctly. Should rowcount() example in the blog should COUNT_STAR() rather than COUNT()? Raghu. On Fri, Jul

Re: Understand Schema after a Join

2011-07-29 Thread Raghu Angadi
Is implicit scalar conversion going to stay in PIG? My preference would to make it explicit like SCALAR(ACCT.year).. On Fri, Jul 29, 2011 at 11:00 AM, Raghu Angadi wrote: > > GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > > should be GENERATE ACCT::year,

Re: Understand Schema after a Join

2011-07-29 Thread Raghu Angadi
> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); should be GENERATE ACCT::year, ACCT::month ... etc. this is a common mistake to use '.' instead of '::'.. I wish the error message is more user friendly.. PIG supports 'scalars' and assumes your ACCT would be a single row

Re: Blocking issue with HBase 0.90.3 and PIG 0.8.1

2011-07-27 Thread Raghu Angadi
Vincent, is the behavior random or the same each time? Couple of things to narrow it down.. - attach the entire console output from PIG run when this happened. - only load start_sessions and end_sessions and store them.. - load the data from tables from previous step and run the same pig co

Re: conditional and multiple generate inside foreach?

2011-07-23 Thread Raghu Angadi
I see 3 independent questions : 1. How can we pass entire row tuple to an UDF as 'B = FOREACH A GENERATE myudf(A)', without knowing schema? I don't know if that is passible. It does feel like it should be possible. 2. How can I return an augmented Tuple? Your UDF can make a copy of the input

Re: PigStorage's handling of InputFormat and OutputFormat

2011-07-22 Thread Raghu Angadi
Thanks guys. Updated PIG-2187 with a new patch. On Fri, Jul 22, 2011 at 3:44 PM, Daniel Dai wrote: > Yes, I am talking about PigTextOutputFormat. > > On Fri, Jul 22, 2011 at 2:51 PM, Raghu Angadi wrote: > > > On Fri, Jul 22, 2011 at 1:29 PM, Daniel Dai > wrote: > >

Re: PigStorage's handling of InputFormat and OutputFormat

2011-07-22 Thread Raghu Angadi
: > "There are very few StoreFuncs that extend PigStorage" that we know of. We > don't know how our users are extending it for themselves. And PigStorage is > a public interface. Breaking it is a non-starter. > > Alan. > > On Jul 22, 2011, at 2:57 PM, Raghu Angad

Re: PigStorage's handling of InputFormat and OutputFormat

2011-07-22 Thread Raghu Angadi
thing by putting the logic in a static method of > PigTextOutputFormat and letting other users use it. Also, the cost of an > extra copy of the output is bad. We don't want to slow down storing data. > > Alan. > > On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote: > > > attached a

Re: PigStorage's handling of InputFormat and OutputFormat

2011-07-22 Thread Raghu Angadi
Did you mean PigTextOutputFormat Raghu. > which many users will use as reference implementation for a > StoreFunc. > > Daniel > > On Fri, Jul 22, 2011 at 12:24 PM, Raghu Angadi wrote: > > > attached a patch to https://issues.apache.org/jira/browse/PIG-2187 > > >

Re: PigStorage's handling of InputFormat and OutputFormat

2011-07-22 Thread Raghu Angadi
ouncement > and > document it as incompatible change if we do so. > > Daniel > > On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote: > > > expectation from PigStorage.getInputFormat() is that it is a > > InputFormat, and PigStorage handles converting Text to > &

PigStorage's handling of InputFormat and OutputFormat

2011-07-21 Thread Raghu Angadi
expectation from PigStorage.getInputFormat() is that it is a InputFormat, and PigStorage handles converting Text to Tuple. This is very useful and easy for users to use some other input format. But the same is not true for PigStorage().getOutputFormat().. Here it expects OutputFormat. So the outp

Re: STORE INTO replacing contents?

2011-07-19 Thread Raghu Angadi
I noticed this too. I could not find any documentation regd how return code from a command like 'fs' or 'rmf' or some shell command is. Replace 'fs -rmr' with rmf for your case. Raghu. On Tue, Jul 19, 2011 at 12:27 AM, Chris Rosner wrote: > Greetings, > > I'm trying to upgrade from 0.7.0 to 0.

Re: Specifying alternate zookeeper locations for HBaseStorage

2011-07-18 Thread Raghu Angadi
did you try set hbase.zookeeper.quorum 'zkquorum2' in your script? It seems to work in my read test. Raghu. On Mon, Jul 18, 2011 at 10:36 AM, Matt Davies wrote: > Hello, > > Is there a way to specify the zookeeper quorum for a pig job writing out to > HBase using HBaseStorage? For instance, >

Re: space in param values in command line

2011-07-13 Thread Raghu Angadi
> before > '=' is strip out, feel free to open a Jira ticket. > > Daniel > > On Tue, Jul 12, 2011 at 1:33 PM, Raghu Angadi wrote: > > > are you using bin/pig? With bin/pig, it does not even go to substitution > > stage. bin/pig joins all the arguments into

Re: space in param values in command line

2011-07-12 Thread Raghu Angadi
looks like. Raghu. On Tue, Jul 12, 2011 at 11:00 AM, Daniel Dai wrote: > Seems space works for me. You can use -r to dry-run the parameter > substitution part to see what it results. > > Daniel > > On Mon, Jul 11, 2011 at 10:15 PM, Raghu Angadi wrote: > > > On Mon

Re: space in param values in command line

2011-07-11 Thread Raghu Angadi
ot work "x==1" : doesn't "x\=\=1" : does "x \=\= 1" : doesn't Raghu. > Daniel > > On Mon, Jul 11, 2011 at 2:09 PM, Raghu Angadi wrote: > > > I am not able to assign a value with spaces to param on command line. > > $ pig -p c

space in param values in command line

2011-07-11 Thread Raghu Angadi
I am not able to assign a value with spaces to param on command line. $ pig -p cond='x == 1' test.pig results in command line parser error. other attempts like 'pig -p cond='"x == 1"' test.pig' didn't help. Is there a work around? otherwise I will file a jira and look into a fix. thanks, Raghu.

Re: UDF property passing

2011-07-09 Thread Raghu Angadi
he question correctly, but PIG does not transfer your object, so what you store in the member variables does not matter. Raghu. On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote: > > > yes. that is exactly how HBaseStorage uses context. > > > > On Fri, Jul 8, 2011 at 10:19 AM

Re: UDF property passing

2011-07-08 Thread Raghu Angadi
var location), but the > UDFContext docs suggests that one keep all state in the UDFContext under an > appropriate signature. > > > > See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for > another case where this has reared it's head in an improper implementation.

Re: UDF property passing

2011-07-06 Thread Raghu Angadi
On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna wrote: > > On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote: > > > I think this is the same problem we were having earlier: > > http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4 > > > > One workaround is to use defines to explicitly create different >

Re: SUM function problem

2011-06-29 Thread Raghu Angadi
try : aa = foreach ( group ee all ) generate SUM(mycount); read description and examples of 'GROUP' in PIG manual. Raghu. On Wed, Jun 29, 2011 at 8:36 AM, Marian Condurache < m.condura...@bigpoint.net> wrote: > Hi, > I have a really weird problem i am new to PIG so I don't really > understa

UDFContext() for a UDF

2011-06-27 Thread Raghu Angadi
UDF I am trying to implement: - output schema is same as input schema. - if (input == null) return tuple matching the schema with NULL for each field. else return input. For this, the UDF needs to pass number for fields in input schema to backend. LoadFunc and StoreFunc handle such

What is expected from FLATTEN(null tuple)?

2011-06-27 Thread Raghu Angadi
Looks like FLATTEN(tuple) results in single null when tuple is null, irrespective of the schema. As as result, the particular ends up with fewer columns than expected. This can lead to various kinds of problems.. runtime exceptions, incorrect values etc. E.g. A = load 'x.txt' as (a, t:(b,c), d:);