RE: Passing a BAG to Pig UDF constructor?

2012-06-29 Thread Mridul Muralidharan
if it is small enough to ignore ! Regards, Mridul On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan mrid...@yahoo-inc.comwrote: -Original Message- From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Wednesday, June 27, 2012 3:12 AM To: user@pig.apache.org

RE: Passing a BAG to Pig UDF constructor?

2012-06-26 Thread Mridul Muralidharan
You could dump the data in a dfs file and pass the location of the file as param to your udf in define - so that it initializes itself using that data ... - Mridul -Original Message- From: Dexin Wang [mailto:wangde...@gmail.com] Sent: Tuesday, June 26, 2012 10:58 PM To:

Re: Multithreaded UDF

2011-11-09 Thread Mridul Muralidharan
A simple solution would be to tag each tuple with a random number (such that each number has multiple url's associated with it - but not too large a number of urls), and simply group based on this field. In the reducer, you get a bag of url's for each random number : at which point, you can

Re: jython not working in cluster mode

2011-06-06 Thread Mridul Muralidharan
You might want to raise a JIRA on this - both abs and rel paths should be supported ... Regards, Mridul On Friday 03 June 2011 11:15 PM, Daniel Eklund wrote: Shawn... excellent!.. thank you. it worked. interestingly, I remember having to use the absolute path in local mode daniel

Re: Tuple to lines conversion in Pig

2011-05-10 Thread Mridul Muralidharan
Easy option would be to write your own udf which can catch corner cases, etc .. But assuming your data strictly follows what you mentioned, something like this might help (illustrative only !) : pets = load 'pets.txt' USING PigStorage(';') AS (pet_id:chararray, pet_type:chararray,

Re: Tuple to lines conversion in Pig

2011-05-10 Thread Mridul Muralidharan
example of such functions in SVN/pig0.8 package? Best Regards Vincent On Tue, May 10, 2011 at 2:02 PM, Mridul Muralidharan mrid...@yahoo-inc.com mailto:mrid...@yahoo-inc.com wrote: Easy option would be to write your own udf which can catch corner cases, etc .. But assuming your data

Re: Looking up two fields in a relation with another relation

2011-04-22 Thread Mridul Muralidharan
. Daniel On 04/19/2011 12:28 AM, Mridul Muralidharan wrote: If I am not wrong, PIG-1705 talks about conflicting alias's in a join : interesting to see how that affects Jay Hacker's issue where there is no alias re-use from what I saw ... Regards, Mridul On Tuesday 19 April 2011 03:11 AM, Daniel

Re: Looking up two fields in a relation with another relation

2011-04-22 Thread Mridul Muralidharan
:53 AM, Mridul Muralidharan wrote: Alias vs relation difference. The bug is about alias issue, not relation iirc. Everything comes from limited number of relations which are loaded anyway :-) - Mridul On Friday 22 April 2011 06:40 AM, Jianyong Dai wrote: m is actually reused. z is joining two

Re: pig query on Cassandra

2011-04-21 Thread Mridul Muralidharan
In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc. This happens not just for cassandra but for hbase, others too - where you modify data in-place. Regards, Mridul On Thursday 21 April 2011 04:29 AM, Bing Wei

Re: pig query on Cassandra

2011-04-21 Thread Mridul Muralidharan
On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote: On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote: In general (on hadoop based systems), if the input is not immutable - you can end up with issues during task re-execution, etc. This happens not just for cassandra but for hbase

Re: Benchmark Haddop and Pig UDFs

2011-04-20 Thread Mridul Muralidharan
Not sure what the scope of the experiment is, but some useful comparisons could be against : a) job using only mapred api. b) hadoop streaming. c) pig streaming. It also depends on the actual script/job being run - if it is using combiners, multiple outputs, 'depth of pipeline', how many

Re: DUMP or STORE Depending on Parameter Input

2011-04-20 Thread Mridul Muralidharan
You could try using property file instead of cli param to pass the name/value ... Regards, Mridul On Tuesday 19 April 2011 05:29 AM, Andreas Paepcke wrote: I'm still struggling with parameter substitution. Below are six examples. Two work, the others don't. When they don't, I get this

Re: Looking up two fields in a relation with another relation

2011-04-19 Thread Mridul Muralidharan
If I am not wrong, PIG-1705 talks about conflicting alias's in a join : interesting to see how that affects Jay Hacker's issue where there is no alias re-use from what I saw ... Regards, Mridul On Tuesday 19 April 2011 03:11 AM, Daniel Dai wrote: I believe it is PIG-1705. Daniel On

Re: Filter on contents of other dataset

2011-04-14 Thread Mridul Muralidharan
The way you described it, it does look like an application of cross. How 'small' is small ? If it is pretty small, you can avoid the shuffle/reduce phase and directly stream huge through a udf which does a task local cross with 'small' (assuming it fits in memory). %define my_udf

Re: Merging lines in a log into a single bag

2011-04-08 Thread Mridul Muralidharan
You could group by first column ? Please refer to the pig manual for more on this. Regards, Mridul On Friday 08 April 2011 07:15 AM, Jonathan Holloway wrote: Hi all, I have the following: A {(3),(Log Message A)} A {(5),(Log Message B)} B{(8),(Log Message C)} B {(1),(Log

Re: Dereferencing columns of nested bags

2011-04-08 Thread Mridul Muralidharan
foreach/flatten invocations, you can get to the data you want (but it is not the same functionally since you loose record level aggregation that $1.$1.$0 (for ex) provides). Regards, Mridul Thanks, badri -Original Message- From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com

Re: store less files

2011-04-02 Thread Mridul Muralidharan
Using rand() as group key, in general, is a pretty bad idea in case of failures. - Mridul On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote: Don't order, that's expensive. Just group by rand(), specify parallelism on the group by, and store the result of foreach grouped generate

Re: Schema

2011-03-09 Thread Mridul Muralidharan
In which case, cant you not model that as a Bag ? I imagine something like Tuple with fields person:chararray, books_read:bag{ (name:chararray, isbn:chararray) }, etc ? Ofcourse, it will work as a bag if the tuple contained within it has a fixed schema :-) (unless you repeat this process N

Re: [DISCUSSION] Pig.next

2011-03-04 Thread Mridul Muralidharan
IMO 1.0 for a product typically promises : 1) Reasonable stability of interfaces. Typically only major version changes break interface compatibility. While we are at 0.x, it seems to be considered 'okish' to violate this : but once you are at 1.0 and higher, breaking interface contracts will

Re: Writing filter function that takes constructor param?

2010-12-02 Thread Mridul Muralidharan
As of now, udf's are limited to only String's as constructor params. Regards, Mridul On Thursday 02 December 2010 02:18 PM, Sheeba George wrote: Hi Daniel I have a related question. My UDF has a constructor that takes 2 param. * public* TopUDF(*int* top, *int* type){ m_cnt = top;