if it is small
enough to ignore !
Regards,
Mridul
On Tue, Jun 26, 2012 at 2:50 PM, Mridul Muralidharan
mrid...@yahoo-inc.comwrote:
-Original Message-
From: Jonathan Coveney [mailto:jcove...@gmail.com]
Sent: Wednesday, June 27, 2012 3:12 AM
To: user@pig.apache.org
You could dump the data in a dfs file and pass the location of the file as
param to your udf in define - so that it initializes itself using that data ...
- Mridul
-Original Message-
From: Dexin Wang [mailto:wangde...@gmail.com]
Sent: Tuesday, June 26, 2012 10:58 PM
To:
A simple solution would be to tag each tuple with a random number (such
that each number has multiple url's associated with it - but not too
large a number of urls), and simply group based on this field.
In the reducer, you get a bag of url's for each random number : at which
point, you can
You might want to raise a JIRA on this - both abs and rel paths should
be supported ...
Regards,
Mridul
On Friday 03 June 2011 11:15 PM, Daniel Eklund wrote:
Shawn... excellent!.. thank you. it worked.
interestingly, I remember having to use the absolute path in local mode
daniel
Easy option would be to write your own udf which can catch corner cases,
etc ..
But assuming your data strictly follows what you mentioned, something
like this might help (illustrative only !) :
pets = load 'pets.txt' USING PigStorage(';') AS (pet_id:chararray,
pet_type:chararray,
example
of such functions in SVN/pig0.8 package?
Best Regards
Vincent
On Tue, May 10, 2011 at 2:02 PM, Mridul Muralidharan
mrid...@yahoo-inc.com mailto:mrid...@yahoo-inc.com wrote:
Easy option would be to write your own udf which can catch corner
cases, etc ..
But assuming your data
.
Daniel
On 04/19/2011 12:28 AM, Mridul Muralidharan wrote:
If I am not wrong, PIG-1705 talks about conflicting alias's in a join :
interesting to see how that affects Jay Hacker's issue where there is no
alias re-use from what I saw ...
Regards,
Mridul
On Tuesday 19 April 2011 03:11 AM, Daniel
:53 AM, Mridul Muralidharan wrote:
Alias vs relation difference.
The bug is about alias issue, not relation iirc.
Everything comes from limited number of relations which are loaded
anyway :-)
- Mridul
On Friday 22 April 2011 06:40 AM, Jianyong Dai wrote:
m is actually reused. z is joining two
In general (on hadoop based systems), if the input is not immutable -
you can end up with issues during task re-execution, etc.
This happens not just for cassandra but for hbase, others too - where
you modify data in-place.
Regards,
Mridul
On Thursday 21 April 2011 04:29 AM, Bing Wei
On Thursday 21 April 2011 06:41 PM, Jeremy Hanna wrote:
On Apr 21, 2011, at 3:19 AM, Mridul Muralidharan wrote:
In general (on hadoop based systems), if the input is not immutable - you can
end up with issues during task re-execution, etc.
This happens not just for cassandra but for hbase
Not sure what the scope of the experiment is, but some useful
comparisons could be against :
a) job using only mapred api.
b) hadoop streaming.
c) pig streaming.
It also depends on the actual script/job being run - if it is using
combiners, multiple outputs, 'depth of pipeline', how many
You could try using property file instead of cli param to pass the
name/value ...
Regards,
Mridul
On Tuesday 19 April 2011 05:29 AM, Andreas Paepcke wrote:
I'm still struggling with parameter substitution.
Below are six examples. Two work, the others don't.
When they don't, I get this
If I am not wrong, PIG-1705 talks about conflicting alias's in a join :
interesting to see how that affects Jay Hacker's issue where there is no
alias re-use from what I saw ...
Regards,
Mridul
On Tuesday 19 April 2011 03:11 AM, Daniel Dai wrote:
I believe it is PIG-1705.
Daniel
On
The way you described it, it does look like an application of cross.
How 'small' is small ?
If it is pretty small, you can avoid the shuffle/reduce phase and
directly stream huge through a udf which does a task local cross with
'small' (assuming it fits in memory).
%define my_udf
You could group by first column ?
Please refer to the pig manual for more on this.
Regards,
Mridul
On Friday 08 April 2011 07:15 AM, Jonathan Holloway wrote:
Hi all,
I have the following:
A {(3),(Log Message A)}
A {(5),(Log Message B)}
B{(8),(Log Message C)}
B {(1),(Log
foreach/flatten invocations, you can get
to the data you want (but it is not the same functionally since you
loose record level aggregation that $1.$1.$0 (for ex) provides).
Regards,
Mridul
Thanks,
badri
-Original Message-
From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com
Using rand() as group key, in general, is a pretty bad idea in case of
failures.
- Mridul
On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote:
Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the
result of foreach grouped generate
In which case, cant you not model that as a Bag ?
I imagine something like Tuple with fields person:chararray,
books_read:bag{ (name:chararray, isbn:chararray) }, etc ?
Ofcourse, it will work as a bag if the tuple contained within it has a
fixed schema :-) (unless you repeat this process N
IMO 1.0 for a product typically promises :
1) Reasonable stability of interfaces.
Typically only major version changes break interface compatibility.
While we are at 0.x, it seems to be considered 'okish' to violate this :
but once you are at 1.0 and higher, breaking interface contracts will
As of now, udf's are limited to only String's as constructor params.
Regards,
Mridul
On Thursday 02 December 2010 02:18 PM, Sheeba George wrote:
Hi Daniel
I have a related question. My UDF has a constructor that takes 2 param.
*
public* TopUDF(*int* top, *int* type){
m_cnt = top;
20 matches
Mail list logo