Hello Andrew
I have a same problem as you have. if you find any solution please help me
as well.
Thanks
byambajargal
On Fri, Mar 25, 2011 at 8:50 PM, Andrew Look wrote:
> Hello Pig-Users,
>
> I have been looking for a flexible way to specify the cluster hostnames of
> the remote mapreduce ins
You could either distribute the small file using distributed cache - in
which case, you can use direct file api to load content from the file,
or directly use hdfs api's to load from each task ... usually
distributed cache should work better, but ymmv !
Regards,
Mridul
On Friday 15 April 2
Thanks Mridul,
(Although, small might grow bigger) For instance, lets have small as
in-memory-small stored in a local file.
When does my udf load the data from the file. Earlier, I wrote a bag
loader that returns a bag of small data (eg- load 'smalldata' using
BagLoader() as (smallbag)). But then
The way you described it, it does look like an application of cross.
How 'small' is small ?
If it is pretty small, you can avoid the shuffle/reduce phase and
directly stream huge through a udf which does a task local cross with
'small' (assuming it fits in memory).
%define my_udf MYUDF('sma
Hi,
What would be the best way to write this script?
I have two datasets - huge (hkey, hdata), small(skey). I want to filter
all the data from huge dataset for which F(hdata, skey) is true.
Please advise.
For example,
huge = load 'mydata' as (key:chararray, value:chararray);
small = load 'smallda
java.io.IOException: Deserialization error: could not instantiate
'org.apache.pig.scripting.jython.JythonFunction' with arguments
'[/home/shawn/TESS/code/mypyudfs.py, isStopWord]'
at
org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.java:55)
at
org.apache.pig.
This should work (see PIG-1653 for a related issue). Can you post the whole
stack trace?
Thanks
-- Richard
On 4/14/11 12:59 PM, "Xiaomeng Wan" wrote:
wondering whether this is a bug or originally designed for. When I
register my python udf file like this:
Register 'a/b/mypyudfs.py' using
wondering whether this is a bug or originally designed for. When I
register my python udf file like this:
Register 'a/b/mypyudfs.py' using jython as mypyudfs;
I got an error saying "could not initialize: a/b/mypyudfs.py". looks
like the original file path is sent to backend without any conversio
Hi,
This is a followon from another question I asked a while ago. I'm
calculating percentiles etc.. for datasets
and I wondered how I would do this with a histogram instead so it's more
efficient.
Does anybody have an example of this currently in the Pig source code or
some advice on how to go a