Re: Embedded Pig: Can hadoop cluster hostnames only be specified from files in the classpath?

2011-04-14 Thread byambajav byambajargal
Hello Andrew I have a same problem as you have. if you find any solution please help me as well. Thanks byambajargal On Fri, Mar 25, 2011 at 8:50 PM, Andrew Look wrote: > Hello Pig-Users, > > I have been looking for a flexible way to specify the cluster hostnames of > the remote mapreduce ins

Re: Filter on contents of other dataset

2011-04-14 Thread Mridul Muralidharan
You could either distribute the small file using distributed cache - in which case, you can use direct file api to load content from the file, or directly use hdfs api's to load from each task ... usually distributed cache should work better, but ymmv ! Regards, Mridul On Friday 15 April 2

Re: Filter on contents of other dataset

2011-04-14 Thread Aniket Mokashi
Thanks Mridul, (Although, small might grow bigger) For instance, lets have small as in-memory-small stored in a local file. When does my udf load the data from the file. Earlier, I wrote a bag loader that returns a bag of small data (eg- load 'smalldata' using BagLoader() as (smallbag)). But then

Re: Filter on contents of other dataset

2011-04-14 Thread Mridul Muralidharan
The way you described it, it does look like an application of cross. How 'small' is small ? If it is pretty small, you can avoid the shuffle/reduce phase and directly stream huge through a udf which does a task local cross with 'small' (assuming it fits in memory). %define my_udf MYUDF('sma

Filter on contents of other dataset

2011-04-14 Thread Aniket Mokashi
Hi, What would be the best way to write this script? I have two datasets - huge (hkey, hdata), small(skey). I want to filter all the data from huge dataset for which F(hdata, skey) is true. Please advise. For example, huge = load 'mydata' as (key:chararray, value:chararray); small = load 'smallda

Re: register python udf file

2011-04-14 Thread Xiaomeng Wan
java.io.IOException: Deserialization error: could not instantiate 'org.apache.pig.scripting.jython.JythonFunction' with arguments '[/home/shawn/TESS/code/mypyudfs.py, isStopWord]' at org.apache.pig.impl.util.ObjectSerializer.deserialize(ObjectSerializer.java:55) at org.apache.pig.

Re: register python udf file

2011-04-14 Thread Richard Ding
This should work (see PIG-1653 for a related issue). Can you post the whole stack trace? Thanks -- Richard On 4/14/11 12:59 PM, "Xiaomeng Wan" wrote: wondering whether this is a bug or originally designed for. When I register my python udf file like this: Register 'a/b/mypyudfs.py' using

register python udf file

2011-04-14 Thread Xiaomeng Wan
wondering whether this is a bug or originally designed for. When I register my python udf file like this: Register 'a/b/mypyudfs.py' using jython as mypyudfs; I got an error saying "could not initialize: a/b/mypyudfs.py". looks like the original file path is sent to backend without any conversio

Computing Histograms with UDF's

2011-04-14 Thread Jonathan Holloway
Hi, This is a followon from another question I asked a while ago. I'm calculating percentiles etc.. for datasets and I wondered how I would do this with a histogram instead so it's more efficient. Does anybody have an example of this currently in the Pig source code or some advice on how to go a