I spent some time debugging this. The reason is --
Sys.path on TT for jython is - ['__classpath__', '__pyclasspath__/']
And for client is ['', '/users/lib/Lib',
'/users/lib/jython_simplejson.jar/Lib', '__classpath__', '__pyclasspath__/']
I am still figuring out why CLASSPATH (java.class.path
This looks like a bug to me. Jython cuts out jython.jar location from
classpath and appends Lib to it. But, in general on TT jython,jar is not
available and its merged into job.jar by pig. Hence, imports will always
fail.
~Aniket
On Mon, Mar 12, 2012 at 12:54 AM, Aniket Mokashi
Hi,
Can write UDF with overrides LOAD SimpleTextLoader without mapreduce, I am
bit confused with the use of mapreduce, because i am not able to get the
flow of the LOAD SimpleTextLoader when the command is invoked.
command: A = LOAD 'data' using myudf.SimpleTextLoader();
i want to now the step
It is restricted to the pig types, yes. You could serializize it to a
DataByteArray and manually manage that, or you could just convert it to a
databag, or you could make it a hashmap with null values, or a tuple... but
yeah.
2012/3/12 Yang tedd...@gmail.com
I tried to return a SetString
No known public good attempts known to me exist to put ML kind of
stuff on top of pig . (well almost none). There are some statistical
packages written at Yahoo but afaik they don't do directly what you
need.
Pig is somewhat excellent data prep pipeline, but IMO is not as
excellent as something
thanks!
On Mon, Mar 12, 2012 at 11:36 AM, Jonathan Coveney jcove...@gmail.comwrote:
It is restricted to the pig types, yes. You could serializize it to a
DataByteArray and manually manage that, or you could just convert it to a
databag, or you could make it a hashmap with null values, or a
yes that's what i meant by almost none. It would seem to me that pig
vector it is technically a bridge between pig schema to some (and at
the moment perhaps quite limited) Mahout functionality rather than
something fundamentally leaning on Pig's own capability. It would seem
to me for that
Well that's not entirely true -- you can in fact train in parallel on
different segments of your dataset, thereby creating an ensemble. Pair
the outputs with a classifier udf that knows how to take advantage of
that, and suddenly you have a massively parallel ETL engine that can
do ML as part of