I'm looking at ways to make streaming python udfs faster.  In PIG-2417 (
https://issues.apache.org/jira/browse/PIG-2417) we used Pig's existing
serialization code (with modifications to handle more complex serialization
cases) but we're finding that the serialization/deserialization is eating
up a lot of time.

I was looking at some options for libraries that handle serialization and
Apache Avro seemed like the best bet.  I was thinking that I could create
an avro schema for each UDF being called (based off of the pig schema which
I already have) and then stream data through standard input to python where
I'd deserialize that data.  Avro seemed best because it didn't require any
code generation and it would be easy to create the data schema for each
udf.

The problem I'm finding is that Avro doesn't seem to support reading data
from a stream on the Python side (
https://issues.apache.org/jira/browse/AVRO-959).  I'm going to follow up on
that on the avro list, but I'm wondering if people have other suggestions
for how to do the serialization.

Thanks,

-- 

Jeremy Karn / Lead Developer
MORTAR DATA / 519 277 4391 / www.mortardata.com

Reply via email to