[
https://issues.apache.org/jira/browse/GIRAPH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitay Joffe updated GIRAPH-717:
-------------------------------
Description:
This adds support for pure Jython jobs. Currently this runner is hooked up to
work with Hive. I'll make it more generic later.
Running a Jython job is simply:
HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar>
org.apache.giraph.hive.jython.HiveJythonRunner [jython1.py] [jython2.py]
You can pass in any number of scripts. They will be parsed in order and sent to
all the workers using DistributedCache.
There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36
There are a few pieces to a Jython job, I'll go over each part here.
The launcher defines the graph types (those IVEMM writables) and sets up the
Hive vertex/edge inputs and output. Each graph type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message
value implements Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable
wrapper that serializes Jython values using Pickle (jython IO framework).
For Hive usage - if your value type is a primitive e.g. IntWritable or
LongWritable, then you need not do anything. The Java code will automatically
read/write the Hive table specified and convert between Hive types and the
primitive Writable. The vertex_id type in the example works like this.
If your value is a custom Jython type, you must create classes which implement
JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both). These
objects read/write Jython types from Hive. There are wrappers in the Java code
which take HiveIO data normally used in giraph-hive and turns them into Jython
types. This means, for example, that getMap() will return a Jython dictionary
instead of a Java Map.
There is also a PageRankBenchmark (from previous diff) implemented in Jython.
Here's a run for comparison / sanity check:
PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
https://gist.github.com/nitay/3170fa3b575d4d2e22a9
total time: 302466
with this diff:
https://gist.github.com/nitay/a52b6d1d64e50ab9829e
total time: 306517
in jython:
https://gist.github.com/nitay/3f2e758b2933c3521727
total time: 434730
So we see that existing things are not affected (is there something else I
should test?) and that Jython has around 40% overhead.
ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to
split up :/)
was:
This adds support for pure Jython jobs. Currently this runner is hooked up to
work with Hive. I'll make it more generic later.
A Jython job is made up of two Jython scripts:
1) launcher - this script is used to configure the job, it is only interpreted
locally.
2) worker - this script is distributed to every worker and is used there.
Running a Jython job is simply:
HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar>
org.apache.giraph.hive.jython.HiveJythonRunner jython --launcher <launcher.py>
--worker <worker.py>
There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36
There are a few pieces to a Jython job, I'll go over each part here.
The launcher defines the graph types (those IVEMM writables) and sets up the
Hive vertex/edge inputs and output. Each graph type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message
value implements Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable
wrapper that serializes Jython values using Pickle (jython IO framework).
For Hive usage - if your value type is a primitive e.g. IntWritable or
LongWritable, then you need not do anything. The Java code will automatically
read/write the Hive table specified and convert between Hive types and the
primitive Writable. The vertex_id type in the example works like this.
If your value is a custom Jython type, you must create classes which implement
JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both). These
objects read/write Jython types from Hive. There are wrappers in the Java code
which take HiveIO data normally used in giraph-hive and turns them into Jython
types. This means, for example, that getMap() will return a Jython dictionary
instead of a Java Map.
There is also a PageRankBenchmark (from previous diff) implemented in Jython.
Here's a run for comparison / sanity check:
PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
https://gist.github.com/nitay/3170fa3b575d4d2e22a9
total time: 302466
with this diff:
https://gist.github.com/nitay/a52b6d1d64e50ab9829e
total time: 306517
in jython:
https://gist.github.com/nitay/3f2e758b2933c3521727
total time: 434730
So we see that existing things are not affected (is there something else I
should test?) and that Jython has around 40% overhead.
ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to
split up :/)
> HiveJythonRunner with support for pure Jython value types.
> ----------------------------------------------------------
>
> Key: GIRAPH-717
> URL: https://issues.apache.org/jira/browse/GIRAPH-717
> Project: Giraph
> Issue Type: Bug
> Reporter: Nitay Joffe
> Assignee: Nitay Joffe
>
> This adds support for pure Jython jobs. Currently this runner is hooked up to
> work with Hive. I'll make it more generic later.
> Running a Jython job is simply:
> HIVE_HOME=<x>
> HADOOP_HOME=<y>
> $HIVE_HOME/bin/hive --service jar <giraph-hive-jar>
> org.apache.giraph.hive.jython.HiveJythonRunner [jython1.py] [jython2.py]
> You can pass in any number of scripts. They will be parsed in order and sent
> to all the workers using DistributedCache.
> There are examples and tests in the diff. Here is one example:
> launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
> worker: https://gist.github.com/nitay/7834fd2b059527e65a36
> There are a few pieces to a Jython job, I'll go over each part here.
> The launcher defines the graph types (those IVEMM writables) and sets up the
> Hive vertex/edge inputs and output. Each graph type is one of the following:
> 1) A Java type. For example the user can specify simply IntWritable
> 2) A Jython type that implements Writable. In the example above the message
> value implements Writable.
> 3) A pure Jython type. The Java code will wrap these objects in a Writable
> wrapper that serializes Jython values using Pickle (jython IO framework).
> For Hive usage - if your value type is a primitive e.g. IntWritable or
> LongWritable, then you need not do anything. The Java code will automatically
> read/write the Hive table specified and convert between Hive types and the
> primitive Writable. The vertex_id type in the example works like this.
> If your value is a custom Jython type, you must create classes which
> implement JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both).
> These objects read/write Jython types from Hive. There are wrappers in the
> Java code which take HiveIO data normally used in giraph-hive and turns them
> into Jython types. This means, for example, that getMap() will return a
> Jython dictionary instead of a Java Map.
> There is also a PageRankBenchmark (from previous diff) implemented in Jython.
> Here's a run for comparison / sanity check:
> PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute
> threads
> trunk:
> https://gist.github.com/nitay/3170fa3b575d4d2e22a9
> total time: 302466
> with this diff:
> https://gist.github.com/nitay/a52b6d1d64e50ab9829e
> total time: 306517
> in jython:
> https://gist.github.com/nitay/3f2e758b2933c3521727
> total time: 434730
> So we see that existing things are not affected (is there something else I
> should test?) and that Jython has around 40% overhead.
> ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard
> to split up :/)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira