[ https://issues.apache.org/jira/browse/GIRAPH-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nitay Joffe updated GIRAPH-683: ------------------------------- Description: Support for writing Computation code in Python. We add Jython bindings so that the Python computation code can communicate back with the Java Giraph classes. To make this work I had to change a few parts of Giraph: 1) The Jython computation is not known until we read the script and create a Computation object for it at runtime. This has to be done on each worker separately after the job has launched. Because of this, there is no Computation class set at the beginning. I suspect other scripting languages will have similar issue. To fix this I created a ComputationFactory interface which is responsible for creating the Computation, with a default that just grabs the class from the Configuration and creates it. 2) I created a GiraphTypes class to hold the I,V,E,M1,M2 classes. There was a lot of repetitive code around these things so centralizing it all in one place made things a lot cleaner. 3) I added some more helpers like isDefaultValue() to our conf options. To use Jython all the user has to do is call Jython#init(...) somewhere in his initialization. This patch contains our page rank benchmark implementation in Jython. I added an option (--jython) which chooses whether to run the default or the jython version. Here is the initial PageRankBenchmark comparison (4 workers, 10M vertices, 25 edges per vertex): Java: Total (milliseconds) 104,388 0 104,388 Superstep 3 (milliseconds) 16,750 0 16,750 Setup (milliseconds) 2,895 0 2,895 Shutdown (milliseconds) 50 0 50 Superstep 0 (milliseconds) 15,838 0 15,838 Superstep 4 (milliseconds) 19,088 0 19,088 Input superstep (milliseconds) 8,700 0 8,700 Superstep 5 (milliseconds) 3,550 0 3,550 Superstep 2 (milliseconds) 17,905 0 17,905 Superstep 1 (milliseconds) 19,608 0 19,608 Jython: Total (milliseconds) 244,965 0 244,965 Superstep 3 (milliseconds) 43,405 0 43,405 Setup (milliseconds) 3,735 0 3,735 Shutdown (milliseconds) 117 0 117 Superstep 0 (milliseconds) 36,962 0 36,962 Superstep 4 (milliseconds) 46,088 0 46,088 Input superstep (milliseconds) 8,551 0 8,551 Superstep 5 (milliseconds) 22,040 0 22,040 Superstep 2 (milliseconds) 42,329 0 42,329 Superstep 1 (milliseconds) 41,737 0 41,737 Overhead of Jython vs Java = 2.5x. However at scale things get better (200 workers, 1B vertices, 200 edges per vertex): Java: Total (milliseconds) 1,702,429 0 1,702,429 Superstep 3 (milliseconds) 316,844 0 316,844 Setup (milliseconds) 13,226 0 13,226 Shutdown (milliseconds) 113 0 113 Superstep 0 (milliseconds) 300,950 0 300,950 Superstep 4 (milliseconds) 318,627 0 318,627 Input superstep (milliseconds) 114,673 0 114,673 Superstep 5 (milliseconds) 7,898 0 7,898 Superstep 2 (milliseconds) 312,152 0 312,152 Superstep 1 (milliseconds) 317,942 0 317,942 Jython: Total (milliseconds) 2,123,228 0 2,123,228 Superstep 3 (milliseconds) 406,422 0 406,422 Setup (milliseconds) 7,159 0 7,159 Shutdown (milliseconds) 131 0 131 Superstep 0 (milliseconds) 347,732 0 347,732 Superstep 4 (milliseconds) 405,696 0 405,696 Input superstep (milliseconds) 112,645 0 112,645 Superstep 5 (milliseconds) 46,687 0 46,687 Superstep 2 (milliseconds) 410,349 0 410,349 Superstep 1 (milliseconds) 386,404 0 386,404 That's a mere 25% overhead. Take a look at the reviewboard for latest patch: https://reviews.apache.org/r/11709/ was: Support for writing Computation code in Python. We add Jython bindings so that the Python computation code can communicate back with the Java Giraph classes. To make this work I had to change a few parts of Giraph: 1) The Jython computation is not known until we read the script and create a Computation object for it at runtime. This has to be done on each worker separately after the job has launched. Because of this, there is no Computation class set at the beginning. I suspect other scripting languages will have similar issue. To fix this I created a ComputationFactory interface which is responsible for creating the Computation, with a default that just grabs the class from the Configuration and creates it. 2) I created a GiraphTypes class to hold the I,V,E,M1,M2 classes. There was a lot of repetitive code around these things so centralizing it all in one place made things a lot cleaner. 3) I added some more helpers like isDefaultValue() to our conf options. To use Jython all the user has to do is call Jython#init(...) somewhere in his initialization. This patch contains our page rank benchmark implementation in Jython. I added an option (--jython) which chooses whether to run the default or the jython version. Here is the initial PageRankBenchmark comparison (4 workers, 10M vertices, 25 edges per vertex): Java: Total (milliseconds) 104,388 0 104,388 Superstep 3 (milliseconds) 16,750 0 16,750 Setup (milliseconds) 2,895 0 2,895 Shutdown (milliseconds) 50 0 50 Superstep 0 (milliseconds) 15,838 0 15,838 Superstep 4 (milliseconds) 19,088 0 19,088 Input superstep (milliseconds) 8,700 0 8,700 Superstep 5 (milliseconds) 3,550 0 3,550 Superstep 2 (milliseconds) 17,905 0 17,905 Superstep 1 (milliseconds) 19,608 0 19,608 Jython: Total (milliseconds) 244,965 0 244,965 Superstep 3 (milliseconds) 43,405 0 43,405 Setup (milliseconds) 3,735 0 3,735 Shutdown (milliseconds) 117 0 117 Superstep 0 (milliseconds) 36,962 0 36,962 Superstep 4 (milliseconds) 46,088 0 46,088 Input superstep (milliseconds) 8,551 0 8,551 Superstep 5 (milliseconds) 22,040 0 22,040 Superstep 2 (milliseconds) 42,329 0 42,329 Superstep 1 (milliseconds) 41,737 0 41,737 So the initial overhead of Jython vs Java is around 2.5x. Take a look at the reviewboard for latest patch: https://reviews.apache.org/r/11709/ > Jython for Computation > ---------------------- > > Key: GIRAPH-683 > URL: https://issues.apache.org/jira/browse/GIRAPH-683 > Project: Giraph > Issue Type: Bug > Reporter: Nitay Joffe > Assignee: Nitay Joffe > > Support for writing Computation code in Python. We add Jython bindings so > that the Python computation code can communicate back with the Java Giraph > classes. > To make this work I had to change a few parts of Giraph: > 1) The Jython computation is not known until we read the script and create a > Computation object for it at runtime. This has to be done on each worker > separately after the job has launched. Because of this, there is no > Computation class set at the beginning. I suspect other scripting languages > will have similar issue. To fix this I created a ComputationFactory interface > which is responsible for creating the Computation, with a default that just > grabs the class from the Configuration and creates it. > 2) I created a GiraphTypes class to hold the I,V,E,M1,M2 classes. There was a > lot of repetitive code around these things so centralizing it all in one > place made things a lot cleaner. > 3) I added some more helpers like isDefaultValue() to our conf options. > To use Jython all the user has to do is call Jython#init(...) somewhere in > his initialization. > This patch contains our page rank benchmark implementation in Jython. I added > an option (--jython) which chooses whether to run the default or the jython > version. > Here is the initial PageRankBenchmark comparison (4 workers, 10M vertices, 25 > edges per vertex): > Java: > Total (milliseconds) 104,388 0 104,388 > Superstep 3 (milliseconds) 16,750 0 16,750 > Setup (milliseconds) 2,895 0 2,895 > Shutdown (milliseconds) 50 0 50 > Superstep 0 (milliseconds) 15,838 0 15,838 > Superstep 4 (milliseconds) 19,088 0 19,088 > Input superstep (milliseconds) 8,700 0 8,700 > Superstep 5 (milliseconds) 3,550 0 3,550 > Superstep 2 (milliseconds) 17,905 0 17,905 > Superstep 1 (milliseconds) 19,608 0 19,608 > Jython: > Total (milliseconds) 244,965 0 244,965 > Superstep 3 (milliseconds) 43,405 0 43,405 > Setup (milliseconds) 3,735 0 3,735 > Shutdown (milliseconds) 117 0 117 > Superstep 0 (milliseconds) 36,962 0 36,962 > Superstep 4 (milliseconds) 46,088 0 46,088 > Input superstep (milliseconds) 8,551 0 8,551 > Superstep 5 (milliseconds) 22,040 0 22,040 > Superstep 2 (milliseconds) 42,329 0 42,329 > Superstep 1 (milliseconds) 41,737 0 41,737 > Overhead of Jython vs Java = 2.5x. > However at scale things get better (200 workers, 1B vertices, 200 edges per > vertex): > Java: > Total (milliseconds) 1,702,429 0 1,702,429 > Superstep 3 (milliseconds) 316,844 0 316,844 > Setup (milliseconds) 13,226 0 13,226 > Shutdown (milliseconds) 113 0 113 > Superstep 0 (milliseconds) 300,950 0 300,950 > Superstep 4 (milliseconds) 318,627 0 318,627 > Input superstep (milliseconds) 114,673 0 114,673 > Superstep 5 (milliseconds) 7,898 0 7,898 > Superstep 2 (milliseconds) 312,152 0 312,152 > Superstep 1 (milliseconds) 317,942 0 317,942 > Jython: > Total (milliseconds) 2,123,228 0 2,123,228 > Superstep 3 (milliseconds) 406,422 0 406,422 > Setup (milliseconds) 7,159 0 7,159 > Shutdown (milliseconds) 131 0 131 > Superstep 0 (milliseconds) 347,732 0 347,732 > Superstep 4 (milliseconds) 405,696 0 405,696 > Input superstep (milliseconds) 112,645 0 112,645 > Superstep 5 (milliseconds) 46,687 0 46,687 > Superstep 2 (milliseconds) 410,349 0 410,349 > Superstep 1 (milliseconds) 386,404 0 386,404 > That's a mere 25% overhead. > Take a look at the reviewboard for latest patch: > https://reviews.apache.org/r/11709/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira