Fwd: [jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode

Alan Gates Mon, 26 Oct 2009 15:22:49 -0700

Forwarding this to pig-user, as many pig users may want to givefeedback on this issue.


Alan.


Begin forwarded message:

From: "Alan Gates (JIRA)" <[email protected]>
Date: October 26, 2009 3:18:59 PM PDT
To: <[email protected]>
Subject: [jira] Commented: (PIG-1053) Consider moving to Hadoop forlocal mode
Reply-To: [email protected]
[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770237#action_12770237 ]
Alan Gates commented on PIG-1053:
---------------------------------
Currently Pig has its own backend implementation framework that ituses for executing Pig Latin scripts on a single box (as opposed toin a Hadoop cluster), referred to as local mode. Having a separateimplementation has several drawbacks:
1) It does not offer the same functionality as Hadoop. A number ofthings do not work, such as counters, slicers, etc.2) UDFs (both eval and load/store functions) are often forced tounderstand both contexts, and test whether they are working in localor hadoop mode.3) Additional code maintenance, as Pig is forced to maintain its ownframework. Going forward, as Pig attempts to leverage more MapReduce functionality (see for example PIG-966) maintaining thisseparate mode is becoming a larger and larger effort.4) It makes debugging harder for users and UDF writers, as theexecution environment on a local box differs from that on theproduction cluster.
Pig's local mode has one very serious advantage over Hadoop in localmode. It is much faster, about 15 times faster. Hadoop is designedfor large data sets and thus is not optimized to handle the start upand tear down involved in small data jobs.
For debugging of code, this performance factor should not be thatbig an issue. Where the performance becomes prohibitive isfunctionality like ILLUSTRATE. Taking 30 seconds to give a sampleof data running through your script is excessive compared to 2seconds.
So, which of these pain points is worse? Originally we felt theperformance was more important. But as we see many user complaintsabout the above listed drawbacks and relatively few users usinglocal mode in performance intensive ways, we are wondering if wemade that choice correctly. Please give your feedback one way oranother.
Consider moving to Hadoop for local mode
----------------------------------------

               Key: PIG-1053
               URL: https://issues.apache.org/jira/browse/PIG-1053
           Project: Pig
        Issue Type: Improvement
          Reporter: Alan Gates
We need to consider moving Pig to use Hadoop's local mode insteadof its own.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Fwd: [jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode

Reply via email to