Forwarding this to pig-user, as many pig users may want to give feedback on this issue.

Alan.

Begin forwarded message:

From: "Alan Gates (JIRA)" <[email protected]>
Date: October 26, 2009 3:18:59 PM PDT
To: <[email protected]>
Subject: [jira] Commented: (PIG-1053) Consider moving to Hadoop for local mode
Reply-To: [email protected]


[ https://issues.apache.org/jira/browse/PIG-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770237 #action_12770237 ]

Alan Gates commented on PIG-1053:
---------------------------------

Currently Pig has its own backend implementation framework that it uses for executing Pig Latin scripts on a single box (as opposed to in a Hadoop cluster), referred to as local mode. Having a separate implementation has several drawbacks:

1) It does not offer the same functionality as Hadoop. A number of things do not work, such as counters, slicers, etc. 2) UDFs (both eval and load/store functions) are often forced to understand both contexts, and test whether they are working in local or hadoop mode. 3) Additional code maintenance, as Pig is forced to maintain its own framework. Going forward, as Pig attempts to leverage more Map Reduce functionality (see for example PIG-966) maintaining this separate mode is becoming a larger and larger effort. 4) It makes debugging harder for users and UDF writers, as the execution environment on a local box differs from that on the production cluster.

Pig's local mode has one very serious advantage over Hadoop in local mode. It is much faster, about 15 times faster. Hadoop is designed for large data sets and thus is not optimized to handle the start up and tear down involved in small data jobs.

For debugging of code, this performance factor should not be that big an issue. Where the performance becomes prohibitive is functionality like ILLUSTRATE. Taking 30 seconds to give a sample of data running through your script is excessive compared to 2 seconds.

So, which of these pain points is worse? Originally we felt the performance was more important. But as we see many user complaints about the above listed drawbacks and relatively few users using local mode in performance intensive ways, we are wondering if we made that choice correctly. Please give your feedback one way or another.


Consider moving to Hadoop for local mode
----------------------------------------

               Key: PIG-1053
               URL: https://issues.apache.org/jira/browse/PIG-1053
           Project: Pig
        Issue Type: Improvement
          Reporter: Alan Gates

We need to consider moving Pig to use Hadoop's local mode instead of its own.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Reply via email to