This paper seems very relevant to the proposal - "Compiled Query Execution Engine using JVM" http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40
>From the abstract - "Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in-memory" (I don't have access to the full paper though). -Thejas On 4/16/09 9:26 AM, "Alan Gates" <ga...@yahoo-inc.com> wrote: > Your understanding of the proposal is correct. The goal would be to > produce Java code rather than a pipeline configuration. But the > reasoning is not so that users can then take that and modify > themselves. There's nothing preventing them from doing it, but it has > a couple of major drawbacks. > > 1) Code generators generally generate horrific looking code, because > they are going for speed and compactness not human maintainability. > Trying to work in that code would be very difficult. > > 2) If you start adding code to generated code, you can no longer use > the original Pig Latin. You are from that point forward stuck in > Java, since you can't backport your Java into the Pig Latin. > > The proposal is designed to test the performance of Pig based on > generated Java (or for that matter any other language, it need not be > Java). For the idea you suggest, the NATIVE keyword (proposed here > https://issues.apache.org/jira/browse/PIG-506) > is a better solution. > > Alan. > > On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote: > >> Hi >> Can you briefly explain what is required in the first project? After >> reading >> the description my impression is, currently when we are executing >> commands >> on Pig Shell, Pig is first converting to map-reduce jobs and then >> feeding it >> to hadoop. In this project are we proposing that, the execution plan >> made by >> Pig will be first converted to a java file for map-reduce procedure >> and then >> feed onto hadoop network ? >> >> If this is the case then I am sure it will be great help to users as >> this >> functionality can be used to write complicated map-reduce jobs very >> easily. >> Initially user can write the Pig scripts / commands required for his >> job and >> get the map-reduce java files. Then he can edit map-reduce files to >> extend >> the functionality and add extra procedures that are not provided by >> Pig but >> can be executed over hadoop. >> >> --nitesh >> >> On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki <wikidi...@apache.org> >> wrote: >> >>> Dear Wiki user, >>> >>> You have subscribed to a wiki page or wiki category on "Pig Wiki" for >>> change notification. >>> >>> The following page has been changed by AlanGates: >>> http://wiki.apache.org/pig/ProposedProjects >>> >>> New page: >>> = Proposed Pig Projects = >>> This page describes projects what we (the committers) would like to >>> see >>> added >>> to Pig. The scale of these projects vary, but they are larger >>> projects, >>> usually on the weeks or months scale. We have not yet filed >>> [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these >>> because they are still in the vague idea stage. As they become more >>> concrete, >>> [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for >>> them. >>> >>> We welcome contributers to take on one of these projects. If you >>> would >>> like >>> to do so, please file a JIRA (if one does not already exist for the >>> project) >>> with a proposed solution. Pig's committers will work with you from >>> there >>> to >>> help refine your solution. Once a solution is agreed upon, you can >>> begin >>> implementation. >>> >>> If you see a project here that you would like to see Pig implement >>> but you >>> are >>> not in a position to implement the solution right now, feel free to >>> vote >>> for >>> the project. Add your name to the list of supporters. This will >>> help >>> contributers looking for a project to select one that will benefit >>> many >>> users. >>> >>> If you would like to propose a project for Pig, feel free to add to >>> this >>> list. >>> If it is a smaller project, or something you plan to begin work on >>> immediately, filing a [https://issues.apache.org/jira/browse/PIG >>> JIRA] is >>> a better route. >>> >>> || Catagory || Project || JIRA || Proposed By || Votes For || >>> || Execution || Pig currently executes scripts by building a >>> pipeline of >>> pre-built operators and running data through those operators in map >>> reduce >>> jobs. We need to investigate instead have Pig generate java code >>> specific >>> to a job, and then compiling that code and using it to run the map >>> reduce >>> jobs. || || Many conference attendees || gates || >>> || Language || Currently only DISTINCT, ORDER BY, and FILTER are >>> allowed >>> inside FOREACH. All operators should be allowed in FOREACH. (Limit >>> is being >>> worked on [https://issues.apache.org/jira/browse/PIG-741 741] || || >>> gates >>> || || >>> || Optimization || Speed up comparison of tuples during shuffle for >>> ORDER >>> BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan >>> || || >>> || Optimization || Order by should be changed to not use POPackage >>> to put >>> all of the tuples in a bag on the reduce side, as the bag is just >>> immediately flattened. It can instead work like join does for the >>> last >>> input in the join. || || gates || || >>> || Optimization || Often in a Pig script that produces a chain of >>> MR jobs, >>> the map phases of 2nd and subsequent jobs very little. What little >>> they do >>> should be pushed into the proceeding reduce and the map replaced by >>> the >>> identity mapper. Initial tests showed that the identity mapper was >>> 50% >>> faster than using a Pig mapper (because Pig uses the loader to >>> parse out >>> tuples even if the map itself is empty). || [ >>> https://issues.apache.org/jira/browse/PIG-480 480] || olgan || >>> gates || >>> || Optimization || Use hand crafted calls to do string to integer >>> or float >>> conversions. Initial tests showed these could be done about 8x >>> faster than >>> String.toIntger() and String.toFloat(). || [ >>> https://issues.apache.org/jira/browse/PIG-482 482] || olgan || >>> gates || >>> || Optimization || Currently Pig always samples for and ORDER BY to >>> determine how to partition, and then runs another job to do the >>> sort. For >>> small enough inputs, it should just sort with a single reducer. || [ >>> https://issues.apache.org/jira/browse/PIG-483 483] || olgan || || >>> || Optimization || In many cases data to be joined is already >>> sorted and >>> partitioned on the same key. Pig needs to be able to take >>> advantage of this >>> and do these joins in the map. The join could be done by sampling >>> one input >>> to determine the value of the join key at the beginning of every >>> HDFS block. >>> This would form an index. Then in a second MR job can be run with >>> the >>> other input. Based on the key seen in the second input, the >>> appropriate >>> blocks of the first input can also be loaded into the map and the >>> join done. >>> || || gates || || >>> || Optimization || The combiner is not currently used if FILTER is >>> in the >>> FOREACH. In some cases it could still be used. || [ >>> https://issues.apache.org/jira/browse/PIG-479 479] || olgan || || >>> || Optimization || Currently when types of data are declared Pig >>> inserts a >>> FOREACH immediately after the LOAD that does the conversions. These >>> conversions should be delayed until the field is actually used. || [ >>> https://issues.apache.org/jira/browse/PIG-410 410] || olgan || >>> gates || >>> || Optimization || When an order by is not the only operation in a >>> pig >>> script, it is done in two additional MR jobs. The first job >>> samples using a >>> sampling loader, the second does the sort. The sample is used to >>> construct >>> a partitioner that equally balances the data in the sort. The >>> sampler needs >>> to be changed to be a !EvalFunc instead of a loader. This way a >>> split can >>> be but in the proceeding MR job, with the main data being written >>> out and >>> the other part flowing to the sampler func, which can then write >>> out the >>> sample. The final MR job can then be the sort. || || gates || || >>> || Optimization || When an order by is the only operation in a pig >>> script >>> it is currently done in 3 MR jobs. The first converts it to >>> BinStorage >>> format (because the sample loader reads that format), the second >>> samples, >>> and the third sorts. Once the changes mentioned above to make the >>> sampler >>> an !EvalFunc are done it should be changed to be done in 2 MR jobs >>> instead >>> of 3. || [https://issues.apache.org/jira/browse/PIG-460 460] || >>> gates || >>> || >>> || Optimization || The Pig optimizer should be used to determine when >>> fields in a record are no longer needed and put in FOREACH >>> statements to >>> project out the unecessary data as early as possible. || [ >>> https://issues.apache.org/jira/browse/PIG-466 466] || olgan || || >>> || Optimization || The Pig optimizers needs to call fieldsToRead so >>> that >>> Load functions that can do column skipping do it. || || gates || || >>> || Scalability || Pig's default join (symmetric hash) currently >>> depends on >>> being able to fit all of the values for a given join key for one of >>> the >>> inputs into memory. (It does try to spill to disk in the case >>> where it >>> cannot fit them all into memory. In practice this often fails as >>> it is not >>> good at understanding when memory is low enough that it should >>> spill. Even >>> in the case where it does not fail, spilling to disk and rereading >>> from disk >>> is very slow.) If instances of keys with a large number of values >>> were >>> broken up so that the row set could fit in memory and then shipped to >>> multiple reducers. A sampling pass would need to be done first to >>> determine >>> which keys to break up. || || chris olston || gates || >>> >> >> >> >> -- >> Nitesh Bhatia >> Dhirubhai Ambani Institute of Information & Communication Technology >> Gandhinagar >> Gujarat >> >> "Life is never perfect. It just depends where you draw the line." >> >> visit: >> http://www.awaaaz.com - connecting through music >> http://www.volstreet.com - lets volunteer for better tomorrow >> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun >