2009/10/31 Santhosh Srinivasan <s...@yahoo-inc.com> > > Misc question: Do you anticipate that Pig will be compatible with > Hadoop 0.20 ? > > The Hadoop 0.20 compatible version, Pig 0.5.0, will be released > shortly. The release got the required votes. >
thanks, I will watch out for that, and anticipate using 0.5 for my study. > > > Finally, I am correct to assume that Pig is not Turing Complete? I am > not clear on this. SQL is not Turing Complete, whereas Java is. So does > that make, Hive or Pig, for example Turing complete, or not? > > Short answer: Hive and Pig are not Turing complete. Turing completeness > is for a particular language and not for the language implementing the > language under question. Since Hive is SQL (like), its not Turing > complete. Till Pig supports loops and conditional statements, Pig will > not be Turing complete. > OK, as I thought. Thanks. I assume therefore that, as Java is turing complete, I would be able to illustrate this difference with a certain query design that requires turing completeness to execute? > > Santhosh > > -----Original Message----- > From: Rob Stewart [mailto:robstewar...@googlemail.com] > Sent: Saturday, October 31, 2009 11:22 AM > To: pig-user@hadoop.apache.org > Subject: Re: Follow Up Questions: PigMix, DataGenerator etc... > > Alan, > > thanks for getting in touch, I appreciate your time, given that it's > clear you're busy popping up in Pig discussion videos on Vimeo and > YouTube just now, see my responses below. > > I intend to get a good feel for the data generation, and to see first of > all: how easy it is for the various interfaces (Pig, JAQL etc..) can > plug into the same file structures, and secondly, how easily and fairly > I would be able to port my queries from one interface to the next. > > Misc question: Do you anticipate that Pig will be compatible with Hadoop > 0.20 ? > > Finally, I am correct to assume that Pig is not Turing Complete? I am > not clear on this. SQL is not Turing Complete, whereas Java is. So does > that make, Hive or Pig, for example Turing complete, or not? > > Again, see my responses below, and thanks again. > > > Rob Stewart > > > > 2009/10/30 Alan Gates <ga...@yahoo-inc.com> > > > > > On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote: > > > > Hi there. > >> > >> As some of you may have read on this mailing list previously, I'm > >> studying various interfaces with Hadoop, one of those being Pig. > >> > >> I have three further questions. I am now beginning to think about the > > >> design of my testing (query design, indicators of performance > >> etc...). > >> > >> 1: > >> I have had a good look at the PigMix benchmark Wiki page, and find it > > >> interesting that a few Pig queries now execute more quickly than the > >> associative Java MapReduce application implementation ( > >> http://wiki.apache.org/pig/PigMix ). The following data processing > >> functions in Pig outperform the Java equivalent: > >> distinct aggregation > >> anti join > >> group applicationorder by 1 field > >> order by multiple fields > >> distinct + join > >> multi-store > >> > >> > >> A few questions: Am I able to obtain the actual queries used for the > >> PigMix benchmarking? And how about obtaining their Java Hadoop > >> equivalent? > >> > > > > https://issues.apache.org/jira/browse/PIG-200 has the code. The > > original perf.patch has both the MR Java code and the Pig Latin > > scripts. The data generator is also in this patch. > > > > > I will check that code out. thanks. > > > > > > And, > >> how, technically, is this high performance achieved? I have read the > >> paper "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng > >> Shao), and they were using a snapshot of Pig trunk from June 2009, > >> showing Pig executing an aggregation query and a join query more > >> slowly than Java Hadoop or Hive, but the aspect that interests me is > >> the number of Map/Reduce tasks created by Pig and Hive. In what way > >> does this number have an effect on the execution time performance. I > >> have a feeling that Pig produces more Map/Reduce tasks than other > >> interfaces, which may be benficial where there is extremely skewed > >> data. Am I wrong in thinking this, or is there another benifit to > >> more Map/Reduce tasks. And how to Pig go about splitting a job into > >> these number of tasks? > >> > > > > Map and reduce parallelism are controlled differently in Hadoop. Map > > parallelism is controlled by the InputSplit. IS determines how many > > maps to start and which file blocks to assign to which maps. In the > > case of PigMix, both the MR Java code and the Pig code use some > > subclass of FileInputFormat, so the map parallelism is the same in > > both tests. I do not know for sure, but I believe Hive also uses > FileInputFormat. > > > > Reduce parallelism is set explicitly as part of the job configuration. > > > In MapReduce this is done through the Java API. In Pig it is done > > through though the PARALLEL command. In PigMix, we set parallelism > > for both the same (40 I believe for this data size). > > > > I have a query about this procedure. It will warrant a simple answer I > assume, but I just need clarity on this. I am wondering how, for > example, both the MR applications and the Pig programs will react if > there are no specifications for the number of Map or Reduce jobs. If, > let's say, I were a programmer writing some Pig scripts where I do not > know the skew of the data, my first execution of the Pig script would be > done without any specification of #Mappers or #Reducers. Is it not a > more natural examination of Pig vs MR apps where both Pig and the MR app > have to decide these details for themselves? So my question is: Why is > it a fundamental requirement that the Pig script and the associated MR > app be given figures for initial Map/Reduce tasks? > > > > > > In general the places where Pig beats MR is due to better algorithms. > > > The MR code was written assuming a basic level of MR coding and > > database knowledge. So for example, the order by queries, the MR code > > > achieves a total order by having a single reducer at the end. Pig has > > > a much more sophisticated system where it samples the data, determines > > > a distribution, and then uses multiple reducers while maintaining a > > total order. So for large data sets Pig will beat MR for these > particular tests. > > > > Sounds very elegant, a really neat solution to skewed data. Is there > some documentation of this process, as I'd like to include that > methodology in my report. And then display data results like: "skewed > data / exeution time", where trend lines for Pig, Hive and MR apps are > shown. It would be nice to show that, as skew of data increases, Pig > overtakes the associative MR app for execution performance. > > > > > > It is, by definition, always possible to write MR Java code as fast as > > > Pig code, since Pig is implemented over MR. But that isn't what > > PixMix is testing. PigMix aims to test how fast code is for what we > > guess is a typical programmer choosing between MR and Pig. If you > > want instead to test the best possible MR against the best possible > > Pig Latin, the MR code in these tests should be rewritten. > > > > > > > >> 2. > >> This sort of leads onto my next question. I want to, in some way, be > >> able to test the performance of Pig against others when dealing with > >> a dataset of extremely skewed data. I have a vague understanding of > >> what this may mean, but I do need clarity on the definition of skewed > > >> data, and the effect this has on the performance on the Map/Reduce > >> model. > >> > > > > In practice we see that almost all data we deal with at Yahoo is power > > > law distributed. Hence we built most of the PigMix data that way as > > well. We used zipf as a definition of skewed, as it turned out to be > > a reasonably close match to our data. > > > > The interesting situation, for us anyway, is when the data skews > > enough that it is no longer possible to process a single key in memory > > > on a single node. Once you have to spill parts of the data to disk, > > performance suffers badly. For sufficiently large data sets (10G or > > more) zipf distribution meets this criteria. > > > > Pig has quite a bit of processing dedicated to dealing with skew > > issues gracefully, since that is one of the weak points of Map Reduce. > > > As mentioned above, order by samples the data and distributes skewed > > keys across multiple reducers (since for a total ordering there is no > > need to collect all keys onto one reducer). Pig has a skewed join > > that also splits skewed keys across multiple reducers. For group by > > (where it truly can't split keys) Pig uses the combiner and the up > > coming accumulator interface (see > https://issues.apache.org/jira/browse/PIG-979 ). > > > > > > > >> 3. > >> In another question relating to the number of Map/Reduce tasks > >> generated by Pig. I am interested to see if, despite that fact that > >> Pig, Hive, JAQL etc... all use the Hadoop fault tolerance > >> technologies, whether the number of Map/Reduce tasks has an effect on > > >> Hadoop's ability to recover, from say, a DataNode that fails. I.e. If > > >> there are 100 Map jobs, spread across 10 DataNodes, and one DataNode > >> fails, then approximately 10 Map jobs will be redistributed over the > >> remaining 9 DataNodes. If, however, there were 500 Map jobs over the > >> 10 DataNodes, one of them fails, then 50 Map jobs will be reallocated > > >> to the remaining 9 DataNodes. Am I to expect a difference in overal > >> performance in both of these scenario's? > >> > > > > In Pig's case (and I believe in Hive's and JAQL's) this is all handled > > > by Map Reduce. So I would direct questions on this to > > mapreduce-u...@hadoop.apache.org > > > > > > > >> 4. > >> Which leads onto... the Pig DataGenerator. Is this what I'm looking > >> for to generate my data? I am probably looking to generate data that > >> would typically take 30 minutes to execute, and I have a cluster of > >> 10 nodes available to me. I would happily use some existing tool to > >> create my test data for analysis, and I just need to know whether the > > >> DataGenerator is the tool for me? ( > >> http://wiki.apache.org/pig/DataGeneratorHadoop ) > >> > > > > The tool we used to generate data for the original PigMix is attached > > to that patch referenced above. The downfall with the original tool > > is that it takes about 30 hours to generate the amount of data needed > > for PigMix (which runs for about 5 minutes on an older 10 node > > cluster, so you'd need significantly more) because it's single > > threaded. The DataGenerator is an attempt to convert that original > > tool to work in MR so it can go much faster. When I've played with > > it, it does create skewed data, but it does not create the very long > > tail of data that the original tool does. If the very long tail is > > not important to you than it may be a good choice. It may also be > > possible to use the tool to create the long tail and I just configured > it incorrectly. > > > > > > > >> 5. > >> Finally... What is the best way to analyse the performance of each > >> Pig job sent to the Hadoop cluster? Obviously, I can simply use the > >> Unix time command, but are there any real-time performance analysis > >> tools built into either Pig or Hadoop to monitor performance. This > >> would typically include the number of Map tasks and the number of > >> reduce tasks over time, and the behaviour of the Hadoop cluster in > the event of a DataNode failure, i.e. > >> following a failure, the amount of network bandwidth used, the > >> CPU/Memory of the NameNode during a failure of a DataNode etc.... > >> > > > > I generally use Hadoop's GUI to monitor and find historic information > > about my jobs. The same information can be gleaned from the Hadoop > > logs. You might take a look at the Chukwa project, which provides a > > parser for Hadoop logs and presents it all in a graphical format. > > > > > OK, so by this, do you mean that you use the web interface to view: > MapReduce tracker, task trackers, and the HDFS name node? I've had a > look at the Chuckwa project, and I may be mistaken, but to me it looks > like a bit of a beast to configure, and becomes more useful as you > increase the number of nodes in the cluster. The cluster I have > available to me is 10 nodes. I will have a good look at the Hadoop logs > generated by each of the nodes to see if that would suffice. > > > > > > > >> > >> I know it's a lot, but I really apprecaite any assistance, and fully > >> intend to return my complete paper back to this Pig and Hadoop > >> community on it's completion!! > >> > >> > >> Many thanks, > >> > >> > >> Rob Stewart > >> > > > > Alan. > > > > >