Re: Follow Up Questions: PigMix, DataGenerator etc...

Rob Stewart Sat, 31 Oct 2009 13:05:46 -0700

2009/10/31 Santhosh Srinivasan <s...@yahoo-inc.com>

> > Misc question: Do you anticipate that Pig will be compatible with
> Hadoop 0.20 ?
>
> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
> shortly. The release got the required votes.
>


thanks, I will watch out for that, and anticipate using 0.5 for my study.

>
> > Finally, I am correct to assume that Pig is not Turing Complete? I am
> not clear on this. SQL is not Turing Complete, whereas Java is. So does
> that make, Hive or Pig, for example Turing complete, or not?
>
> Short answer: Hive and Pig are not Turing complete. Turing completeness
> is for a particular language and not for the language implementing the
> language under question. Since Hive is SQL (like), its not Turing
> complete. Till Pig supports loops and conditional statements, Pig will
> not be Turing complete.
>

OK, as I thought. Thanks. I assume therefore that, as Java is turing
complete, I would be able to illustrate this difference with a certain query
design that requires turing completeness to execute?


>
> Santhosh
>
> -----Original Message-----
> From: Rob Stewart [mailto:robstewar...@googlemail.com]
> Sent: Saturday, October 31, 2009 11:22 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: Follow Up Questions: PigMix, DataGenerator etc...
>
> Alan,
>
> thanks for getting in touch,  I appreciate your time, given that it's
> clear you're busy popping up in Pig discussion videos on Vimeo and
> YouTube just now, see my responses below.
>
> I intend to get a good feel for the data generation, and to see first of
> all: how easy it is for the various interfaces (Pig, JAQL etc..) can
> plug into the same file structures, and secondly, how easily and fairly
> I would be able to port my queries from one interface to the next.
>
> Misc question: Do you anticipate that Pig will be compatible with Hadoop
> 0.20 ?
>
> Finally, I am correct to assume that Pig is not Turing Complete? I am
> not clear on this. SQL is not Turing Complete, whereas Java is. So does
> that make, Hive or Pig, for example Turing complete, or not?
>
> Again, see my responses below, and thanks again.
>
>
> Rob Stewart
>
>
>
> 2009/10/30 Alan Gates <ga...@yahoo-inc.com>
>
> >
> > On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
> >
> >  Hi there.
> >>
> >> As some of you may have read on this mailing list previously, I'm
> >> studying various interfaces with Hadoop, one of those being Pig.
> >>
> >> I have three further questions. I am now beginning to think about the
>
> >> design of my testing (query design, indicators of performance
> >> etc...).
> >>
> >> 1:
> >> I have had a good look at the PigMix benchmark Wiki page, and find it
>
> >> interesting that a few Pig queries now execute more quickly than the
> >> associative Java MapReduce application implementation (
> >> http://wiki.apache.org/pig/PigMix ). The following data processing
> >> functions in Pig outperform the Java equivalent:
> >> distinct aggregation
> >> anti join
> >> group applicationorder by 1 field
> >> order by multiple fields
> >> distinct + join
> >> multi-store
> >>
> >>
> >> A few questions: Am I able to obtain the actual queries used for the
> >> PigMix benchmarking? And how about obtaining their Java Hadoop
> >> equivalent?
> >>
> >
> > https://issues.apache.org/jira/browse/PIG-200 has the code.  The
> > original perf.patch has both the MR Java code and the Pig Latin
> > scripts.  The data generator is also in this patch.
> >
> >
> I will check that code out. thanks.
>
>
> >
> >  And,
> >> how, technically, is this high performance achieved? I have read the
> >> paper "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng
> >> Shao), and they were using a snapshot of Pig trunk from June 2009,
> >> showing Pig executing an aggregation query and a join query more
> >> slowly than Java Hadoop or Hive, but the aspect that interests me is
> >> the number of Map/Reduce tasks created by Pig and Hive. In what way
> >> does this number have an effect on the execution time performance. I
> >> have a feeling that Pig produces more Map/Reduce tasks than other
> >> interfaces, which may be benficial where there is extremely skewed
> >> data. Am I wrong in thinking this, or is there another benifit to
> >> more Map/Reduce tasks. And how to Pig go about splitting a job into
> >> these number of tasks?
> >>
> >
> > Map and reduce parallelism are controlled differently in Hadoop.  Map
> > parallelism is controlled by the InputSplit.  IS determines how many
> > maps to start and which file blocks to assign to which maps.  In the
> > case of PigMix, both the MR Java code and the Pig code use some
> > subclass of FileInputFormat, so the map parallelism is the same in
> > both tests.  I do not know for sure, but I believe Hive also uses
> FileInputFormat.
> >
> > Reduce parallelism is set explicitly as part of the job configuration.
>
> > In MapReduce this is done through the Java API.  In Pig it is done
> > through though the PARALLEL command.  In PigMix, we set parallelism
> > for both the same (40 I believe for this data size).
> >
>
> I have a query about this procedure. It will warrant a simple answer I
> assume, but I just need clarity on this. I am wondering how, for
> example, both the MR applications and the Pig programs will react if
> there are no specifications for the number of Map or Reduce jobs. If,
> let's say, I were a programmer writing some Pig scripts where I do not
> know the skew of the data, my first execution of the Pig script would be
> done without any specification of #Mappers or #Reducers. Is it not a
> more natural examination of Pig vs MR apps where both Pig and the MR app
> have to decide these details for themselves? So my question is: Why is
> it a fundamental requirement that the Pig script and the associated MR
> app be given figures for initial Map/Reduce tasks?
>
>
> >
> > In general the places where Pig beats MR is due to better algorithms.
>
> > The MR code was written assuming a basic level of MR coding and
> > database knowledge.  So for example, the order by queries, the MR code
>
> > achieves a total order by having a single reducer at the end.  Pig has
>
> > a much more sophisticated system where it samples the data, determines
>
> > a distribution, and then uses multiple reducers while maintaining a
> > total order.  So for large data sets Pig will beat MR for these
> particular tests.
> >
>
> Sounds very elegant, a really neat solution to skewed data. Is there
> some documentation of this process, as I'd like to include that
> methodology in my report. And then display data results like: "skewed
> data / exeution time", where trend lines for Pig, Hive and MR apps are
> shown. It would be nice to show that, as skew of data increases, Pig
> overtakes the associative MR app for execution performance.
>
>
> >
> > It is, by definition, always possible to write MR Java code as fast as
>
> > Pig code, since Pig is implemented over MR.  But that isn't what
> > PixMix is testing.  PigMix aims to test how fast code is for what we
> > guess is a typical programmer choosing between MR and Pig.  If you
> > want instead to test the best possible MR against the best possible
> > Pig Latin, the MR code in these tests should be rewritten.
> >
> >
> >
> >> 2.
> >> This sort of leads onto my next question. I want to, in some way, be
> >> able to test the performance of Pig against others when dealing with
> >> a dataset of extremely skewed data. I have a vague understanding of
> >> what this may mean, but I do need clarity on the definition of skewed
>
> >> data, and the effect this has on the performance on the Map/Reduce
> >> model.
> >>
> >
> > In practice we see that almost all data we deal with at Yahoo is power
>
> > law distributed.  Hence we built most of the PigMix data that way as
> > well.  We used zipf as a definition of skewed, as it turned out to be
> > a reasonably close match to our data.
> >
> > The interesting situation, for us anyway, is when the data skews
> > enough that it is no longer possible to process a single key in memory
>
> > on a single node.  Once you have to spill parts of the data to disk,
> > performance suffers badly.  For sufficiently large data sets (10G or
> > more) zipf distribution meets this criteria.
> >
> > Pig has quite a bit of processing dedicated to dealing with skew
> > issues gracefully, since that is one of the weak points of Map Reduce.
>
> > As mentioned above, order by samples the data and distributes skewed
> > keys across multiple reducers (since for a total ordering there is no
> > need to collect all keys onto one reducer).  Pig has a skewed join
> > that also splits skewed keys across multiple reducers.  For group by
> > (where it truly can't split keys) Pig uses the combiner and the up
> > coming accumulator interface (see
> https://issues.apache.org/jira/browse/PIG-979 ).
> >
> >
> >
> >> 3.
> >> In another question relating to the number of Map/Reduce tasks
> >> generated by Pig. I am interested to see if, despite that fact that
> >> Pig, Hive, JAQL etc... all use the Hadoop fault tolerance
> >> technologies, whether the number of Map/Reduce tasks has an effect on
>
> >> Hadoop's ability to recover, from say, a DataNode that fails. I.e. If
>
> >> there are 100 Map jobs, spread across 10 DataNodes, and one DataNode
> >> fails, then approximately 10 Map jobs will be redistributed over the
> >> remaining 9 DataNodes. If, however, there were 500 Map jobs over the
> >> 10 DataNodes, one of them fails, then 50 Map jobs will be reallocated
>
> >> to the remaining 9 DataNodes. Am I to expect a difference in overal
> >> performance in both of these scenario's?
> >>
> >
> > In Pig's case (and I believe in Hive's and JAQL's) this is all handled
>
> > by Map Reduce.  So I would direct questions on this to
> > mapreduce-u...@hadoop.apache.org
> >
> >
> >
> >> 4.
> >> Which leads onto... the Pig DataGenerator. Is this what I'm looking
> >> for to generate my data? I am probably looking to generate data that
> >> would typically take 30 minutes to execute, and I have a cluster of
> >> 10 nodes available to me. I would happily use some existing tool to
> >> create my test data for analysis, and I just need to know whether the
>
> >> DataGenerator is the tool for me? (
> >> http://wiki.apache.org/pig/DataGeneratorHadoop )
> >>
> >
> > The tool we used to generate data for the original PigMix is attached
> > to that patch referenced above.  The downfall with the original tool
> > is that it takes about 30 hours to generate the amount of data needed
> > for PigMix (which runs for about 5 minutes on an older 10 node
> > cluster, so you'd need significantly more) because it's single
> > threaded.  The DataGenerator is an attempt to convert that original
> > tool to work in MR so it can go much faster.  When I've played with
> > it, it does create skewed data, but it does not create the very long
> > tail of data that the original tool does.  If the very long tail is
> > not important to you than it may be a good choice.  It may also be
> > possible to use the tool to create the long tail and I just configured
> it incorrectly.
> >
> >
> >
> >> 5.
> >> Finally... What is the best way to analyse the performance of each
> >> Pig job sent to the Hadoop cluster? Obviously, I can simply use the
> >> Unix time command, but are there any real-time performance analysis
> >> tools built into either Pig or Hadoop to monitor performance. This
> >> would typically include the number of Map tasks and the number of
> >> reduce tasks over time, and the behaviour of the Hadoop cluster in
> the event of a DataNode failure, i.e.
> >> following a failure, the amount of network bandwidth used, the
> >> CPU/Memory of the NameNode during a failure of a DataNode etc....
> >>
> >
> > I generally use Hadoop's GUI to monitor and find historic information
> > about my jobs.  The same information can be gleaned from the Hadoop
> > logs.  You might take a look at the Chukwa project, which provides a
> > parser for Hadoop logs and presents it all in a graphical format.
> >
> >
> OK, so by this, do you mean that you use the web interface to view:
> MapReduce tracker, task trackers, and the HDFS name node? I've had a
> look at the Chuckwa project, and I may be mistaken, but to me it looks
> like a bit of a beast to configure, and becomes more useful as you
> increase the number of nodes in the cluster. The cluster I have
> available to me is 10 nodes. I will have a good look at the Hadoop logs
> generated by each of the nodes to see if that would suffice.
>
>
> >
> >
> >>
> >> I know it's a lot, but I really apprecaite any assistance, and fully
> >> intend to return my complete paper back to this Pig and Hadoop
> >> community on it's completion!!
> >>
> >>
> >> Many thanks,
> >>
> >>
> >> Rob Stewart
> >>
> >
> > Alan.
> >
> >
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Reply via email to