Re: Follow Up Questions: PigMix, DataGenerator etc...

Rob Stewart Sat, 31 Oct 2009 11:23:15 -0700

Alan,

thanks for getting in touch,  I appreciate your time, given that it's clear
you're busy popping up in Pig discussion videos on Vimeo and YouTube just
now, see my responses below.


I intend to get a good feel for the data generation, and to see first of
all: how easy it is for the various interfaces (Pig, JAQL etc..) can plug
into the same file structures, and secondly, how easily and fairly I would
be able to port my queries from one interface to the next.

Misc question: Do you anticipate that Pig will be compatible with Hadoop
0.20 ?

Finally, I am correct to assume that Pig is not Turing Complete? I am not
clear on this. SQL is not Turing Complete, whereas Java is. So does that
make, Hive or Pig, for example Turing complete, or not?

Again, see my responses below, and thanks again.


Rob Stewart



2009/10/30 Alan Gates <ga...@yahoo-inc.com>

>
> On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
>
>  Hi there.
>>
>> As some of you may have read on this mailing list previously, I'm studying
>> various interfaces with Hadoop, one of those being Pig.
>>
>> I have three further questions. I am now beginning to think about the
>> design
>> of my testing (query design, indicators of performance etc...).
>>
>> 1:
>> I have had a good look at the PigMix benchmark Wiki page, and find it
>> interesting that a few Pig queries now execute more quickly than the
>> associative Java MapReduce application implementation (
>> http://wiki.apache.org/pig/PigMix ). The following data processing
>> functions
>> in Pig outperform the Java equivalent:
>> distinct aggregation
>> anti join
>> group applicationorder by 1 field
>> order by multiple fields
>> distinct + join
>> multi-store
>>
>>
>> A few questions: Am I able to obtain the actual queries used for the
>> PigMix
>> benchmarking? And how about obtaining their Java Hadoop equivalent?
>>
>
> https://issues.apache.org/jira/browse/PIG-200 has the code.  The original
> perf.patch has both the MR Java code and the Pig Latin scripts.  The data
> generator is also in this patch.
>
>
I will check that code out. thanks.


>
>  And,
>> how, technically, is this high performance achieved? I have read the paper
>> "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng Shao), and they
>> were using a snapshot of Pig trunk from June 2009, showing Pig executing
>> an
>> aggregation query and a join query more slowly than Java Hadoop or Hive,
>> but
>> the aspect that interests me is the number of Map/Reduce tasks created by
>> Pig and Hive. In what way does this number have an effect on the execution
>> time performance. I have a feeling that Pig produces more Map/Reduce tasks
>> than other interfaces, which may be benficial where there is extremely
>> skewed data. Am I wrong in thinking this, or is there another benifit to
>> more Map/Reduce tasks. And how to Pig go about splitting a job into these
>> number of tasks?
>>
>
> Map and reduce parallelism are controlled differently in Hadoop.  Map
> parallelism is controlled by the InputSplit.  IS determines how many maps to
> start and which file blocks to assign to which maps.  In the case of PigMix,
> both the MR Java code and the Pig code use some subclass of FileInputFormat,
> so the map parallelism is the same in both tests.  I do not know for sure,
> but I believe Hive also uses FileInputFormat.
>
> Reduce parallelism is set explicitly as part of the job configuration.  In
> MapReduce this is done through the Java API.  In Pig it is done through
> though the PARALLEL command.  In PigMix, we set parallelism for both the
> same (40 I believe for this data size).
>

I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for example,
both the MR applications and the Pig programs will react if there are no
specifications for the number of Map or Reduce jobs. If, let's say, I were a
programmer writing some Pig scripts where I do not know the skew of the
data, my first execution of the Pig script would be done without any
specification of #Mappers or #Reducers. Is it not a more natural examination
of Pig vs MR apps where both Pig and the MR app have to decide these details
for themselves? So my question is: Why is it a fundamental requirement that
the Pig script and the associated MR app be given figures for initial
Map/Reduce tasks?


>
> In general the places where Pig beats MR is due to better algorithms.  The
> MR code was written assuming a basic level of MR coding and database
> knowledge.  So for example, the order by queries, the MR code achieves a
> total order by having a single reducer at the end.  Pig has a much more
> sophisticated system where it samples the data, determines a distribution,
> and then uses multiple reducers while maintaining a total order.  So for
> large data sets Pig will beat MR for these particular tests.
>

Sounds very elegant, a really neat solution to skewed data. Is there some
documentation of this process, as I'd like to include that methodology in my
report. And then display data results like: "skewed data / exeution time",
where trend lines for Pig, Hive and MR apps are shown. It would be nice to
show that, as skew of data increases, Pig overtakes the associative MR app
for execution performance.


>
> It is, by definition, always possible to write MR Java code as fast as Pig
> code, since Pig is implemented over MR.  But that isn't what PixMix is
> testing.  PigMix aims to test how fast code is for what we guess is a
> typical programmer choosing between MR and Pig.  If you want instead to test
> the best possible MR against the best possible Pig Latin, the MR code in
> these tests should be rewritten.
>
>
>
>> 2.
>> This sort of leads onto my next question. I want to, in some way, be able
>> to
>> test the performance of Pig against others when dealing with a dataset of
>> extremely skewed data. I have a vague understanding of what this may mean,
>> but I do need clarity on the definition of skewed data, and the effect
>> this
>> has on the performance on the Map/Reduce model.
>>
>
> In practice we see that almost all data we deal with at Yahoo is power law
> distributed.  Hence we built most of the PigMix data that way as well.  We
> used zipf as a definition of skewed, as it turned out to be a reasonably
> close match to our data.
>
> The interesting situation, for us anyway, is when the data skews enough
> that it is no longer possible to process a single key in memory on a single
> node.  Once you have to spill parts of the data to disk, performance suffers
> badly.  For sufficiently large data sets (10G or more) zipf distribution
> meets this criteria.
>
> Pig has quite a bit of processing dedicated to dealing with skew issues
> gracefully, since that is one of the weak points of Map Reduce.  As
> mentioned above, order by samples the data and distributes skewed keys
> across multiple reducers (since for a total ordering there is no need to
> collect all keys onto one reducer).  Pig has a skewed join that also splits
> skewed keys across multiple reducers.  For group by (where it truly can't
> split keys) Pig uses the combiner and the up coming accumulator interface
> (see https://issues.apache.org/jira/browse/PIG-979 ).
>
>
>
>> 3.
>> In another question relating to the number of Map/Reduce tasks generated
>> by
>> Pig. I am interested to see if, despite that fact that Pig, Hive, JAQL
>> etc... all use the Hadoop fault tolerance technologies, whether the number
>> of Map/Reduce tasks has an effect on Hadoop's ability to recover, from
>> say,
>> a DataNode that fails. I.e. If there are 100 Map jobs, spread across 10
>> DataNodes, and one DataNode fails, then approximately 10 Map jobs will be
>> redistributed over the remaining 9 DataNodes. If, however, there were 500
>> Map jobs over the 10 DataNodes, one of them fails, then 50 Map jobs will
>> be
>> reallocated to the remaining 9 DataNodes. Am I to expect a difference in
>> overal performance in both of these scenario's?
>>
>
> In Pig's case (and I believe in Hive's and JAQL's) this is all handled by
> Map Reduce.  So I would direct questions on this to
> mapreduce-u...@hadoop.apache.org
>
>
>
>> 4.
>> Which leads onto... the Pig DataGenerator. Is this what I'm looking for to
>> generate my data? I am probably looking to generate data that would
>> typically take 30 minutes to execute, and I have a cluster of 10 nodes
>> available to me. I would happily use some existing tool to create my test
>> data for analysis, and I just need to know whether the DataGenerator is
>> the
>> tool for me? ( http://wiki.apache.org/pig/DataGeneratorHadoop )
>>
>
> The tool we used to generate data for the original PigMix is attached to
> that patch referenced above.  The downfall with the original tool is that it
> takes about 30 hours to generate the amount of data needed for PigMix (which
> runs for about 5 minutes on an older 10 node cluster, so you'd need
> significantly more) because it's single threaded.  The DataGenerator is an
> attempt to convert that original tool to work in MR so it can go much
> faster.  When I've played with it, it does create skewed data, but it does
> not create the very long tail of data that the original tool does.  If the
> very long tail is not important to you than it may be a good choice.  It may
> also be possible to use the tool to create the long tail and I just
> configured it incorrectly.
>
>
>
>> 5.
>> Finally... What is the best way to analyse the performance of each Pig job
>> sent to the Hadoop cluster? Obviously, I can simply use the Unix time
>> command, but are there any real-time performance analysis tools built into
>> either Pig or Hadoop to monitor performance. This would typically include
>> the number of Map tasks and the number of reduce tasks over time, and the
>> behaviour of the Hadoop cluster in the event of a DataNode failure, i.e.
>> following a failure, the amount of network bandwidth used, the CPU/Memory
>> of
>> the NameNode during a failure of a DataNode etc....
>>
>
> I generally use Hadoop's GUI to monitor and find historic information about
> my jobs.  The same information can be gleaned from the Hadoop logs.  You
> might take a look at the Chukwa project, which provides a parser for Hadoop
> logs and presents it all in a graphical format.
>
>
OK, so by this, do you mean that you use the web interface to view:
MapReduce tracker, task trackers, and the HDFS name node? I've had a look at
the Chuckwa project, and I may be mistaken, but to me it looks like a bit of
a beast to configure, and becomes more useful as you increase the number of
nodes in the cluster. The cluster I have available to me is 10 nodes. I will
have a good look at the Hadoop logs generated by each of the nodes to see if
that would suffice.


>
>
>>
>> I know it's a lot, but I really apprecaite any assistance, and fully
>> intend
>> to return my complete paper back to this Pig and Hadoop community on it's
>> completion!!
>>
>>
>> Many thanks,
>>
>>
>> Rob Stewart
>>
>
> Alan.
>
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Reply via email to