Bikas has covered the questions well. Just elaborating a little more on
Question 3.
(3) Why it is use for hive and pig? how is it better than spark or mr?
MR has been the execution engine for Pig and Hive and was the only
choice available before Tez and Spark (DAG engines) came in. The query
plans for Pig and Hive are DAGs and to run it on MR, we had to cut the DAG
into pieces and run them as multiple MR jobs tracking the dependency
between the multiple jobs and store intermediate data to HDFS. With DAG
engines, a pig script's query plan is easily translated into one Tez or
Spark DAG and run them in one job. The advantage Tez offers over Spark is
that the DAG layer is not exposed by Spark and it is the Spark framework
that constructs the DAG based on the data flow defined. With Tez, Pig and
Hive get to construct the DAG themselves and do low level customizations
and optimizations, play with data routing and make dynamic runtime
decisions with the knowledge they have of the plan and data which can help
the Tez DAG to be more optimal in execution.
Regards,
Rohini
On Wed, Jan 20, 2016 at 10:08 AM, Bikas Saha <[email protected]> wrote:
> Tez is designed as a set of libraries and APIs that will make it easier to
> write data processing applications on YARN. It provides no logical
> functionality by itself. Instead it provides infrastructure pieces that
> take care of YARN scheduling, YARN container allocation, YARN container
> launch and setup and other aspects YARN reporting like ATS integration and
> security. Think of Tez as providing the infrastructure to coordinate and
> orchestrate the application on YARN.
>
> MR was both a logic application that provided Map-Reduce functional style
> semantics with a Key-Value data model. Hive and Pig were record oriented
> engines that provide higher level logical functionality but where built on
> MR and had to translate their complex logical plans into MR. By switching
> to Tez, these applications get necessary cluster coordination libraries
> from Tez - so its easier for them to natively integrate with YARN instead
> of translating to MR semantics.
>
> The DAG based model in Tez comes from the DAG API that Tez exposes to
> define the structure of the application that will execute on YARN. This
> only defines the physical layout of parts of the program that will get
> launched on YARN. What happens inside those launched programs is defined by
> the application - not Tez. Inside the launched programs, the application
> runs its own processing logic (eg joining or filtering data) and does some
> IO (say to local storage or HDFS). Tez provides some helper libraries for
> the IO but the application is free to write their own. So pluggability of
> the IO is also provided by Tez to customize the application.
>
> Effectively, Tez provides a pluggable coordination layer for scheduling
> applications on the cluster. With the recent extensions made to Tez under
> TEZ-2003, it may be possible to have the functionality extended to not just
> to YARN clusters but other clusters like Mesos.
>
> 1) Tez is providing building blocks that can be used to write higher level
> engines like MR, Hive, Pig etc. Application scenarios are any applications
> whose final scheduling structure looks like a DAG of distributed tasks.
> 2) The problem its solving it to provide libraries that can be used by
> higher level engines and other projects.
> 3) hive and Pig use it because it only provides the cluster coordination
> and does not impose data semantics. So hive and Pig can use their native
> data semantics (earlier they were translating to MR semantics). Similarly
> MR can be run using the Tez libraries and it works today. There was a
> prototype of Spark running on YARN using Tez libraries for YARN scheduling.
> All of these are higher level engines that provide data semantics and
> logical operations while Tez provides the scheduling infrastructure to run
> on YARN.
> 4) Don’t solve problems that have already been solved reiterates the
> common libraries. Pig, hive, cascading, etc. don’t have to write the same
> code to solve the same problems if they can use Tez libraries for common
> functionality.
>
> Hope that helps!
> Bikas
>
> -----Original Message-----
> From: LLBian [mailto:[email protected]]
> Sent: Wednesday, January 20, 2016 8:44 AM
> To: [email protected]
> Subject: What's the application scenario of Apache TEZ
>
>
> Hello,Tez experts:
> I have known that, tez is used in DAG cases.
> Because it can control the intermediate results do not write to
> disk, and container reuse, so it is more effective in processing small
> amount of data than mr. So, mybe I will think that hive on tez is better
> than hive on mr in processing small amount of data, am I right?
> Well, now, my questions are:
> (1)Even though there are main design themes in https://tez.apache.org/ ,
> I am still not very clear about its application scenarios,and If there are
> some real and main enterprise applications,so much the better.
> (2)I am still not very clear what question It is mainly used to solving?
> (3) Why it is use for hive and pig? how is it better than spark or mr?
> (4)I looked at your official PPT and paper “Apache Tez: A Unifying
> Framework for Modeling and Building Data Processing Applications" , but
> still not very clearly.
> How to understand this :"Don’t solve problems that have already been
> solved. Or else you will have to solve them again!"? Is there any real
> example?
>
> Apache tez is a great product , I hope to learn more about it.
>
> Any reply are very appreciated.
>
> Thankyou & Best Regards.
>
> ---LLBian
>
>
>
>