Re: Re: Submitting multiple Pig Scripts on the same Session

Mário Sérgio Fujikawa Ferreira Mon, 04 Feb 2019 00:06:54 -0800

We're wondering if there was something like Apache Hive LLAP:https://cwiki.apache.org/confluence/display/Hive/LLAP

We submit scripts asynchronously throughout the day. Never more than 20a time up to a thousand a day. Input file size varies from less than amegabyte to a couple terabytes.


1. Hadoop distribution is Hortonworks HDP 2.6.3

2. Apache pig 0.16 using TEZ.

3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the databasefor both insert and joins using Pivotal HAWQ external tables (CSVfiles). Data is retrieved from database using external tables as well.


   3.1.
   
https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html

   3.2.
   https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html

4. All processing is done on HDFS and all intermediate files arecompressed with lzo.


We orchestrate everything using python (not jython).

1. python script detects new input files.

2. Prepares a pig script according to rules parameterized on a SQL database.

3. Submits pig script by pig command line client (-exectype tez).

4. Use output files (CSV file generated by pig script on item 3) forjoin operations on the SQL database.

5. Prepares another pig script against result (CSV file generated byPivotal HAWQ on item 4) of join operation on SQL database.


6. Submits pig script by pig command line client (-exectype tez).

7. Finally, loads table (CSV file generated from script from item 5) onSQL database.


We're considering some optimizations:

1. Share AM/tez sessions across different scripts using somethingsimilar to Hive LLAP. A continuously running YARN daemon that can shareresources across different pig scripts. I haven't found anythingsimilar. Unfortunately, I have no idea how where to begin if we were tocode this. It's just an out there idea. Any pointers/suggestions wouldbe appreciated.

2. Write a pig UDF to arbitrarily submit SQL statements to the databaseso that we don't have to run 2 separate pig script with 2 SQL statementsin between. It would be a single script as follows:


      1st_pig_script_statements;

      exec;

      sql_udf_run;

      exec;

      2nd_script_statements;

      exec;

      sql_udf_run;

   2.1. This would submit everything under a single AM thus sharing
   resources and reducing overall run time (less start/stop script
   overhead). Is the sql_query_submit_UDF idea feasible? Should I just
   bite the bullet and use jython instead? At least, for the pig
   scripts? Can I just write a standard UDF and run it against a fake
   one  line input file?

3. Set pig.auto.local.enabled to true to reduce some overhead on smallinput files for faster (less time) processing. Unfortunately, I haven'tseen much gain here on 100 megabytes input files when testing withexectype tez_local. Furthermore, the pig script on tez_local modewouldn't find the input files. I had to prefix file paths with hdfs:///

Any help is appreciated. We've been using Apache PIG for ETL purposesfor more than an year and we're very satisfied with it'sperformance/ease of use.


Best regards,
  Mário Sérgio

On 22/01/2019 16:49, Rohini Palaniswamy wrote:

If you are using PigServer and submitting programmatically via same jvm, it
should automatically reuse the application if the requested AM resources
are same.

https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245

On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <diego.ns.pere...@gmail.com>
wrote:

Hi!

We are developing an application that is looking for new files on a folder,
running a few Pig Scripts to prepare those files and, finally, loading them
into our database.

The problem is that, for small files, the time that Pig / Tez / Yarn take
to create a new application master and spawn new containers is way longer
than the time it takes processing.

Since Tez Sessions already allows a single Pig script to run multiple DAGs
against the same application master, is there a way to reuse that
application master and it´s containers for multiple Pig Scripts submissions
?

Regards,

Diego

Re: Re: Submitting multiple Pig Scripts on the same Session

Reply via email to