We're wondering if there was something like Apache Hive LLAP: https://cwiki.apache.org/confluence/display/Hive/LLAP

We submit scripts asynchronously throughout the day. Never more than 20 a time up to a thousand a day. Input file size varies from less than a megabyte to a couple terabytes.

1. Hadoop distribution is Hortonworks HDP 2.6.3

2. Apache pig 0.16 using TEZ.

3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the database for both insert and joins using Pivotal HAWQ external tables (CSV files). Data is retrieved from database using external tables as well.

   3.1.
   
https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html

   3.2.
   https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html

4. All processing is done on HDFS and all intermediate files are compressed with lzo.

We orchestrate everything using python (not jython).

1. python script detects new input files.

2. Prepares a pig script according to rules parameterized on a SQL database.

3. Submits pig script by pig command line client (-exectype tez).

4. Use output files (CSV file generated by pig script on item 3) for join operations on the SQL database.

5. Prepares another pig script against result (CSV file generated by Pivotal HAWQ on item 4) of join operation on SQL database.

6. Submits pig script by pig command line client (-exectype tez).

7. Finally, loads table (CSV file generated from script from item 5) on SQL  database.

We're considering some optimizations:

1. Share AM/tez sessions across different scripts using something similar to Hive LLAP. A continuously running YARN daemon that can share resources across different pig scripts. I haven't found anything similar. Unfortunately, I have no idea how where to begin if we were to code this. It's just an out there idea. Any pointers/suggestions would be appreciated.

2. Write a pig UDF to arbitrarily submit SQL statements to the database so that we don't have to run 2 separate pig script with 2 SQL statements in between. It would be a single script as follows:

      1st_pig_script_statements;

      exec;

      sql_udf_run;

      exec;

      2nd_script_statements;

      exec;

      sql_udf_run;

   2.1. This would submit everything under a single AM thus sharing
   resources and reducing overall run time (less start/stop script
   overhead). Is the sql_query_submit_UDF idea feasible? Should I just
   bite the bullet and use jython instead? At least, for the pig
   scripts? Can I just write a standard UDF and run it against a fake
   one  line input file?

3. Set pig.auto.local.enabled to true to reduce some overhead on small input files for faster (less time) processing. Unfortunately, I haven't seen much gain here on 100 megabytes input files when testing with exectype tez_local. Furthermore, the pig script on tez_local mode wouldn't find the input files. I had to prefix file paths with hdfs:///

Any help is appreciated. We've been using Apache PIG for ETL purposes for more than an year and we're very satisfied with it's performance/ease of use.

Best regards,
  Mário Sérgio

On 22/01/2019 16:49, Rohini Palaniswamy wrote:
If you are using PigServer and submitting programmatically via same jvm, it
should automatically reuse the application if the requested AM resources
are same.

https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245

On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <diego.ns.pere...@gmail.com>
wrote:

Hi!

We are developing an application that is looking for new files on a folder,
running a few Pig Scripts to prepare those files and, finally, loading them
into our database.

The problem is that, for small files, the time that Pig / Tez / Yarn take
to create a new application master and spawn new containers is way longer
than the time it takes processing.

Since Tez Sessions already allows a single Pig script to run multiple DAGs
against the same application master, is there a way to reuse that
application master and it´s containers for multiple Pig Scripts submissions
?

Regards,

Diego

Reply via email to