We're wondering if there was something like Apache Hive LLAP:
https://cwiki.apache.org/confluence/display/Hive/LLAP
We submit scripts asynchronously throughout the day. Never more than 20
a time up to a thousand a day. Input file size varies from less than a
megabyte to a couple terabytes.
1. Hadoop distribution is Hortonworks HDP 2.6.3
2. Apache pig 0.16 using TEZ.
3. SQL database is Pivotal HAWQ 2.3.0.0. Data is sent to the database
for both insert and joins using Pivotal HAWQ external tables (CSV
files). Data is retrieved from database using external tables as well.
3.1.
https://hdb.docs.pivotal.io/230/hawq/datamgmt/load/g-working-with-file-based-ext-tables.html
3.2.
https://hdb.docs.pivotal.io/230/hawq/pxf/PXFExternalTableandAPIReference.html
4. All processing is done on HDFS and all intermediate files are
compressed with lzo.
We orchestrate everything using python (not jython).
1. python script detects new input files.
2. Prepares a pig script according to rules parameterized on a SQL database.
3. Submits pig script by pig command line client (-exectype tez).
4. Use output files (CSV file generated by pig script on item 3) for
join operations on the SQL database.
5. Prepares another pig script against result (CSV file generated by
Pivotal HAWQ on item 4) of join operation on SQL database.
6. Submits pig script by pig command line client (-exectype tez).
7. Finally, loads table (CSV file generated from script from item 5) on
SQL database.
We're considering some optimizations:
1. Share AM/tez sessions across different scripts using something
similar to Hive LLAP. A continuously running YARN daemon that can share
resources across different pig scripts. I haven't found anything
similar. Unfortunately, I have no idea how where to begin if we were to
code this. It's just an out there idea. Any pointers/suggestions would
be appreciated.
2. Write a pig UDF to arbitrarily submit SQL statements to the database
so that we don't have to run 2 separate pig script with 2 SQL statements
in between. It would be a single script as follows:
1st_pig_script_statements;
exec;
sql_udf_run;
exec;
2nd_script_statements;
exec;
sql_udf_run;
2.1. This would submit everything under a single AM thus sharing
resources and reducing overall run time (less start/stop script
overhead). Is the sql_query_submit_UDF idea feasible? Should I just
bite the bullet and use jython instead? At least, for the pig
scripts? Can I just write a standard UDF and run it against a fake
one line input file?
3. Set pig.auto.local.enabled to true to reduce some overhead on small
input files for faster (less time) processing. Unfortunately, I haven't
seen much gain here on 100 megabytes input files when testing with
exectype tez_local. Furthermore, the pig script on tez_local mode
wouldn't find the input files. I had to prefix file paths with hdfs:///
Any help is appreciated. We've been using Apache PIG for ETL purposes
for more than an year and we're very satisfied with it's
performance/ease of use.
Best regards,
Mário Sérgio
On 22/01/2019 16:49, Rohini Palaniswamy wrote:
If you are using PigServer and submitting programmatically via same jvm, it
should automatically reuse the application if the requested AM resources
are same.
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezSessionManager.java#L242-L245
On Fri, Jan 18, 2019 at 12:20 PM Diego Pereira <diego.ns.pere...@gmail.com>
wrote:
Hi!
We are developing an application that is looking for new files on a folder,
running a few Pig Scripts to prepare those files and, finally, loading them
into our database.
The problem is that, for small files, the time that Pig / Tez / Yarn take
to create a new application master and spawn new containers is way longer
than the time it takes processing.
Since Tez Sessions already allows a single Pig script to run multiple DAGs
against the same application master, is there a way to reuse that
application master and it´s containers for multiple Pig Scripts submissions
?
Regards,
Diego