OK, good news. You have made some progress here :)
bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file. You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:
Hi Mich,
Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.
This might just be the reality of trying to process a 240m record
OK for now have you analyzed statistics in Hive external table
spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;
Hive external tables have little optimization
HTH
Mich Talebzadeh,
Solutions Architect/Engineering
Hello,
I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m