Hi Mich, Thanks for the reply. I started running ANALYZE TABLE on the external table, but the progress was very slow. The stage had only read about 275MB in 10 minutes. That equates to about 5.5 hours just to analyze the table.
This might just be the reality of trying to process a 240m record file with 80+ columns, unless there's an obvious issue with my setup that someone sees. The solution is likely going to involve increasing parallelization. To that end, I extracted and re-zipped this file in bzip. Since bzip is splittable and gzip is not, Spark can process the bzip file in parallel. The same CTAS query only took about 45 minutes. This is still a bit slower than I had hoped, but the import from bzip fully utilized all available cores. So we can give the cluster more resources if we need the process to go faster. Patrick On Mon, Jun 26, 2023 at 12:52 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > OK for now have you analyzed statistics in Hive external table > > spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL > COLUMNS; > spark-sql (default)> DESC EXTENDED test.stg_t2; > > Hive external tables have little optimization > > HTH > > > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 26 Jun 2023 at 16:33, Patrick Tucci <patrick.tu...@gmail.com> > wrote: > >> Hello, >> >> I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master >> node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores >> and 64GB of RAM. >> >> I'm trying to process a large pipe delimited file that has been >> compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 >> columns). I uploaded the gzipped file to HDFS and created an external table >> using the attached script. I tried two simpler queries on the same table, >> and they finished in ~5 and ~10 minutes respectively: >> >> SELECT COUNT(*) FROM ClaimsImport; >> SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1; >> >> However, when I tried to create a table stored as ORC using this table as >> the input, the query ran for almost 4 hours: >> >> CREATE TABLE Claims STORED AS ORC >> AS >> SELECT * >> FROM ClaimsImport >> --Exclude the header record >> WHERE ClaimID <> 'ClaimID'; >> >> [image: image.png] >> >> Why is there such a speed disparity between these different operations? I >> understand that this job cannot be parallelized because the file is >> compressed with gzip. I also understand that creating an ORC table from the >> input will take more time than a simple COUNT(*). But it doesn't feel like >> the CREATE TABLE operation should take more than 24x longer than a simple >> SELECT COUNT(*) statement. >> >> Thanks for any help. Please let me know if I can provide any additional >> information. >> >> Patrick >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >