Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m records, ~85 columns). I uploaded the gzipped file to HDFS and created an external table using the attached script. I tried two simpler queries on the same table, and they finished in ~5 and ~10 minutes respectively: SELECT COUNT(*) FROM ClaimsImport; SELECT COUNT(*) FROM ClaimsImport WHERE ClaimLineID = 1; However, when I tried to create a table stored as ORC using this table as the input, the query ran for almost 4 hours: CREATE TABLE Claims STORED AS ORC AS SELECT * FROM ClaimsImport --Exclude the header record WHERE ClaimID <> 'ClaimID'; [image: image.png] Why is there such a speed disparity between these different operations? I understand that this job cannot be parallelized because the file is compressed with gzip. I also understand that creating an ORC table from the input will take more time than a simple COUNT(*). But it doesn't feel like the CREATE TABLE operation should take more than 24x longer than a simple SELECT COUNT(*) statement. Thanks for any help. Please let me know if I can provide any additional information. Patrick
Create Table.sql
Description: Binary data
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org