[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
Hi Team, I have no luck in trying to expose port 5005 (for remote debugging purpose) on my executor container using the following pod template and spark configuration s3a://mybucket/pod-template-executor-debug.yaml

[Spark-SQL] Dataframe write saveAsTable failed

2023-06-26 Thread Anil Dasari
Hi, We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table creation while writing dataframe as saveAsTable failed with below error. Can not create the managed table(``) The associated location('hdfs:') already exists. On high level our code does below before writing dataframe as

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :) bzip (bzip2) works (splittable) because it is block-oriented whereas gzip is stream oriented. I also noticed that you are creating a managed ORC file. You can bucket and partition an ORC (Optimized Row Columnar file format. An example below:

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich, Thanks for the reply. I started running ANALYZE TABLE on the external table, but the progress was very slow. The stage had only read about 275MB in 10 minutes. That equates to about 5.5 hours just to analyze the table. This might just be the reality of trying to process a 240m record

Unable to populate spark metrics using custom metrics API

2023-06-26 Thread Surya Soma
Hello, I am trying to publish custom metrics using Spark CustomMetric API as supported since spark 3.2 https://github.com/apache/spark/pull/31476, https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html I have created a custom metric implementing

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Unsubscribe

2023-06-26 Thread Ghazi Naceur
Unsubscribe

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

Re: [Spark streaming]: Microbatch id in logs

2023-06-26 Thread Mich Talebzadeh
In SSS writeStream. \ outputMode('append'). \ option("truncate", "false"). \ * foreachBatch(SendToBigQuery). \* option('checkpointLocation', checkpoint_path). \ so this writeStream will call