Hi Team,
I have no luck in trying to expose port 5005 (for remote debugging purpose) on
my executor container using the following pod template and spark configuration
s3a://mybucket/pod-template-executor-debug.yaml
Hi,
We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table
creation while writing dataframe as saveAsTable failed with below error.
Can not create the managed table(``) The associated
location('hdfs:') already exists.
On high level our code does below before writing dataframe as
OK, good news. You have made some progress here :)
bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file. You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:
Hi Mich,
Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.
This might just be the reality of trying to process a 240m record
Hello,
I am trying to publish custom metrics using Spark CustomMetric API as
supported since spark 3.2 https://github.com/apache/spark/pull/31476,
https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html
I have created a custom metric implementing
OK for now have you analyzed statistics in Hive external table
spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;
Hive external tables have little optimization
HTH
Mich Talebzadeh,
Solutions Architect/Engineering
Unsubscribe
Hello,
I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m
In SSS
writeStream. \
outputMode('append'). \
option("truncate", "false"). \
* foreachBatch(SendToBigQuery). \*
option('checkpointLocation', checkpoint_path). \
so this writeStream will call