Re: running pyspark on kubernetes - no space left on device

2022-09-01 Thread Matt Proetsch
Hi George,

You can try mounting a larger PersistentVolume to the work directory as 
described here instead of using localdir which might have site-specific size 
constraints:

https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes

-Matt

> On Sep 1, 2022, at 09:16, Manoj GEORGE  
> wrote:
> 
> 
> CONFIDENTIAL & RESTRICTED
> 
> Hi Team,
>  
> I am new to spark, so please excuse my ignorance.
>  
> Currently we are trying to run PySpark on Kubernetes cluster. The setup is 
> working fine for some jobs, but when we are processing a large file ( 36 gb), 
>  we run into one of space issues.
>  
> Based on what was found on internet, we have mapped the local dir to a 
> persistent volume. This still doesn’t solve the issue.
>  
> I am not sure if it is still writing to /tmp folder on the pod. Is there some 
> other setting which need to be changed for this to work.
>  
> Thanks in advance.
>  
>  
>  
> Thanks,
> Manoj George
> Manager Database Architecture​
> M: +1 3522786801
> manoj.geo...@amadeus.com
> www.amadeus.com​
> 
>  
> Disclaimer: This email message and information contained in or attached to 
> this message may be privileged, confidential, and protected from disclosure 
> and is intended only for the person or entity to which it is addressed. Any 
> review, retransmission, dissemination, printing or other use of, or taking of 
> any action in reliance upon, this information by persons or entities other 
> than the intended recipient is prohibited. If you receive this message in 
> error, please immediately inform the sender by reply email and delete the 
> message and any attachments. Thank you.


Re: Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-19 Thread Matt Proetsch
Hi Jay,

Some things to check:

Do you have the following set in your Spark SQL config:

"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Is the JAR for the package delta-core_2.12:0.7.0 available on both your driver 
and executor classpaths?
(More info 
https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake)

Since you are using non-default metastore version have you set the config for 
spark.sql.hive.metastore.version
(More info 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)

Finally are you able to read/write Delta tables outside of Hive?

-Matt

> On Dec 19, 2020, at 13:03, Jay  wrote:
> 
> Hi All -
> 
> I have currently setup a Spark 3.0.1 cluster with delta version 0.7.0 which 
> is connected to an external hive metastore.
> 
> I run the below set of commands :-
> 
> val tableName = tblname_2
> spark.sql(s"CREATE TABLE $tableName(col1 INTEGER) USING delta 
> options(path='GCS_PATH')")
> 20/12/19 17:30:52 WARN org.apache.spark.sql.hive.HiveExternalCatalog: 
> Couldn't find corresponding Hive SerDe for data source provider delta. 
> Persisting data source table `default`.`tblname_2` into Hive metastore in 
> Spark SQL specific format, which is NOT compatible with Hive.
> 
> spark.sql(s"INSERT OVERWRITE $tableName VALUES 5, 6, 7, 8, 9")
> res51: org.apache.spark.sql.DataFrame = []  
> 
> spark.sql(s"SELECT * FROM $tableName").show()
> org.apache.spark.sql.AnalysisException: Table does not support reads: 
> default.tblname_2;   
> 
> I see a warning which is related to integration with Hive Metastore which 
> essentially tells that this table cannot be queried via Hive or Presto which 
> is fine but when I try to read the data from the same spark session I am 
> getting an error. Can someone suggest what can be the problem ?