Re: Tuning Best Practices

2023-11-28 Thread Jack Goodson
Hi Bryant, the below docs are a good start on performance tuning https://spark.apache.org/docs/latest/sql-performance-tuning.html Hope it helps! On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright wrote: > Hi, I'm looking for a comprehensive list of Tuning Best Practices for > spark. > > I did a

Re: Inquiry about Processing Speed

2023-09-28 Thread Jack Goodson
Hi Haseeb, I think the user mailing list is what you're looking for, people are usually pretty active on here if you present a direct question about apache spark. I've linked below the community guidelines which says which mailing lists are for what etc https://spark.apache.org/community.html

Re: Change default timestamp offset on data load

2023-09-07 Thread Jack Goodson
operty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 7 Sept 2023 at 01:42, Jack Goodson wrote: > &

Re: Change default timestamp offset on data load

2023-09-06 Thread Jack Goodson
com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no ca

Change default timestamp offset on data load

2023-09-05 Thread Jack Goodson
Hi, I've got a number of tables that I'm loading in from a SQL server. The timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these as parquet files in our raw storage location and pick them up in Databricks. When I load the data in databricks, the dataframe/spark assumes UTC or

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-02-15 Thread Jack Goodson
Hi, There is some good documentation under here https://docs.databricks.com/structured-streaming/query-recovery.html Under the “recovery after change in structured streaming query” heading that gives good general guidelines on what can be changed in a “pause” of a stream On Thu, 16 Feb 2023

Jira Account for Contributions

2023-02-09 Thread Jack Goodson
Hi, I'm wanting to start contributing to the Spark project, do I need a Jira account at https://issues.apache.org/jira/projects/SPARK/summary before I'm able to do this? If so can one please be created with this email address? Thank you

Re: Spark with GPU

2023-02-05 Thread Jack Goodson
As far as I understand you will need a GPU for each worker node or you will need to partition the GPU processing somehow to each node which I think would defeat the purpose. In Databricks for example when you select GPU workers there is a GPU allocated to each worker. I assume this is the

Re: Splittable or not?

2022-09-19 Thread Jack Goodson
When reading in Gzip files, I’ve always read them into a data frame and then written out to parquet/delta more or less in their raw form and then used these files for my transformations as the workloads are now parallelisable from these split files, when reading in Gzips these will be read by