Re: test

2020-07-27 Thread Ashley Hoff
Yes, your emails are getting through. On Mon, Jul 27, 2020 at 6:31 PM Suat Toksöz wrote: > user@spark.apache.org > > -- > > Best regards, > > *Suat Toksoz* > -- Kustoms On Silver

Re: [Announcement] Cloud data lake conference with heavy focus on open source

2020-07-07 Thread Ashley Hoff
Interesting You've piqued my interest. Will the sessons be available after the conference? (I'm in the wrong timezone to see this during daylight hours) On Wed, Jul 8, 2020 at 2:40 AM ldazaa11 wrote: > Hello Sparkers, > > If you’re interested in how Spark is being applied in cloud data

Re: wot no toggle ?

2020-04-16 Thread Ashley Hoff
OK, we get it. you are not satisfied that Spark is easy to be used by mere mortals. Please stop Maybe you should look at Data Bricks? On Thu, Apr 16, 2020 at 3:43 PM jane thorpe wrote: > https://spark.apache.org/docs/3.0.0-preview/web-ui.html#storage-tab > > On the link in one of the

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff
ite.partitionBy('_action').format('parquet').mode('overwrite').save('/path/to/output.parquet') > df_output = sql_context.read.parquet('/path/to/output.parquet')inserts_count > = df_output.where(col('_action') === 'Insert').count() > updates_count = df_output.where(col('_action') === 'Updat

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
I would do that. I think you may > also not need the .cache() statements and you might want to experiment > reducing the number spark.sql.shuffle.partitions too. > > Thanks > Dave > > > > > > > > > On Thu, 13 Feb 2020, 04:09 Ashley Hoff, wrote: > >>

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading