Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS
> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) I wonder if avro is a better candidate for this because it's row oriented it should be faster to write/read for such a task. Never heard about checkpoint. Enrico Minack writes: > It is not about very large or small, it is

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Enrico Minack
It is not about very large or small, it is about how large your cluster is w.r.t. your data. Caching is only useful if you have the respective memory available across your executors. Otherwise you could either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed have to do

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS
> .dropDuplicates() \ .cache() | > Since df_actions is cached, you can count inserts and updates quickly > with only that one join in df_actions: Hi Enrico. I am wondering if this is ok for very large tables ? Is caching faster than recomputing both insert/update ? Thanks Enrico Minack

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff
Hi, Thank you both for your suggestions! These have been eyeopeners for me. Just to clarify, I need the counts for logging and auditing purposes otherwise I would exclude the step. I should have also mentioned that while I am processing around 30 GB of raw data, the individual outputs are

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Enrico Minack
Ashley, I want to suggest a few optimizations. The problem might go away but at least performance should improve. The freeze problems could have many reasons, the Spark UI SQL pages and stages detail pages would be useful. You can send them privately, if you wish. 1. the repartition(1)

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi ashley, Apologies reading this on my phone as work l laptop doesn't let me access personal email. Are you actually doing anything with the counts (printing to log, writing to table?) If you're not doing anything with them get rid of them and the caches entirely. If you do want to do

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Thanks David, I did experiment with the .cache() keyword and have to admit I didn't see any marked improvement on the sample that I was running, so yes I am a bit apprehensive including it (not even sure why I actually left it in). When you say "do the count as the final step", are you referring

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading