Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Tomas Bartalos
when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get an incorrect result of 0 rows. val rightDF = spark.read.format("parquet").load("table-a") val leftDF = spark.read.format("parquet").load("table-b") //needed to activate dynamic pruning subquery .where('part_ts ===

Spark 3: creating schema for hive metastore hangs forever

2020-08-13 Thread Tomas Bartalos
Hello, I'm using spark-3.0.0-bin-hadoop3.2 with custom hive metastore DB (postgres). I'm setting the "autoCreateAll" flag to true, so hive is creating its relational schema on first use. The problem is there is a deadlock and the query hangs forever: *Tx1* (*holds lock on TBLS relation*,

Re: Parquet read performance for different schemas

2019-09-20 Thread Tomas Bartalos
I forgot to mention important part that I'm issuing same query to both parquets - selecting only one column: df.select(sum('amount)) BR, Tomas št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a): > Hello, > > I have 2 parquets (each containing 1 file): > >- parquet-wide - sc

Parquet read performance for different schemas

2019-09-19 Thread Tomas Bartalos
Hello, I have 2 parquets (each containing 1 file): - parquet-wide - schema has 25 top level cols + 1 array - parquet-narrow - schema has 3 top level cols Both files have same data for given columns. When I read from parquet-wide spark reports* read 52.6 KB*, from parquet-narrow *only 2.6

Partition pruning by IDs from another table

2019-07-12 Thread Tomas Bartalos
Hello, I have 2 parquet tables: stored - table of 10 M records data - table of 100K records *This is fast:* val dataW = data.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)") dataW.count res44: Long = 42 //takes 3 seconds stored.join(broadcast(dataW),

Re: Access to live data of cached dataFrame

2019-05-19 Thread Tomas Bartalos
e: > A cached DataFrame isn't supposed to change, by definition. > You can re-read each time or consider setting up a streaming source on > the table which provides a result that updates as new data comes in. > > On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos > wrote:

Access to live data of cached dataFrame

2019-05-17 Thread Tomas Bartalos
Hello, I have a cached dataframe: spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache I would like to access the "live" data for this data frame without deleting the cache (using unpersist()). Whatever I do I always get the cached data on subsequent queries. Even

Howto force spark to honor parquet partitioning

2019-05-03 Thread Tomas Bartalos
Hello, I have partitioned parquet files based on "event_hour" column. After reading parquet files to spark: spark.read.format("parquet").load("...") Files from the same parquet partition are scattered in many spark partitions. Example of mapping spark partition -> parquet partition: Spark

PR process

2019-03-15 Thread Tomas Bartalos
Hello, I've contributed a PR https://github.com/apache/spark/pull/23749/. I think it is an interesting feature that might be of use by lot of folks from Kafka community. Our company already uses this feature for real time reporting based on Kafka events. I was trying to strictly follow the

Re: Structured streaming from Kafka by timestamp

2019-02-01 Thread Tomas Bartalos
s how it's integrated with spark. What can >> be done from spark perspective is to look for an offset for a specific >> lowest timestamp and start the reading from there. >> >> BR, >> G >> >> >> On Thu, Jan 24, 2019 at 6:38 PM Tomas Bartalos >> wr

Reading compacted Kafka topic is slow

2019-01-24 Thread Tomas Bartalos
Hello Spark folks, I'm reading compacted Kafka topic with spark 2.4, using direct stream - KafkaUtils.createDirectStream(...). I have configured necessary options for compacted stream, so its processed with CompactedKafkaRDDIterator. It works well, however in case of many gaps in the topic, the