when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get
an incorrect result of 0 rows.
val rightDF = spark.read.format("parquet").load("table-a")
val leftDF = spark.read.format("parquet").load("table-b")
//needed to activate dynamic pruning subquery
.where('part_ts ===
Hello,
I'm using spark-3.0.0-bin-hadoop3.2 with custom hive metastore DB
(postgres). I'm setting the "autoCreateAll" flag to true, so hive is
creating its relational schema on first use. The problem is there is a
deadlock and the query hangs forever:
*Tx1* (*holds lock on TBLS relation*,
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:
df.select(sum('amount))
BR,
Tomas
št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a):
> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>- parquet-wide - sc
Hello,
I have 2 parquets (each containing 1 file):
- parquet-wide - schema has 25 top level cols + 1 array
- parquet-narrow - schema has 3 top level cols
Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6
Hello,
I have 2 parquet tables:
stored - table of 10 M records
data - table of 100K records
*This is fast:*
val dataW = data.where("registration_ts in (20190516204l,
20190515143l,20190510125l, 20190503151l)")
dataW.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(dataW),
e:
> A cached DataFrame isn't supposed to change, by definition.
> You can re-read each time or consider setting up a streaming source on
> the table which provides a result that updates as new data comes in.
>
> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos
> wrote:
Hello,
I have a cached dataframe:
spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
I would like to access the "live" data for this data frame without deleting
the cache (using unpersist()). Whatever I do I always get the cached data
on subsequent queries. Even
Hello,
I have partitioned parquet files based on "event_hour" column.
After reading parquet files to spark:
spark.read.format("parquet").load("...")
Files from the same parquet partition are scattered in many spark
partitions.
Example of mapping spark partition -> parquet partition:
Spark
Hello,
I've contributed a PR https://github.com/apache/spark/pull/23749/. I think
it is an interesting feature that might be of use by lot of folks from
Kafka community. Our company already uses this feature for real time
reporting based on Kafka events.
I was trying to strictly follow the
s how it's integrated with spark. What can
>> be done from spark perspective is to look for an offset for a specific
>> lowest timestamp and start the reading from there.
>>
>> BR,
>> G
>>
>>
>> On Thu, Jan 24, 2019 at 6:38 PM Tomas Bartalos
>> wr
Hello Spark folks,
I'm reading compacted Kafka topic with spark 2.4, using direct stream -
KafkaUtils.createDirectStream(...). I have configured necessary options for
compacted stream, so its processed with CompactedKafkaRDDIterator.
It works well, however in case of many gaps in the topic, the
11 matches
Mail list logo