Hey team,
I was reading through the Storage Partition Join SPIP
(https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl)
but it seems like it only supports buckets, not partitions. Is that true? And
if so does anybody have an intuition for
In a Structured Streaming job, the listener that is supported is
StreamingQueryListener.
spark.streams().addListener(
new StreamingQueryListener() {
...
}
);
However, there is no straightforward way to use StreamingListener.
I have done it like this:
StreamingContext streamingContext
To: Arwin Tio
Cc: user @spark
Subject: Re: Spark Executor OOMs when writing Parquet
You also have disk spill which is a performance hit.
Try multiplying the number of partitions by about 20x - 40x and see if you can
eliminate shuffle spill.
On Fri, 17 Jan 2020, 10:37 pm Arwin Tio,
mailto:arwin
Hi,
I am exploring the usage of Catalyst expression functions to avoid the
performance issues associated with UDFs.
One thing that I noticed is that there is a trait called CodegenFallback and
there are some Catalyst expressions in Spark that inherit from it [0].
My question is, is there a
Spark-Native catalyst/codegen functions
To: Arwin Tio
Cc: user@spark.apache.org
Look at
https://github.com/DataSystemsLab/GeoSpark/tree/master/sql/src/main/scala/org/apache/spark/sql/geospark
sql for an example.
Using custom function registration and functions residing inside sparks private
Hi friends,
I am looking into converting some UDFs/UDAFs to Spark-Native functions to
leverage Catalyst and codegen.
Looking through some examples (for example:
https://github.com/apache/spark/pull/7214/files for Levenshtein) it seems like
we need to add these functions to the Spark framework
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
```
The problem is that