Storage Partition Joins only works for buckets?

2023-11-08 Thread Arwin Tio
Hey team, I was reading through the Storage Partition Join SPIP (https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl) but it seems like it only supports buckets, not partitions. Is that true? And if so does anybody have an intuition for

Using Streaming Listener in a Structured Streaming job

2020-12-28 Thread Arwin Tio
In a Structured Streaming job, the listener that is supported is StreamingQueryListener. spark.streams().addListener( new StreamingQueryListener() { ... } ); However, there is no straightforward way to use StreamingListener. I have done it like this: StreamingContext streamingContext

Re: Spark Executor OOMs when writing Parquet

2020-01-17 Thread Arwin Tio
To: Arwin Tio Cc: user @spark Subject: Re: Spark Executor OOMs when writing Parquet You also have disk spill which is a performance hit. Try multiplying the number of partitions by about 20x - 40x and see if you can eliminate shuffle spill. On Fri, 17 Jan 2020, 10:37 pm Arwin Tio, mailto:arwin

In Catalyst expressions, when is it appropriate to use codegen

2019-08-29 Thread Arwin Tio
Hi, I am exploring the usage of Catalyst expression functions to avoid the performance issues associated with UDFs. One thing that I noticed is that there is a trait called CodegenFallback and there are some Catalyst expressions in Spark that inherit from it [0]. My question is, is there a

Re: Creating custom Spark-Native catalyst/codegen functions

2019-08-22 Thread Arwin Tio
Spark-Native catalyst/codegen functions To: Arwin Tio Cc: user@spark.apache.org Look at https://github.com/DataSystemsLab/GeoSpark/tree/master/sql/src/main/scala/org/apache/spark/sql/geospark sql for an example. Using custom function registration and functions residing inside sparks private

Creating custom Spark-Native catalyst/codegen functions

2019-08-21 Thread Arwin Tio
Hi friends, I am looking into converting some UDFs/UDAFs to Spark-Native functions to leverage Catalyst and codegen. Looking through some examples (for example: https://github.com/apache/spark/pull/7214/files for Levenshtein) it seems like we need to add these functions to the Spark framework

Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Arwin Tio
I am trying to use Spark's **bucketBy** feature on a pretty large dataset. ```java dataframe.write() .format("parquet") .bucketBy(500, bucketColumn1, bucketColumn2) .mode(SaveMode.Overwrite) .option("path", "s3://my-bucket") .saveAsTable("my_table"); ``` The problem is that