Optimizing LIMIT in DSv2

2020-03-30 Thread Andrew Melo
Hello, Executing "SELECT Muon_Pt FROM rootDF LIMIT 10", where "rootDF" is a temp view backed by a DSv2 reader yields the attached plan [1]. It appears that the initial stage is run over every partition in rootDF, even though each partition has 200k rows (modulo the last partition which holds the

Re: Data Source - State (SPARK-28190)

2020-03-30 Thread Jungtaek Lim
Hi Bryan, Thanks for the interest! Unfortunately there's lack of support on committers for SPARK-28190 (I have been struggling with lack of support on structured streaming contributions). I hope things will get better, but in the meantime, could you please try out my own project instead?

[Spark SQL]: How to deserailize column of ArrayType to java.util.List

2020-03-30 Thread Dima Pavlyshyn
Hello Apache Spark Support Team, I am writing Spark on Java now. I use Dataset API and I face with an issue, that I am doing something like that public Dataset> groupByKey(Dataset> consumers, Class kClass) { consumers.groupBy("_1").agg(collect_list(col("_2"))).printSchema(); return

[no subject]

2020-03-30 Thread Dima Pavlyshyn
Hello Apache Spark Support Team, I am writing Spark on Java now. I use Dataset API and I face with an issue, that I am doing something like that public Dataset> groupByKey(Dataset> consumers, Class kClass) { consumers.groupBy("_1").agg(collect_list(col("_2"))).printSchema(); return

Data Source - State (SPARK-28190)

2020-03-30 Thread Bryan Jeffrey
Hi, Jungtaek. We've been investigating the use of Spark Structured Streaming to replace our Spark Streaming operations. We have several cases where we're using mapWithState to maintain state across batches, often with high volumes of data. We took a look at the Structured Streaming stateful

Building Spark + hadoop docker for openshift

2020-03-30 Thread Antoine DUBOIS
Hello, I'm trying to build a spark+hadoop docker image compatible with Openshift. I've used oshinko Spark build script here https://github.com/radanalyticsio/openshift-spark to build something with Hadoop jar in classpath to allow usage of S3 storage. However I'm now stuk on the spark