Re: Sort order in bucketing in a custom datasource

2019-04-16 Thread Jacek Laskowski
Hi, I don't think so. I can't think of an interface (trait) that would give that information to the Catalyst optimizer. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
Thanks Ryan. To me the "test" for putting things in a maintenance release is really a trade-off between benefit and risk (along with some caveats, like user facing surface should not grow). The benefits here are fairly large (now it is possible to plug in partition aware data sources) and the risk

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
Spark has a lot of strange behaviors already that we don't fix in patch releases. And bugs aren't usually fixed with a configuration flag to turn on the fix. That said, I don't have a problem with this commit making it into a patch release. This is a small change and looks safe enough to me. I

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Bobby Evans
I am +1, I better be because I am proposing the SPIP. Thanks, Bobby On Tue, Apr 16, 2019 at 10:38 AM Tom Graves wrote: > Hi everyone, > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for > extended Columnar Processing Support. The proposal is to extend the > support to

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
I would argue that its confusing enough to a user for options from DataFrameWriter to be silently dropped when instantiating the data source to consider this a bug. They asked for partitioning to occur, and we are doing nothing (not even telling them we can't). I was certainly surprised by this

Re: Spark 2.4.2

2019-04-16 Thread Ryan Blue
Is this a bug fix? It looks like a new feature to me. On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust wrote: > Hello All, > > I know we just released Spark 2.4.1, but in light of fixing SPARK-27453 > I was wondering if it > might make sense

Spark 2.4.2

2019-04-16 Thread Michael Armbrust
Hello All, I know we just released Spark 2.4.1, but in light of fixing SPARK-27453 I was wondering if it might make sense to follow up quickly with 2.4.2. Without this fix its very hard to build a datasource that correctly handles partitioning

Re: Sort order in bucketing in a custom datasource

2019-04-16 Thread Russell Spitzer
Please join the DataSource V2 meetings, the next one is tomorrow since we are discussing these very topics right now. Datasource v1 cannot provide this information but any source which just generates RDDs can specify a partitioner. This is only useful though if you are only using RDDs, for

Sort order in bucketing in a custom datasource

2019-04-16 Thread Long, Andrew
Hey Friends, Is it possible to specify the sort order or bucketing in a way that can be used by the optimizer in spark? Cheers Andrew

[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-16 Thread Tom Graves
Hi everyone, I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended Columnar Processing Support.  The proposal is to extend the support to allow for more columnar processing. You can find the full proposal in the jira at:  https://issues.apache.org/jira/browse/SPARK-27396.

Re: Is there value in publishing nightly snapshots?

2019-04-16 Thread Koert Kuipers
we have used it at times to detect any breaking changes, since it allows us to run out internal unit tests against spark snapshot binaries but we can also build these snapshots in-house if you want to turn it off On Tue, Apr 16, 2019 at 9:29 AM Sean Owen wrote: > I noticed recently ... > >

Is there value in publishing nightly snapshots?

2019-04-16 Thread Sean Owen
I noticed recently ... https://github.com/apache/spark-website/pull/194/files#diff-d95d573366135f01d4fbae2d64522500R466 ... that we stopped publishing nightly releases a long while ago. That's fine. What about turning off the job that builds -SNAPSHOTs of the artifacts each night? does anyone

Re: pyspark.sql.functions ide friendly

2019-04-16 Thread 880f0464
Hi. That's a problem with Spark as such and in general can be addressed on IDE to IDE basis - see for example https://stackoverflow.com/q/40163106 for some hints. Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Tuesday, April 16, 2019 2:10 PM, educhana wrote: > Hi, >

pyspark.sql.functions ide friendly

2019-04-16 Thread educhana
Hi, Currently using pyspark.sql.functions from an IDE like PyCharm is causing the linters complain due to the functions being declared at runtime. Would a PR fixing this be welcomed? Is there any problems/difficulties I'm unaware? -- Sent from:

Support for arrays parquet vectorized reader

2019-04-16 Thread Mick Davies
Hi, I'm working with a medical data model that uses arrays of simple types to represent things like the drug exposures and conditions that are associated with a patient. Using this model, patient data is co-located and is consequently processed by Spark more efficiently. The data is stored in

Fwd: Uncaught Exception Handler in master

2019-04-16 Thread Alessandro Liparoti
Hi everyone, I have a spark libary where I would like to do some action before an uncaught exception happens (log it, increment an error metric, ...). I tried multiple times to use setUncaughtExceptionHandler in the current Thread but this doesn't work. If I spawn another thread this works fine.