[DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-03 Thread Ryan Blue
Hi everyone, This is a follow-up to the "Identifiers with multi-catalog support" discussion thread. I've taken the proposal I posted to that thread and written it up as an official SPIP for how to identify tables and other catalog objects when working with multiple catalogs. The doc is available

GSoC 2019 : Contributing to Apache Spark

2019-02-03 Thread Vishal Gupta
Hi I'm a Python Developer (& Data Scientist) and I contributed to Debian[1][2] last year as a part of Google Summer of Code[3]. Having used Lucene, Kafka and Spark in the past, I wanted to work on at least one of them this summer. Since Spark uses Python[4] (API) unlike the others, I felt I could g

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high res

Re: Feature request: split dataset based on condition

2019-02-03 Thread Sean Owen
I don't think Spark supports this model, where N inputs depending on parent are computed once at the same time. You can of course cache the parent and filter N times and do the same amount of work. One problem is, where would the N inputs live? they'd have to be stored if not used immediately, and