Hi everyone,
This is a follow-up to the "Identifiers with multi-catalog support"
discussion thread. I've taken the proposal I posted to that thread and
written it up as an official SPIP for how to identify tables and other
catalog objects when working with multiple catalogs.
The doc is available
Hi
I'm a Python Developer (& Data Scientist) and I contributed to Debian[1][2]
last year as a part of Google Summer of Code[3]. Having used Lucene, Kafka
and Spark in the past, I wanted to work on at least one of them this
summer. Since Spark uses Python[4] (API) unlike the others, I felt I could
g
If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high res
I don't think Spark supports this model, where N inputs depending on parent
are computed once at the same time. You can of course cache the parent and
filter N times and do the same amount of work. One problem is, where would
the N inputs live? they'd have to be stored if not used immediately, and