Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-23 Thread Jim Kleckner
I understand that "non-dev" persons could become confused and that some sort of signposting/warning makes sense. Certainly I consider my personal registry on gitlab.com as ephemeral and not intended to publish. We have our own private instance of gitlab where I put artifacts that are derived and

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread mshen
Hi Wenchen, Glad to know that you like this idea. We also looked into making this pluggable in our early design phase. While the ShuffleManager API for pluggable shuffle systems does provide quite some room for customized behaviors for Spark shuffle, we feel that it is still not enough for this

Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-23 Thread Sean Owen
Yeah the color on this is that 'snapshot' or 'nightly' builds are not quite _discouraged_ by the ASF, but need to be something only devs are likely to find and clearly signposted, because they aren't official blessed releases. It gets into a gray area if the project is 'officially' hosting a way

Re: [DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-23 Thread Dongjoon Hyun
Hi, Jim. Thank you for the proposal. I understand the request. However, the following key benefit sounds like unofficial snapshot binary releases. > For example, this was used to build a version of spark that included SPARK-28938 which has yet to be released and was necessary for spark-operator

[DISCUSS][SPARK-30275] Discussion about whether to add a gitlab-ci.yml file

2020-01-23 Thread Jim Kleckner
This story [1] proposes adding a .gitlab-ci.yml file to make it easy to create artifacts and images for spark. Using this mechanism, people can submit any subsequent version of spark for building and image hosting with gitlab.com. There is a companion WIP branch [2] with a candidate and example

Re: [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

2020-01-23 Thread Wenchen Fan
I don't think we want to add a lot of flexibility to the PARTITION BY expressions. It's usually just columns or nested fields, or some common functions like year, month, etc. If you look at the parser, we create DS V2 Expression directly. The partition-specific expressions are for

Re: Enabling push-based shuffle in Spark

2020-01-23 Thread Wenchen Fan
The name "push-based shuffle" is a little misleading. This seems like a better shuffle service that co-locates shuffle blocks of one reducer at the map phase. I think this is a good idea. Is it possible to make it completely external via the shuffle plugin API? This looks like a good use case of