Re: [build system] github fetches timing out

2021-03-17 Thread shane knapp ☠
it's been happening a lot again recently...  i'm investigating.

On Wed, Mar 10, 2021 at 10:23 AM Liang-Chi Hsieh  wrote:

> Thanks Shane for looking at it!
>
>
> shane knapp ☠ wrote
> > ...and just like that, overnight the builds started successfully git
> > fetching!
> >
> > --
> > Shane Knapp
> > Computer Guy / Voice of Reason
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Thanks Nicholas for the pointer :-).

On Thu, 18 Mar 2021, 00:11 Nicholas Chammas, 
wrote:

> On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon  wrote:
>
>>   I am currently thinking we will have to convert the Koalas tests to use
>> unittests to match with PySpark for now.
>>
> Keep in mind that pytest supports unittest-based tests out of the box
> , so you should be able
> to run pytest against the PySpark codebase without changing much about the
> tests.
>


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Nicholas Chammas
On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon  wrote:

>   I am currently thinking we will have to convert the Koalas tests to use
> unittests to match with PySpark for now.
>
Keep in mind that pytest supports unittest-based tests out of the box
, so you should be able to
run pytest against the PySpark codebase without changing much about the
tests.


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Yeah, that's a good point, Georg. I think we will port as is first, and
discuss further about that indexing system.
We should probably either add non-index mode or switch it to a distributed
default index type that minimizes the side effect in query plan.
We still have some months left. I will very likely raise another discussion
about it in a PR or dev mailing list after finishing the initial porting.

2021년 3월 17일 (수) 오후 8:33, Georg Heiler 님이 작성:

> Would you plan to keep the existing indexing mechanism then?
>
> https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
> For me, it always even when trying to use the distributed version resulted
> in various window functions being chained, a different query plan than the
> default query plan, and slower execution of the job due to this overhead.
>
> Especially when some people here are thinking about making it the
> default/replacing the regular API I would strongly suggest defaulting to an
> indexing mechanism that is not changing the query plan.
>
> Best,
> Georg
>
> Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <
> gurwls...@gmail.com>:
>
>> > Just out of curiosity, does Koalas pretty much implement all of the
>> Pandas APIs now? If there are some that are yet to be implemented or others
>> that have differences, are these documented so users won't be caught
>> off-guard?
>>
>> It's roughly 75% done so far (in Series, DataFrame and Index).
>> Yeah, and it throws an exception that says it's not implemented yet
>> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
>> easily make users shoot their feet by, for example, for loop ... ).
>>
>>
>> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler 님이 작성:
>>
>>> +1 the proposal sounds good to me. Having a familiar API built-in will
>>> really help new users get into using Spark that might only have Pandas
>>> experience. It sounds like maintenance costs should be manageable, once the
>>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>>> pretty much implement all of the Pandas APIs now? If there are some that
>>> are yet to be implemented or others that have differences, are these
>>> documented so users won't be caught off-guard?
>>>
>>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo 
>>> wrote:
>>>
 Hi,

 Integrating Koalas with pyspark might help enable a richer integration
 between the two. Something that would be useful with a tighter
 integration is support for custom column array types. Currently, Spark
 takes dataframes, converts them to arrow buffers then transmits them
 over the socket to Python. On the other side, pyspark takes the arrow
 buffer and converts it to a Pandas dataframe. Unfortunately, the
 default Pandas representation of a list-type for a column causes it to
 turn what was contiguous value/offset arrays in Arrow into
 deserialized Python objects for each row. Obviously, this kills
 performance.

 A PR to extend the pyspark API to elide the pandas conversion
 (https://github.com/apache/spark/pull/26783) was submitted and
 rejected, which is unfortunate, but perhaps this proposed integration
 would provide the hooks via Pandas' ExtensionArray interface to allow
 Spark to performantly interchange jagged/ragged lists to/from python
 UDFs.

 Cheers
 Andrew

 On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon 
 wrote:
 >
 > Thank you guys for all your feedback. I will start working on SPIP
 with Koalas team.
 > I would expect the SPIP can be sent late this week or early next week.
 >
 >
 > I inlined and answered the questions unanswered as below:
 >
 > Is the community developing the pandas API layer for Spark interested
 in being part of Spark or do they prefer having their own release cycle?
 >
 > Yeah, Koalas team used to have its own release cycle to develop and
 move quickly.
 > Now it became pretty mature with reaching 1.7.0, and the team thinks
 that it’s now
 > fine to have less frequent releases, and they are happy to work
 together with Spark with
 > contributing to it. The active contributors in the Koalas community
 will continue to
 > make the contributions in Spark.
 >
 > How about test code? Does it fit into the PySpark test framework?
 >
 > Yes, this will be one of the places where it needs some efforts.
 Koalas currently uses pytest
 > with various dependency version combinations (e.g., Python version,
 conda vs pip) whereas
 > PySpark uses the plain unittests with less dependency version
 combinations.
 >
 > For pytest in Koalas <> unittests in PySpark:
 >
 >   I am currently thinking we will have to convert the Koalas tests to
 use unittests to match
 >   with PySpark for now.
 >   It is a feasible option for PySpark to 

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Georg Heiler
Would you plan to keep the existing indexing mechanism then?
https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
For me, it always even when trying to use the distributed version resulted
in various window functions being chained, a different query plan than the
default query plan, and slower execution of the job due to this overhead.

Especially when some people here are thinking about making it the
default/replacing the regular API I would strongly suggest defaulting to an
indexing mechanism that is not changing the query plan.

Best,
Georg

Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon :

> > Just out of curiosity, does Koalas pretty much implement all of the
> Pandas APIs now? If there are some that are yet to be implemented or others
> that have differences, are these documented so users won't be caught
> off-guard?
>
> It's roughly 75% done so far (in Series, DataFrame and Index).
> Yeah, and it throws an exception that says it's not implemented yet
> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
> easily make users shoot their feet by, for example, for loop ... ).
>
>
> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler 님이 작성:
>
>> +1 the proposal sounds good to me. Having a familiar API built-in will
>> really help new users get into using Spark that might only have Pandas
>> experience. It sounds like maintenance costs should be manageable, once the
>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>> pretty much implement all of the Pandas APIs now? If there are some that
>> are yet to be implemented or others that have differences, are these
>> documented so users won't be caught off-guard?
>>
>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo 
>> wrote:
>>
>>> Hi,
>>>
>>> Integrating Koalas with pyspark might help enable a richer integration
>>> between the two. Something that would be useful with a tighter
>>> integration is support for custom column array types. Currently, Spark
>>> takes dataframes, converts them to arrow buffers then transmits them
>>> over the socket to Python. On the other side, pyspark takes the arrow
>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>> default Pandas representation of a list-type for a column causes it to
>>> turn what was contiguous value/offset arrays in Arrow into
>>> deserialized Python objects for each row. Obviously, this kills
>>> performance.
>>>
>>> A PR to extend the pyspark API to elide the pandas conversion
>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>> rejected, which is unfortunate, but perhaps this proposed integration
>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>> Spark to performantly interchange jagged/ragged lists to/from python
>>> UDFs.
>>>
>>> Cheers
>>> Andrew
>>>
>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Thank you guys for all your feedback. I will start working on SPIP
>>> with Koalas team.
>>> > I would expect the SPIP can be sent late this week or early next week.
>>> >
>>> >
>>> > I inlined and answered the questions unanswered as below:
>>> >
>>> > Is the community developing the pandas API layer for Spark interested
>>> in being part of Spark or do they prefer having their own release cycle?
>>> >
>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>> move quickly.
>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>> that it’s now
>>> > fine to have less frequent releases, and they are happy to work
>>> together with Spark with
>>> > contributing to it. The active contributors in the Koalas community
>>> will continue to
>>> > make the contributions in Spark.
>>> >
>>> > How about test code? Does it fit into the PySpark test framework?
>>> >
>>> > Yes, this will be one of the places where it needs some efforts.
>>> Koalas currently uses pytest
>>> > with various dependency version combinations (e.g., Python version,
>>> conda vs pip) whereas
>>> > PySpark uses the plain unittests with less dependency version
>>> combinations.
>>> >
>>> > For pytest in Koalas <> unittests in PySpark:
>>> >
>>> >   I am currently thinking we will have to convert the Koalas tests to
>>> use unittests to match
>>> >   with PySpark for now.
>>> >   It is a feasible option for PySpark to migrate to pytest too but it
>>> will need extra effort to
>>> >   make it working with our own PySpark testing framework seamlessly.
>>> >   Koalas team (presumably and likely I) will take a look in any event.
>>> >
>>> > For the combinations of dependency versions:
>>> >
>>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>>> to just add the
>>> >   Koalas tests into the matrix PySpark is currently using.
>>> >
>>> > one question I have; what’s an initial goal of the proposal?
>>> > Is that to port all the pandas interfaces that Koalas has already
>>> 

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
> Just out of curiosity, does Koalas pretty much implement all of the
Pandas APIs now? If there are some that are yet to be implemented or others
that have differences, are these documented so users won't be caught
off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet
properly (or intentionally not implemented, e.g.) Series.__iter__ that will
easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler 님이 작성:

> +1 the proposal sounds good to me. Having a familiar API built-in will
> really help new users get into using Spark that might only have Pandas
> experience. It sounds like maintenance costs should be manageable, once the
> hurdle with setting up tests is done. Just out of curiosity, does Koalas
> pretty much implement all of the Pandas APIs now? If there are some that
> are yet to be implemented or others that have differences, are these
> documented so users won't be caught off-guard?
>
> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo  wrote:
>
>> Hi,
>>
>> Integrating Koalas with pyspark might help enable a richer integration
>> between the two. Something that would be useful with a tighter
>> integration is support for custom column array types. Currently, Spark
>> takes dataframes, converts them to arrow buffers then transmits them
>> over the socket to Python. On the other side, pyspark takes the arrow
>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>> default Pandas representation of a list-type for a column causes it to
>> turn what was contiguous value/offset arrays in Arrow into
>> deserialized Python objects for each row. Obviously, this kills
>> performance.
>>
>> A PR to extend the pyspark API to elide the pandas conversion
>> (https://github.com/apache/spark/pull/26783) was submitted and
>> rejected, which is unfortunate, but perhaps this proposed integration
>> would provide the hooks via Pandas' ExtensionArray interface to allow
>> Spark to performantly interchange jagged/ragged lists to/from python
>> UDFs.
>>
>> Cheers
>> Andrew
>>
>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon  wrote:
>> >
>> > Thank you guys for all your feedback. I will start working on SPIP with
>> Koalas team.
>> > I would expect the SPIP can be sent late this week or early next week.
>> >
>> >
>> > I inlined and answered the questions unanswered as below:
>> >
>> > Is the community developing the pandas API layer for Spark interested
>> in being part of Spark or do they prefer having their own release cycle?
>> >
>> > Yeah, Koalas team used to have its own release cycle to develop and
>> move quickly.
>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>> that it’s now
>> > fine to have less frequent releases, and they are happy to work
>> together with Spark with
>> > contributing to it. The active contributors in the Koalas community
>> will continue to
>> > make the contributions in Spark.
>> >
>> > How about test code? Does it fit into the PySpark test framework?
>> >
>> > Yes, this will be one of the places where it needs some efforts. Koalas
>> currently uses pytest
>> > with various dependency version combinations (e.g., Python version,
>> conda vs pip) whereas
>> > PySpark uses the plain unittests with less dependency version
>> combinations.
>> >
>> > For pytest in Koalas <> unittests in PySpark:
>> >
>> >   I am currently thinking we will have to convert the Koalas tests to
>> use unittests to match
>> >   with PySpark for now.
>> >   It is a feasible option for PySpark to migrate to pytest too but it
>> will need extra effort to
>> >   make it working with our own PySpark testing framework seamlessly.
>> >   Koalas team (presumably and likely I) will take a look in any event.
>> >
>> > For the combinations of dependency versions:
>> >
>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>> to just add the
>> >   Koalas tests into the matrix PySpark is currently using.
>> >
>> > one question I have; what’s an initial goal of the proposal?
>> > Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> > Or, the basic set of them?
>> >
>> > The goal of the proposal is to port all of Koalas project into PySpark.
>> > For example,
>> >
>> > import koalas
>> >
>> > will be equivalent to
>> >
>> > # Names, etc. might change in the final proposal or during the review
>> > from pyspark.sql import pandas
>> >
>> > Koalas supports pandas APIs with a separate layer to cover a bit of
>> difference between
>> > DataFrame structures in pandas and PySpark, e.g.) other types as column
>> names (labels),
>> > index (something like row number in DBMSs) and so on. So I think it
>> would make more sense
>> > to port the whole layer instead of a subset of the APIs.
>> >
>> >
>> >
>> >
>> >
>> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan 님이 작성:
>> >>
>> >> +1, it's great to have Pandas support in