Re: [DISCUSS] Support pandas API layer on PySpark

Reynold Xin Sun, 14 Mar 2021 23:13:00 -0700

I don't think we should deprecate existing APIs.

Spark's own Python API is relatively stable and not difficult to support. It 
has a pretty large number of users and existing code. Also pretty easy to learn 
by data engineers.


pandas API is a great for data science, but isn't that great for some other 
tasks. It's super wide. Great for data scientists that have learned it, or 
great for copy paste from Stackoverflow.

On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
> 
> 
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> 
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
> 
> 
> 
> 
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon < gurwls...@gmail.com > wrote:
> 
> 
> 
>> 
>> 
>> Firstly my biggest reason is that I would like to promote this more as a
>> built-in support because it is simply
>> important to have it with the impact on the large user group, and the
>> needs are increasing
>> as the charts indicate. I usually think that features or add-ons stay as
>> third parties when it’s rather for a
>> smaller set of users, it addresses a corner case of needs, etc. I think
>> this is similar to the datasources
>> we have added. Spark ported CSV and Avro because more and more people use
>> it, and it became important
>> to have it as a built-in support.
>> 
>> 
>> 
>> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
>> experts from the
>> bigger community. Koalas’ team isn’t experts in all the areas, and there
>> are many missing corner
>> cases to fix, Some require deep expertise from specific areas.
>> 
>> 
>> 
>> One example is the type hints. Koalas uses type hints for schema
>> inference.
>> Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
>> way (
>> https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas
>> ).
>> Fortunately the way Koalas implemented is now partially proposed into
>> Python officially (PEP 646).
>> But Koalas could have been better with interacting with the Python
>> community more and actively
>> joining in the design issues together to lead the best output that
>> benefits both and more projects.
>> 
>> 
>> 
>> Thirdly, I would like to contribute to the growth of PySpark. The growth
>> of the Koalas is very fast given the
>> internal and external stats. The number of users has jumped up twice
>> almost every 4 ~ 6 months.
>> I think Koalas will be a good momentum to keep Spark up.
>> 
>> 
>> Fourthly, PySpark is still not Pythonic enough. For example, I hear
>> complaints such as "why does
>> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
>> APIs are very difficult to change
>> in Spark (as I emphasized above). This set of Koalas APIs will be able to
>> address these concerns
>> in PySpark.
>> 
>> Lastly, I really think PySpark needs its native plotting features. As I
>> emphasized before with
>> elaboration, I do think this is an important feature missing in PySpark
>> that users need.
>> I do think Koalas completes what PySpark is currently missing.
>> 
>> 
>> 
>> 
>> 
>> 2021년 3월 14일 (일) 오후 7:12, Sean Owen < sro...@gmail.com >님이 작성:
>> 
>> 
>>> I like koalas a lot. Playing devil's advocate, why not just let it
>>> continue to live as an add on? Usually the argument is it'll be maintained
>>> better in Spark but it's well maintained. It adds some overhead to
>>> maintaining Spark conversely. On the upside it makes it a little more
>>> discoverable. Are there more 'synergies'?
>>> 
>>> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon < gurwls...@gmail.com > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Hi all,
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I would like to start the discussion on supporting pandas API layer on
>>>> Spark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> If we have a general consensus on having it in PySpark, I will initiate
>>>> and drive an SPIP with a detailed explanation about the implementation’s
>>>> overview and structure.
>>>> 
>>>> 
>>>> 
>>>> I would appreciate it if I can know whether you guys support this or not
>>>> before starting the SPIP.
>>>> 
>>>> 
>>>> 
>>>> ----------------------------
>>>>  What do you want to propose?
>>>> ----------------------------
>>>> 
>>>> 
>>>> 
>>>> I have been working on the Koalas ( https://github.com/databricks/koalas )
>>>> project that is essentially: pandas API support on Spark, and I would like
>>>> to propose embracing Koalas in PySpark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> More specifically, I am thinking about adding a separate package, to
>>>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything
>>>> in the existing codes. The overview would look as below:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> pyspark_dataframe.[... PySpark APIs ...]
>>>> pandas_dataframe.[... pandas APIs (local) ...]
>>>> 
>>>> # The package names will change in the final proposal and during review. 
>>>> koalas_dataframe = koalas.from_pandas *(* pyspark_dataframe *)*
>>>> koalas_dataframe  = koalas.from_spark *(* pandas_dataframe *)*
>>>> koalas_dataframe.[... pandas APIs on Spark ...]
>>>> 
>>>> pyspark_dataframe = koalas_dataframe.to_spark()
>>>> pandas_dataframe = koalas_dataframe.to_pandas()
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Koalas provides a pandas API layer on PySpark. It supports almost the same
>>>> API usages. Users can leverage their existing Spark cluster to scale their
>>>> pandas workloads. It works interchangeably with PySpark by allowing both
>>>> pandas and PySpark APIs to users.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> The project has grown separately more than two years, and this has been
>>>> successfully going. With version 1.7.0 Koalas has greatly improved
>>>> maturity and stability. Its usability has been proven with numerous users’
>>>> adoptions and by reaching more than 75% API coverage in pandas’ Index,
>>>> Series and DataFrame.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I strongly think this is the direction we should go for Apache Spark, and
>>>> it is a win-win strategy for the growth of both Apache Spark and pandas.
>>>> Please see the reasons below.
>>>> 
>>>> 
>>>> 
>>>> ------------------
>>>>  Why do we need it?
>>>> ------------------
>>>> 
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> Python has grown dramatically in the last few years and became one of the
>>>> most popular languages, see also StackOverFlow trend (
>>>> https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr )
>>>> for Python, Java, R and Scala languages.
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> pandas became almost the standard library of data science. Please also see
>>>> the StackOverFlow trend (
>>>> https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr )
>>>> for pandas, Apache Spark and PySpark.
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> PySpark is not Pythonic enough. At least I myself hear a lot of
>>>> complaints. That initiated Project Zen (
>>>> https://issues.apache.org/jira/browse/SPARK-32082 ) , and we have greatly
>>>> improved PySpark usability and made it more Pythonic.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Nevertheless, data scientists tend to prefer pandas libraries according to
>>>> the trends but APIs are hard to change in PySpark. We should redesign all
>>>> APIs and improve them from scratch, which is very difficult.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> One straightforward and fast approach is to benchmark a successful case,
>>>> and pandas does not support distributed execution. Once PySpark supports
>>>> pandas-like APIs, it can be a good option for pandas users to scale their
>>>> workloads easily. I do believe this is a win-win strategy for the growth
>>>> of both pandas and PySpark.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> In fact, there are already similar tries such as Dask ( https://dask.org/ )
>>>> and Modin ( https://modin.readthedocs.io/en/latest/ ) (other than Koalas (
>>>> https://github.com/databricks/koalas ) ). They are all growing fast and
>>>> successfully, and I find that people compare it to PySpark from time to
>>>> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data
>>>> technologies battling head to head (
>>>> https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13
>>>> ).
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> * 
>>>> 
>>>> There are many important features missing that are very common in data
>>>> science. One of the most important features is plotting and drawing a
>>>> chart. Almost every data scientist plots and draws a chart to understand
>>>> their data quickly and visually in their daily work but this is missing in
>>>> PySpark. Please see one example in pandas:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I do recommend taking a quick look for blog posts and talks made for
>>>> pandas on Spark: 
>>>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
>>>> (
>>>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html
>>>> ). They explain why we need this far more better.
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Support pandas API layer on PySpark

Reply via email to