Re: [DISCUSS] SPIP: Python Stored Procedures

Mich Talebzadeh Sat, 02 Sep 2023 08:27:23 -0700

I have noticed an worthy discussion in the SPIP comments regarding the
definition of "stored procedure" in the context of Spark, and I believe it
is an important point to address.

To provide some historical context, Sybase
<https://www.referenceforbusiness.com/history2/49/Sybase-Inc.html>, a
relational database vendor (which later co-licensed their code to Microsoft
for SQL Server), introduced the concept of stored procedures while
positioning themselves as a client-server company. During this period, they
were in competition with Oracle, particularly in the realm of front-office
trading systems. The introduction of stored procedures, stored on the
server-side within the database, allowed Sybase to modularize frequently
used code. This move significantly reduced network overhead and latency.
Stored procedures were first introduced in the mid-1980s and proved to be a
profitable innovation. It is important to note that they had a robust
database to rely on during this process.

Now, as we contemplate the implementation of stored procedures in Spark, we
must think strategically about where these procedures will be stored and
how they will be reused. Some colleagues have suggested using HMS (Derby)
by default, but it is worth noting that HMS is inherently single-threaded.
If we intend to leverage stored procedures extensively, Should we consider
establishing "a native" storage solution? This approach not only aligns
with good architectural practices but also has the potential for broader
applications beyond Spark. While empowering users to choose their preferred
database for this purpose might sound appealing, it may not be the most
realistic or practical approach. This discussion highlights the importance
of clarifying terminologies and establishing a solid foundation for this
feature.

HTH

Mich Talebzadeh,

Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 31 Aug 2023 at 18:19, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> I concur with the view point raised by @Sean Owen
>
> While this might introduce some challenges related to compatibility and
> environment issues, it is not fundamentally different from how the users
> currently import and use common code in Python. The main difference is that
> now this shared code would be stored as stored procedures in the catalog of
> user choice -> probably Hive Metastore
>
> HTH
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 31 Aug 2023 at 16:41, Sean Owen <sro...@gmail.com> wrote:
>
>> I think you're talking past Hyukjin here.
>>
>> I think the response is: none of that is managed by Pyspark now, and this
>> proposal does not change that. Your current interpreter and environment is
>> used to execute the stored procedure, which is just Python code. It's on
>> you to bring an environment that runs the code correctly. This is just the
>> same as how running any python code works now.
>>
>> I think you have exactly the same problems with UDFs now, and that's all
>> a real problem, just not something Spark has ever tried to solve for you.
>> Think of this as exactly like: I have a bit of python code I import as a
>> function and share across many python workloads. Just, now that chunk is
>> stored as a 'stored procedure'.
>>
>> I agree this raises the same problem in new ways - now, you are storing
>> and sharing a chunk of code across many workloads. There is more potential
>> for compatibility and environment problems, as all of that is simply punted
>> to the end workloads. But, it's not different from importing common code
>> and the world doesn't fall apart.
>>
>> On Wed, Aug 30, 2023 at 11:16 PM Alexander Shorin <kxe...@apache.org>
>> wrote:
>>
>>>
>>> Which Python version will run that stored procedure?
>>>>
>>>> All Python versions supported in PySpark
>>>>
>>>
>>> Where in stored procedure defines the exact python version which will
>>> run the code? That was the question.
>>>
>>>
>>>> How to manage external dependencies?
>>>>
>>>> Existing way we have
>>>> https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
>>>> .
>>>> In fact, this will use the external dependencies within your Python
>>>> interpreter so you can use all existing conda or venvs.
>>>>
>>> Current proposal solves this issue nohow (the stored code doesn't
>>> provide any manifest about its dependencies and what is required to run
>>> it). So feels like it's better to stay with UDF since they are under
>>> control and their behaviour is predictable. Did I miss something?
>>>
>>> How to test it via a common CI process?
>>>>
>>>> Existing way of PySpark unittests, see
>>>> https://github.com/apache/spark/tree/master/python/pyspark/tests
>>>>
>>> Sorry, but this wouldn't work since stored procedure thing requires some
>>> specific definition and this code will not be stored as regular python
>>> code. Do you have any examples how to test stored python procedures as a
>>> unit e.g. without spark?
>>>
>>> How to manage versions and do upgrades? Migrations?
>>>>
>>>> This is a new feature so no migration is needed. We will keep the
>>>> compatibility according to the sember we follow.
>>>>
>>> Question was not about spark, but about stored procedures itself. Any
>>> guidelines which will not copy flaws of other systems?
>>>
>>> Current Python UDF solution handles these problems in a good way since
>>>> they delegate them to project level.
>>>>
>>>> Current UDF solution cannot handle stored procedures because UDF is on
>>>> the worker side. This is Driver side.
>>>>
>>> How so? Currently it works and we never faced such issue. May be you
>>> should have the same Python code also on the driver side? But such trivial
>>> idea doesn't require new feature on Spark since you already have to ship
>>> that code somehow.
>>>
>>> --
>>> ,,,^..^,,,
>>>
>>

Re: [DISCUSS] SPIP: Python Stored Procedures

Reply via email to