I agree with the proposed revamp of the PySpark API. It's a good move, and to
ensure smooth progress, maybe we could break it down into smaller, manageable
steps in more detail.
------------------------------------------------------------------
发件人:Sem <[email protected]>
发送时间:2024年4月12日(星期五) 17:05
收件人:dev<[email protected]>
主 题:Re: A new Pyspark API that will work with botth Spark Classic and Spark
Connect
Exactly!
It is the topic I wanted to discuss on the latest community meeting!
On Fri, 2024-04-12 at 08:23 +0000, Weibin Zeng wrote:
> Perhaps this is not directly relevant, but is the proposal you're
> mentioning the topic that was intended for discussion at the last
> community meeting?
>
> On 2024/04/11 18:58:56 Sem wrote:
> > Hello!
> >
> > The current PySpark implementation has one serious problem: it
> > won't
> > work with the new Spark Connect because it relies on using `py4j`
> > and
> > an internal `_jvm` variable.
> >
> > My suggestion is to rewrite PySpark API from scratch in the
> > following
> > way:
> >
> > 1. We will have pure Python GraphInfo, EdgeInfo and VertexInfo
> > 2. We will have pure PySpark utils (index generators)
> > 3. We will use spark scala datasources for writing and reading in
> > GAR
> > format from PySpark
> >
> > It is a lot of work, but I'm committing to do it and support it in
> > the
> > future as a PMC of the project. Decoupling PySpark from Scala will
> > also
> > simplify Scala/Java development. Another good point is that the
> > actual
> > logic in PySpark will be mostly in Python code that simplifies
> > reading
> > of source code and debugging for everyone who wants to work with a
> > library.
> >
> > Couple of additional dependencies will be introduced:
> > 1. Pydantic for working with YAML-models of Info objects (MIT
> > License,
> > pure python)
> > 2. PyYaml for the same reason (MIT License, pure python/cython)
> >
> > Overall it should be good for the project, bceause it will simplify
> > testing of both part (spark and pyspark).
> >
> > I see GraphAr PySpark mostly not in production ETL-jobs but in
> > interactive development and ad-hoc analytics on graph data. And
> > typically such an analytics happens on Databricks Notebooks (does
> > not
> > provide an access to `_jvm` in shared clsuters) or in other tools
> > (like
> > VSCode spark-connect) relies on Spark Connect. So, for that case
> > support of Spark Connect may be more important than for spark scala
> > part that should be used for jobs not interactive development.
> >
> > Thoughts?
> >
> > -------------------------------------------------------------------
> > --
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]