I looked into the Arrow build system some more. It is possible to get the 
Python classes generated by adding "--python-out" flag (set to a directory 
created for it) to the `${ARROW_PROTOBUF_PROTOC}` command under 
`macro(build_substrait)` in `cpp/cmake_modules/ThirdpartyToolchain.cmake`. 
However, this makes them available only in the Arrow C++ build whereas for the 
current purpose they need to be available in the PyArrow build. The PyArrow 
build calls `cmake` on `python/CMakeLists.txt`, which AFAICS has access to 
`cpp/cmake_modules`. So, one solution could be to pull `macro(build_substrait)` 
into `python/CMakeLists.txt` and call it to generate the Python protobuf 
classes under `python/`, making them available for import by PyArrow code. This 
would probably be cleaner with some macro parameters to distinguish between C++ 
and Python generation.

Does this sound like a reasonable approach?


Yaron.

________________________________
From: Yaron Gvili <rt...@hotmail.com>
Sent: Saturday, July 2, 2022 8:55 AM
To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud 
<cpcl...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

I'm somewhat confused by this answer because I think resolving the issue I 
raised does not require any change outside PyArrow. I'll try to explain the 
issue differently.

First, let me describe the current situation with Substrait protobuf in Arrow 
C++. The Arrow C++ build system handles external projects in 
`cpp/cmake_modules/ThirdpartyToolchain.cmake`, and one of these external 
projects is "substrait". By default, the build system takes the source code for 
"substrait" from 
`https://github.com/substrait-io/substrait/archive/${ARROW_SUBSTRAIT_BUILD_VERSION}.tar.gz
 ` where `ARROW_SUBSTRAIT_BUILD_VERSION` is set in 
`cpp/thirdparty/versions.txt`. The source code is check-summed and unpacked in 
`substrait_ep-prefix` under the build directory and from this the protobuf C++ 
classes are generated in `*.pb.{h,cc}` files in `substrait_ep-generated` under 
the build directory. The build system makes a library using the `*.cc` files 
and makes the `*.h` files available for other C++ modules to use.

Setting up the above mechanism did not require any change in the 
`substrait-io/substrait` repo, nor any coordination with its authors. What I'm 
looking for is a similar build mechanism for PyArrow that builds Substrait 
protobuf Python classes and makes them available for use by other PyArrow 
modules. I believe this PyArrow build mechanism does not exist currently and 
that setting up one would not require any changes outside PyArrow. I'm asking 
(1) whether that's indeed the case, (2) whether others agree this mechanism is 
needed at least due to the problem I ran into that I previously described, and 
(3) for any thoughts about how to set up this mechanism assuming it is needed.

Weston, perhaps your thinking was that the Substrait protobuf Python classes 
need to be built by a repo in the substrait-io space and made available as a 
binary+headers package? This can be done but will require involving Substrait 
people and appears to be inconsistent with current patterns in the Arrow build 
system. Note that for my purposes here, the Substrait protobuf Python classes 
will be used for composing or interpreting a Substrait plan, not for 
transforming it by an optimizer, though a Python-based optimizer is a valid use 
case for them.


Yaron.
________________________________
From: Weston Pace <weston.p...@gmail.com>
Sent: Friday, July 1, 2022 12:42 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>; Phillip Cloud 
<cpcl...@gmail.com>
Subject: Re: accessing Substrait protobuf Python classes from PyArrow

Given that Acero does not do any planner / optimizer type tasks I'm
not sure you will find anything like this in arrow-cpp or pyarrow.
What you are describing I sometimes refer to as "plan slicing and
dicing".  I have wondered if we will someday need this in Acero but I
fear it is a slippery slope between "a little bit of plan
manipulation" and "a full blown planner" so I've shied away from it.
My first spot to look would be a substrait-python repository which
would belong here: https://github.com/substrait-io

However, it does not appear that such a repository exists.  If you're
willing to create one then a quick ask on the Substrait Slack instance
should be enough to get the repository created.  Perhaps there is some
genesis of this library in Ibis although I think Ibis would use its
own representation for slicing and dicing and only use Substrait for
serialization.

Once that repository is created pyarrow could probably import it but
unless this plan manipulation makes sense purely from a pyarrow
perspective I would rather prefer that the user application import
both pyarrow and substrait-python independently.

Perhaps @Phillip Cloud or someone from the Ibis space might have some
ideas on where this might be found.

-Weston

On Thu, Jun 30, 2022 at 10:06 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi,
>
> Is there support for accessing Substrait protobuf Python classes (such as 
> Plan) from PyArrow? If not, how should such support be added? For example, 
> should the PyArrow build system pull in the Substrait repo as an external 
> project and build its protobuf Python classes, in a manner similar to how 
> Arrow C++ does it?
>
> I'm pondering these questions after running into an issue with code I'm 
> writing under PyArrow that parses a Substrait plan represented as a 
> dictionary. The current (and kind of shaky) parsing operation in this code 
> uses json.dumps() on the dictionary, which results in a string that is passed 
> to a Cython API that handles it using Arrow C++ code that has access to 
> Substrait protobuf C++ classes. But when the Substrait plan contains a 
> bytes-type, json.dump() no longer works and fails with "TypeError: Object of 
> type bytes is not JSON serializable". A fix for this, and a better way to 
> parse, is using google.protobuf.json_format.ParseDict() [1] on the 
> dictionary. However, this invocation requires a second argument, namely a 
> protobuf message instance to merge with. The class of this message (such as 
> Plan) is a Substrait protobuf Python class, hence the need to access such 
> classes from PyArrow.
>
> [1] 
> https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html
>
>
> Yaron.

Reply via email to