Cool, thank you. This would solve the problem at hand. I agree it'd be good to kick off the PyArrow API discussion in parallel with the PR8023 review. Maybe you and Itamar could prep a googledoc draft for the community to have a look and to comment.
Cheers, Gidon On Fri, Sep 4, 2020 at 6:08 PM Roee Shlomo <roe...@gmail.com> wrote: > Sounds good. In the suggestion above the builders for > FileEncryptionProperties/FileDecryptionProperties should not be exposed, so > only key tools would create those. This is just one option of course. > > On 2020/09/03 20:44:26, Antoine Pitrou <anto...@python.org> wrote: > > > > It would be useful for outsiders to expose what those two API levels > > are, and to what usage they correspond. > > Is Parquet encryption used only with that Spark? While Spark > > interoperability is important, Parquet files are more ubiquitous than > that. > > > > Regards > > > > Antoine. > > > > > > Le 03/09/2020 à 22:31, Gidon Gershinsky a écrit : > > > Why would the low level API be exposed directly.. This will break the > > > interop between the two analytic ecosystems down the road. > > > Again, let me suggest leveraging the high level interface, based on the > > > PropertiesDrivenCryptoFactory. > > > It should address your technical requirements; if it doesn't, we can > > > discuss the gaps. > > > All questions are welcome. > > > > > > Cheers, Gidon > > > > > > > > > On Thu, Sep 3, 2020 at 10:11 PM Roee Shlomo <roe...@gmail.com> wrote: > > > > > >> Hi Itamar, > > >> > > >> I implemented some python wrappers for the low level API and would be > > >> happy to collaborate on that. The reason I didn't push this forward > yet is > > >> what Gidon mentioned. The API to expose to python users needs to be > > >> finalized first and it must include the key tools API for interop with > > >> Spark. > > >> > > >> Perhaps it would be good to kickoff a discussion on how the pyarrow > API > > >> for PME should look like (in parallel to reviewing the arrow-cpp > > >> implementation of key-tools; to ensure that wrapping it would be a > > >> reasonable effort). > > >> > > >> One possible approach is to expose both the low level API and keytools > > >> separately. A user would create and initialize a > > >> PropertiesDrivenCryptoFactory and use it to create the > > >> FileEncryptionProperties/FileDecryptionProperties to pass to the lower > > >> level API. In pandas this would translate to something like: > > >> ``` > > >> factory = PropertiesDrivenCryptoFactory(...) > > >> df.to_parquet(path, engine="pyarrow", > > >> encryption=factory.getFileEncryptionProperties(...)) > > >> df = pd.read_parquet(path, engine="pyarrow", > > >> decryption=factory.getFileDecryptionProperties(...)) > > >> ``` > > >> This should also work with reading datasets since decryption uses a > > >> KeyRetriever, but I'm not sure what will need to be done once > datasets will > > >> support write. > > >> > > >> On 2020/09/03 14:11:51, "Itamar Turner-Trauring" < > ita...@pythonspeed.com> > > >> wrote: > > >>> Hi, > > >>> > > >>> I'm looking into implementing this, and it seems like there are two > > >> parts: packaging, but also wrapping the APIs in Python. Is the latter > item > > >> accurate? If so, any examples of similar existing wrapped APIs, or > should I > > >> just come up with something on my own? > > >>> > > >>> Context: > > >>> https://github.com/apache/arrow/pull/4826 > > >>> https://issues.apache.org/jira/browse/ARROW-8040 > > >>> > > >>> Thanks, > > >>> > > >>> —Itamar > > >> > > > > > >