Sounds good. In the suggestion above the builders for FileEncryptionProperties/FileDecryptionProperties should not be exposed, so only key tools would create those. This is just one option of course.
On 2020/09/03 20:44:26, Antoine Pitrou <anto...@python.org> wrote: > > It would be useful for outsiders to expose what those two API levels > are, and to what usage they correspond. > Is Parquet encryption used only with that Spark? While Spark > interoperability is important, Parquet files are more ubiquitous than that. > > Regards > > Antoine. > > > Le 03/09/2020 à 22:31, Gidon Gershinsky a écrit : > > Why would the low level API be exposed directly.. This will break the > > interop between the two analytic ecosystems down the road. > > Again, let me suggest leveraging the high level interface, based on the > > PropertiesDrivenCryptoFactory. > > It should address your technical requirements; if it doesn't, we can > > discuss the gaps. > > All questions are welcome. > > > > Cheers, Gidon > > > > > > On Thu, Sep 3, 2020 at 10:11 PM Roee Shlomo <roe...@gmail.com> wrote: > > > >> Hi Itamar, > >> > >> I implemented some python wrappers for the low level API and would be > >> happy to collaborate on that. The reason I didn't push this forward yet is > >> what Gidon mentioned. The API to expose to python users needs to be > >> finalized first and it must include the key tools API for interop with > >> Spark. > >> > >> Perhaps it would be good to kickoff a discussion on how the pyarrow API > >> for PME should look like (in parallel to reviewing the arrow-cpp > >> implementation of key-tools; to ensure that wrapping it would be a > >> reasonable effort). > >> > >> One possible approach is to expose both the low level API and keytools > >> separately. A user would create and initialize a > >> PropertiesDrivenCryptoFactory and use it to create the > >> FileEncryptionProperties/FileDecryptionProperties to pass to the lower > >> level API. In pandas this would translate to something like: > >> ``` > >> factory = PropertiesDrivenCryptoFactory(...) > >> df.to_parquet(path, engine="pyarrow", > >> encryption=factory.getFileEncryptionProperties(...)) > >> df = pd.read_parquet(path, engine="pyarrow", > >> decryption=factory.getFileDecryptionProperties(...)) > >> ``` > >> This should also work with reading datasets since decryption uses a > >> KeyRetriever, but I'm not sure what will need to be done once datasets will > >> support write. > >> > >> On 2020/09/03 14:11:51, "Itamar Turner-Trauring" <ita...@pythonspeed.com> > >> wrote: > >>> Hi, > >>> > >>> I'm looking into implementing this, and it seems like there are two > >> parts: packaging, but also wrapping the APIs in Python. Is the latter item > >> accurate? If so, any examples of similar existing wrapped APIs, or should I > >> just come up with something on my own? > >>> > >>> Context: > >>> https://github.com/apache/arrow/pull/4826 > >>> https://issues.apache.org/jira/browse/ARROW-8040 > >>> > >>> Thanks, > >>> > >>> —Itamar > >> > > >