tritzman commented on issue #41246:
URL: https://github.com/apache/arrow/issues/41246#issuecomment-2062856650
In my application code, when I call `write_dataset`, I have a file_visitor
that collects metadata as Parquet files are created. Looking at the
`pyarrow.dataset.WrittenFile`'s metadata, I find `path_in_schema`, which shows
lists are stored in Parquet with the name `<column_name>.list.element`. Adding
the suffix to the value in `col_b_key_name`’s, (see column_keys below) results
in proper operation, to include the assert comparison between the input table
and output table. (ATM I'm not sure how to confirm all data is completely
encrypted.)
`
column_keys={
col_a_key_name: ["a"],
col_b_key_name: ["b.list.element"],
}
`
Similarly, my application data includes structs. There I found
`path_in_schema` entries for each field of the struct. I believe this would
require a key declaration for each struct field (e.g. `<column_name>.field_1`,
`<column_name>.field_2`, `<column_name>.field_3`, etc.
I have not looked into nested structs-of-lists or lists-of-structs to see
how those are represented in Parquet.
It seems reasonable to have the developer list the column names to encrypt.
But for non-primitive types, I'm not sure how they would know the modified
column name used in the file.
In my application code, when writing encrypted Parquet, Python silently
crashes in the previously mentioned file visitor. The application just exits
with no messages or exceptions. This happens when calling
`pyarrow.dataset.WrittenFile`’s function `.metadata.to_dict()`. By setting a
break point and playing in the debugger, I found the same symptom when
accessing meadata.row_group(0)’s to_dict() function. I won't be collecting and
writing the _metadata or _common_metadata files when encrypting the data, so
this code is normally disabled. But I figured it was worth noting the crash.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]