tritzman commented on issue #41246:
URL: https://github.com/apache/arrow/issues/41246#issuecomment-2062856650

   In my application code, when I call `write_dataset`, I have a file_visitor 
that collects metadata as Parquet files are created. Looking at the 
`pyarrow.dataset.WrittenFile`'s metadata, I find `path_in_schema`, which shows 
lists are stored in Parquet with the name `<column_name>.list.element`.  Adding 
the suffix to the value in `col_b_key_name`’s, (see column_keys below) results 
in proper operation, to include the assert comparison between the input table 
and output table. (ATM I'm not sure how to confirm all data is completely 
encrypted.)
   
   `
    column_keys={
      col_a_key_name: ["a"],
      col_b_key_name: ["b.list.element"],
   }
   `
   
   Similarly, my application data includes structs. There I found 
`path_in_schema` entries for each field of the struct. I believe this would 
require a key declaration for each struct field (e.g. `<column_name>.field_1`, 
`<column_name>.field_2`, `<column_name>.field_3`, etc. 
   
   I have not looked into nested structs-of-lists or lists-of-structs to see 
how those are represented in Parquet. 
   
   It seems reasonable to have the developer list the column names to encrypt. 
But for non-primitive types, I'm not sure how they would know the modified 
column name used in the file.
   
   In my application code, when writing encrypted Parquet, Python silently 
crashes in the previously mentioned file visitor. The application just exits 
with no messages or exceptions. This happens when calling 
`pyarrow.dataset.WrittenFile`’s function `.metadata.to_dict()`. By setting a 
break point and playing in the debugger, I found the same symptom when 
accessing meadata.row_group(0)’s to_dict() function. I won't be collecting and 
writing the _metadata or _common_metadata files when encrypting the data, so 
this code is normally disabled. But I figured it was worth noting the crash.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to