[ https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17034294#comment-17034294 ]
Gabor Szadovszky commented on PARQUET-1792: ------------------------------------------- If you are talking about one file at a time you might be right that it is 10x faster than doing it by a query engine. But the tool is running on one node while the query engine uses several ones at the same time so I am not sure about the 10x performance. Pruning the file makes sense to me to be written at the library level because you can do it in an effective way (do not need to unpack/decode the pages or the entire column chunks). To mask the values in the other hand requires to read the actual values and to generate the hashes. You also need to generate the related statistics. Therefore, I am not sure if this masking feature properly suited for parquet-mr. > Add 'mask' command to parquet-tools/parquet-cli > ----------------------------------------------- > > Key: PARQUET-1792 > URL: https://issues.apache.org/jira/browse/PARQUET-1792 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.12.0 > Reporter: Xinli Shang > Assignee: Xinli Shang > Priority: Major > Fix For: 1.12.0 > > > Some personal data columns need to be masked instead of being > pruned(Parquet-1791). We need a tool to replace the raw data columns with > masked value. The masked value could be hash, null, redact etc. For the > unchanged columns, they should be moved as a whole like 'merge', 'prune' > command in Parquet-tools. > > Implementing this feature in file format is 10X faster than doing it by > rewriting the table data in the query engine. -- This message was sent by Atlassian Jira (v8.3.4#803005)