[PR] Allow writing BYTE_ARRAY with converted type NONE [arrow]

via GitHub Fri, 15 Nov 2024 04:18:19 -0800


pulkomandy opened a new pull request, #44739:
URL: https://github.com/apache/arrow/pull/44739


   This allows to store binary data of arbitrary length in a parquet file, 
without having to wrongly declare it as UTF-8.
   
   Fixes the writer part of #42971
   
   The reader part has already been fixed in 
4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar 
implementation, but with a stricter set of "exceptions" (only byte arrays with 
NONE type are allowed).
   
   
   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's 
Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing 
Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor 
PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). 
Could you open an issue for this pull request on GitHub? 
https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the 
[Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.)
 of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   Hello,
   
   We are trying to store binary data (in our case, dump of captured CAN 
messages) in a parquet file. The data has a variable length (from 0 to 8 bytes) 
and is not an UTF-8 string (or a text string at all). For this, physical type 
BYTE_ARRAY and logical type NONE seems appropriate.
   
   Unfortunately, the parquet writer will not let us do that. We can do either 
fixed length and converted type NONE, or variable length and converted type 
UTF-8. This change relaxes the type check on byte arrays to allow use of the 
NONE converted type.
   
   ### What changes are included in this PR?
   
   Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE 
logical type. The changes are based to similar changes made earlier to the 
stream reader.
   
   ### Are these changes tested?
   
   I'm not sure if this is the right way to fix this problem. I'm happy to add 
tests if needed after the general idea has been validated.
   
   In particular, the NONE type does not assume ASCII text (with no NULL bytes 
inside), so the `operator<<(const char* v)` method may need to be excluded from 
this (and only allow UTF-8), what do you think? In that case, what would be the 
way of implementing this without making slightly different versions of 
CheckColumn for each case?
   
   ### Are there any user-facing changes?
   
   Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for 
storage of arbitrary binary data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Allow writing BYTE_ARRAY with converted type NONE [arrow]

Reply via email to