pulkomandy opened a new pull request, #44739:
URL: https://github.com/apache/arrow/pull/44739
This allows to store binary data of arbitrary length in a parquet file,
without having to wrongly declare it as UTF-8.
Fixes the writer part of #42971
The reader part has already been fixed in
4d825497cb04c9e1c288000a7a8f75786cc487ff and this uses a similar
implementation, but with a stricter set of "exceptions" (only byte arrays with
NONE type are allowed).
<!--
Thanks for opening a pull request!
If this is your first pull request you can find detailed information on how
to contribute here:
* [New Contributor's
Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
* [Contributing
Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
If this is not a [minor
PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes).
Could you open an issue for this pull request on GitHub?
https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the
[Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.)
of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
-->
### Rationale for this change
Hello,
We are trying to store binary data (in our case, dump of captured CAN
messages) in a parquet file. The data has a variable length (from 0 to 8 bytes)
and is not an UTF-8 string (or a text string at all). For this, physical type
BYTE_ARRAY and logical type NONE seems appropriate.
Unfortunately, the parquet writer will not let us do that. We can do either
fixed length and converted type NONE, or variable length and converted type
UTF-8. This change relaxes the type check on byte arrays to allow use of the
NONE converted type.
### What changes are included in this PR?
Allow the parquet stream writer to store data in a BYTE_ARRAY with NONE
logical type. The changes are based to similar changes made earlier to the
stream reader.
### Are these changes tested?
I'm not sure if this is the right way to fix this problem. I'm happy to add
tests if needed after the general idea has been validated.
In particular, the NONE type does not assume ASCII text (with no NULL bytes
inside), so the `operator<<(const char* v)` method may need to be excluded from
this (and only allow UTF-8), what do you think? In that case, what would be the
way of implementing this without making slightly different versions of
CheckColumn for each case?
### Are there any user-facing changes?
Parquet stream writer allows using BYTE_ARRAY witn NONE converted type for
storage of arbitrary binary data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]