I agree with Ofir, UTF8 is inherently variable length.
I think I phrased the question incorrectly. For the purposes of cleaning
up the use of 'binary' in the spec, does the spec as currently written
allow for FLBA with UTF8 encoding? It looks like as far as parquet-java
and parquet-cpp are concerned the answer is "no".
Whether the spec should be changed to allow for fixed-length strings is
a different topic.
Regards,
Ed
On 6/18/24 8:00 AM, Ofir Manor wrote:
At least in SQL, char(n) is a fixed-length string, but it means fixed number of
characters. Since strings are typically UTF8, it is still a variable number of
bytes...
So, I don't see how a string column can be stored in FLBA, even if it has a
fixed number of characters (unless less common cases like an 8-byte encoding
like a specific ASCII character set)
Just my two cents,
Ofir
________________________________
From: Gang Wu <ust...@gmail.com>
Sent: Tuesday, June 18, 2024 5:20 PM
To: dev@parquet.apache.org <dev@parquet.apache.org>
Subject: [External] Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with
STRING?
I have the same feeling and that's why I've asked in the mentioned PR.
It seems FLBA is just a special case of BYTE_ARRAY.
On Tue, Jun 18, 2024 at 10:16 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:
I don't see why it shouldn't be supported. FBLA and String are orthogonal
features. The first optimizes encoding by not storing lengths and the
latter says the binary is valid UTF8.
On Tue, Jun 18, 2024 at 8:35 AM Gang Wu <ust...@gmail.com> wrote:
FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA.
[1]
https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/cpp/src/parquet/types.cc#L829-L842
[2]
https://github.com/apache/parquet-java/blob/fbe13d89ae4193be12c164d4bb5342c5eba3963f/parquet-column/src/main/java/org/apache/parquet/schema/Types.java#L443-L447
Best,
Gang
On Tue, Jun 18, 2024 at 11:53 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:
My instinct says "No", but others may have a different
interpretation.
This is also my instinct, I think we should check validation in
Parquet-java and parquet-cpp to see if they are in agreement on the
matter
and then make a decision from there. It doesn't seem too onerous to
support FLBA as a String though if necessary?
Cheers,
Micah
On Mon, Jun 17, 2024 at 12:15 PM Ed Seidl <etse...@live.com> wrote:
Hi all,
While discussing PARQUET-2485 a question was raised about the STRING
annotation [1]. The current wording in the specification is "|STRING|
may only be used to annotate the binary primitive type"; PARQUET-2485
would change that to "|STRING| may only be used to annotate the
|BYTE_ARRAY| primitive type". The question is, can
FIXED_LEN_BYTE_ARRAY
also be annotated with STRING? My instinct says "No", but others may
have a different interpretation.
Are there any strong opinions in the community? Are there any
implementations that allow fixed length strings?
Thanks,
Ed
[1]
https://github.com/apache/parquet-format/pull/251#discussion_r1635669939