I agree with Ofir, UTF8 is inherently variable length.

I think I phrased the question incorrectly. For the purposes of cleaning up the use of 'binary' in the spec, does the spec as currently written allow for FLBA with UTF8 encoding?  It looks like as far as parquet-java and parquet-cpp are concerned the answer is "no".

Whether the spec should be changed to allow for fixed-length strings is a different topic.

Regards,
Ed

On 6/18/24 8:00 AM, Ofir Manor wrote:
At least in SQL, char(n) is a fixed-length string, but it means fixed number of 
characters. Since strings are typically UTF8, it is still a variable number of 
bytes...
So, I don't see how a string column can be stored in FLBA, even if it has a 
fixed number of characters (unless less common cases like an 8-byte encoding 
like a specific ASCII character set)
Just my two cents,
    Ofir


________________________________
From: Gang Wu <ust...@gmail.com>
Sent: Tuesday, June 18, 2024 5:20 PM
To: dev@parquet.apache.org <dev@parquet.apache.org>
Subject: [External] Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated with 
STRING?

I have the same feeling and that's why I've asked in the mentioned PR.
It seems FLBA is just a special case of BYTE_ARRAY.

On Tue, Jun 18, 2024 at 10:16 PM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

I don't see why it shouldn't be supported. FBLA and String are orthogonal
features. The first optimizes encoding by not storing lengths and the
latter says the binary is valid UTF8.

On Tue, Jun 18, 2024 at 8:35 AM Gang Wu <ust...@gmail.com> wrote:

FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA.

[1]


https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/cpp/src/parquet/types.cc#L829-L842
[2]


https://github.com/apache/parquet-java/blob/fbe13d89ae4193be12c164d4bb5342c5eba3963f/parquet-column/src/main/java/org/apache/parquet/schema/Types.java#L443-L447
Best,
Gang

On Tue, Jun 18, 2024 at 11:53 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

My instinct says "No", but others may have a different
interpretation.

This is also my instinct, I think we should check validation in
Parquet-java and parquet-cpp to see if they are in agreement on the
matter
and then make a decision from there.  It doesn't seem too onerous to
support FLBA as a String though if necessary?

Cheers,
Micah

On Mon, Jun 17, 2024 at 12:15 PM Ed Seidl <etse...@live.com> wrote:

Hi all,
While discussing PARQUET-2485 a question was raised about the STRING
annotation [1]. The current wording in the specification is "|STRING|
may only be used to annotate the binary primitive type"; PARQUET-2485
would change that to "|STRING| may only be used to annotate the
|BYTE_ARRAY| primitive type". The question is, can
FIXED_LEN_BYTE_ARRAY
also be annotated with STRING? My instinct says "No", but others may
have a different interpretation.

Are there any strong opinions in the community? Are there any
implementations that allow fixed length strings?

Thanks,
Ed

[1]

https://github.com/apache/parquet-format/pull/251#discussion_r1635669939

Reply via email to