> I don't see how a string column can be stored in FLBA

An encoder can choose to store strings in an FLBA sequence and pad shorter
strings with nulls. If all strings are of similar sizes and the max size of
string is small, this can actually be the shortest encoding possible.
Examples off the top of my head are http methods and stock exchange
tickers. They have to be UTF8 (or a subset thereof) and they are short.


On Tue, Jun 18, 2024 at 5:00 PM Ofir Manor <oma...@speedata.io> wrote:

> At least in SQL, char(n) is a fixed-length string, but it means fixed
> number of characters. Since strings are typically UTF8, it is still a
> variable number of bytes...
> So, I don't see how a string column can be stored in FLBA, even if it has
> a fixed number of characters (unless less common cases like an 8-byte
> encoding like a specific ASCII character set)
> Just my two cents,
>    Ofir
>
>
> ________________________________
> From: Gang Wu <ust...@gmail.com>
> Sent: Tuesday, June 18, 2024 5:20 PM
> To: dev@parquet.apache.org <dev@parquet.apache.org>
> Subject: [External] Re: [DISCUSS] Can FIXED_LEN_BYTE_ARRAY be annotated
> with STRING?
>
> I have the same feeling and that's why I've asked in the mentioned PR.
> It seems FLBA is just a special case of BYTE_ARRAY.
>
> On Tue, Jun 18, 2024 at 10:16 PM Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.invalid> wrote:
>
> > I don't see why it shouldn't be supported. FBLA and String are orthogonal
> > features. The first optimizes encoding by not storing lengths and the
> > latter says the binary is valid UTF8.
> >
> > On Tue, Jun 18, 2024 at 8:35 AM Gang Wu <ust...@gmail.com> wrote:
> >
> > > FYI, both parquet-cpp [1] and parquet-java [2] do not allow FLBA.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/eec6f17c8879b469dc3370dad4a7f68f11705a6b/cpp/src/parquet/types.cc#L829-L842
> > > [2]
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/fbe13d89ae4193be12c164d4bb5342c5eba3963f/parquet-column/src/main/java/org/apache/parquet/schema/Types.java#L443-L447
> > >
> > > Best,
> > > Gang
> > >
> > > On Tue, Jun 18, 2024 at 11:53 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > wrote:
> > >
> > > > >
> > > > > My instinct says "No", but others may have a different
> > interpretation.
> > > >
> > > >
> > > > This is also my instinct, I think we should check validation in
> > > > Parquet-java and parquet-cpp to see if they are in agreement on the
> > > matter
> > > > and then make a decision from there.  It doesn't seem too onerous to
> > > > support FLBA as a String though if necessary?
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > > On Mon, Jun 17, 2024 at 12:15 PM Ed Seidl <etse...@live.com> wrote:
> > > >
> > > > > Hi all,
> > > > > While discussing PARQUET-2485 a question was raised about the
> STRING
> > > > > annotation [1]. The current wording in the specification is
> "|STRING|
> > > > > may only be used to annotate the binary primitive type";
> PARQUET-2485
> > > > > would change that to "|STRING| may only be used to annotate the
> > > > > |BYTE_ARRAY| primitive type". The question is, can
> > FIXED_LEN_BYTE_ARRAY
> > > > > also be annotated with STRING? My instinct says "No", but others
> may
> > > > > have a different interpretation.
> > > > >
> > > > > Are there any strong opinions in the community? Are there any
> > > > > implementations that allow fixed length strings?
> > > > >
> > > > > Thanks,
> > > > > Ed
> > > > >
> > > > > [1]
> > > > >
> > >
> https://github.com/apache/parquet-format/pull/251#discussion_r1635669939
> > > > >
> > > >
> > >
> >
>

Reply via email to