Re: Array::GetValue ?

2022-11-30 Thread John Muehlhausen
std::pair FlatArray::GetValueBytes(int64_t index)

.second being -1 in every case other than a boolean ... in that case it
indicates the bit of importance in the single-byte string_view?

Just an idea.

On Wed, Nov 30, 2022 at 11:46 PM Micah Kornfield 
wrote:

> Yeah, another base class doesn't seem great.  One option would be to use a
> union type like variant, but that seems also not particularly great?
>
>
>
> On Thu, Nov 17, 2022 at 1:15 AM Antoine Pitrou  wrote:
>
> >
> > Uh, you're right.  We may want another base class, not sure how it
> > should be named though (also, we may want to be careful with multiple
> > inheritance?).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 17/11/2022 à 06:15, Micah Kornfield a écrit :
> > >>
> > >> std::string_view FlatArray::GetValueBytes(int64_t index)
> > >
> > >
> > > I think this would be problematic for Boolean?
> > >
> > > On Tue, Nov 15, 2022 at 11:01 AM John Muehlhausen  wrote:
> > >
> > >> If that covers primitive and binary(string) types, that would work for
> > me.
> > >>
> > >> On Tue, Nov 15, 2022 at 13:50 Antoine Pitrou 
> > wrote:
> > >>
> > >>>
> > >>> Then perhaps we can define a method:
> > >>>
> > >>> std::string_view FlatArray::GetValueBytes(int64_t index)
> > >>>
> > >>> ?
> > >>>
> > >>>
> > >>> Le 15/11/2022 à 19:39, John Muehlhausen a écrit :
> >  I had a use-case where untyped access to bytes would have been
> > >>> sufficient,
> >  vs branching depending on array type.  This is what brought the idea
> > to
> >  mind.
> > 
> >  On Tue, Nov 15, 2022 at 02:34 Jin Shang 
> > >> wrote:
> > 
> > > Hi John,
> > >
> > > In addition to Micah’s reply, does the member method Value(int64_t
> > > i)[1][2][3] satisfy your need? It is defined for all array types
> with
> > >> a
> > > primitive value representation, i.e. all primitive arrays and
> binary
> > >>> arrays.
> > >
> > > [1]
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L50
> > > <
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L50
> > >>
> > > [2]
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L109
> > > <
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L109
> > >>
> > > [3]
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_binary.h#L87
> > > <
> > >
> > >>>
> > >>
> >
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_binary.h#L87
> > >>
> > >
> > >
> > >> 2022年11月15日 13:06,Micah Kornfield  写道:
> > >>
> > >> Hi John,
> > >>
> > >> There are a couple of edge cases that need to be discussed to move
> > >> the
> > >> function to the base array class (which IIUC is this proposal):
> > >> 1. boolean
> > >> 2. struct
> > >> 3.  lists/LargeList
> > >> 4.  DictionaryArray
> > >>
> > >> FlatArray [1] seems like a better place for this method if there
> is
> > >> consensus on adding it.
> > >>
> > >> Cheers,
> > >> Micah
> > >>
> > >> [1]
> > >>
> > >
> > >>>
> > >>
> >
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L219
> > >>
> > >> On Mon, Nov 14, 2022 at 11:46 AM John Muehlhausen 
> > >> wrote:
> > >>
> > >>> There exists:
> > >>> const uint8_t* BaseBinaryArray::GetValue(int64_t i, offset_type*
> > >>> out_length) const
> > >>>
> > >>> What about adding:
> > >>> const uint8_t* Array::GetValue(int64_t i, offset_type*
> out_length)
> > >>> const
> > >>>
> > >>> This would allow GetValue to get the untyped bytes/length of any
> > >>> value?
> > >>> E.g. out_length would be set to sizeof(T) for arrays of primitive
> > >> type
> > > T?
> > >>>
> > >>> For FixedSizeBinaryArray the existing GetValue would still be a
> > >> valid
> > >>> overload.
> > >>>
> > >>> -John
> > >>>
> > >
> > >
> > 
> > >>>
> > >>
> > >
> >
>


Re: Array::GetValue ?

2022-11-30 Thread Micah Kornfield
Yeah, another base class doesn't seem great.  One option would be to use a
union type like variant, but that seems also not particularly great?



On Thu, Nov 17, 2022 at 1:15 AM Antoine Pitrou  wrote:

>
> Uh, you're right.  We may want another base class, not sure how it
> should be named though (also, we may want to be careful with multiple
> inheritance?).
>
> Regards
>
> Antoine.
>
>
> Le 17/11/2022 à 06:15, Micah Kornfield a écrit :
> >>
> >> std::string_view FlatArray::GetValueBytes(int64_t index)
> >
> >
> > I think this would be problematic for Boolean?
> >
> > On Tue, Nov 15, 2022 at 11:01 AM John Muehlhausen  wrote:
> >
> >> If that covers primitive and binary(string) types, that would work for
> me.
> >>
> >> On Tue, Nov 15, 2022 at 13:50 Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Then perhaps we can define a method:
> >>>
> >>> std::string_view FlatArray::GetValueBytes(int64_t index)
> >>>
> >>> ?
> >>>
> >>>
> >>> Le 15/11/2022 à 19:39, John Muehlhausen a écrit :
>  I had a use-case where untyped access to bytes would have been
> >>> sufficient,
>  vs branching depending on array type.  This is what brought the idea
> to
>  mind.
> 
>  On Tue, Nov 15, 2022 at 02:34 Jin Shang 
> >> wrote:
> 
> > Hi John,
> >
> > In addition to Micah’s reply, does the member method Value(int64_t
> > i)[1][2][3] satisfy your need? It is defined for all array types with
> >> a
> > primitive value representation, i.e. all primitive arrays and binary
> >>> arrays.
> >
> > [1]
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L50
> > <
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L50
> >>
> > [2]
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L109
> > <
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L109
> >>
> > [3]
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_binary.h#L87
> > <
> >
> >>>
> >>
> https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_binary.h#L87
> >>
> >
> >
> >> 2022年11月15日 13:06,Micah Kornfield  写道:
> >>
> >> Hi John,
> >>
> >> There are a couple of edge cases that need to be discussed to move
> >> the
> >> function to the base array class (which IIUC is this proposal):
> >> 1. boolean
> >> 2. struct
> >> 3.  lists/LargeList
> >> 4.  DictionaryArray
> >>
> >> FlatArray [1] seems like a better place for this method if there is
> >> consensus on adding it.
> >>
> >> Cheers,
> >> Micah
> >>
> >> [1]
> >>
> >
> >>>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L219
> >>
> >> On Mon, Nov 14, 2022 at 11:46 AM John Muehlhausen 
> >> wrote:
> >>
> >>> There exists:
> >>> const uint8_t* BaseBinaryArray::GetValue(int64_t i, offset_type*
> >>> out_length) const
> >>>
> >>> What about adding:
> >>> const uint8_t* Array::GetValue(int64_t i, offset_type* out_length)
> >>> const
> >>>
> >>> This would allow GetValue to get the untyped bytes/length of any
> >>> value?
> >>> E.g. out_length would be set to sizeof(T) for arrays of primitive
> >> type
> > T?
> >>>
> >>> For FixedSizeBinaryArray the existing GetValue would still be a
> >> valid
> >>> overload.
> >>>
> >>> -John
> >>>
> >
> >
> 
> >>>
> >>
> >
>


Re: [DISCUSS] JSON Canonical Extension Type

2022-11-30 Thread Micah Kornfield
>
> Can a logical extension be based on another logical extension?

Potentially but this is mostly an implementation details, each type should
have their own specification IMO.

HOCON support might be nice..

I'm not sure if this is common enough to warrant a canonical type within
Arrow but you are welcome to propose something if you would like.

Cheers,
Micah

On Mon, Nov 28, 2022 at 11:55 AM Lee, David 
wrote:

> Can a logical extension be based on another logical extension?
>
> HOCON support might be nice..
>
> -Original Message-
> From: Micah Kornfield 
> Sent: Monday, November 28, 2022 11:50 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] JSON Canonical Extension Type
>
> External Email: Use caution with links and attachments
>
>
> This seems like a reasonable definition to me.  Since there hasn't been
> much feedback, I think maybe following through an implementation + this
> description in a PR would be the next steps.  If there isn't further
> feedback on this, once the PR is up we can have try to vote (which might
> bring up some more feedback, but hopefully wouldn't cause too much
> implementation churn).
>
> Thanks,
> Micah
>
> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
>  wrote:
>
> > Hi folks!
> >
> > I put together this specification for canonicalizing the JSON type in
> > Arrow.
> >
> > ## Introduction
> > JSON is a widely used text based data interchange format. There are
> > many use cases where a user has a column whose contents are a JSON
> > encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> > Type][2] are two such examples.
> >
> > The JSON specification is defined in [RFC-8259][3]. However, many of
> > the most popular parsers support non standard extensions. Examples of
> > non standard extensions to JSON include comments, unquoted keys,
> > trailing commas, etc.
> >
> > ## Extension Specification
> > * The name of the extension is `arrow.json`
> > * The storage type of the extension is `utf8`
> > * The extension type has no parameters
> > * The metadata MUST be either empty or a valid JSON object
> > - There is no canonical metadata
> > - Implementations MAY include implementation-specific metadata by
> > using a namespaced key. For example `{"google.bigquery": {"my":
> > "metadata"}}`
> > * Implementations...
> > - MUST produce valid UTF-8 encoded text
> > - SHOULD produce valid standard JSON
> > - MAY produce valid non-standard JSON
> > - MUST support parsing standard JSON
> > - MAY support parsing non standard JSON
> > - SHOULD pass through contents that they do not understand
> >
> > ## Forward compatibility
> > In the future we might allow this logical type to annotate a byte
> > storage type with a different text encoding.  Implementations
> > consuming JSON logical types should verify this.
> >
> > [1]:
> >
> >
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> > [2]:
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> > [3]:
> >
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>


Re: DISCUSS: [FlightSQL] Catalog support

2022-11-30 Thread Gavin Ray
Just to chime in on this, one thing I'm curious about is whether there
will be support for user-defined catalog/schema hierarchy depth?

This comment that James made does seem reasonable to me
> scheme://:/path-1/path-2/.../path-n

Trino/Presto does a similar thing (jdbc:trino://localhost:8080/tpch/sf1)

At Hasura, what we do is have an alias "FullyQualifiedName" which is
just "Array"
and the identifier to some element in a data source is always fully-qualified:

https://github.com/hasura/graphql-engine/tree/master/dc-agents#schema

["postgres_1", "db1", "schema2", "my_table", "col_a"]
["mongo", "db1",  "collection_a", "field_a"]
["csv_adapter", "myfile.csv", "col_x"]

On Wed, Nov 30, 2022 at 6:31 PM James Duong
 wrote:
>
> Our current convention of sending connection properties as headers with
> every request has the benefit of making statefulness optional, but has the
> drawback of sending redundant, unused properties on requests after the
> first, which increases the payload size unnecessarily.
>
> I'd suggest we define session management features explicitly in Flight
> (while being optional). The suggestion is to make this part of Flight as an
> optional feature, rather than Flight SQL due to its applicability outside
> of just database access.
>
> Creating a session:
> - The Flight client supplies a New-Session header which has key-value pairs
> for initial session options. This header can be applied to any RPC call,
> but logically should be the first one the client makes.
> - The server should send a Set-Cookie header back containing some
> server-side representation of the session that the client can use in
> subsequent requests.
> - The path specified in the URI is sent as a "Catalog" session option.
>
> Modifying session options:
> - A separate RPC call that takes in a Stream representing
> each session option that is being modified and returns a stream of statuses
> to indicate if the setting change was accepted.
> - This RPC call is only valid when the Cookie header is used.
> - It is up to the server to define if a failed session property change is
> fatal or if other properties can continue to be set.
>
> Closing a session:
> - A separate RPC call that tells the server to drop the session specified
> by the Cookie header.
>
> Notes:
> A Flight SQL client would check if session management RPCs are supported
> through a new GetSqlInfo property. A Flight client doesn't have a way to do
> this generically, but there could be an application-specific RPC or header
> that reports this metadata.
>
> The O/JDBC and ADBC drivers would need to be updated to programmatically
> check for session management RPCs. If unsupported, then use the old
> behavior of sending all properties as headers with each request. If
> supported, make use of the New-Session header and drop the session when
> closing the client-side connection.
>
> It's a bit asymmetric that creating a new session is done by applying a
> header, but closing a session is an RPC call. This was so that session
> creation doesn't introduce another round trip before the first real data
> request. If there's a way to batch RPC calls it might be better to make
> session creation an RPC call.
>
> On Tue, Nov 22, 2022 at 3:16 PM David Li  wrote:
>
> > It sounds reasonable - then there are three points:
> >
> > - A standard URI scheme for Flight SQL that can be used by multiple client
> > APIs (JDBC, ADBC, etc.)
> > - A standard scheme for session data (likely header/cookie-based)
> > - A mapping from URI parameters and fields to session data
> >
> >
> >
> > On Tue, Nov 22, 2022, at 17:45, James Duong wrote:
> > > Just following up on this and if there are any thoughts.
> > >
> > > The purpose would be to standardize how we specify access to some named
> > > logical grouping of data. This would make it easy to model catalog/schema
> > > semantics in Flight SQL.
> > >
> > > Having this be part of the connection URI makes it similar to specifying
> > a
> > > resource in an HTTP URL (ie an endpoint) which should make it easy for
> > end
> > > users to work with and modify.
> > >
> > > On Fri, Nov 18, 2022 at 3:17 PM James Duong 
> > wrote:
> > >
> > >> As for surfacing catalogs itself, perhaps we allow the URI take in a
> > path
> > >> and treat that as a way of specifying a multi-level resource that which
> > the
> > >> FlightClient is connecting to:
> > >>
> > >> eg a connection URI of the form:
> > >> scheme://:/path-1/path-2/.../path-n
> > >>
> > >> The FlightClient could send this path as either a header or a session
> > >> property (with a neutral name like 'resource-path'). Flight SQL
> > Producers
> > >> could interpret this as a catalog or schema.
> > >> eg
> > >> grpc://:/catalog/schema
> > >>
> > >> On Fri, Nov 11, 2022 at 2:07 AM James Henderson  wrote:
> > >>
> > >>> Sounds good to me.
> > >>>
> > >>> > Are you interested in writing up a (sketch of a) proposal?
> > >>>
> > >>> Yep, can do - I'm OoO over the next couple of weeks so might be a 

Re: DISCUSS: [FlightSQL] Catalog support

2022-11-30 Thread James Duong
Our current convention of sending connection properties as headers with
every request has the benefit of making statefulness optional, but has the
drawback of sending redundant, unused properties on requests after the
first, which increases the payload size unnecessarily.

I'd suggest we define session management features explicitly in Flight
(while being optional). The suggestion is to make this part of Flight as an
optional feature, rather than Flight SQL due to its applicability outside
of just database access.

Creating a session:
- The Flight client supplies a New-Session header which has key-value pairs
for initial session options. This header can be applied to any RPC call,
but logically should be the first one the client makes.
- The server should send a Set-Cookie header back containing some
server-side representation of the session that the client can use in
subsequent requests.
- The path specified in the URI is sent as a "Catalog" session option.

Modifying session options:
- A separate RPC call that takes in a Stream representing
each session option that is being modified and returns a stream of statuses
to indicate if the setting change was accepted.
- This RPC call is only valid when the Cookie header is used.
- It is up to the server to define if a failed session property change is
fatal or if other properties can continue to be set.

Closing a session:
- A separate RPC call that tells the server to drop the session specified
by the Cookie header.

Notes:
A Flight SQL client would check if session management RPCs are supported
through a new GetSqlInfo property. A Flight client doesn't have a way to do
this generically, but there could be an application-specific RPC or header
that reports this metadata.

The O/JDBC and ADBC drivers would need to be updated to programmatically
check for session management RPCs. If unsupported, then use the old
behavior of sending all properties as headers with each request. If
supported, make use of the New-Session header and drop the session when
closing the client-side connection.

It's a bit asymmetric that creating a new session is done by applying a
header, but closing a session is an RPC call. This was so that session
creation doesn't introduce another round trip before the first real data
request. If there's a way to batch RPC calls it might be better to make
session creation an RPC call.

On Tue, Nov 22, 2022 at 3:16 PM David Li  wrote:

> It sounds reasonable - then there are three points:
>
> - A standard URI scheme for Flight SQL that can be used by multiple client
> APIs (JDBC, ADBC, etc.)
> - A standard scheme for session data (likely header/cookie-based)
> - A mapping from URI parameters and fields to session data
>
>
>
> On Tue, Nov 22, 2022, at 17:45, James Duong wrote:
> > Just following up on this and if there are any thoughts.
> >
> > The purpose would be to standardize how we specify access to some named
> > logical grouping of data. This would make it easy to model catalog/schema
> > semantics in Flight SQL.
> >
> > Having this be part of the connection URI makes it similar to specifying
> a
> > resource in an HTTP URL (ie an endpoint) which should make it easy for
> end
> > users to work with and modify.
> >
> > On Fri, Nov 18, 2022 at 3:17 PM James Duong 
> wrote:
> >
> >> As for surfacing catalogs itself, perhaps we allow the URI take in a
> path
> >> and treat that as a way of specifying a multi-level resource that which
> the
> >> FlightClient is connecting to:
> >>
> >> eg a connection URI of the form:
> >> scheme://:/path-1/path-2/.../path-n
> >>
> >> The FlightClient could send this path as either a header or a session
> >> property (with a neutral name like 'resource-path'). Flight SQL
> Producers
> >> could interpret this as a catalog or schema.
> >> eg
> >> grpc://:/catalog/schema
> >>
> >> On Fri, Nov 11, 2022 at 2:07 AM James Henderson  wrote:
> >>
> >>> Sounds good to me.
> >>>
> >>> > Are you interested in writing up a (sketch of a) proposal?
> >>>
> >>> Yep, can do - I'm OoO over the next couple of weeks so might be a bit
> >>> intermittent.
> >>>
> >>> On Thu, 10 Nov 2022 at 15:28, David Li  wrote:
> >>>
> >>> > Hey James H.,
> >>> >
> >>> > That would make sense to me. So it sounds like we'd want
> >>> >
> >>> > - Formal specification of using cookies/headers to mark a 'session'
> (I
> >>> > guess this will be a little inconsistent with transactions, though)
> >>> > - Adding RPCs to query session values
> >>> > - Adding RPCs to set session values
> >>> > - Listing standard values and types
> >>> >
> >>> > Some things may require more consideration, e.g. transaction
> isolation
> >>> > might be better off as part of the transaction RPCs than an ambient
> >>> > property. Are you interested in writing up a (sketch of a) proposal?
> >>> >
> >>> > -David
> >>> >
> >>> > On Thu, Nov 10, 2022, at 10:09, James Henderson wrote:
> >>> > > Similarly, we're also currently considering how best to implement
> >>> some of
> >>> > > 

Current state of using GitHub issues for Arrow

2022-11-30 Thread Joris Van den Bossche
Hi all,

There is a separate vote thread about already using GitHub issues for all
new issues (and in practice users also are already doing that, since new
JIRA signup has been disabled). And so to prepare for this, there has been
some ongoing work to update our workflows to handle GitHub issues (both
issues and PRs closing such issues).

An overview of the current state of using GitHub issues:

   - The current "lang: xx" labels have been renamed to "Component: xx" (to
   match the components on JIRA to ease the migration of JIRA issues to
   github). The github workflow to automatically add those labels to PRs
   (based on which files they touch) has been updated to use those labels. We
   can still reconsider the exact naming scheme later (after the migration).
   - People can *open GitHub* *issues* for reports. In the form (
   https://github.com/apache/arrow/issues/new/choose), you can currently
   choose between bug or feature request, and this choice will automatically
   add a label to the resulting issue ("Type: bug", "Type: enhancement").
  - Within the form, you can also choose a component (matching the
  components from JIRA). However, note that this does NOT yet automatically
  add a label (https://issues.apache.org/jira/browse/ARROW-18377 tracks
  this). So for committers that open or triage issues, please add the
  matching component labels manually, for now (until this is automated).
  - People currently can't assign an issue to themselves, as one can do
  on JIRA (unless you are a committer). That's a limitation of GitHub, and
  the idea is to workaround that with some comment workflow:
https://github.com/apache/arrow/issues/14784

  - People can *open* *PRs* that reference a GitHub issue. Instead of
   starting the title with "ARROW-XXX: ", you use "GH-XXX: ".
  - The bot checking the title has been updated to handle this
  correctly, and will in addition also assign the linked issue to the user
  opening the PR, and will post a message in case the issue has no
component.
  - The merge script has been updated to be able to merge such PRs
  (once https://github.com/apache/arrow/pull/14750 is merged). This
  will then close the linked github issue, and will also set the milestone
  (if not yet set).
   - For now we still require that a PR references a GH or JIRA issue
  (except for MINOR PRs), as was required when we only used JIRAs
for issues.
  This is something we could discuss in the future in a separate
thread if we
  want to relax this and allow direct PRs (which then will also require
  further updates to the workflows / merge script).

This is not yet documented in our contributing guidelines. But this thread
is a start to at least ensure the community is up to date with those
changes.

For sure the exact workflows will still be further refined while starting
to use this. And if there are things missing or unclear in the current
practices around how to handle GitHub issues or any other feedback or
ideas, this thread is yours!

Best,
Joris


Re: Dictionary Key For Null Slot

2022-11-30 Thread Raphael Taylor-Davies
For those following along, I've proposed a workaround that loosens the 
restriction on non-nullable children here 
. In particular 
non-nullable children are now allowed to contain nulls, so long as they 
don't introduce any new nulls not already found on their parent.


I think this preserves the semantic notions of nullability, whilst 
preserving the ability to correctly interpret, validate and construct 
child arrays in the absence of their parents.


Please do let me know if you foresee any issue with this approach, or 
have any insights to share on how other implementations are handling this.


Kind Regards,

Raphael Taylor-Davies

On 29/11/2022 20:47, Jacob Quinn wrote:

I was just looking into a related issue last night where it seems pandas
complains if there are _any_ nulls in the dictionary and we were
considering not allowing nulls in the dictionary values at all. But it's a
little tangled up at the moment because we've already allowed it. Ref:
https://github.com/apache/arrow-julia/issues/360

-Jacob

On Tue, Nov 29, 2022 at 8:06 AM Raphael Taylor-Davies
  wrote:


Hi All,

I am not sure if it is intentional, but a common property of all arrow
layouts is that the value at a given index is defined, even if for a
null it may contain an arbitrary value. This is true everywhere except
for the dictionary layout, where the key in the null slot may contain an
arbitrary value, and consequently the value of the index is undefined.

This has been a repeated nuisance in the Rust implementation, but so far
I've managed to find workarounds for most issues, however, I'm unsure
how to handle StructArrays containing non-nullable, dictionary-encoded
children. As the children are non-nullable, they cannot contain a null
mask, but without a null mask the child dictionary array is ill-formed.
I'm not really sure how best to handle this?

One option might be to require that all dictionary keys, even those for
null slots, are a valid index into the child values array. As the child
values array can itself contain nulls, this is always possible.

My questions are therefore:

* How are other implementations handling this case?

* Is requiring all dictionary keys to be a valid index into the child
values acceptable? We already do something similar for offsets

* What is the motivation for dictionaries having two levels of
nullability, both in the keys and values. UnionArray by contrast only
encodes nullability in its children

Any help would be much appreciated

Kind Regards,

Raphael Taylor-Davies