Re: [VOTE] Allow Decimal32 and Decimal64 bitwidths in Arrow Format

2024-09-04 Thread Curt Hagenlocher
+1 (non-binding)

I can upload a PR for the corresponding C# change within the next day.

On Wed, Sep 4, 2024 at 2:21 PM Matt Topol  wrote:

> Based on various discussions among the ecosystem and to continue expanding
> the zero-copy interoperability for Arrow to be used with different
> libraries and databases (such as libcudf, ClickHouse, etc) I would like to
> propose that we extend the allowable bit-widths for Arrow Decimal types to
> allow 32-bit and 64-bit decimals.
>
> The Arrow Spec currently defines the Decimal type as a parameterized type,
> parameterized by the bit-width, and then just specifies that the only
> allowed bitwidths are 128 and 256. Thus, rather than adding an entirely new
> type we could simply expand what is allowed for the bitwidth field which
> makes the format side of this a very small change.
>
> I've uploaded a PR for adding support for this to C++ [1] and will be
> uploading a PR for a corresponding Go change within the next day and will
> respond to this thread with the link.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - Update the Arrow Spec to allow for 32-bit and 64-bit bitwidths for
> Arrow Decimal types
> [ ] +0
> [ ] -1 - Do not update the Arrow Spec to allow for 32-bit and 64-bit
> bitwidths for Arrow Decimal types because
>
> Thanks everyone!
>
> --Matt
>
> [1]: https://github.com/apache/arrow/pull/43957
>


Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Curt Hagenlocher
This seems to straddle that line, in that you can also view this as a way
to represent semi-structured data in a manner that allows for more
efficient querying and computation by breaking out some of its components
into a more structured form.

(I also happen to want a canonical Arrow representation for variant data,
as this type occurs in many databases but doesn't have a great
representation today in ADBC results. That's why I filed [Format] Consider
adding an official variant type to Arrow · Issue #42069 · apache/arrow
(github.com) . Of course,
there's no specific reason why a canonical Arrow representation for
variants must align with Spark and/or Iceberg.)

-Curt

On Thu, Aug 22, 2024 at 2:01 AM Antoine Pitrou  wrote:

>
> Ah, thanks. I've tried to find a rationale and ended up on
> https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 . Is it
> a good description of what you're after?
>
> If so, then I don't think Arrow is a good match. This seems mostly to be
> a marshalling format for semi-structured data (like Avro?). Arrow data
> types are meant to be in a representation ideal for querying and
> computation, rather than transport and storage.
>
> This could be developed separately and then be represented in Arrow
> using an extension type (perhaps a canonical one as in
> https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html).
>
> What do other Arrow developers think?
>
> Regards
>
> Antoine.
>
>
> Le 22/08/2024 à 10:45, Gang Wu a écrit :
> > Sorry for the inconvenience.
> >
> > This is the permalink for the discussion:
> > https://lists.apache.org/thread/hopkr2f0ftoywwt9zo3jxb7n0ob5s5bw
> >
> > On Thu, Aug 22, 2024 at 3:51 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> Hi Gang,
> >>
> >> Sorry, but can you give a pointer to the start of this discussion thread
> >> in a readable format (for example a mailing-list archive)? It appears
> >> that dev@arrow wasn't cc'ed from the start and that can make it
> >> difficult to understand what this is about.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 22/08/2024 à 08:32, Gang Wu a écrit :
> >>> It seems that we have reached a consensus to some extent that there
> >>> should be a new home for the variant spec. The pending question
> >>> is whether Parquet or Arrow is a better choice. As a committer from
> >> Arrow,
> >>> Parquet and ORC communities, I am neutral to choose any and happy to
> >>> help with the movement once a decision has been made.
> >>>
> >>> Should we start a vote to move forward?
> >>>
> >>> Best,
> >>> Gang
> >>>
> >>> On Sat, Aug 17, 2024 at 8:34 AM Micah Kornfield  >
> >>> wrote:
> >>>
> >
> > That being said, I think the most important consideration for now is
>  where
> > are the current maintainers / contributors to the variant type. If
> most
>  of
> > them are already PMC members / committers on a project, it becomes a
> >> bit
> > easier. Otherwise if there isn't much overlap with a project's
> existing
> > governance, I worry there could be a bit of friction. How many active
> > contributors are there from Iceberg? And how about from Arrow?
> 
> 
>  I think this is the key question. What are the requirements around
>  governance?  I've seen some tangential messaging here but I'm not
> clear
> >> on
>  what everyone expects.
> 
>  I think for a lot of the other concerns my view is that the exact
> >> project
>  does not really matter (and choosing a project with mature cross
> >> language
>  testing infrastructure or committing to building it is critical). IIUC
> >> we
>  are talking about following artifacts:
> 
>  1.  A stand alone specification document (this can be hosted anyplace)
>  2.  A set of language bindings with minimal dependencies can be
> consumed
>  downstream (again, as long as dependencies are managed carefully any
>  project can host these)
>  3.  Potential integration where appropriate into file format libraries
> >> to
>  support shredding (but as of now this is being bypassed by using
>  conventions anyways).  My impression is that at least for Parquet
> there
> >> has
>  been a proliferation of vectorized readers across different projects,
> so
>  I'm not clear how much standardization in parquet-java could help
> here.
> 
>  To respond to some other questions:
> 
>  Arrow is not used as Spark's in-memory model, nor Trino and others so
> >> those
> > existing relationships aren't there. I also worry that differences in
> > approaches would make it difficult later on.
> 
> 
>  While Arrow is not in the core memory model, for Spark I believe it is
>  still used for IPC for things like Java<->Python. Trino also consumes
> >> Arrow
>  libraries today to support things like Snowflake/Bigquery federation.
> >> But I
>  think this is minor because as mentioned above I think the f

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Curt Hagenlocher
>  would it be easier to request statistics at a higher level of
abstraction?

What if there were a "single table provider" level of abstraction between
ADBC and ArrowArrayStream as a C API; something that can report statistics
and apply simple predicates?

On Thu, May 23, 2024 at 5:57 AM Dewey Dunnington
 wrote:

> Thank you for the background! I understand that these statistics are
> important for query planning; however, I am not sure that I follow why
> we are constrained to the ArrowSchema to represent them. The examples
> given seem to going through Python...would it be easier to request
> statistics at a higher level of abstraction? There would already need
> to be a separate mechanism to request an ArrowArrayStream with
> statistics (unless the PyCapsule `requested_schema` argument would
> suffice).
>
> > ADBC may be a bit larger to use only for transmitting
> > statistics. ADBC has statistics related APIs but it has more
> > other APIs.
>
> Some examples of producers given in the linked threads (Delta Lake,
> Arrow Dataset) are well-suited to being wrapped by an ADBC driver. One
> can implement an ADBC driver without defining all the methods (where
> the producer could call AdbcConnectionGetStatistics(), although
> AdbcStatementGetStatistics() might be more relevant here and doesn't
> exist). One example listed (using an Arrow Table as a source) seems a
> bit light to wrap in an ADBC driver; however, it would not take much
> code to do so and the overhead of getting the reader via ADBC it is
> something like 100 microseconds (tested via the ADBC R package's
> "monkey driver" which wraps an existing stream as a statement). In any
> case, the bulk of the code is building the statistics array.
>
> > How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
>
> Whatever format for statistics is decided on, I imagine it should be
> exactly the same as the ADBC standard? (Perhaps pushing changes
> upstream if needed?).
>
> On Thu, May 23, 2024 at 3:21 AM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice
> >
> > It seems that we should use the approach because all
> > feedback said so. How about the following schema for the
> > statistics ArrowArray? It's based on ADBC.
> >
> > | Field Name   | Field Type| Comments |
> > |--|---|  |
> > | column_name  | utf8  | (1)  |
> > | statistic_key| utf8 not null | (2)  |
> > | statistic_value  | VALUE_SCHEMA not null |  |
> > | statistic_is_approximate | bool not null | (3)  |
> >
> > 1. If null, then the statistic applies to the entire table.
> >It's for "row_count".
> > 2. We'll provide pre-defined keys such as "max", "min",
> >"byte_width" and "distinct_count" but users can also use
> >application specific keys.
> > 3. If true, then the value is approximate or best-effort.
> >
> > VALUE_SCHEMA is a dense union with members:
> >
> > | Field Name | Field Type |
> > |||
> > | int64  | int64  |
> > | uint64 | uint64 |
> > | float64| float64|
> > | binary | binary |
> >
> > If a column is an int32 column, it uses int64 for
> > "max"/"min". We don't provide all types here. Users should
> > use a compatible type (int64 for a int32 column) instead.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS] Statistics through the C data interface" on Wed, 22 May
> 2024 17:04:57 +0200,
> >   Antoine Pitrou  wrote:
> >
> > >
> > > Hi Kou,
> > >
> > > I agree that Dewey that this is overstretching the capabilities of the
> > > C Data Interface. In particular, stuffing a pointer as metadata value
> > > and decreeing it immortal doesn't sound like a good design decision.
> > >
> > > Why not simply pass the statistics ArrowArray separately in your
> > > producer API of choice (Dewey mentioned ADBC but it is of course just
> > > a possible API among others)?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 22/05/2024 à 04:37, Sutou Kouhei a écrit :
> > >> Hi,
> > >> We're discussing how to provide statistics through the C
> > >> data interface at:
> > >> https://github.com/apache/arrow/issues/38837
> > >> If you're interested in this feature, could you share your
> > >> comments?
> > >> Motivation:
> > >> We can interchange Apache Arrow data by the C data interface
> > >> in the same process. For example, we can pass Apache Arrow
> > >> data read by Apache Arrow C++ (provider) to DuckDB
> > >> (consumer) through the C data interface.
> > >> A provider may know Apache Arrow data statistics. For
> > >> example, a provider can know statistics when it reads Apache
> > >> Parquet data because Apache Parquet may provide statistics.
> > >> But a consumer can't know statistics that are known by a
> > >> producer. Because there

Re: [DISCUSS][C#][GLib] Formalize use of the GLib libraries for native library bindings

2024-05-09 Thread Curt Hagenlocher
As a ParquetSharp user, I think this is a great idea. I agree with Kou that
the best user experience includes distributing the native parts, but I
believe these can be separated out into individual "runtime" NuGet packages
and referenced in a way that only the required ones are downloaded --
see  NuGet
Gallery | runtime.native.System.Security.Cryptography.OpenSsl 4.3.3

for
an example of this.

On Tue, May 7, 2024 at 1:58 PM Adam Reeve  wrote:

> Hi Kou, thanks for your insight
>
> > If we have many development resources for the C# bindings,
> > it may be better that we implement the C++ bindings directly
> > like PyArrow does. If we doesn't, it may be better that we
> > use Arrow GLib to combine development resources with
> > GLib/Ruby developers like me.
>
> I think it's fair to say there isn't a lot of developer time dedicated
> to the C# library and bindings, but I can see there being demand for
> bindings to the full dataset and compute APIs at least, so from that
> perspective it sounds like using the GLib libraries would make sense.
>
> > We may want to publish a NuGet package that includes Arrow
> > GLib libraries like ParquetSharp includes
> > ParquetSharpNative.* that are liked to Arrow/Parquet C++
> > statically.
>
> Good point, that would definitely help simplify things for end users.
>
> > We may want to create a C# library in addition of auto
> > generated codes based on GObject Introspection. It's an
> > approach used by Ruby. The auto generated codes may be
> > difficult to use from C#.
>
> Right, yes this is similar to what I meant by not publicly exposing
> the GLib.GObject based classes, although we could do something closer
> to this where we make the GObject classes public but in a separate
> namespace, and provide a cleaner API built on top of the generated
> code but allow users to access the lower level GObject API if needed.
>
> > > I was worried about whether it's possible to use GObject to implement
> > > bindings for some of the more complex parts of the Dataset API, like
> > > providing a .NET implementation of a KmsClientFactory, which would be
> > > required for reading encrypted Parquet data.
> >
> > We can use GObject for the case as you did. I can open a PR
> > for it or I can review your implementation. (If you open a
> > PR of your work.)
>
> The code I have is more like a prototype of a simplified version of
> the KMS API, so it's not useful as is, but I'll look into expanding
> this to implement the full API and make a PR.
>
> Cheers,
> Adam
>
> On Tue, 7 May 2024 at 20:11, Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I'm the author of Arrow GLib.
> >
> > I agree with Pros/Cons you summarized.
> >
> > If we have many development resources for the C# bindings,
> > it may be better that we implement the C++ bindings directly
> > like PyArrow does. If we doesn't, it may be better that we
> > use Arrow GLib to combine development resources with
> > GLib/Ruby developers like me.
> >
> > If we don't have many development resources for the C#
> > bindings but we don't need many bindings, it may be better
> > that we implement the C++ bindings directly.
> >
> > > * There's no need to distribute a native binary with NuGet packages,
> > > and NuGet packages aren't bloated by builds for architectures that
> > > aren't used
> >
> > > * Users need to separately install the Arrow GLib libraries in order
> > > to use some Arrow NuGet packages, and this might complicate build and
> > > deployment processes compared to just adding a NuGet package reference
> > > to a project
> >
> > We may want to publish a NuGet package that includes Arrow
> > GLib libraries like ParquetSharp includes
> > ParquetSharpNative.* that are liked to Arrow/Parquet C++
> > statically.
> >
> >
> > We may want to create a C# library in addition of auto
> > generated codes based on GObject Introspection. It's an
> > approach used by Ruby. The auto generated codes may be
> > difficult to use from C#.
> >
> > For example, both of the following Ruby codes read a table:
> >
> > # With a Ruby library
> > table = Arrow::Table.load("data.arrow")
> >
> > # Without a Ruby library (Use only auto generated API)
> > input = Arrow::memoryMappedInputStream.new("data.arrow")
> > reader = Arrow::RecordBatchFileReader.new(input)
> > table = reader.read_all
> >
> >
> > > I was worried about whether it's possible to use GObject to implement
> > > bindings for some of the more complex parts of the Dataset API, like
> > > providing a .NET implementation of a KmsClientFactory, which would be
> > > required for reading encrypted Parquet data.
> >
> > We can use GObject for the case as you did. I can open a PR
> > for it or I can review your implementation. (If you open a
> > PR of your work.)
> >
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "[DISCUSS][C#][GLib] Formalize use of the GLib libraries for native
> library bindings

Re: ADBC - OS-level driver manager

2024-04-01 Thread Curt Hagenlocher
The advantage to system-wide registration of drivers (however that's
accomplished) is of course that it allows driver authors to provide a
single installer or set of instructions for the driver to be installed
without regard for different usage scenarios. So if Tableau and Excel can
both use ODBC drivers, then I (as a hypothetical author of a niche driver)
don't have to solve N installation problems for N possible use cases. And
my spouse (as a non-developer finance user) can just run one installer and
know that the data source will be available in multiple tools. Or at least
that's the principle.

For a real-world example, compare the instructions for installing ODBC
drivers into Tableau (
https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases.htm
) with those for installing JDBC drivers (
https://help.tableau.com/current/pro/desktop/en-us/examples_otherdatabases_jdbc.htm
). The JDBC instructions include copying or installing files to a specific
directory which possibly needs to be created. The ODBC instructions ...
don't.

With what I'm most immediately invested in -- database drivers for
Microsoft Power BI -- part of the problem actually ends up being that many
drivers are closed source and/or not freely redistributable. So for someone
to use Power BI with Oracle, they either need a way to install Oracle
drivers onto their machine in a standard way which lets us find them or we
need to go through a painful and sometimes expensive "biz dev" effort to
get the right to redistribute those drivers and install them ourselves.

I am of course aware that there can also be significant downsides to such
system-wide registration.

-Curt

On Wed, Mar 20, 2024 at 7:23 AM Antoine Pitrou  wrote:

>
> Also, with ADBC driver implementations currently in flux (none of them
> has reached the "stable" status in
> https://arrow.apache.org/adbc/main/driver/status.html), it might be a
> disservice to users to implicitly fetch drivers from potentially
> outdated DLLs on the current system.
>
> Regards
>
> Antoine.
>
>
> Le 20/03/2024 à 15:08, Matt Topol a écrit :
> >> it seems like the current driver manager work has been largely targeting
> > an app-specific implementation.
> >
> > Yup, that was the intention. So far discussions of ADBC having a
> > system-wide driver registration paradigm like ODBC have mostly been to
> > discuss how much we dislike that paradigm and would prefer ADBC to stay
> > with the app-specific approach that we currently have. :)
> >
> > As of yet, no one has requested such a paradigm so the discussions
> haven't
> > gotten revived.
> >
> > On Wed, Mar 20, 2024 at 9:22 AM David Coe  .invalid>
> > wrote:
> >
> >> ODBC has different OS-level driver managers available on their
> respective
> >> systems. It seems like the current driver manager<
> >> https://arrow.apache.org/adbc/main/cpp/driver_manager.html> work has
> been
> >> largely targeting an app-specific implementation. Have there been any
> >> discussions of ADBC having a similar system-wide driver registration
> >> paradigm like ODBC does?
> >>
> >
>


Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread Curt Hagenlocher
Congrats Joel!

On Mon, Apr 1, 2024 at 10:36 AM Wes McKinney  wrote:

> Congrats!
>
> On Mon, Apr 1, 2024 at 11:01 AM Andrew Lamb  wrote:
>
> > Congratulations Joel.
> >
> > On Mon, Apr 1, 2024 at 11:53 AM Raúl Cumplido 
> > wrote:
> >
> > > Congratulations and welcome Joel!
> > >
> > >
> > > El lun, 1 abr 2024, 17:18, Kevin Gurney  >
> > > escribió:
> > >
> > > > Congratulations, Joel!
> > > >
> > > > 
> > > > From: Jason Z 
> > > > Sent: Monday, April 1, 2024 11:13 AM
> > > > To: dev@arrow.apache.org 
> > > > Subject: Re: [ANNOUNCE] New Committer Joel Lubinitsky
> > > >
> > > > Congrats Joel!
> > > >
> > > >
> > > > Thanks,
> > > > Jiashen
> > > >
> > > >
> > > > On Mon, Apr 1, 2024 at 8:10 AM Ian Cook  wrote:
> > > >
> > > > > Congratulations Joel!
> > > > >
> > > > > On Mon, Apr 1, 2024 at 11:08 AM wish maple  >
> > > > wrote:
> > > > >
> > > > > > Congrats Joel!
> > > > > >
> > > > > > Best,
> > > > > > Xuwei Fu
> > > > > >
> > > > > > Matt Topol  于2024年4月1日周一 22:59写道:
> > > > > >
> > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Joel
> > > > Lubinitsky
> > > > > > has
> > > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > > Welcome,
> > > > > > and
> > > > > > > thank you for your contributions!
> > > > > > >
> > > > > > > --Matt
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: ADBC: xdbc_data_type and xdbc_sql_data_type

2024-01-11 Thread Curt Hagenlocher
Interestingly, the description of sql_data_type in FlightSql.proto includes
"The value of the SQL DATA TYPE which has the same values as data_type
value."



On Thu, Jan 11, 2024 at 10:06 AM David Li  wrote:

> Those values are inherited from Flight SQL [1] which effectively borrowed
> types from JDBC/ODBC.
>
> xdbc_sql_data_type [2] is defined by an enum [3]. This is the database's
> type in its SQL dialect, not the Arrow type. Arrow types are always
> represented in Arrow schemas. (This field is a little contradictory to
> JDBC, which specifies sql_data_type is unused/reserved.)
>
> xdbc_data_type [4] is ill-defined I think. James Duong, do you have a
> clarification about Dremio's original intent here? In JDBC this is a
> java.sql.Types value but it is not explained in Flight SQL. In fact it
> seems the proto interchanged the definitions of the two fields, since the
> enum above is java.sql.Types.
>
>
> [1]:
> https://github.com/apache/arrow-adbc/blob/6b73e529ced2f057aa463e7599c6e1227104b025/adbc.h#L1520-L1522
> [2]:
> https://github.com/apache/arrow/blob/2b4a70320232647f730b19d2fea5746c3baec752/format/FlightSql.proto#L1098-L1102
> [3]:
> https://github.com/apache/arrow/blob/2b4a70320232647f730b19d2fea5746c3baec752/format/FlightSql.proto#L944-L973
> [4]:
> https://github.com/apache/arrow/blob/2b4a70320232647f730b19d2fea5746c3baec752/format/FlightSql.proto#L1067
>
> On Thu, Jan 11, 2024, at 12:37, David Coe wrote:
> > I recently raised csharp/src/Apache.Arrow/Types/ArrowType: There are
> > different type IDs for values after 21, including Decimal128 and
> > Decimal256, than for Python * Issue #39568 * apache/arrow
> > (github.com) because I
> > have a downstream system that is interpreting the
> > XDBC_DATA_TYPE<
> https://github.com/apache/arrow-adbc/blob/6b73e529ced2f057aa463e7599c6e1227104b025/adbc.h#L1501>
>
> > as the ArrowTypeId and those are different values in different
> > languages.
> >
> > For ADBC, what is the intended distinction between xdbc_data_type and
> > xdbc_sql_data_type? Is the xdbc_data_type intended to mimic the C types
> > in ODBC? Or is there a different interpretation? And if there are docs
> > I don't seem to be finding, please refer me to those.
> >
> > Thanks,
> >
> >   *   David
>


Java, dictionary ids and schema equality

2023-12-09 Thread Curt Hagenlocher
I've (mostly) fixed the C# implementation of dictionary IPC but I'm getting
a failing integration test. The Java checks are explicitly validating that
the dictionary IDs in the schema match the values it expects. None of the
other implementations seem to do that, though they're obviously passing and
so they're assigning dictionary IDs consistently with what the Java
implementation expects.

This seems to be because the C# implementation starts numbering
dictionaries with 1 while Java seems to expect them to start with 0. (I
have not yet validated this theory.)

But more broadly, I'm curious -- is the Java implementation being overly
pedantic here or is there an explicit expectation that the dictionary
number serialized into Flatbuffer format for files will follow a specific
ordering?

Thanks,
-Curt


Re: CIDR 2024

2023-12-06 Thread Curt Hagenlocher
Yes, sorry, thank you!

On Wed, Dec 6, 2023 at 12:33 AM Antoine Pitrou  wrote:

>
> For the sake of clarity, it seems this is talking about the Conference
> on Innovative Data Systems Research:
> https://www.cidrdb.org/cidr2024/
>
> Regards
>
> Antoine.
>
>
> Le 06/12/2023 à 01:15, Wes McKinney a écrit :
> > I will also be there.
> >
> > On Mon, Dec 4, 2023 at 12:58 PM Tony Wang  wrote:
> >
> >> I am
> >>
> >> Get Outlook for Android<https://aka.ms/AAb9ysg>
> >> 
> >> From: Curt Hagenlocher 
> >> Sent: Monday, December 4, 2023 12:53:00 PM
> >> To: dev@arrow.apache.org 
> >> Subject: CIDR 2024
> >>
> >> Who's going to CIDR in January?
> >>
> >> (And who else is shocked that it's already going to be 2024...?)
> >>
> >> -Curt
> >>
> >
>


CIDR 2024

2023-12-04 Thread Curt Hagenlocher
Who's going to CIDR in January?

(And who else is shocked that it's already going to be 2024...?)

-Curt


Re: decimal64

2023-11-09 Thread Curt Hagenlocher
It certainly could be. Would float16 be done as a canonical extension type
if it were proposed today?

On Thu, Nov 9, 2023 at 9:36 AM David Li  wrote:

> cuDF has decimal32/decimal64 [1].
>
> Would a canonical extension type [2] be appropriate here? I think that's
> come up as a solution before.
>
> [1]: https://docs.rapids.ai/api/cudf/stable/user_guide/data-types/
> [2]: https://arrow.apache.org/docs/format/CanonicalExtensions.html
>
> On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote:
> > Or they could trivially use a int64 column for that, since the scale is
> > fixed anyway, and you're probably not going to multiply money values
> > together.
> >
> >
> > Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :
> >> If Arrow had a decimal64 type, someone could choose to use that for a
> >> PostgreSQL money column knowing that there are edge cases where they may
> >> get an undesired result.
> >>
> >> On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :
> >>>> Or more succinctly,
> >>>> "111,111,111,111,111." will fit into a decimal64; would you
> prevent
> >>> it
> >>>> from being stored in one so that you can describe the column as
> >>>> "decimal(18, 4)"?
> >>>
> >>> That's what we do for other decimal types, see PyArrow below:
> >>> ```
> >>>   >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
> >>> Traceback (most recent call last):
> >>> [...]
> >>> ArrowInvalid: Precision is not great enough for the result. It should
> be
> >>> at least 19
> >>> ```
> >>>
> >>>
> >>
>


Re: decimal64

2023-11-09 Thread Curt Hagenlocher
If Arrow had a decimal64 type, someone could choose to use that for a
PostgreSQL money column knowing that there are edge cases where they may
get an undesired result.

On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou  wrote:

>
> Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :
> > Or more succinctly,
> > "111,111,111,111,111." will fit into a decimal64; would you prevent
> it
> > from being stored in one so that you can describe the column as
> > "decimal(18, 4)"?
>
> That's what we do for other decimal types, see PyArrow below:
> ```
>  >>> pa.array([111_111_111_111_111_]).cast(pa.decimal128(18, 0))
> Traceback (most recent call last):
>[...]
> ArrowInvalid: Precision is not great enough for the result. It should be
> at least 19
> ```
>
>


Re: decimal64

2023-11-09 Thread Curt Hagenlocher
(But yes, I think this would be more commonly described as having a
precision of 18 and my writeup was probably influenced by looking at the
SQL Server and Postgres descriptions of the "money" type, both of which
allow the full range of the underlying 64-bit value to be used.)

On Thu, Nov 9, 2023 at 8:23 AM Curt Hagenlocher 
wrote:

> Obviously the limits don't match up exactly, so it depends on whether you
> want to express the decimal bounds in a way that captures all the possible
> underlying values or in a way that allows all the decimal values to be
> represented as the underlying type. Or more succinctly,
> "111,111,111,111,111." will fit into a decimal64; would you prevent it
> from being stored in one so that you can describe the column as
> "decimal(18, 4)"? In any event, all scaled integer representations of
> decimal values have the same question and it's more-or-less orthogonal to
> having a decimal64 type.
>
> On Thu, Nov 9, 2023 at 8:17 AM Raphael Taylor-Davies
>  wrote:
>
>> Perhaps my maths is incorrect, but a decimal64 would have a maximum
>> precision of 18, not 19? log(9223372036854775807) = 18.9?
>>
>> On 09/11/2023 16:01, Curt Hagenlocher wrote:
>> > Recently, someone opened an issue on GitHub ([C++] Decimal64/32
>> support? ·
>> > Issue #38622 · apache/arrow (github.com)
>> > <https://github.com/apache/arrow/issues/38622>) asking for support for
>> > narrower decimal types. They were advised to start a thread on the
>> mailing
>> > list, and as they haven't done so yet I will start.
>> >
>> > It's fairly common to store currency in databases in a type that's
>> > compatible with decimal64. Both PostgreSQL and Microsoft SQL Server have
>> > "money" data types; for Postgres, this is a decimal(19, 2) and for SQL
>> > Server it's decimal(19, 4). Microsoft Analysis Services also uses
>> > decimal(19, 4) as one of its core data types. If you search the internet
>> > for suggestions on the database type to use for money, the vast majority
>> > recommend a decimal type with a precision <= 19. Currency is something
>> > stored very frequently as data, and it makes sense to have a type that's
>> > optimized for this purpose. I submit that it's a far more common type
>> than
>> > float16, and that even if it's not as hip as the AI scenarios which
>> > popularized float16, the ultimate goal of those scenarios is, after
>> all, to
>> > make more "money".
>> >
>> > decimal64 is considerably easier to work with on modern CPUs and in
>> common
>> > programming languages than decimal128, and requires half the amount of
>> > storage space. And while adding new types to Arrow obviously needs to be
>> > done very sparingly, it's harder to imagine a new type for which support
>> > would be easier to implement than this one.
>> >
>> > I think decimal32 is much harder to justify. MS SQL Server has a
>> > "smallmoney" (decimal(10, 4)), but I suspect it's not that heavily used.
>> > Maybe others have more feedback on this one.
>> >
>> >
>> > -Curt
>> >
>>
>


Re: decimal64

2023-11-09 Thread Curt Hagenlocher
Obviously the limits don't match up exactly, so it depends on whether you
want to express the decimal bounds in a way that captures all the possible
underlying values or in a way that allows all the decimal values to be
represented as the underlying type. Or more succinctly,
"111,111,111,111,111." will fit into a decimal64; would you prevent it
from being stored in one so that you can describe the column as
"decimal(18, 4)"? In any event, all scaled integer representations of
decimal values have the same question and it's more-or-less orthogonal to
having a decimal64 type.

On Thu, Nov 9, 2023 at 8:17 AM Raphael Taylor-Davies
 wrote:

> Perhaps my maths is incorrect, but a decimal64 would have a maximum
> precision of 18, not 19? log(9223372036854775807) = 18.9?
>
> On 09/11/2023 16:01, Curt Hagenlocher wrote:
> > Recently, someone opened an issue on GitHub ([C++] Decimal64/32 support?
> ·
> > Issue #38622 · apache/arrow (github.com)
> > <https://github.com/apache/arrow/issues/38622>) asking for support for
> > narrower decimal types. They were advised to start a thread on the
> mailing
> > list, and as they haven't done so yet I will start.
> >
> > It's fairly common to store currency in databases in a type that's
> > compatible with decimal64. Both PostgreSQL and Microsoft SQL Server have
> > "money" data types; for Postgres, this is a decimal(19, 2) and for SQL
> > Server it's decimal(19, 4). Microsoft Analysis Services also uses
> > decimal(19, 4) as one of its core data types. If you search the internet
> > for suggestions on the database type to use for money, the vast majority
> > recommend a decimal type with a precision <= 19. Currency is something
> > stored very frequently as data, and it makes sense to have a type that's
> > optimized for this purpose. I submit that it's a far more common type
> than
> > float16, and that even if it's not as hip as the AI scenarios which
> > popularized float16, the ultimate goal of those scenarios is, after all,
> to
> > make more "money".
> >
> > decimal64 is considerably easier to work with on modern CPUs and in
> common
> > programming languages than decimal128, and requires half the amount of
> > storage space. And while adding new types to Arrow obviously needs to be
> > done very sparingly, it's harder to imagine a new type for which support
> > would be easier to implement than this one.
> >
> > I think decimal32 is much harder to justify. MS SQL Server has a
> > "smallmoney" (decimal(10, 4)), but I suspect it's not that heavily used.
> > Maybe others have more feedback on this one.
> >
> >
> > -Curt
> >
>


decimal64

2023-11-09 Thread Curt Hagenlocher
Recently, someone opened an issue on GitHub ([C++] Decimal64/32 support? ·
Issue #38622 · apache/arrow (github.com)
) asking for support for
narrower decimal types. They were advised to start a thread on the mailing
list, and as they haven't done so yet I will start.

It's fairly common to store currency in databases in a type that's
compatible with decimal64. Both PostgreSQL and Microsoft SQL Server have
"money" data types; for Postgres, this is a decimal(19, 2) and for SQL
Server it's decimal(19, 4). Microsoft Analysis Services also uses
decimal(19, 4) as one of its core data types. If you search the internet
for suggestions on the database type to use for money, the vast majority
recommend a decimal type with a precision <= 19. Currency is something
stored very frequently as data, and it makes sense to have a type that's
optimized for this purpose. I submit that it's a far more common type than
float16, and that even if it's not as hip as the AI scenarios which
popularized float16, the ultimate goal of those scenarios is, after all, to
make more "money".

decimal64 is considerably easier to work with on modern CPUs and in common
programming languages than decimal128, and requires half the amount of
storage space. And while adding new types to Arrow obviously needs to be
done very sparingly, it's harder to imagine a new type for which support
would be easier to implement than this one.

I think decimal32 is much harder to justify. MS SQL Server has a
"smallmoney" (decimal(10, 4)), but I suspect it's not that heavily used.
Maybe others have more feedback on this one.


-Curt


Language-specific discussion (with C# example)

2023-10-17 Thread Curt Hagenlocher
I'm curious what other (sub-) communities do about implementation-specific
considerations that aren't directly tied to the Arrow standard. I don't see
much of that kind of discussion on the dev list; does that mean these
happen largely in the context of specific pull requests -- or perhaps not
at all?

My specific motivation for asking is that there are three similar feature
requests for C#: 23892 , 37359
 and 35199
. Looking at these, I was
thinking that the best general solution would be to have the scalar arrays
in C# implement IReadOnlyList and ICollection. The former is a
strictly-better superset of IEnumerable which also allows indexing by
position, while the latter is an unfortunate concession to working well
with "LINQ" (pre-.NET 9). Implementing ICollection would allow LINQ's
"ToList" to just work, and work efficiently.

But it feels weird to just submit a PR for this kind of implementation
decision without more feedback from users or potential users, and at the
same time it doesn't feel significant enough to e.g. write it up in a
document to submit for review. I could (and will) open a new issue for this
on GitHub, but it doesn't look like anyone proactively looks at new issues
to find things to comment on.

So what do others do?

Thanks,
-Curt


Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-16 Thread Curt Hagenlocher
Thanks, all!

On Mon, Oct 16, 2023 at 9:19 AM Dane Pitkin 
wrote:

> Congrats Curt!
>
> On Mon, Oct 16, 2023 at 12:00 PM Kevin Gurney
> 
> wrote:
>
> > Congratulations, Curt!
> > 
> > From: Weston Pace 
> > Sent: Sunday, October 15, 2023 5:32 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher
> >
> > Congratulations!
> >
> > On Sun, Oct 15, 2023, 8:51 AM Gang Wu  wrote:
> >
> > > Congrats!
> > >
> > > On Sun, Oct 15, 2023 at 10:49 PM David Li  wrote:
> > >
> > > > Congrats & welcome Curt!
> > > >
> > > > On Sun, Oct 15, 2023, at 09:03, wish maple wrote:
> > > > > Congratulations!
> > > > >
> > > > > Raúl Cumplido  于2023年10月15日周日 20:48写道:
> > > > >
> > > > >> Congratulations and welcome!
> > > > >>
> > > > >> El dom, 15 oct 2023, 13:57, Ian Cook 
> > escribió:
> > > > >>
> > > > >> > Congratulations Curt!
> > > > >> >
> > > > >> > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb  >
> > > > wrote:
> > > > >> >
> > > > >> > > On behalf of the Arrow PMC, I'm happy to announce that Curt
> > > > Hagenlocher
> > > > >> > > has accepted an invitation to become a committer on Apache
> > > > >> > > Arrow. Welcome, and thank you for your contributions!
> > > > >> > >
> > > > >> > > Andrew
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>


Re: Are interval components unsigned?

2023-10-13 Thread Curt Hagenlocher
Thanks, that was very helpful.

Incidentally (and FWIW) the information associated with SQL Server in that
thread and document is incorrect. MS SQL Server doesn't have an interval
type, and the linked documentation is part of the ODBC specification.

-Curt

On Fri, Oct 13, 2023 at 4:56 PM Micah Kornfield 
wrote:

> My understanding is that the intent in Arrow is that intervals are signed
> ([1] has the discussion on the types).  IIUC, this aligns with most SQL
> type systems.  I don't have context on  Parquet (and I think Avro) chose to
> make them unsigned.
>
> Also, note that because of this there is no canonical way of mapping Arrow
> Interval's directly to parquet intervals.  In the past there have been some
> proposals to add a new logical type to parquet but nobody has followed
> through on them.
>
> Thanks,
> Micah
>
> [1] https://lists.apache.org/thread/pqs6qnjvw1gxfkxz02bntvvyqxvw34mm
>
> On Fri, Oct 13, 2023 at 4:40 PM Curt Hagenlocher 
> wrote:
>
> > The Parquet specification clearly states that the components of a Parquet
> > interval are unsigned integers. I couldn't find an equivalent statement
> for
> > Arrow, and the C++ implementation has these as signed. Is it correct to
> > assume that the intent for Arrow intervals is that they should be
> > non-negative?
> >
> > Thanks,
> > -Curt
> >
>


Are interval components unsigned?

2023-10-13 Thread Curt Hagenlocher
The Parquet specification clearly states that the components of a Parquet
interval are unsigned integers. I couldn't find an equivalent statement for
Arrow, and the C++ implementation has these as signed. Is it correct to
assume that the intent for Arrow intervals is that they should be
non-negative?

Thanks,
-Curt


Re: [VOTE] Release Apache Arrow 13.0.0 - RC0

2023-07-21 Thread Curt Hagenlocher
Apparently, we never managed to test the C API support in the C# library on
downlevel versions of .NET until now, and they don't actually work :(. This
gap is in part because the tests don't get run on anything but .NET 7.0.
I've submitted a PR to address these issues, though the testing for .NET
4.7.2 is disabled because it causes a hang on shutdown when xUnit tries to
unload the AppDomain.

GH-36812: [C#] Fix C API support to work with .NET desktop framework by
CurtHagenlocher · Pull Request #36813 · apache/arrow (github.com)


I'd like to consider this a blocking bug.

On Fri, Jul 21, 2023 at 2:49 AM Raúl Cumplido  wrote:

> Hi,
>
> As discussed during the community calls I have also triggered the
> benchmark tests on the Pull Request for RC 0 [1].
>
> I am trying to get the conbench comparison between the 13.0.0 RC0 and
> 12.0.1 RC1 (latest release) by having a chat with the conbench
> maintainers. I'll share as soon as I have it.
>
> I wanted to share the Verification email as soon as possible so we can
> start running the verification process.
>
> Thanks,
> Raúl
>
> [1] https://github.com/apache/arrow/pull/36775#issuecomment-1645088676
>
> El vie, 21 jul 2023 a las 11:45, Raúl Cumplido ()
> escribió:
> >
> > Hi,
> >
> > I would like to propose the following release candidate (RC0) of Apache
> > Arrow version 13.0.0. This is a release consisting of 428
> > resolved GitHub issues[1].
> >
> > This release candidate is based on commit:
> > ac2d207611ce25c91fb9fc90d5eaff2933609660 [2]
> >
> > The source release rc0 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > The changelog is located at [12].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [13] for how to validate a release
> candidate.
> >
> > See also a verification result on GitHub pull request [14].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 13.0.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 13.0.0 because...
> >
> > [1]:
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A13.0.0+is%3Aclosed
> > [2]:
> https://github.com/apache/arrow/tree/ac2d207611ce25c91fb9fc90d5eaff2933609660
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-13.0.0-rc0
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/13.0.0-rc0
> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/13.0.0-rc0
> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/13.0.0-rc0
> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [12]:
> https://github.com/apache/arrow/blob/ac2d207611ce25c91fb9fc90d5eaff2933609660/CHANGELOG.md
> > [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > [14]: https://github.com/apache/arrow/pull/36775
>


Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Curt Hagenlocher
At some point, you just have to trust that a user is doing
semantically-meaningful operations. After all, they could also choose to
subtract a temperature from an elevation, or add feet to meters. It's
important to define the precise semantics of an operation, including the
assumptions it makes about the input(s). After that, it's up to a user to
ensure proper use.

On Thu, Jun 17, 2021 at 1:39 PM Weston Pace  wrote:

> If a system does not store a local datetime using the UTC-normalized
> representation and they put it in an Arrow timestamp column without
> timezone then how should an Arrow compute function extract a field.
>
> For a concrete example, let's assume I have the number 17280 in a
> timestamp(ms) column with no time zone and the user has asked to
> extract the day of month.  I use 17280 because it is in the
> parquet docs example[1].  I thought I could assume that the source
> system had normalized this value to UTC and so I could run something
> like `datetime.fromtimestamp(172800).day` and find out that it is 2.
>
> Perhaps, more concretely:
>
> There are many ways that one could store a datetime into a single
> number.  The parquet docs mention two different ways but they are
> really the same thing, figure out the epoch timestamp for that
> datetime in the UTC timezone (the instant at which a wall clock in UTC
> would show the desired wall clock time).  With this method the
> datetime (1970, 1, 2, 14, 0) is stored as 0x0A4CB800
> (17280, assuming ms). So let's invent a third way.  I could use
> the first 16 bits for the year, the next 8 bits for the month, the
> next 8 bits for the day of month, the next 8 bits for the hour, the
> next 8 bits for the minute, and the remaining bits for the seconds.
> Using this method I would store (1970, 1, 2, 14, 0) as
> 0x07B201020E00.
>
> If I understand your argument correctly it is that Arrow is not going
> to govern how these other systems encode a local datetime into an 8
> byte value and so both of those are valid representations of (1970, 1,
> 2, 14, 0).  As a result, there would be no possible way to write a
> uniform kernel for field extraction that would work in Arrow.
>
> Am I understanding you correctly?  Or have I misinterpted things again
> as I've already done that several times on this thread alone :)
>
> [1]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
>
> On Thu, Jun 17, 2021 at 8:59 AM Wes McKinney  wrote:
> >
> > To take a step back to focus on some concrete issues
> >
> > Parquet has two timestamp types: with (UTC-normalized)/without time
> > zone (non-UTC-normalized)
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L268
> >
> > The SQL standard (e.g. PostgresSQL) has two timestamp types:
> > with/without time zone — in some SQL implementations each slot can
> > have a different time zone
> > https://www.postgresql.org/docs/9.1/datatype-datetime.html
> > WITHOUT TIME ZONE: "timestamp without time zone value should be taken
> > or given as timezone local time"
> >
> > Spark / Databricks discusses how Spark handles this
> >
> https://docs.databricks.com/spark/latest/dataframes-datasets/dates-timestamps.html#ansi-sql-and-spark-sql-timestamps
> > * WITHOUT TIME ZONE: "These timestamps are not bound to any time zone,
> > and are wall clock timestamps." — not UTC-normalized
> > * WITH TIME ZONE: "does not affect the physical point in time that the
> > timestamp represents, as that is fully represented by the UTC time
> > instant given by the other timestamp components"
> >
> > pandas as discussed has non-UTC-normalized WITHOUT TIME ZONE "naive"
> > timestamps and UTC-normalized WITH TIME ZONE.
> >
> > If we were to change Arrow's "WITHOUT TIMEZONE" semantics to be
> > interpreted as UTC-normalized, that would force all of these other
> > systems (and more) to serialize their data to be UTC-normalized (i.e.
> > calling the equivalent of pandas's tz_localize function) when they
> > convert to Arrow. This seems very harmful to me, and will make data
> > from these systems not accurately representable in Arrow and unable to
> > be round-tripped.
> >
> > Perhaps we can make a spreadsheet and look comprehensively at how many
> > use cases would be disenfranchised by requiring UTC normalization
> > always.
> >
> > On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper  wrote:
> > >
> > > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace 
> wrote:
> > >
> > > > Arrow's "Timestamp with Timezone" can have fields extracted
> > > > from it.
> > > >
> > >
> > > Sure, one *can* extract fields from timestamp+tz. But I don't feel
> > > timestamp+tz is *designed* for extracting fields:
> > >
> > >- Extracting fields from int64+tz is inefficient, because it
> bundles two
> > >steps: 1) convert to datetime struct; and 2) return one field from
> the
> > >datetime struct. (If I want to extract Year, Month, Day, is that
> three