Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Keith Kraus
> I believe several array implementations (e.g., numpy, R) are able to
broadcast/recycle a length-1 array. Run-end-encoding is also an option that
would make that broadcast explicit without expanding the scalar.

Some libraries behave this way, i.e. Polars, but others like Pandas and
cuDF only broadcast up dimensions. I.E. scalars can be broadcast across
columns or dataframes, columns can be broadcast across dataframes, but
length 1 columns do not broadcast across columns where trying to add say a
length 5 and length 1 column isn't valid but adding a length 5 column and a
scalar is. Additionally, it differentiates between operations that are
guaranteed to return a scalar, i.e. something like a reduction of `sum()`
versus operations that can return a length 1 column depending on the data,
i.e. `unique()`.

> For UDFs: UDFs are a system-specific interface. Presumably, that
interface can encode whether an Arrow array is meant to represent a column
or scalar (or record batch or ...). Again, because Arrow doesn't define
scalars (for now...) or UDFs, the UDF interface needs to layer its own
semantics on top of Arrow.
>
> In other words, I don't think the C Data Interface was meant to be
something where you can expect to _only_ pass the ArrowDeviceArray around
and have it encode all the semantics for a particular system, right? The
UDF example is something where the engine would pass an ArrowDeviceArray
plus additional context.

There's a growing trend in execution engines supporting UDFs of Arrow in
and Arrow out, DuckDB, PySpark, DataFusion, etc. Many of them have
different options of passing in RecordBatches vs Arrays where they
currently rely on the Arrow library containers in order to differentiate
them.

Additionally, libcudf has some generic functions that currently use Arrow
C++ containers (
https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/interop_arrow/)
for differentiating between RecordBatches, Arrays, and Scalars which could
be moved to using the C Data Interfaces, Polars has similar (
https://docs.pola.rs/py-polars/html/reference/api/polars.from_arrow.html)
that currently uses PyArrow containers, and you could imagine other
DataFrame libraries having similar.

Ultimately, there's a desire to be able to move Arrow data between
different libraries, applications, frameworks, etc. and given Arrow
implementations like C++, Rust, and Go have containers for RecordBatches,
Arrays, and Scalars respectively, things have been built around and
differentiated around the concepts. Maybe trying to differentiate this
information at runtime isn't the correct path, but I believe there's a
demonstrated desire for being able to differentiate things in a library
agnostic way.

On Tue, Apr 23, 2024 at 8:37 PM David Li  wrote:

> For scalars: Arrow doesn't define scalars. They're an implementation
> concept. (They may be a *useful* one, but if we want to define them more
> generally, that's a separate discussion.)
>
> For UDFs: UDFs are a system-specific interface. Presumably, that interface
> can encode whether an Arrow array is meant to represent a column or scalar
> (or record batch or ...). Again, because Arrow doesn't define scalars (for
> now...) or UDFs, the UDF interface needs to layer its own semantics on top
> of Arrow.
>
> In other words, I don't think the C Data Interface was meant to be
> something where you can expect to _only_ pass the ArrowDeviceArray around
> and have it encode all the semantics for a particular system, right? The
> UDF example is something where the engine would pass an ArrowDeviceArray
> plus additional context.
>
> > since we can't determine which a given ArrowArray is on its own. In the
> > libcudf situation, it came up with what happens if you pass a non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
>
> Presumably it should just error? I can see this being ambiguous if there
> were an API that dynamically returned either a table or a column based on
> the input shape (where before it would be less ambiguous since you'd
> explicitly pass pa.RecordBatch or pa.Array, and now it would be ambiguous
> since you only pass ArrowDeviceArray). But it doesn't sound like that's the
> case?
>
> On Tue, Apr 23, 2024, at 11:15, Weston Pace wrote:
> > I tend to agree with Dewey.  Using run-end-encoding to represent a scalar
> > is clever and would keep the c data interface more compact.  Also, a
> struct
> > array is a superset of a record batch (assuming the metadata is kept in
> the
> > schema).  Consumers should always be able to deserialize into a struct
> > array and then downcast to a record batch if that is what they want to do
> > (raising an error if there happen to be nulls).
> >
> >> Depending on the function in question, it could be valid to pass a
> struct
> >> column vs a record batch with different results.
> >
> > Are there any concrete examples where this is the 

Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Keith Kraus
+1 (non-binding)

On Wed, Aug 24, 2022 at 12:12 PM David Li  wrote:

> +1 (binding)
>
> On Wed, Aug 24, 2022, at 12:06, Ivan Ogasawara wrote:
> > +1 (non-binding)
> >
> > On Wed, Aug 24, 2022 at 12:00 PM Sasha Krassovsky <
> krassovskysa...@gmail.com>
> > wrote:
> >
> >> ++1 (non-binding)
> >>
> >> > 24 авг. 2022 г., в 08:53, Jacob Wujciak  >
> >> написал(а):
> >> > + 1 (non-binding)
> >> >
> >> > Benjamin Kietzman  schrieb am Mi., 24. Aug.
> 2022,
> >> > 17:43:
> >> >
> >> >> +1 (binding)
> >> >>
> >> >> On Wed, Aug 24, 2022, 11:32 Antoine Pitrou 
> wrote:
> >> >>
> >> >>> Hello,
> >> >>> I would like to propose that the Arrow C++ implementation switch to
> >> >>> C++17 as its baseline supported version (currently C++11).
> >> >>> The rationale and subsequent discussion can be read in the archives
> >> here:
> >> >>> https://lists.apache.org/thread/9g14n3odhj6kzsgjxr6k6d3q73hg2njr
> >> >>> The exact steps and timeline for switching can be decided later on,
> but
> >> >>> this proposal implies that it could happen soon, possibly next week
> :-)
> >> >>> ... or, more realistically, in the next Arrow C++ release, 10.0.0.
> >> >>> The vote will be open for at least 72 hours.
> >> >>> [ ] +1 Switch to C++17 in the impeding future
> >> >>> [ ] +0
> >> >>> [ ] -1 Do not switch to C++17 because...
> >> >>> Regards
> >> >>> Antoine.
> >>
>


Re: cmake FindPackage fails in Windows

2022-08-19 Thread Keith Kraus
The package you're using is from the `defaults` channels as opposed to
conda-forge in which the community doesn't control the recipe.

Only other thought is that if you're using a system `cmake` instead of a
conda supplied `cmake` then it won't properly search the conda prefix for
packages and will instead search the system prefix.

-Keith

On Fri, Aug 19, 2022 at 3:26 PM Niranda Perera 
wrote:

> This is what I got as the output.
>
> (temp) PS C:\temp\build> cmake ..
> -- Building for: Visual Studio 17 2022
> -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19043.
> -- The CXX compiler identification is MSVC 19.31.31107.0
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: C:/Program Files (x86)/Microsoft Visual
> Studio/2022/BuildTools/VC/Tools/MSVC/14.31.31103/bin/Hostx64/x64/cl.exe -
> skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> CMake Error at CMakeLists.txt:5 (find_package):
>   By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project has
>   asked CMake to find a package configuration file provided by "Arrow", but
>   CMake did not find one.
>
>   Could not find a package configuration file provided by "Arrow" with any
> of
>   the following names:
>
> ArrowConfig.cmake
> arrow-config.cmake
>
>   Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
>   "Arrow_DIR" to a directory containing one of the above files.  If "Arrow"
>   provides a separate development package or SDK, be sure it has been
>   installed.
>
>
> -- Configuring incomplete, errors occurred!
> See also "C:/temp/build/CMakeFiles/CMakeOutput.log".
> (temp) PS C:\temp\build> conda list | findstr "arrow"
> arrow-cpp 8.0.0   py310h38b8b19_0
> pyarrow   8.0.0   py310h26aae1b_0
>
> On Fri, Aug 19, 2022 at 3:21 PM Keith Kraus 
> wrote:
>
> > All recent patches and updates have only been made to 8.0.1, but if you
> > share the package version you're using I was planning to poke around and
> > see if it turns out there's a broken package. That being said, I don't
> see
> > any changes in the commit history for the conda-forge package that should
> > have changed CMake handling between 8.0.0 and 8.0.1.
> >
> > -Keith
> >
> > On Fri, Aug 19, 2022 at 1:31 PM Niranda Perera  >
> > wrote:
> >
> > > Hi Keith,
> > > Interestingly it was working with 8.0.1. So, I am guessing 8.0.0
> Windows
> > > artifacts have been overridden by the point release?
> > >
> > > On Fri, Aug 19, 2022 at 12:00 PM Keith Kraus 
> > > wrote:
> > >
> > > > Hey Niranda,
> > > >
> > > > Could you share exactly which 8.0.0 package you have installed? The
> > > output
> > > > of `conda list arrow-cpp` should show it.
> > > >
> > > > On Thu, Aug 18, 2022 at 9:37 PM Niranda Perera <
> > niranda.per...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > This issue is not there in v9.0.0 as well.
> > > > >
> > > > > On Thu, Aug 18, 2022 at 9:34 PM Niranda Perera <
> > > niranda.per...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have Arrow v8.0.0 installed in my conda environment. Cmake
> > > (v3.24.0)
> > > > is
> > > > > > unable to find Arrow in Windows OS. I have no trouble running it
> in
> > > > Linux
> > > > > > though. We have been previously using arrow v5.0.0 and haven't
> had
> > > any
> > > > > > trouble (I have verified it in a new conda env).
> > > > > >
> > > > > > CMakeLists.txt
> > > > > >
> > > > > > cmake_minimum_required(VERSION 3.22)
> > > > > > project(temp VERSION 1.0 LANGUAGES CXX)
> > > > > >
> > > > > > find_package(Arrow REQUIRED)
> > > > > > message("Arrow_Found ${ARROW_FOUND}")
> > > > > > message("Arrow_VERSION ${ARROW_VERSION}")
> > > > > > message("Arrow_FULL_SO_VERION ${ARROW_FULL_SO_VERSION}")
> > > > > >
> > > > > >
> > > > > > Error in windows:
> > > > > >
> > > > > > CMake Error at 

Re: cmake FindPackage fails in Windows

2022-08-19 Thread Keith Kraus
All recent patches and updates have only been made to 8.0.1, but if you
share the package version you're using I was planning to poke around and
see if it turns out there's a broken package. That being said, I don't see
any changes in the commit history for the conda-forge package that should
have changed CMake handling between 8.0.0 and 8.0.1.

-Keith

On Fri, Aug 19, 2022 at 1:31 PM Niranda Perera 
wrote:

> Hi Keith,
> Interestingly it was working with 8.0.1. So, I am guessing 8.0.0 Windows
> artifacts have been overridden by the point release?
>
> On Fri, Aug 19, 2022 at 12:00 PM Keith Kraus 
> wrote:
>
> > Hey Niranda,
> >
> > Could you share exactly which 8.0.0 package you have installed? The
> output
> > of `conda list arrow-cpp` should show it.
> >
> > On Thu, Aug 18, 2022 at 9:37 PM Niranda Perera  >
> > wrote:
> >
> > > This issue is not there in v9.0.0 as well.
> > >
> > > On Thu, Aug 18, 2022 at 9:34 PM Niranda Perera <
> niranda.per...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have Arrow v8.0.0 installed in my conda environment. Cmake
> (v3.24.0)
> > is
> > > > unable to find Arrow in Windows OS. I have no trouble running it in
> > Linux
> > > > though. We have been previously using arrow v5.0.0 and haven't had
> any
> > > > trouble (I have verified it in a new conda env).
> > > >
> > > > CMakeLists.txt
> > > >
> > > > cmake_minimum_required(VERSION 3.22)
> > > > project(temp VERSION 1.0 LANGUAGES CXX)
> > > >
> > > > find_package(Arrow REQUIRED)
> > > > message("Arrow_Found ${ARROW_FOUND}")
> > > > message("Arrow_VERSION ${ARROW_VERSION}")
> > > > message("Arrow_FULL_SO_VERION ${ARROW_FULL_SO_VERSION}")
> > > >
> > > >
> > > > Error in windows:
> > > >
> > > > CMake Error at CMakeLists.txt:5 (find_package):
> > > >   By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this
> project
> > > has
> > > >   asked CMake to find a package configuration file provided by
> "Arrow",
> > > but
> > > >   CMake did not find one.
> > > >
> > > >   Could not find a package configuration file provided by "Arrow"
> with
> > > any
> > > > of
> > > >   the following names:
> > > >
> > > > ArrowConfig.cmake
> > > > arrow-config.cmake
> > > >
> > > >   Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
> > > >   "Arrow_DIR" to a directory containing one of the above files.  If
> > > "Arrow"
> > > >   provides a separate development package or SDK, be sure it has been
> > > >   installed.
> > > >
> > > > What could be the issue here?
> > > >
> > > > Best
> > > > --
> > > > Niranda Perera
> > > > https://niranda.dev/
> > > > @n1r44 <https://twitter.com/N1R44>
> > > >
> > > >
> > >
> > > --
> > > Niranda Perera
> > > https://niranda.dev/
> > > @n1r44 <https://twitter.com/N1R44>
> > >
> >
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>


Re: cmake FindPackage fails in Windows

2022-08-19 Thread Keith Kraus
Hey Niranda,

Could you share exactly which 8.0.0 package you have installed? The output
of `conda list arrow-cpp` should show it.

On Thu, Aug 18, 2022 at 9:37 PM Niranda Perera 
wrote:

> This issue is not there in v9.0.0 as well.
>
> On Thu, Aug 18, 2022 at 9:34 PM Niranda Perera 
> wrote:
>
> > Hi all,
> >
> > I have Arrow v8.0.0 installed in my conda environment. Cmake (v3.24.0) is
> > unable to find Arrow in Windows OS. I have no trouble running it in Linux
> > though. We have been previously using arrow v5.0.0 and haven't had any
> > trouble (I have verified it in a new conda env).
> >
> > CMakeLists.txt
> >
> > cmake_minimum_required(VERSION 3.22)
> > project(temp VERSION 1.0 LANGUAGES CXX)
> >
> > find_package(Arrow REQUIRED)
> > message("Arrow_Found ${ARROW_FOUND}")
> > message("Arrow_VERSION ${ARROW_VERSION}")
> > message("Arrow_FULL_SO_VERION ${ARROW_FULL_SO_VERSION}")
> >
> >
> > Error in windows:
> >
> > CMake Error at CMakeLists.txt:5 (find_package):
> >   By not providing "FindArrow.cmake" in CMAKE_MODULE_PATH this project
> has
> >   asked CMake to find a package configuration file provided by "Arrow",
> but
> >   CMake did not find one.
> >
> >   Could not find a package configuration file provided by "Arrow" with
> any
> > of
> >   the following names:
> >
> > ArrowConfig.cmake
> > arrow-config.cmake
> >
> >   Add the installation prefix of "Arrow" to CMAKE_PREFIX_PATH or set
> >   "Arrow_DIR" to a directory containing one of the above files.  If
> "Arrow"
> >   provides a separate development package or SDK, be sure it has been
> >   installed.
> >
> > What could be the issue here?
> >
> > Best
> > --
> > Niranda Perera
> > https://niranda.dev/
> > @n1r44 
> >
> >
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 
>


Re: DISCUSS: [C++] Switch to C++17

2022-08-17 Thread Keith Kraus
+1 (non-binding)

>From having previously run a large C++ project that migrated from C++11 to
C++17, there was a huge quality of life improvement for developers and it
made attracting new developers much easier.

One potential pitfall, C++17 wasn't supported by NVIDIA compilers until
CUDA Toolkit 11.0 which was released in March 2020. Given there's now
userspace options to use newer CUDA Toolkit versions, I don't think this
should hold back this upgrade.

-Keith

On Wed, Aug 17, 2022 at 11:55 AM Jacob Wujciak
 wrote:

> + 1
> After Antoine implemented a work around for an issue with optional::value
> [1][2], we were able to successfully build cpp standalone (cpp_build.sh)
> and the R package [3] on the 10.13 runner that is setup to match the CRAN
> builder. (The run has an R test failure but that is most likely unrelated)
>
> [1]:
> https://stackoverflow.com/questions/44217316/how-do-i-use-stdoptional-in-c
> [2]:
>
> https://github.com/pitrou/arrow/commit/004feb2431a869575d14056fd8f5467db0efa757
> [3]:
>
> https://github.com/ursacomputing/crossbow/runs/7882199240?check_suite_focus=true#step:8:12065
>
> On Wed, Aug 17, 2022 at 5:10 PM Antoine Pitrou  wrote:
>
> >
> > Le 17/08/2022 à 16:52, Weston Pace a écrit :
> > > Sorry for a "one more thing email" but I had one more thought
> > > regarding R 3.6 support for Windows.  I think those users should
> > > continue to be able to use Arrow 10.0.0.
> >
> > Any particular reason why this should be 10.0 and not 9.0 for example?
> > (is due to an incoming feature of note?)
> >
> >
>


Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-03 Thread Keith Kraus
Libcudf / cuDF have supported 32-bit and 64-bit decimals for a few releases
now (as well as 128-bit decimals in the past couple of releases) and
they've generally been received positively from the community. Being able
to roundtrip them through Arrow would definitely be nice as well!

On Thu, Mar 3, 2022 at 12:06 PM Micah Kornfield 
wrote:

> I think this makes sense to add these.  Typically when adding new types,
> we've waited  on the official vote until there are two reference
> implementations demonstrating compatibility.
>
> On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > Currently, the Arrow format specification restricts the bitwidth of
> > decimal numbers to either 128 or 256 bits.
> >
> > However, there is interest in allowing other bitwidths, at least 32 and
> > 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal
> > datatype would allow for precisions of up to 18 digits (respectively 9
> > digits), which are sufficient for some applications which are mainly
> > looking for exact computations rather than sheer precision. Obviously,
> > smaller datatypes are cheaper to store in memory and cheaper to run
> > computations on.
> >
> > For example, the Spark documentation mentions that some decimal types
> > may fit in a Java int (32 bits) or long (64 bits):
> >
> >
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DecimalType.html
> >
> > ... and a draft PR had even been filed for initial support in the C++
> > implementation (https://github.com/apache/arrow/pull/8578).
> >
> > I am therefore proposing that we relax the wording in the Arrow format
> > specification to also allow 32- and 64-bit decimal types.
> >
> > This is a preliminary discussion to gather opinions and potential
> > counter-arguments against this proposal. If no strong counter-argument
> > emerges, we will probably run a vote in a week or two.
> >
> > Best regards
> >
> > Antoine.
> >
>


Re: [ANNOUNCE] New Arrow PMC chair: Kouhei Sutou

2022-01-25 Thread Keith Kraus
Congrats Kou! Thanks for all of your work!

On Tue, Jan 25, 2022 at 8:04 PM Weston Pace  wrote:

> Congratulations Kou!
>
> On Tue, Jan 25, 2022 at 8:22 AM Neal Richardson
>  wrote:
> >
> > Congratulations!
> >
> > Neal
> >
> > On Tue, Jan 25, 2022 at 12:53 PM Benson Muite <
> benson_mu...@emailplus.org>
> > wrote:
> >
> > > Congratulations Kou!
> > > On 1/25/22 8:44 PM, Vibhatha Abeykoon wrote:
> > > > Congrats Kou!
> > > >
> > > >
> > > > On Tue, Jan 25, 2022 at 11:13 PM Ian Joiner 
> > > wrote:
> > > >
> > > >> Congrats Kou!
> > > >>
> > > >> On Tuesday, January 25, 2022, Wes McKinney 
> wrote:
> > > >>
> > > >>> I am pleased to announce that we have a new PMC chair and VP as per
> > > >>> our newly started tradition of rotating the chair once a year. I
> have
> > > >>> resigned and Kouhei was duly elected by the PMC and approved
> > > >>> unanimously by the board. Please join me in congratulating Kou!
> > > >>>
> > > >>> Thanks,
> > > >>> Wes
> > > >>>
> > > >>
> > > >
> > >
> > >
>


Re: Preparing for version 7.0.0 release

2022-01-13 Thread Keith Kraus
I responded on the JIRA ticket, but this is just the runtime library which
is backwards compatible and doesn't force an upgrade of the actual compiler
in any way. I have built and run libcudf and cudf in my environment which
has gcc / g++ 9.4.0 with libstdcxx 11.2.0 without issue.

On Thu, Jan 13, 2022 at 4:15 PM Prem Sagar Gali 
wrote:

> Hi Arrow Devs,
>
> Since the preparations for 7.0.0 release is underway I wanted to bring up
> a conda issue that is affecting rapids suite of libraries that rely on GCC
> 9.x stack:
>
> Issue described here: https://issues.apache.org/jira/browse/ARROW-15330
>
> Regards,
> Prem
>
> From: Alessandro Molina 
> Date: Thursday, January 13, 2022 at 3:37 AM
> To: dev@arrow.apache.org 
> Subject: Re: Preparing for version 7.0.0 release
> External email: Use caution opening links or attachments
>
>
> The skeleton for the Release blog post has been created at
> https://github.com/apache/arrow-site/pull/178/files
>
> If anyone wants to prepare the part related to the environment/bindings
> they work on it would greatly help. I'll do my best to make sure R, C++,
> Python and Java parts are filled.
>
> On Tue, Jan 4, 2022 at 3:27 PM Alessandro Molina <
> alessan...@ursacomputing.com> wrote:
>
> > Quick note that all "Unassigned" issues that were not already started
> have
> > been moved to 8.0.0.
> > End of next week I'll do another pass and move all "Improvements/New
> > Features" that are not yet started to 8.0.0
> >
> > On Tue, Jan 4, 2022 at 10:02 AM Antoine Pitrou 
> wrote:
> >
> >>
> >> Le 03/01/2022 à 15:44, Alessandro Molina a écrit :
> >> > The plan seems to be to cut a release the 2nd or 3rd week of January,
> a
> >> new
> >> > confluence page was made to track progress of the release (
> >> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FARROW%2FArrow%2B7.0.0%2BReleasedata=04%7C01%7Cpgali%40nvidia.com%7Cb3abcca0e33d449850ba08d9d6783c2e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637776634717956215%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=1Yx2XA7qzq0N95BgFHIWAY8fIPhANV2csejPuNqdz20%3Dreserved=0
> >> ).
> >> >
> >> > It would greatly help in the process of preparing for the release if
> you
> >> > could review tickets that are assigned to you in the "TODO Backlog"
> and
> >> > move those that you think you will not be able to close in ~1 week to
> >> > "Version 8.0.0" in Jira, so that we can start preparing release
> >> > announcements etc with a good estimate of what's actually going to end
> >> up
> >> > in the release.
> >>
> >> Note there's also the cpp-7.0.0 version on the Parquet JIRA:
> >>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FPARQUET%2Fversions%2F12350844data=04%7C01%7Cpgali%40nvidia.com%7Cb3abcca0e33d449850ba08d9d6783c2e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637776634717956215%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=AiM8PmpiTONKQWaaSF0xzhR%2B5iGRmI0bd2XU1lSpQ8Y%3Dreserved=0
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>


Re: [ANNOUNCE] New Arrow PMC member: Joris Van den Bossche

2021-11-17 Thread Keith Kraus
Congrats Joris!

On Wed, Nov 17, 2021 at 6:35 PM Weston Pace  wrote:

> Congratulations!  I continue to be grateful for Joris' many contributions
> and advice.  This is great news.
>
> On Wed, Nov 17, 2021, 12:56 PM Wes McKinney  wrote:
>
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Joris Van den Bossche to become a PMC member and we are pleased to
> > announce that Joris has accepted.
> >
> > Congratulations and welcome!
> >
>


Re: Arrow in HPC

2021-10-26 Thread Keith Kraus
Outside of just HPC, integrating UCX would potentially allow taking
advantage of its shared memory backend which would be interesting from a
performance perspective in the single-node, multi-process case in many
situations.

Not sure it's worth the UCX dependency in the long run, but would allow us
to experiment with a lot of different transport backends.

On Tue, Oct 26, 2021 at 10:10 PM Yibo Cai  wrote:

>
> On 10/26/21 10:02 PM, David Li wrote:
> > Hi Yibo,
> >
> > Just curious, has there been more thought on this from your/the HPC side?
>
> Yes. I will investigate the possible approach. Maybe build a quick (and
> dirty) POC test at first.
>
> >
> > I also realized we never asked, what is motivating Flight in this space
> in the first place? Presumably broader Arrow support in general?
>
> No special reason. Will be great if comes up with something useful, or
> an interesting experiment otherwise.
>
> >
> > -David
> >
> > On Fri, Sep 10, 2021, at 12:27, Micah Kornfield wrote:
> >>>
> >>> I would support doing the work necessary to get UCX (or really any
> other
> >>> transport) supported, even if it is a lot of work. (I'm hoping this
> clears
> >>> the path to supporting a Flight-to-browser transport as well; a few
> >>> projects seem to have rolled their own approaches but I think Flight
> itself
> >>> should really handle this, too.)
> >>
> >>
> >> Another possible technical approach is investigating to see if coming up
> >> with a  custom gRPC "channel" implementation for new transports .
> >> Searching around it seems like there were some defunct PRs trying to
> >> enable UCX as one, I didn't look closely enough at why they might have
> >> failed.
> >>
> >> On Thu, Sep 9, 2021 at 11:07 AM David Li  wrote:
> >>
> >>> I would support doing the work necessary to get UCX (or really any
> other
> >>> transport) supported, even if it is a lot of work. (I'm hoping this
> clears
> >>> the path to supporting a Flight-to-browser transport as well; a few
> >>> projects seem to have rolled their own approaches but I think Flight
> itself
> >>> should really handle this, too.)
> >>>
> >>>  From what I understand, you could tunnel gRPC over UCX as Keith
> mentions,
> >>> or directly use UCX, which is what it sounds like you are thinking
> about.
> >>> One idea we had previously was to stick to gRPC for 'control plane'
> >>> methods, and support alternate protocols only for 'data plane' methods
> like
> >>> DoGet - this might be more manageable, depending on what you have in
> mind.
> >>>
> >>> In general - there's quite a bit of work here, so it would help to
> >>> separate the work into phases, and share some more detailed
> >>> design/implementation plans, to make review more manageable. (I
> realize of
> >>> course this is just a general interest check right now.) Just splitting
> >>> gRPC/Flight is going to take a decent amount of work, and (from what
> little
> >>> I understand) using UCX means choosing from various communication
> methods
> >>> it offers and writing a decent amount of scaffolding code, so it would
> be
> >>> good to establish what exactly a 'UCX' transport means. (For instance,
> >>> presumably there's no need to stick to the Protobuf-based wire format,
> but
> >>> what format would we use?)
> >>>
> >>> It would also be good to expand the benchmarks, to validate the
> >>> performance we get from UCX and have a way to compare it against gRPC.
> >>> Anecdotally I've found gRPC isn't quite able to saturate a connection
> so it
> >>> would be interesting to see what other transports can do.
> >>>
> >>> Jed - how would you see MPI and Flight interacting? As another
> >>> transport/alternative to UCX? I admit I'm not familiar with the HPC
> space.
> >>>
> >>> About transferring commands with data: Flight already has an
> app_metadata
> >>> field in various places to allow things like this, it may be
> interesting to
> >>> combine with the ComputeIR proposal on this mailing list, and
> hopefully you
> >>> & your colleagues can take a look there as well.
> >>>
> >>> -David
> >>>
> >>> On Thu, Sep 9, 2021, at 11:24, Jed Brown wrote:
>  Yibo Cai  writes:
> 
> > HPC infrastructure normally leverages RDMA for fast data transfer
> >>> among
> > storage nodes and compute nodes. Computation tasks are dispatched to
> > compute nodes with best fit resources.
> >
> > Concretely, we are investigating porting UCX as Flight transport
> >>> layer.
> > UCX is a communication framework for modern networks. [1]
> > Besides HPC usage, many projects (spark, dask, blazingsql, etc) also
> > adopt UCX to accelerate network transmission. [2][3]
> 
>  I'm interested in this topic and think it's important that even if the
> >>> focus is direct to UCX, that there be some thought into MPI
> >>> interoperability and support for scalable collectives. MPI considers
> UCX to
> >>> be an implementation detail, but the two main implementations (MPICH
> and
> >>> Open MPI) support it 

Re: [C++] Decimal arithmetic edge cases

2021-09-30 Thread Keith Kraus
For another point of reference, here's microsoft's docs for SQL server on
resulting precision and scale for different operators including its
overflow rules:
https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15

-Keith

On Thu, Sep 30, 2021 at 9:42 AM David Li  wrote:

> Hello all,
>
> While looking at decimal arithmetic kernels in ARROW-13130, the question
> of what to do about overflow came up.
>
> Currently, our rules are based on Redshift [1], except we raise an error
> if we exceed the maximum precision (Redshift's docs implies it saturates
> instead). Hence, we can always add/subtract/etc. without checking for
> overflow, but we can't do things like add two decimal256(76, 0) since
> there's no more precision available.
>
> If we were to support this last case, what would people expect the
> unchecked arithmetic kernels to do on overflow? For integers, we wrap
> around, but this doesn't really make sense for decimals; we could also
> return nulls, or just always raise an error (this seems the most reasonable
> to me). Any thoughts?
>
> For reference, for an unchecked add, currently we have:
> "1" (decimal256(75, 0)) + "1" (decimal256(75, 0)) = "2" (decimal256(76, 0))
> "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = error (not enough
> precision)
> "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error (not enough
> precision)
> "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error
> (not enough precision)
>
> Arguably these last three cases should be:
> "1" (decimal128(38, 0)) + "1" (decimal128(38, 0)) = "2" (decimal256(39,
> 0)) (promote to decimal256)
> "1" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = "2" (decimal256(76,
> 0)) (saturate at max precision)
> "99...9 (76 digits)" (decimal256(76, 0)) + "1" (decimal256(76, 0)) = error
> (overflow)
>
> On a related note, you could also argue that we shouldn't increase the
> precision like this, though DBs other than Redshift also do this. Playing
> with DuckDB a bit, though, it doesn't match Redshift: addition/subtraction
> increase precision by 1 like Redshift does, but division results in a
> float, and multiplication only adds the input precisions together, while
> Redshift adds 1 to the sum of the precisions. (That is, decimal128(3, 0) *
> decimal128(3, 0) is decimal128(7, 0) in Redshift/Arrow but decimal128(6, 0)
> in DuckDB.)
>
> [1]:
> https://docs.aws.amazon.com/redshift/latest/dg/r_numeric_computations201.html
>
> -David


Re: [Python] manylinux2014 and _GLIBCXX_USE_CXX11_ABI setting

2021-09-10 Thread Keith Kraus
Apologies, libstdc++ is what I meant (too used to arguing about glibc
compatibility). Devtoolset forcibly disables the `_GLIBCXX_USE_CXX11_ABI`
flag because the mixed linkage model it uses can't support it:
https://bugzilla.redhat.com/show_bug.cgi?id=1546704. RHEL / CentOS 8 does
support the new ABI.

-Keith

On Fri, Sep 10, 2021 at 11:13 AM Antoine Pitrou  wrote:

>
> Le 10/09/2021 à 17:05, Keith Kraus a écrit :
> > For what it's worth, setting it to 1 as opposed to 0 will make the
> package
> > incompatible with CentOS / RHEL 7 as the glibc they ship does not support
> > the new ABI.
>
> It is not about the glibc, it's about the stdlibc++.
>
>
> >
> > -Keith
> >
> > On Fri, Sep 10, 2021, 4:53 AM Philipp Moritz  wrote:
> >
> >> Ah ok, that makes sense! I'm also not even sure if
> >> _GLIBCXX_USE_CXX11_ABI=0 was ever mandated on manylinux1, it might
> >> just be a community convention.
> >>
> >> I posted
> >>
> >>
> https://discuss.python.org/t/how-to-set-glibcxx-use-cxx11-abi-for-manylinux2014-and-manylinux2010-wheels/10551
> >> ,
> >> we can shift the discussion there.
> >>
> >> On Fri, Sep 10, 2021 at 1:45 AM Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Le 10/09/2021 à 10:05, Philipp Moritz a écrit :
> >>>> Thanks for your answer Antoine!
> >>>>
> >>>> Considering your first comment, there is a section in
> >>>> https://www.python.org/dev/peps/pep-0571 under "Backwards
> >> compatibility
> >>>> with manylinux1 wheels" that states
> >>>> "manylinux1 wheels are considered manylinux2010 wheels" and the same
> >>> remark
> >>>> in https://www.python.org/dev/peps/pep-0599/ for manylinux2014 about
> >>>> compatibility with both manylinux2010 and manylinux1.
> >>>
> >>> As far as I understand, this sentence is talking about system
> >>> compatibility: if you can use manylinux2010 wheels on a system, you can
> >>> also use manylinux1 wheels. That doesn't necessarily mean a manylinux1
> >>> wheel will nicely interoperate with a manylinux2010 wheel that would
> >>> expose the same symbols.
> >>>
> >>> It seems wheel-to-wheel interoperability is a grey area of the
> manylinux
> >>> specs.  To their credit, though, the issues with C++ symbol / ABI
> >>> conflicts are pretty abstruse and almost impossible to predict.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>
> >
>


Re: [Python] manylinux2014 and _GLIBCXX_USE_CXX11_ABI setting

2021-09-10 Thread Keith Kraus
For what it's worth, setting it to 1 as opposed to 0 will make the package
incompatible with CentOS / RHEL 7 as the glibc they ship does not support
the new ABI.

-Keith

On Fri, Sep 10, 2021, 4:53 AM Philipp Moritz  wrote:

> Ah ok, that makes sense! I'm also not even sure if
> _GLIBCXX_USE_CXX11_ABI=0 was ever mandated on manylinux1, it might
> just be a community convention.
>
> I posted
>
> https://discuss.python.org/t/how-to-set-glibcxx-use-cxx11-abi-for-manylinux2014-and-manylinux2010-wheels/10551
> ,
> we can shift the discussion there.
>
> On Fri, Sep 10, 2021 at 1:45 AM Antoine Pitrou  wrote:
>
> >
> > Le 10/09/2021 à 10:05, Philipp Moritz a écrit :
> > > Thanks for your answer Antoine!
> > >
> > > Considering your first comment, there is a section in
> > > https://www.python.org/dev/peps/pep-0571 under "Backwards
> compatibility
> > > with manylinux1 wheels" that states
> > > "manylinux1 wheels are considered manylinux2010 wheels" and the same
> > remark
> > > in https://www.python.org/dev/peps/pep-0599/ for manylinux2014 about
> > > compatibility with both manylinux2010 and manylinux1.
> >
> > As far as I understand, this sentence is talking about system
> > compatibility: if you can use manylinux2010 wheels on a system, you can
> > also use manylinux1 wheels. That doesn't necessarily mean a manylinux1
> > wheel will nicely interoperate with a manylinux2010 wheel that would
> > expose the same symbols.
> >
> > It seems wheel-to-wheel interoperability is a grey area of the manylinux
> > specs.  To their credit, though, the issues with C++ symbol / ABI
> > conflicts are pretty abstruse and almost impossible to predict.
> >
> > Regards
> >
> > Antoine.
> >
>


Re: [ANNOUNCE] New Arrow committer: Nic Crane

2021-09-09 Thread Keith Kraus
Congrats Nic!

On Thu, Sep 9, 2021 at 11:47 AM Neal Richardson 
wrote:

> On behalf of the Apache Arrow PMC, I'm happy to announce that Nic Crane
> has accepted an invitation to become a committer on Apache Arrow.
>
> Welcome and thank you for your contributions!
>
> Neal
>


Re: Arrow in HPC

2021-09-09 Thread Keith Kraus
There's nothing stopping us from transmitting HTTP/2 or another binary
protocol over UCX. You can think of UCX as a transport layer abstraction
library which allows transparently taking advantage of things like RDMA
over InfiniBand / RoCE, inter-process shared memory, TCP sockets, etc.

The other thing is if we think about something like handling GPU memory
from Arrow CUDA in Flight, it would be nice to be able to be able to take
advantage of the GPU RDMA features of libraries like UCX and libfabric
without having to implement something like a HTTP/2 parser on the GPU.

-Keith

On Thu, Sep 9, 2021 at 7:48 AM Antoine Pitrou  wrote:

>
> Le 09/09/2021 à 12:34, Yibo Cai a écrit :
> > Hi,
> >
> > We have some rough ideas of applying Flight in HPC (High Performance
> > Computation). Would like to hear comments.
> >
> > HPC infrastructure normally leverages RDMA for fast data transfer among
> > storage nodes and compute nodes. Computation tasks are dispatched to
> > compute nodes with best fit resources.
> >
> > Concretely, we are investigating porting UCX as Flight transport layer.
> > UCX is a communication framework for modern networks. [1]
> > Besides HPC usage, many projects (spark, dask, blazingsql, etc) also
> > adopt UCX to accelerate network transmission. [2][3]
> >
> > I see a recent discussion about decoupling Flight from gRPC. Looks this
> > is also what we should do first to adapt UCX to Flight.
>
> Is it actually necessary? Or is it possible to transmit HTTP/2 over UCX?
>
> (I'm not fond of gRPC, but decoupling Flight *and* building support for
> another transport layer will be quite a bit of work)
>
> Regards
>
> Antoine.
>


Re: [DISCUSS][Python] Public Cython API

2021-08-25 Thread Keith Kraus
If I remember correctly the reason cuDF interacts with the Cython code for
IPC stuff is that in the past the existing IPC machinery in Arrow didn't
work correctly with GPU memory. If that is fixed I think there's a case to
remove this code entirely from cuDF and instruct users to use the higher
level PyArrow IPC APIs instead.

-Keith

On Wed, Aug 25, 2021 at 11:12 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> One example of consumer of our Cython API is cudf (
> https://github.com/rapidsai/cudf).
> I am not very familiar with the package itself, but browsing its code, I
> see that they do for example cimport RecordBatchReader (
>
> https://github.com/rapidsai/cudf/blob/f6d31fa95d9b8d8658301438d0f9ba22a1c131aa/python/cudf/cudf/_lib/gpuarrow.pyx#L20
> ),
> a case that would be impacted by
> https://github.com/apache/arrow/pull/10162
>
> Another question: do we regard `from pyarrow.includes.* import xx` as
> public?
>
> On Fri, 20 Aug 2021 at 12:25, Alessandro Molina <
> alessan...@ursacomputing.com> wrote:
>
> > ...
> >
> > Personally, even at risk of breaking third parties code, I think it would
> > be wise to aim for the minimum exposed surface. I'd consider Cython
> mostly
> > an implementation detail and promote usage of libarrow from C/C++
> directly
> > if you need to work on high performance Python extensions.
> >
>
> Personally I am not sure if such recommendation is necessarily needed or
> useful (re-using our Cython code makes it easier to develop python
> extensions
> that interact with (py)arrow).
>


Re: [VOTE][Format] Clarify allowed value range for the Time types

2021-08-20 Thread Keith Kraus
+1 (non-binding)

On Fri, Aug 20, 2021 at 9:49 AM Rok Mihevc  wrote:

> +1 (non-binding)
>
> On Fri, Aug 20, 2021 at 3:46 PM Jorge Cardoso Leitão
>  wrote:
> >
> > +1
> >
> > On Fri, Aug 20, 2021 at 2:43 PM David Li  wrote:
> >
> > > +1
> > >
> > > On Thu, Aug 19, 2021, at 18:33, Weston Pace wrote:
> > > > +1
> > > >
> > > > On Thu, Aug 19, 2021 at 9:18 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > > +1
> > > > >
> > > > > On Thu, Aug 19, 2021 at 6:20 PM Antoine Pitrou  >
> > > wrote:
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I would like to propose clarifying the allowed value range for
> the
> > > Time
> > > > > > types.  Specifically, I would propose that:
> > > > > >
> > > > > > 1) allowed values fall between 0 (included) and 86400 seconds
> > > > > > (excluded), adjusted for the time unit;
> > > > > >
> > > > > > 2) leap seconds cannot be represented (see above: 86400 is
> outside of
> > > > > > the range of allowed values).
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Accept the proposed clarification
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not accept the proposed clarification because...
> > > > > >
> > > > > > My vote is +1.
> > > > > >
> > > > > > If this proposal is accepted, I will submit a PR to enhance
> > > Schema.fbs
> > > > > > with additional comments.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > >
>


Re: [VOTE][Format] Add in a new interval type can combines Month, Days and Nanoseconds

2021-08-17 Thread Keith Kraus
+1 (non-binding)

On Tue, Aug 17, 2021 at 7:34 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> +1
>
> On Tue, Aug 17, 2021 at 8:50 PM Micah Kornfield 
> wrote:
>
> > Hello,
> > As discussed previously [1], I'd like to call a vote to add a new
> interval
> > type which is a triple of Month, Days, and nanoseconds.  The formal
> > definition is defined in a PR [2] along with Java and C++ implementations
> > that have been verified with integration tests.
> >
> > The PR has gone through one round of code review comments and all have
> been
> > addressed (more are welcome).
> >
> > If this vote passes I will follow-up with the following work items:
> > - Add issues to track implementation in other languages
> > - Extend the C-data interface to support the new type.
> > - Add conversion from C++ to Python/Pandas
> >
> > The vote will remain open for at least 72 hours.
> >
> > Please record your vote:
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Accept the new type and merge the PR after all comments have been
> > addressed
> > [ ] +0
> > [ ] -1 Do not accept the new type because ...
> >
> > My vote is +1.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/rd919c4ed8ad2f2827a2d4f665d8da99e545ba92ef992b2e557831751%40%3Cdev.arrow.apache.org%3E
> > [2] https://github.com/apache/arrow/pull/10177
> >
>


Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Keith Kraus
> Personally, I do not care about the speed of IR processing right now.
> Any non-trivial (and probably trivial too) computation done
> by an IR consumer will dwarf the cost of IR processing. Of course,
> we shouldn't prematurely pessimize either, but there's no reason
> to spend time worrying about IR processing performance in my opinion
(yet).

In other processing engines I've seen situations somewhat commonly where
the time to build the compute graph becomes non-negligible and even more
expensive than doing the computation itself. I've even seen situations
where attempts were made to iteratively build a graph while executing in
order to try to overlap the cost of building the graph with the compute
execution.

There's been a huge amount of effort put into optimizing critical kernel
components like the hash table implementation in order to make Arrow the
most performant analytical library possible. Architecting and designing the
IR implementation without performance in mind from the beginning could
potentially put us into a difficult situation later that we'd have to
invest considerably more effort to work our way out of.

On Fri, Aug 13, 2021 at 2:30 PM Weston Pace  wrote:

> I believe you would need a JSON compatible version of the type system
> (including binary values) because you'd need to at least encode
> literals.  However, I don't think that creating a human readable
> encoding of the Arrow type system is a bad thing in and of itself.  We
> have tickets and get questions occasionally asking for a JSON format.
> This could at least be a step in that direction.  I don't think you'd
> need to add support for arrays/batches/tables.  Note, the C++
> implementation has a JSON format that is used for testing purposes
> (though I do not believe it is comprehensive).
>
> I think we could add two (potentially conflicting) requirements
>  * Low barrier to entry for consumers
>  * Low barrier to entry for producers
>
> JSON/YAML seem to lower the barrier to entry for producers.  Some
> producers may not even be working with Arrow data (e.g. could one go
> from SQL-literal -> JSON-literal skipping an intermediate
> Arrow-literal step?).  I think we've also dismissed Antoine's earlier
> point which I found the most compelling.  Handling flatbuffers adds
> one more step that people have to integrate into their build systems.
>
> Flatbuffers on the other hand lowers the barrier to entry for
> consumers.  A consumer is likely already going to have flatbuffers
> support built in so that they can read/write IPC files.  If we adopt
> JSON then the consumer will have to add support for a new file format
> (or at least part of one).
>
> On Fri, Aug 13, 2021 at 6:46 AM Jacob Quinn 
> wrote:
> >
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences.
> > >
> > Can you clarify why this is needed? Is it that custom_metadata maps
> should
> > allow byte sequences as values?
> >
> > On Fri, Aug 13, 2021 at 10:00 AM Phillip Cloud 
> wrote:
> >
> > > On Fri, Aug 13, 2021 at 11:43 AM Antoine Pitrou 
> > > wrote:
> > >
> > > >
> > > > Le 13/08/2021 à 17:35, Phillip Cloud a écrit :
> > > > >
> > > > >> I.e. make the ability to read and write by humans be more
> important
> > > than
> > > > >> speed of validation.
> > > > >
> > > > > I think I differ on whether the IR should be easy to read and
> write by
> > > > > humans.
> > > > > IR is going to be predominantly read and written by machines,
> though of
> > > > > course
> > > > > we will need a way to inspect it for debugging.
> > > >
> > > > But the code executed by machines is written by humans.  I think
> that's
> > > > mostly where the contention resides: is it easy to code, in any given
> > > > language, the routines required to produce or consume the IR?
> > > >
> > >
> > > Definitely not for flatbuffers, since flatbuffers is IMO annoying to
> use in
> > > any language except C++,
> > > and it's borderline annoying there too. Protobuf is similar (less
> annoying
> > > in Rust,
> > > but still annoying in Python and C++ IMO), though I think any binary
> format
> > > is going to be
> > > less human-friendly, by construction.
> > >
> > > If we were to use something like JSON or msgpack, can someone sketch
> out
> > > the interaction
> > > between the IR and the rest of arrow's type system?
> > >
> > > Would we need a JSON-encoded-arrow-type -> in-memory representation
> for an
> > > Arrow type in a given language?
> > >
> > > I just thought of one other requirement: the format needs to support
> > > arbitrary byte sequences. JSON
> > > doesn't support untransformed byte sequences, though it's not uncommon
> to
> > > base64-encode a byte sequence.
> > > IMO that adds an unnecessary layer of complexity, which is another
> tradeoff
> > > to consider.
> > >
>


Re: [Question] what is the purpose of the typeids in the UnionArray?

2021-08-13 Thread Keith Kraus
How would using the typeid directly work with arbitrary Extension types?

-Keith

On Fri, Aug 13, 2021 at 12:49 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> In the UnionArray, there is a level of indirection between types (buffer of
> i8s) -> typeId (i8) -> field. For example, the generated_union part of our
> integration tests has the data:
>
> types: [5, 5, 5, 5, 7, 7, 7, 7, 5, 5, 7] (len = 11)
> typeids: [5, 7]
> fields: [int32, utf8]
>
> My understanding is that, to get the field of item 4, we read types[4] (7),
> look for the index of it in typeids (1), and take the field of index 1
> (utf8), and then read the value (4 or other depending on sparsess).
>
> Does someone know the rationale for the intermediare typeid? I.e. couldn't
> the types contain the index of the field directly [0, 0, 0, 0, 1, 1, 1, 1,
> 0, 0,1] (replace 5 by 0, 7 by 1, and not use typeids)?
>
> Best,
> Jorge
>


Re: [DISCUSS] Dropping support for Visual Studio 2015

2021-08-09 Thread Keith Kraus
+1 as well. Is there any build platforms that we're currently supporting
that still use vs2015?

Conda-forge did its migration ~1.5 years ago:
https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/501.

-Keith

On Mon, Aug 9, 2021 at 12:01 PM Antoine Pitrou  wrote:

>
> +1 for requiring a more recent MSVC version.
>
> Regards
>
> Antoine.
>
>
> Le 09/08/2021 à 17:38, Benjamin Kietzman a écrit :
> > MSVC 19.0 is buggy enough that I for one have spent multiple days
> > reworking code that is fine for all other compilers we test against.
> > Most recently in the context of
> https://github.com/apache/arrow/pull/10793
> > (ARROW-13482) I found that for some types T,
> > `std::is_convertible::value` will be false. This necessitated the
> > following
> > (very hacky) workaround:
> >
> >
> https://github.com/apache/arrow/pull/10793/commits/c44be29686af6fab2132097aa3cbd430d6ac71fe
> >
> >  (Side note: if anybody has a better solution than that specific
> hack,
> >   please don't hesitate to comment on the PR.)
> >
> > Would it be allowable for us to drop support for this compiler? IIUC
> > Microsoft is no longer accepting feedback/bug reports for VS2017, let
> > alone VS2015. Are there any users who depend on libarrow building
> > with that compiler?
> >
>


JIRA Access

2021-07-29 Thread Keith Kraus
Hello,

Could someone give me access to assign myself to JIRA issues?

Would like to assign myself to
https://issues.apache.org/jira/browse/ARROW-13500.

Thanks!
Keith


[jira] [Created] (ARROW-6043) Array equals returns incorrectly if NaNs are in arrays

2019-07-25 Thread Keith Kraus (JIRA)
Keith Kraus created ARROW-6043:
--

 Summary: Array equals returns incorrectly if NaNs are in arrays
 Key: ARROW-6043
 URL: https://issues.apache.org/jira/browse/ARROW-6043
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.14.1
Reporter: Keith Kraus


{code:python}
import numpy as np
import pyarrow as pa

data = [0, 1, np.nan, None, 4]

arr1 = pa.array(data)
arr2 = pa.array(data)

pa.Array.equals(arr1, arr2)
{code}

Unsure if this is expected behavior, but in Arrow 0.12.1 this returned `True` 
as compared to `False` in 0.14.1.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Error building cuDF on new Arrow with std::variant backport

2019-07-23 Thread Keith Kraus
Just following up in case anyone was following that this turned out to be an 
NVCC bug that we've reported to the relevant team internally. We moved the 
`ipc.cu` file to `ipc.cpp` and it works as expected with gcc. Thanks everyone!

-Keith

On 7/22/19, 12:52 PM, "Keith Kraus"  wrote:

We're working on that now, will report back once we have something more 
concrete to act on. Thanks!

-Keith

On 7/22/19, 12:51 PM, "Antoine Pitrou"  wrote:


Hi Keith,

Can you try to further reduce the reduce your reproducer until you find
the offending construct?

Regards

Antoine.


Le 22/07/2019 à 18:46, Keith Kraus a écrit :
> I temporarily removed the csr related code that has the namespace 
clash and confirmed that the same compilation warnings and errors still occur.
> 
> On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:
> 
> The namespace collision is a definite possibility, especially if 
you are
> using g++ which seems to be less smart about inferring types vs 
methods
> than clang is.
> 
> On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 

> wrote:
> 
> > Hi Micah,
> >
> > We were able to build Arrow standalone with both c++ 11 and 14, 
but cuDF
> > needs c++ 14.
> >
> > I found this line[1] in one of our cuda files after sending and 
realized
> > we may have a collision/polluted namespace. Does that sound 
like a
> > possibility?
> >
> > Thanks,
> > Paul
> >
> > 1.
> > 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
> >
> > On 7/19/19 8:41 PM, Micah Kornfield wrote:
> >
> > Hi Paul,
> > This actually looks like it might be a problem with arrow-4800. 
  Did the
> > build of arrow use c++14 or c++11?
> >
> > Thanks,
> > Micah
> >
> > On Friday, July 19, 2019, Paul Taylor 
 wrote:
> >
> >> We're updating cuDF to Arrow 0.14 but encountering errors 
building that
> >> look related to PR #4259 
<https://github.com/apache/arrow/pull/4259>. We
> >> can build Arrow itself, but we can't build cuDF when we 
include Arrow
> >> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
> >>
> >> Has anyone seen these before or know of a fix?
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> 
/cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>>
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In 
member function
> >>> 'void arrow::Result::AssignVariant(mpark::variant >>> const char*>&&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>

Re: Error building cuDF on new Arrow with std::variant backport

2019-07-22 Thread Keith Kraus
We're working on that now, will report back once we have something more 
concrete to act on. Thanks!

-Keith

On 7/22/19, 12:51 PM, "Antoine Pitrou"  wrote:


Hi Keith,

Can you try to further reduce the reduce your reproducer until you find
the offending construct?

Regards

Antoine.


Le 22/07/2019 à 18:46, Keith Kraus a écrit :
> I temporarily removed the csr related code that has the namespace clash 
and confirmed that the same compilation warnings and errors still occur.
> 
> On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:
> 
> The namespace collision is a definite possibility, especially if you 
are
> using g++ which seems to be less smart about inferring types vs 
methods
> than clang is.
> 
> On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 
> wrote:
> 
> > Hi Micah,
> >
> > We were able to build Arrow standalone with both c++ 11 and 14, but 
cuDF
> > needs c++ 14.
> >
> > I found this line[1] in one of our cuda files after sending and 
realized
> > we may have a collision/polluted namespace. Does that sound like a
> > possibility?
> >
> > Thanks,
> > Paul
> >
> > 1.
> > 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
> >
> > On 7/19/19 8:41 PM, Micah Kornfield wrote:
> >
> > Hi Paul,
> > This actually looks like it might be a problem with arrow-4800.   
Did the
> > build of arrow use c++14 or c++11?
> >
> > Thanks,
> > Micah
> >
> > On Friday, July 19, 2019, Paul Taylor  
wrote:
> >
> >> We're updating cuDF to Arrow 0.14 but encountering errors building 
that
> >> look related to PR #4259 
<https://github.com/apache/arrow/pull/4259>. We
> >> can build Arrow itself, but we can't build cuDF when we include 
Arrow
> >> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
> >>
> >> Has anyone seen these before or know of a fix?
> >>
> >> Thanks,
> >>
> >> Paul
> >>
> >> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
> >>> warning: attribute does not apply to any entity
> >>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
> >>> warning: attribute does not apply to any entity
> >>>
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function
> >>> 'void arrow::Result::AssignVariant(mpark::variant >>> const char*>&&)':
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: 
error:
> >>> expected primary-expression before ',' token
> >>>  variant_.~variant();
> >>> ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected primary-expression before 'const'
> >>>  variant_.~variant();
> >>>   ^
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: 
error:
> >>> expected ')' before 'const'
> >>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member 
function
> >>> 'void arrow::Result::

Re: Error building cuDF on new Arrow with std::variant backport

2019-07-22 Thread Keith Kraus
I temporarily removed the csr related code that has the namespace clash and 
confirmed that the same compilation warnings and errors still occur.

On 7/20/19, 1:03 AM, "Micah Kornfield"  wrote:

The namespace collision is a definite possibility, especially if you are
using g++ which seems to be less smart about inferring types vs methods
than clang is.

On Fri, Jul 19, 2019 at 9:28 PM Paul Taylor 
wrote:

> Hi Micah,
>
> We were able to build Arrow standalone with both c++ 11 and 14, but cuDF
> needs c++ 14.
>
> I found this line[1] in one of our cuda files after sending and realized
> we may have a collision/polluted namespace. Does that sound like a
> possibility?
>
> Thanks,
> Paul
>
> 1.
> 
https://github.com/rapidsai/cudf/blob/branch-0.9/cpp/src/io/convert/csr/cudf_to_csr.cu#L30
>
> On 7/19/19 8:41 PM, Micah Kornfield wrote:
>
> Hi Paul,
> This actually looks like it might be a problem with arrow-4800.   Did the
> build of arrow use c++14 or c++11?
>
> Thanks,
> Micah
>
> On Friday, July 19, 2019, Paul Taylor  wrote:
>
>> We're updating cuDF to Arrow 0.14 but encountering errors building that
>> look related to PR #4259 . We
>> can build Arrow itself, but we can't build cuDF when we include Arrow
>> headers. Using C++ 14 and have tried gcc/g++ 5, 7, and clang.
>>
>> Has anyone seen these before or know of a fix?
>>
>> Thanks,
>>
>> Paul
>>
>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(195):
>>> warning: attribute does not apply to any entity
>>> /cudf/cpp/build/arrow/install/include/arrow/io/interfaces.h(196):
>>> warning: attribute does not apply to any entity
>>>
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member function
>>> 'void arrow::Result::AssignVariant(mpark::variant>> const char*>&&)':
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:24: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:32: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
>>> expected primary-expression before 'const'
>>>  variant_.~variant();
>>>   ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:292:34: error:
>>> expected ')' before 'const'
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h: In member function
>>> 'void arrow::Result::AssignVariant(const mpark::variant>> arrow::Status, const char*>&)':
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:24: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:32: error:
>>> expected primary-expression before ',' token
>>>  variant_.~variant();
>>> ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
>>> expected primary-expression before 'const'
>>>  variant_.~variant();
>>>   ^
>>> /cudf/cpp/build/arrow/install/include/arrow/result.h:305:34: error:
>>> expected ')' before 'const'
>>>
>>
>>
>



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[jira] [Created] (ARROW-5008) ORC Reader Core Dumps in PyArrow if `/etc/localtime` does not exist

2019-03-25 Thread Keith Kraus (JIRA)
Keith Kraus created ARROW-5008:
--

 Summary: ORC Reader Core Dumps in PyArrow if `/etc/localtime` does 
not exist
 Key: ARROW-5008
 URL: https://issues.apache.org/jira/browse/ARROW-5008
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.12.1, 0.12.0
Reporter: Keith Kraus


In docker containers it's common for `/etc/localtime` to not exist, and if it 
doesn't exist it causes a file not found error which is not handled in PyArrow. 
Workaround is to install `tzdata` into the container (at least for Ubuntu), but 
wanted to report upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4766) Casting empty boolean array causes segfault

2019-03-04 Thread Keith Kraus (JIRA)
Keith Kraus created ARROW-4766:
--

 Summary: Casting empty boolean array causes segfault
 Key: ARROW-4766
 URL: https://issues.apache.org/jira/browse/ARROW-4766
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Keith Kraus


Reproducer:

{code:python}
import pyarrow as pa

test = pa.array([], type=pa.bool_())
test2 = test.cast(pa.int8())
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4324) [Python] Array dtype inference incorrect when created from list of mixed numpy scalars

2019-01-22 Thread Keith Kraus (JIRA)
Keith Kraus created ARROW-4324:
--

 Summary: [Python] Array dtype inference incorrect when created 
from list of mixed numpy scalars
 Key: ARROW-4324
 URL: https://issues.apache.org/jira/browse/ARROW-4324
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Keith Kraus


Minimal reproducer:
{code:python}
import pyarrow as pa
import numpy as np

test_list = [np.dtype('int32').type(10), np.dtype('float32').type(0.5)]
test_array = pa.array(test_list)

# Expected
# test_array
# 
# [
#   10,
#   0.5
# ]

# Got
# test_array
# 
# [
#   10,
#   0
# ]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)