Re: Improving PR workload management for Arrow maintainers

2021-06-30 Thread Antoine Pitrou
Le 30/06/2021 à 10:04, Wes McKinney a écrit : I guess my concern with this is how to quickly separate out "PRs I am keeping an eye on". If there are 100 active PRs and only 20 of them are ones you've interacted with, how do you know which ones need your attention? GitHub does have the "reviewe

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou
Le 29/06/2021 à 17:58, Niranda Perera a écrit : So, FWIU, in vector selection, the output array would always have a non-null validity buffer, isn't it? Why? On Tue, Jun 29, 2021 at 11:54 AM Antoine Pitrou wrote: Le 29/06/2021 à 17:49, Niranda Perera a écrit : Hi all, I'

Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Antoine Pitrou
Le 29/06/2021 à 15:25, Wes McKinney a écrit : On Tue, Jun 29, 2021 at 3:10 PM Andrew Lamb wrote: The thing that would make me more efficient reviewing PRs is figuring out which one of the open reviews are ready for additional feedback. Yes, I think this would be the single most significant

Re: [C++] Reducing branching in compute/kernels/vector_selection.cc

2021-06-29 Thread Antoine Pitrou
Le 29/06/2021 à 17:49, Niranda Perera a écrit : Hi all, I'm looking into this now, and I feel like there's a potential memory corruption at the very end of the out_data_ array. algo: bool advance = BitUtil::GetBit(filter_data_, filter_offset_ + in_position); BitUtil::SetBitTo(out_is_valid_,

Re: [VOTE] Donation of rust arrow2 and parquet2

2021-06-28 Thread Antoine Pitrou
+1 as well (binding) Le 28/06/2021 à 17:28, Ben Kietzman a écrit : +1 (binding) On Mon, Jun 28, 2021 at 5:35 AM Wes McKinney wrote: +1 (binding) On Mon, Jun 28, 2021 at 11:08 AM Daniël Heres wrote: +1 (non binding) Great work Jorge! On Mon, Jun 28, 2021, 10:26 Weston Steimel wrote:

Re: [VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-25 Thread Antoine Pitrou
Le 24/06/2021 à 21:16, Weston Pace a écrit : The discussion in [1] led to the following proposal which I would like to submit for a vote. --- Arrow allows a timestamp column to omit the time zone property. This has caused confusion because some people have interpreted a timestamp without a ti

Re: [STRAW POLL] (How) should Arrow define storage for "Instant"s

2021-06-24 Thread Antoine Pitrou
Option C. Le 24/06/2021 à 21:24, Weston Pace a écrit : This proposal states that Arrow should define how to encode an Instant into Arrow data. There are several ways this could happen, some which change schema.fbs and some which do not. --- For sample arguments (currently grouped as "for c

Re: [ANNOUNCE] Official media types (MIME types) for Apache Arrow formats

2021-06-24 Thread Antoine Pitrou
Can we document them in the format docs and/or in the FAQ? On Thu, 24 Jun 2021 10:47:34 +0900 (JST) Sutou Kouhei wrote: > Hi, > > The official media types (MIME types) for Apache Arrow > formats are registered to IANA: > > * > https://www.iana.org/assignments/media-types/application/vnd.a

Re: [Python] Drop Python 3.6 and Numpy 1.16 support?

2021-06-24 Thread Antoine Pitrou
, Jun 23, 2021 at 10:36 AM Wes McKinney wrote: This seems reasonable to me. On Wed, Jun 23, 2021 at 11:39 AM Antoine Pitrou wrote: Hello, In https://issues.apache.org/jira/browse/ARROW-12706 it was proposed to drop support for the aforementioned Python and Numpy versions. The rationale is

[Python] Drop Python 3.6 and Numpy 1.16 support?

2021-06-23 Thread Antoine Pitrou
Hello, In https://issues.apache.org/jira/browse/ARROW-12706 it was proposed to drop support for the aforementioned Python and Numpy versions. The rationale is that they have ceased to be supported by Numpy, which is a mandatory dependency of PyArrow. Besides, Pandas (an optional dependenc

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Antoine Pitrou
On Wed, 23 Jun 2021 07:37:09 -0500 Wes McKinney wrote: > On Wed, Jun 23, 2021 at 3:03 AM Antoine Pitrou wrote: > > > > On Tue, 22 Jun 2021 19:04:49 -0500 > > Wes McKinney wrote: > > > Some on this list might be interested in a new paper out of CMU/MIT > >

Re: [PAPER] Selection vectors and bitmaps for filter results

2021-06-23 Thread Antoine Pitrou
On Tue, 22 Jun 2021 19:04:49 -0500 Wes McKinney wrote: > Some on this list might be interested in a new paper out of CMU/MIT > about the use of selection vectors and bitmaps for handling the > intermediate results of filters: > > https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf > > The resea

Re: [Format] Bounded numbers?

2021-06-22 Thread Antoine Pitrou
On Mon, 21 Jun 2021 23:50:29 -0400 Ying Zhou wrote: > Hi, > > In data people use there are often bounded numbers, mostly integers with > clear and fixed upper and lower bounds but also decimals and floats as well > e.g. test scores, numerous codes in older databases, max temperature of a > cit

Re: Complex Number support in Arrow

2021-06-21 Thread Antoine Pitrou
this need to be implemented? Not necessarily (*). But before thinking about implementation, this proposal must be accepted into the format. Yes, this is a type that has been proposed in the past and I think handles a lot of types not yet in Arrow but have been requested (e.g. IP

Re: [C++] Apache Arrow C++ Variadic Kernels Design

2021-06-18 Thread Antoine Pitrou
Aren't SELECT DISTINCT and COUNT DISTINCT just condensed variants of a GROUP BY query? Do they need to be exposed as standalone kernels? Le 18/06/2021 à 00:58, Ian Cook a écrit : Arrow developers, A couple of recent PRs have added new variadic scalar kernels to the Arrow C++ library (ARROW

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Antoine Pitrou
Le 15/06/2021 à 19:18, Weston Pace a écrit : Thanks for the excellent summary everyone. I agree with these summaries that have been pointed out. It seems like things are moving towards consensus. Hmm, are we so sure? I don't think I've seen widespread agreement about how the spec should be

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Antoine Pitrou
Le 15/06/2021 à 16:53, Adam Hooper a écrit : - *"Datetime"* lets you extract fields, parse strings, format to string. You can't sort (because clocks sometimes go backwards). You can't convert between timestamps and future datetimes (because timezones change). Not true if the timez

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Antoine Pitrou
Le 15/06/2021 à 12:57, Joris Van den Bossche a écrit : A general observation: it might be useful to get back to the message of Julian Hyde in the previous email thread about this 2 weeks ago (https://lists.apache.org/thread.html/r5a89aa20b1cb812dc01a3817a5bfb365971577986d586dcc7ee21e72%40%3Cdev

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Antoine Pitrou
Le 15/06/2021 à 09:31, Joris Van den Bossche a écrit : (but I also don't fully understand your point here, as your "they would get the correct histogram" seems to imply a positive statemenent for tz-naive timestamps, while your email starts with a +1 on Antoine's proposal which, as far as I un

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Antoine Pitrou
Le 14/06/2021 à 18:47, Wes McKinney a écrit : On Mon, Jun 14, 2021 at 11:33 AM Antoine Pitrou wrote: Le 14/06/2021 à 18:28, Wes McKinney a écrit : Hi Antoine — when there is no time zone specified, I do not think it is appropriate to consider the data to refer to a specific moment in time

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Antoine Pitrou
Le 14/06/2021 à 18:28, Wes McKinney a écrit : Hi Antoine — when there is no time zone specified, I do not think it is appropriate to consider the data to refer to a specific moment in time without applying an explicit time zone localization. Well, how can that be done? The timezone informatio

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Antoine Pitrou
s purely up to the program, just like it is up to the program whether a particular number represents metres, miles, or mass. Naive objects are easy to understand and to work with, at the cost of ignoring some aspects of reality.""" Le 14/06/2021 à 17:57, Antoine Pitrou a écrit :

[Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Antoine Pitrou
Hello, In ARROW-13033, there was a disagreement as to how the specification about timezone-less timestamps should be interpreted. Here is the wording in the Schema specification: /// * If the time zone is null or equal to an empty string, the data is "time /// zone naive" and shall b

Re: Complex Number support in Arrow

2021-06-14 Thread Antoine Pitrou
Le 14/06/2021 à 10:54, Simon Perkins a écrit : > The reason why I am being nit-picky here is I think that having a first class type indicates that it should eventually be supported by all reference implementations. An "well known" extension type I think offers less guarantees which makes it

Re: C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-11 Thread Antoine Pitrou
Le 11/06/2021 à 20:10, Wes McKinney a écrit : So this particular toolchain mix seems to be broken, does everything work if you compile Arrow, the plugin, and the core database with devtoolset-3? I think the weak link is Arrow C++ compiled with a non-devtoolset compiler toolchain. This "toolch

Re: Long title on github page

2021-06-10 Thread Antoine Pitrou
Sound good enough to me. Le 10/06/2021 à 23:35, Wes McKinney a écrit : I hate to reopen this can of worms again, but here is my effort to synthesize feedback: "Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing." On Thu, Jun 10, 2021 at 12:37

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

2021-06-10 Thread Antoine Pitrou
On Thu, 10 Jun 2021 17:33:23 +0200 Joris Van den Bossche wrote: > > We just merged a PR to add some kernels to extract fields from timestamps > (year, month, day, hour, etc -> ARROW-11759 > ). But once you start with > kernels for timestamp data, you qu

Re: Complex Number support in Arrow

2021-06-10 Thread Antoine Pitrou
Le 10/06/2021 à 09:20, Simon Perkins a écrit : Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array of Structs (AoS)? If you are not familiar with the Arrow format, I would suggest you start by reading https://arrow.apache.org/docs/format/Columnar.html (see "Struct

Re: Complex Number support in Arrow

2021-06-09 Thread Antoine Pitrou
On Wed, 9 Jun 2021 15:34:41 -0700 Micah Kornfield wrote: > Hi Antoine, > In regards to conceptual simplicity, I might have misinterpreted when you > wrote: > > Since complex numbers are quite common in some domains, and since they > > are conceptually simply, > > > It seemed like a justificat

Re: Complex Number support in Arrow

2021-06-09 Thread Antoine Pitrou
Le 10/06/2021 à 00:05, Micah Kornfield a écrit : While dedicated types are not strictly required, compute functions would be much easier to add for a first-class dedicated complex datatype rather than for an extension type. It seems like maybe this is an area to focus on? I'm not sure conce

Re: Complex Number support in Arrow

2021-06-09 Thread Antoine Pitrou
Le 09/06/2021 à 17:52, Micah Kornfield a écrit : Adding a new first-class type in Arrow requires working integration tests between C++ and Java libraries (once the idea is informally agreed upon) and then a final vote for approval. We haven't formalized extension types but I imagine a similar

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
Le 09/06/2021 à 19:25, Eduardo Ponce a écrit : Measurable metrics: * code size (source and binary) - measured in bytes [...] Qualitative metrics: * code structure/maintainability - how would it improve development? * code readability - ease of understanding details for new/current contribut

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
On Tue, 8 Jun 2021 17:37:30 -0500 Jonathan Keane wrote: > I've been digging a bit to try and put numbers on those users the Neal > mentions. Specifically, we know that requiring C++17 will mean that R > users on windows using versions of R before 4.0.0 will not be able to > compile/install arrow.

Re: [C++][Discuss] Switch to C++17

2021-06-09 Thread Antoine Pitrou
On Tue, 8 Jun 2021 14:39:27 -0700 Neal Richardson wrote: > I'm guessing there hasn't been opposition on this thread because the users > that this might affect aren't following this mailing list. > > I'd be interested to see which other major C++ projects out there have > bumped their requirement

Re: Moving automated nightly build e-mails to a separate mailing list

2021-06-09 Thread Antoine Pitrou
Hello, bui...@arrow.apache.org now also has a GMane mirror at gmane.comp.apache.arrow.builds. Regards Antoine. On Sun, 23 May 2021 08:13:37 -0700 Wes McKinney wrote: > hi folks, > > In an effort to increase the signal-to-noise ratio on dev@, I suggest > that we move the [NIGHTLY] e-mails

Re: [C++][Discuss] Switch to C++17

2021-06-08 Thread Antoine Pitrou
eally concerned by this, it would be better to speak up quickly, as otherwise we may decide to move forward with the change. Best regards Antoine. On Thu, 27 May 2021 10:03:03 +0200 Antoine Pitrou wrote: > Hello, > > It seems the only two platforms that constrained us to C++11 will not b

Re: [C++] [DISCUSS] Moving towards a consistent enum naming scheme

2021-06-04 Thread Antoine Pitrou
field wrote: I would prefer kCamelCaps to be inline with the style guide (unless we are too far down a different path). On Fri, Jun 4, 2021 at 12:37 PM Antoine Pitrou wrote: Le 04/06/2021 à 21:34, Weston Pace a écrit : The C++ code base currently has a mix of ALL_CAPS (e.g. arrow::ValueDescr::

Re: [C++] [DISCUSS] Moving towards a consistent enum naming scheme

2021-06-04 Thread Antoine Pitrou
Le 04/06/2021 à 21:34, Weston Pace a écrit : The C++ code base currently has a mix of ALL_CAPS (e.g. arrow::ValueDescr::Shape, seems to be favored in arrow::compute::), CapWords (e.g. arrow::StatusCode), and kCapWords (e.g. arrow::DecimalStatus, not common in arrow:: but used in gandiva:: and t

Re: C++ Migrate from Arrow 0.16.0

2021-06-03 Thread Antoine Pitrou
all a breaking change just prior to 1.0). On Wed, Jun 2, 2021 at 1:00 PM Antoine Pitrou wrote: Le 02/06/2021 à 21:57, Rares Vernica a écrit : Thanks for the pointers! The migration is going well. We have been using Arrow 0.16.0 RecordBatchStreamWriter < https://github.com/Paradigm4/bri

Re: [Format] Timestamp timezone semantics?

2021-06-03 Thread Antoine Pitrou
Le 02/06/2021 à 22:56, Micah Kornfield a écrit : Any SQL interface to Arrow should follow the SQL standard. So, for instance, if a column has TIMESTAMP type, it should behave as a date-time without a time-zone. At least in bigquery we do the following mapping: SQL TIMESTAMP -> Arrow Timesta

Re: C++ Migrate from Arrow 0.16.0

2021-06-02 Thread Antoine Pitrou
Le 02/06/2021 à 21:57, Rares Vernica a écrit : Thanks for the pointers! The migration is going well. We have been using Arrow 0.16.0 RecordBatchStreamWriter with & without CompressedOutputStream and wrote the resultin

Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Antoine Pitrou
Le 02/06/2021 à 14:58, Joris Van den Bossche a écrit : On Wed, 2 Jun 2021 at 13:56, Antoine Pitrou wrote: Hello, For the first time I notice this piece of information about the timestamp type: /// * If the time zone is set to a valid value, values can be displayed as

[Format] Timestamp timezone semantics?

2021-06-02 Thread Antoine Pitrou
Hello, For the first time I notice this piece of information about the timestamp type: /// * If the time zone is set to a valid value, values can be displayed as /// "localized" to that time zone, even though the underlying 64-bit /// integers are identical to the same data stored

Re: Ordering of encodings?

2021-06-01 Thread Antoine Pitrou
Le 01/06/2021 à 20:46, Micah Kornfield a écrit : I couldn't find anything in the specification on this, but is there any constraint on how disallowed ordering of encoded pages in a column for a row group. I think in practice most types try to dictionary encode first and then fallback to anothe

Re: [C++][Discuss] Switch to C++14

2021-05-27 Thread Antoine Pitrou
going to c++14. While we're at it: which platforms prevent us from using c++17? On Thu, May 27, 2021, 04:03 Antoine Pitrou wrote: Hello, It seems the only two platforms that constrained us to C++11 will not be supported anymore (those platforms are RTools 3.5 for R packages, and many

[C++][Discuss] Switch to C++14

2021-05-27 Thread Antoine Pitrou
Hello, It seems the only two platforms that constrained us to C++11 will not be supported anymore (those platforms are RTools 3.5 for R packages, and manylinux1 for Python packages). It would be beneficial to bump our C++ requirement to C++14. There is an issue open listing benefits: htt

Re: C++ Compression in RecordBatchStreamWriter

2021-05-20 Thread Antoine Pitrou
If you use a CompressedOutputStream, then you get a compressed (e.g. gzip) file. If you want to use the Arrow IPC buffer compression, you need to specify in IpcWriteOptions. Regards Antoine. Le 20/05/2021 à 18:45, Rares Vernica a écrit : Hello, Just a clarifying question, when a Compr

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Antoine Pitrou
Le 19/05/2021 à 07:37, Weston Pace a écrit : I spoke a while ago about working on a multithreaded stress test suite. I have put together some very early details[1]. I would appreciate any feedback. I would recommend writing such tests in Python, such as is already done for the CSV reader.

Re: Language silos and transpilers

2021-05-18 Thread Antoine Pitrou
Le 19/05/2021 à 03:28, Arun Sharma a écrit : On Tue, May 18, 2021 at 5:37 PM Wes McKinney wrote: You just sent this same e-mail 24 hours ago. I think the problems we are solving are different. We are addressing language siloing at the data level and the shared-computing-libraries level. I a

Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou
ng to these multi-emoji glyphs as "emoji ZWJ sequences," and linking to https://unicode.org/emoji/charts/emoji-zwj-sequences.html Ian On Mon, May 17, 2021 at 11:21 AM Antoine Pitrou wrote: Le 17/05/2021 à 17:17, David Li a écrit : A little clarification on my point: it's not tha

Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou
Le 17/05/2021 à 17:17, David Li a écrit : A little clarification on my point: it's not that a single codepoint gets encoded with more than four bytes, it's that a grapheme cluster/human-delimited 'character' might be multiple codepoints, so reversing the individual codepoints may produce an une

Re: String reverse kernel

2021-05-17 Thread Antoine Pitrou
Le 17/05/2021 à 16:28, Niranda Perera a écrit : Hi all, This is RE: [1] & [2] String reverse kernel. Even though it is a seemingly trivial exercise, I would like to clarify a few things. In the current PR [1], there are 2 reverse kernels, ASCII and UTF8. I'd like to get some feedback for the

Re: [C++][DISCUSS] Implementing interpreted (non-compiled) tests for compute functions

2021-05-15 Thread Antoine Pitrou
mming language such as Python; it doesn’t matter much.) For example, assertThatCall(“foo(1, 2)”, returns(“3”)) might actually call foo with arguments 1 and 2, or it might generate a C++ or Rust test that does the same. Julian On May 14, 2021, at 8:45 AM, Antoine Pitrou wrote: L

Re: [C++][DISCUSS] Implementing interpreted (non-compiled) tests for compute functions

2021-05-14 Thread Antoine Pitrou
Le 14/05/2021 à 15:30, Wes McKinney a écrit : hi folks, As we build more functions (kernels) in the project, I note that the amount of hand-coded C++ code relating to testing function correctness is growing significantly. Many of these tests are quite simple and could be expressed in a text fo

Re: Extending arrow::compute::internal::StringTransform class

2021-05-13 Thread Antoine Pitrou
Hi Niranda, Le 13/05/2021 à 21:00, Niranda Perera a écrit : I am writing a String Reverse kernel [1]. I am extending the arrow::compute::internal::StringTransform class [2] for my impl. StringTransform is expecting the Derived class to implement the following Tranfsform method. bool Transfor

Re: [DISCUSS/QUESTION][C++] Persisting "field id" (or other metadata) through transformation?

2021-05-12 Thread Antoine Pitrou
Le 12/05/2021 à 21:19, Weston Pace a écrit : The parquet format has a "field id" concept (unique integer identifier for a column) that gets promoted in the C++ implementation to a key/value pair in the field's metadata. I don't think anything says the "field id" should be unique. It's just a

Re: [C++] Deciding between "compute function" and "utility function"

2021-05-11 Thread Antoine Pitrou
Le 11/05/2021 à 22:10, Weston Pace a écrit : How does one decide between "utility function" and "compute function"? For example, https://issues.apache.org/jira/browse/ARROW-12739 is very similar to StructArray::Make which is implemented as a static function. However, 12739 would require poo

Re: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman

2021-05-06 Thread Antoine Pitrou
Congratulations Ben :-) Le 06/05/2021 à 21:02, Rok Mihevc a écrit : Congrats! On Thu, May 6, 2021 at 10:49 AM Krisztián Szűcs wrote: Congrats Ben! On Thu, May 6, 2021 at 9:20 AM Joris Van den Bossche wrote: Congrats! On Thu, 6 May 2021 at 07:03, Weston Pace wrote: Congratulations

Re: [DISCUSS][C++] Refactoring of Expression simplification passes

2021-05-05 Thread Antoine Pitrou
On Wed, 5 May 2021 13:23:36 -0400 Benjamin Kietzman wrote: > Currently, Expressions (used to specify dataset filters and projections) > are simplified by direct rewriting: a filter such as `alpha == 2 and beta > > 3` > on a partition where we are guaranteed that `beta == 5` will be rewritten > to

Re: [VOTE] Register media types (MIME types) for Apache Arrow formats to IANA

2021-05-04 Thread Antoine Pitrou
+1 from me. Thank you for doing this! Regards Antoine. Le 04/05/2021 à 13:41, Weston Pace a écrit : Per ARROW-7396 I would like to propose an application to the IANA to register media types for the Arrow IPC formats (both file and streaming). The proposed application is available as [1].

Re: [C++] Adopting a library for (distributed) tracing

2021-05-01 Thread Antoine Pitrou
Hi David, I'm favorable to adopting a tracing library. My main question is: does integrating OpenTracing complicate our build procedure? Is it header-only as long as you use the no-op tracer? Or do you have to build it and link with it nonetheless? The opentracing-cpp documentations seem

Re: [DISCUSS] New Types (Schema.fbs vs Extension Types)

2021-04-30 Thread Antoine Pitrou
I concur with both what Wes and Micah said. As for temporal types, they have wide-spread use and their semantics require dedicated treatment for arithmetic and conversion, so it's helpful to define dedicated types for them, as opposed to mere annotations. Regards Antoine. Le 30/04/2021 à

Re: Independent releases and format version

2021-04-29 Thread Antoine Pitrou
Le 29/04/2021 à 02:26, Weston Pace a écrit : There is also a potential format change coming up (new interval type). Ok, so more accurately, it is not a format change, it's a format addition ;-) This sounds pedantic but a format change would potentially break compatibility (for example if

Re: Independent releases and format version

2021-04-29 Thread Antoine Pitrou
Le 29/04/2021 à 02:26, Weston Pace a écrit : We now have independent releases. There has been some discussion (not sure if it was formalized) around aligning major release versions across the languages. There is also a potential format change coming up (new interval type). I think this bring

Re: [C++][Python] Parquet INT96 overflow for arrow timestamps

2021-04-27 Thread Antoine Pitrou
Hi Karik, I answered in the JIRA itself. Feel free to ask any more questions! Regards Antoine. Le 27/04/2021 à 16:28, Karik Isichei a écrit : Hi there, I previously raised an issue regarding arrow timestamp values overflowing when reading parquet type INT96 ( https://issues.apache.org/ji

Re: nullptr for mutable data in pyarrow table from pandas

2021-04-24 Thread Antoine Pitrou
chance to get this into 4.0.0 this would be a nice one but I suspect the next RC is already under way (it need not block since this bug has been present a long time) On Wed, Apr 21, 2021 at 3:31 AM Antoine Pitrou wrote: It sounds like a bug if is_mutable_ is true but mutable_data_ is nullpt

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-22 Thread Antoine Pitrou
I tried to verify the source release on Ubuntu 20.04 with ARROW_GANDIVA=0 TEST_JAVA=0 TEST_INTEGRATION=0 TEST_CSHARP=0. It succeeded until the Ruby bindings: + bundle exec ruby test/run-test.rb Traceback (most recent call last): 9: from test/run-test.rb:48:in `' 8: from test/ru

Re: [C++] Indeterminate poor performance of random number generator

2021-04-22 Thread Antoine Pitrou
Le 22/04/2021 à 03:38, Yibo Cai a écrit : Both using same libstdc++. But std::bernoulli_distribution is inlined, so they are indeed different for clang and gcc. https://godbolt.org/z/aT84x5Yec Looks a pure compiler thing. It looks like clang generates calls to logl() and __divtf3() (soft-fl

Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Antoine Pitrou
Le 21/04/2021 à 11:41, Yibo Cai a écrit : On 4/21/21 5:17 PM, Antoine Pitrou wrote: Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" co

Re: [C++] Indeterminate poor performance of random number generator

2021-04-21 Thread Antoine Pitrou
Le 21/04/2021 à 11:14, Yibo Cai a écrit : When running benchmarks on Arm64 servers, I find some benchmarks are extremely slow when built with clang. E.g., "ModeKernelNarrow/1048576/1" costs 90s to finish. I find almost all the time is spent in generating random bits (prepare test data)[1]

Re: nullptr for mutable data in pyarrow table from pandas

2021-04-21 Thread Antoine Pitrou
It sounds like a bug if is_mutable_ is true but mutable_data_ is nullptr. Regards Antoine. Le 21/04/2021 à 03:17, Weston Pace a écrit : If it comes from pandas (and is eligible for zero-copy) then the buffer implementation will be `NumPyBuffer`. Printing one in GDB yields... ``` $12 = {_v

Re: [Gandiva] Replacing the LRU cache in gandiva

2021-04-20 Thread Antoine Pitrou
Hi Projjal, The main issue here is to compute the cost accurately (is it computation runtime? memory footprint? can you measure the computation time accurately, regardless of system noise - e.g. other threads and processes?). Intuitively, if the LRU cache shows too many misses, a simple mea

Re: [ANNOUNCE] Copying Rust components to new repositories

2021-04-18 Thread Antoine Pitrou
On Sun, 18 Apr 2021 18:27:27 +0200 Jorge Cardoso Leitão wrote: > Yes, we are not touching apache/arrow. > > Does anyone know how to request permissions on github repos? We can't even > see "Settings" atm (we can push to master). A JIRA on INFRA? "Settings" are only available by INFRA AFAIU :-(

Re: [ANNOUNCE] Copying Rust components to new repositories

2021-04-18 Thread Antoine Pitrou
g about the arrow-rs repository here. If however the suggestion is to do it on the main Arrow repository, then I'm entirely opposed to it. Regards Antoine. On Sun, Apr 18, 2021 at 5:09 PM Antoine Pitrou wrote: Le 18/04/2021 à 16:36, Andy Grove a écrit : Hi Wes, We started looking

Re: [ANNOUNCE] Copying Rust components to new repositories

2021-04-18 Thread Antoine Pitrou
Le 18/04/2021 à 16:36, Andy Grove a écrit : Hi Wes, We started looking at the documentation for git filter-branch and it recommends not to use it. It states that "git-filter-branch is riddled with gotchas resulting in various ways to easily corrupt repos or end up with a mess worse than what y

Re: [RUST] parquet2 experiment

2021-04-16 Thread Antoine Pitrou
On Fri, 16 Apr 2021 18:21:50 +0200 Jorge Cardoso Leitão wrote: > > - Integration: it is integration-tested against parquet generated by > pyarrow==3, and round trip tests for the write. Note you can also find files generated by other implementations here: https://github.com/apache/parquet-testin

Re: [VOTE] Move Rust components to new repos and process

2021-04-15 Thread Antoine Pitrou
+0. Regards Antoine. Le 15/04/2021 à 02:04, Andy Grove a écrit : This vote is to determine if the Arrow PMC is in favor of the Rust community moving the Rust implementation of Apache Arrow as well as the related projects (such as Parquet, DataFusion, Ballista, etc) out of the monorepo and i

Re: CI feedback time

2021-04-15 Thread Antoine Pitrou
Le 15/04/2021 à 03:13, Kazuaki Ishizaki a écrit : As we know this is a common issue among Apache projects. While the projects do not have the final solution, Apache Spark project has a mechanism [1][2] to run a test in own local (forked) repository. Can we alleviate the problem a little bit?

Re: CI feedback time

2021-04-14 Thread Antoine Pitrou
Hi Krisztian, Thanks for bringing this up. This is definitely becoming a high-priority topic for Arrow development. I don't believe there is much opportunity for reducing the number of builds or their runtime. We simply have a lot of development going on, and the number of different CI j

Re: 4.0 release preparation

2021-04-11 Thread Antoine Pitrou
Le 10/04/2021 à 23:06, Weston Pace a écrit : Nightly build triage (based on nightly builds from 4/9): Failed Tasks: - conda-linux-gcc-py36-aarch64: ARROW-12324 (conda builds timing out, conda slow) - conda-linux-gcc-py37-aarch64: ARROW-12324 (conda builds timing out, conda slow) - conda-

Re: Rust sync meeting

2021-04-08 Thread Antoine Pitrou
On Thu, 8 Apr 2021 00:26:57 -0700 Julian Hyde wrote: > Antoine, > > I need to correct your assertion > > > we develop on the side every day when we submit PRs from forks; > > it's just a matter of how much complexity is being submitted at once > > Intuitively, there seems to be a continuum be

Re: Rust sync meeting

2021-04-07 Thread Antoine Pitrou
Hi Jorge, I don't think you have done anything inappropriately here. I'm not able to give a qualified advice on the arrow2 and parquet2 projects (after all, we develop on the side every day when we submit PRs from forks; it's just a matter of how much complexity is being submitted at once).

Re: [NIGHTLY] Arrow Build Report for Job nightly-2021-04-06-0

2021-04-07 Thread Antoine Pitrou
On Tue, 6 Apr 2021 11:34:20 -0700 Neal Richardson wrote: > I just checked crossbow's logs and the jpype build has been failing since > October 26 (i.e., we're coming up on 6 months of solid failure). I don't > know the use case for why we have it, but if we are going to ignore a > failing build fo

Re: Status of Arrow Julia implementation?

2021-04-03 Thread Antoine Pitrou
Hi Jacob, Le 02/04/2021 à 22:03, Jacob Quinn a écrit : We realize one of the main implications will probably be dropping Julia from the list of "official implementations". We're encouraged by the many users who have already started using the Julia implementation and will strive to maintain a

Re: Arrow sync call March 31 at 12:00 US/Eastern, 16:00 UTC

2021-03-31 Thread Antoine Pitrou
I'm fine with Zoom. But doesn't need it a host as well? Le 31/03/2021 à 18:09, Wes McKinney a écrit : The Google Meet link is on dremio.com, so there must not be someone from the org to let people in. What do folks think about moving to Zoom for future meetings (which shouldn't have this pro

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-03-31 Thread Antoine Pitrou
with simplicity. In the past there has been some reference to people wanting to store very large timestamps (fall out of Nanoseconds max representable value) but we've concluded that this wasn't something that we wanted to really support. On Wed, Mar 31, 2021 at 4:49 AM Antoine Pi

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-03-31 Thread Antoine Pitrou
I would favour the following characteristics : - support for nanoseconds (especially as other Arrow temporal types support it) - easy to handle (which excludes the ZetaSQL representtaion IMHO) OTOH I don't really understand the point of supporting "the most reasonable ranges for Year, Month

[JIRA] Archived "CI" component

2021-03-30 Thread Antoine Pitrou
Hello, I've archived the "CI" component on JIRA since it was redundant with another component named "Continuous Integration". Regards Antoine.

Bintray deprecation

2021-03-29 Thread Antoine Pitrou
Hello, The Apache INFRA team issued a statement about Bintray deprecation, which can be read here (text pasted below): https://mail-archives.apache.org/mod_mbox/www-builds/202103.mbox/%3CCAN0Gg1dSbHnzO%2BQYsq4qAOy94a2Mhwz57JHqx7vjUyj1qt%2BwdA%40mail.gmail.com%3E """ We have secured a replace

Re: Wrap Unwarp Scalars in Cython API

2021-03-23 Thread Antoine Pitrou
Hi Vibhatha, The APIs exist and are declared (in Cython) as: cdef public object pyarrow_wrap_scalar(const shared_ptr[CScalar]& sp_scalar) cdef public shared_ptr[CScalar] pyarrow_unwrap_scalar(object scalar) However, it appears that we forgot to document them. Regards Antoine. Le 23/03/20

Re: [Java] Source control of generated flatbuffers code

2021-03-22 Thread Antoine Pitrou
Le 22/03/2021 à 20:17, bobtins a écrit : TL;DR: The Java implementation doesn't have generated flatbuffers code under source control, and the code generation depends on an unofficially-maintained Maven artifact. Other language implementations do check in the generated code; would it make sense

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Antoine Pitrou
Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit : Also, I would like to resume the discussion about the Frame format vs the Block format. There were 3 points for the Frame format by Antoine: - it allows streaming compression and decompression (meaning you can avoid loading a huge compressed b

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-21 Thread Antoine Pitrou
when an issue is "NOT ASSIGNED!", it asks if you want to assign to the reporter, and if the reporter is not a "contributor", add them to that role first. That's not the only time it comes up (and if we start having more JIRAs created from GitHub issues/PR by a b

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-20 Thread Antoine Pitrou
at role first. That's not the only time it comes up (and if we start having more JIRAs created from GitHub issues/PR by a bot user, it won't be as helpful), but it would be a start. Neal On Fri, Mar 5, 2021 at 3:08 AM Antoine Pitrou wrote: Le 05/03/2021 à 06:15, Micah Kornfie

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Antoine Pitrou
to read the file, appending all the deltas to yield one set of dictionaries for reassembly. The downside is that the “partial dictionaries” that existed at the time that the file was written are not recoverable, but that seems like an acceptable compromise. On Fri, Mar 19, 2021 at 10:34 AM Antoine

Re: [ALL] Integration tests for dense and sparse tensor

2021-03-19 Thread Antoine Pitrou
t. If there aren't any Jira issues about adding integration tests, it would make sense to go ahead and open some and clarify the scope of what you would like to see get developed. On Tue, Mar 16, 2021 at 3:25 AM Antoine Pitrou wrote: Hi Fernando, Currently there are no explicit plans t

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-19 Thread Antoine Pitrou
If we want this format to be common to different execution engines then it seems like it should represent logical expressions indeed (which may be implemented by different physical operators, depending on the execution engine). But I'm no expert in the matter. Regards Antoine. Le 18/03/

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Antoine Pitrou
Le 19/03/2021 à 13:37, Wes McKinney a écrit : I am also under the impression that the file format is supposed to support deltas, but not replacements. Is this not implemented in C++? Definitely not. Also I was not aware that the file format was supposed to support deltas. Regards Antoine

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Antoine Pitrou
It's a bit more configurable, but basically yes. See the IPC write options: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73 Regards Antoine. Le 18/03/2021 à 16:37, Jacob Quinn a écrit : Ah, interesting. So to make sure I understand correctly, the C++ write imple

Re: [JIRA Permissions] Assigning myself to ARROW-11901

2021-03-18 Thread Antoine Pitrou
Hi Benjamin, This should be done. Regards Antoine. Le 18/03/2021 à 10:58, Benjamin Wilhelm a écrit : Hi all, I would like to contribute to Arrow by working on the performance issues with the newly introduced LZ4 compression in Java (JIRA: https://issues.apache.org/jira/browse/ARROW-11901)

<    1   2   3   4   5   6   7   8   9   10   >