Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread wish maple
Some configs, like use_thread would be true in Python but false in C++ Maybe we call fill all configs explicitly with same values Best, Xuwei Fu J N 于2024年6月13日周四 13:32写道: > Hello, > We all know that there inherent overhead in Python, and we wanted to > compare the performance of reading

Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-10 Thread wish maple
Ah, only PMC can vote binding Please regard me as non-binding Best, Xuwei Fu wish maple 于2024年5月10日周五 10:39写道: > +1 (binding) > > TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1 > Release candidate 16.1.0 works well on my M1 MacOS > > Best, > Xuwei Fu >

Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-09 Thread wish maple
+1 (binding) TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1 Release candidate 16.1.0 works well on my M1 MacOS Best, Xuwei Fu David Li 于2024年5月10日周五 09:30写道: > +1 (binding) > > Tested sources with Conda on Debian 12/x86_64 (binaries failed due to > download flakiness) > > On

Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread wish maple
Congrats! Best, Xuwei Fu Joris Van den Bossche 于2024年5月7日周二 21:53写道: > On behalf of the Arrow PMC, I'm happy to announce that Dane Pitkin has > accepted an invitation to become a committer on Apache Arrow. Welcome, > and thank you for your contributions! > > Joris >

Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread wish maple
Congrats! Best, Xuwei Fu Kevin Gurney 于2024年4月11日周四 23:22写道: > Congratulations, Sarah!! Well deserved! > > From: Jacob Wujciak > Sent: Thursday, April 11, 2024 11:14 AM > To: dev@arrow.apache.org > Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore >

Parquet: Legacy timestamp "adjustToUtc" conversion change in arrow 16.0

2024-04-10 Thread wish maple
The issue [1] mentions about the syntax change about arrow parquet. In general, when reading from a Parquet file with legacy timestamp not written by arrow, isAdjustedToUTC would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a

Re: [VOTE] Bulk ingestion support for Flight SQL (vote #2)

2024-04-06 Thread wish maple
+1 (non binding) Best, Xuwei Fu ulk ingestion support for Flight SQL David Li 于2024年4月5日周五 16:38写道: > Hello, > > Joel Lubinitsky has proposed adding bulk ingestion support to Arrow Flight > SQL [1]. This provides a path for uploading an Arrow dataset to a Flight > SQL server to create or

Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread wish maple
Congrats Joel! Best, Xuwei Fu Matt Topol 于2024年4月1日周一 22:59写道: > On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky has > accepted an invitation to become a committer on Apache Arrow. Welcome, and > thank you for your contributions! > > --Matt >

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread wish maple
Congrats! Best, Xuwei Fu Nic Crane 于2024年3月18日周一 10:24写道: > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has > accepted an invitation to become a committer on Apache Arrow. Welcome, and > thank you for your contributions! > > Nic >

Re: [C++][Parquet] Add support for writing bloom filter to Parquet file

2024-03-16 Thread wish maple
I was working on this previously[1]. But forgot the context for it. Now I'll moving this forward [1] https://github.com/apache/arrow/pull/37400 Best regards, Xuwei Fu Andrei Lazăr 于2024年3月17日周日 03:14写道: > Hi, > > I would like proposing extending the C++ library to add support for writing >

Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-05 Thread wish maple
+1 verified C++ and Python on M1 MacOS Best, Xuwei Fu Raúl Cumplido 于2024年3月4日周一 17:05写道: > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 15.0.1. This is a release consisting of 37 > resolved GitHub issues[1]. > > This release candidate is

[DISCUSS] Proposal: Efficient filtering in parquet-cpp

2023-12-29 Thread wish maple
Hi, all. We're proposing Page Filtering in parquet-cpp implementation[1]. Currently, parquet-cpp and arrow only support RowGroup/ColumnChunk level pruning. Now we can support filtering with Parquet PageIndex[2]. The interface can be also used to helping implementing the iceberg positional delete

Re: [VOTE] Release Apache Arrow 14.0.2 - RC3

2023-12-14 Thread wish maple
+1 (binding) Verified C++ and Python in my M1 MacOS Best, Xuwei Fu Jean-Baptiste Onofré 于2023年12月15日周五 00:19写道: > +1 (non binding) > > I checked: > - hash and signature are OK > - build is OK as soon as submodule are added (see the discussion on > another thread) > - LICENSE and NOTICE look

Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread wish maple
Congrats Felipe!!! Best, Xuwei Fu Benjamin Kietzman 于2023年12月7日周四 23:42写道: > On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira > Carvalho > has accepted an invitation to become a committer on Apache > Arrow. Welcome, and thank you for your contributions! > > Ben Kietzman >

Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread wish maple
Congrats Andy! Best, Xuwei Fu Andrew Lamb 于2023年11月27日周一 20:47写道: > I am pleased to announce that the Arrow Project has a new PMC chair and VP > as per our tradition of rotating the chair once a year. I have resigned and > Andy Grove was duly elected by the PMC and approved unanimously by the

Re: C++: Code that read parquet into Arrow Arrays?

2023-11-17 Thread wish maple
Hi, The parquet is divided into arrow and parquet part. 1. The parquet part lowest position is parquet decoder, in [1]. The float point might choosing PLAIN, RLE_DCIT or BYTE_STREAM_SPLIT encoding. 2. parquet::ColumnReader is applied beyond decoder, each row-group might have one or

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread wish maple
Congrats Raul! Best, Xuwei Fu Andrew Lamb 于2023年11月14日周二 03:28写道: > The Project Management Committee (PMC) for Apache Arrow has invited > Raúl Cumplido to become a PMC member and we are pleased to announce > that Raúl Cumplido has accepted. > > Please join me in congratulating them. > >

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread wish maple
Thanks kou and every nice person in arrow community! I've learned a lot during learning and contribution to arrow and parquet. Thanks for everyone's help. Hope we can bring more fancy features in the future! Best, Xuwei Fu Sutou Kouhei 于2023年10月23日周一 12:48写道: > On behalf of the Arrow PMC, I'm

Re: Apache Arrow file format

2023-10-22 Thread wish maple
t; > > > to encode and decode, and instead relies on index structures and > > > > > statistics to accelerate access. > > > > > > > > > > Both are therefore perfectly viable options depending on your > > > particular > > > > > u

Re: Apache Arrow file format

2023-10-17 Thread wish maple
Arrow IPC file is great, it focuses on in-memory representation and direct computation. Basically, it can support compression and dictionary encoding, and can zero-copy deserialize the file to memory Arrow format. Parquet provides some strong functionality, like Statistics, which could help

Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread wish maple
Congratulations! Raúl Cumplido 于2023年10月15日周日 20:48写道: > Congratulations and welcome! > > El dom, 15 oct 2023, 13:57, Ian Cook escribió: > > > Congratulations Curt! > > > > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Curt

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread wish maple
+1 LGTM, thanks! Ian Cook 于2023年9月30日周六 00:49写道: > +1 (non-binding) > > Thanks very much Felipe for your persistence and your commitment to > addressing the numerous questions and comments that have been raised > since the beginning of the discussion on this in April. > > On Fri, Sep 29, 2023

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
By the way, you can try to use a memory-profiler like [1] and [2] . It would be help to find how the memory is used Best, Xuwei Fu [1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling [2] https://google.github.io/tcmalloc/gperftools.html Felipe Oliveira Carvalho

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
rmation (perhaps > metadata) per file scanned? > > On Wed, Sep 6, 2023 at 12:10 PM wish maple wrote: > > > I've met lots of Parquet Dataset issues. The main problem is that > currently > > we have 2 sets or API > > and they have different scan-options. And sometimes different

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
I've met lots of Parquet Dataset issues. The main problem is that currently we have 2 sets or API and they have different scan-options. And sometimes different interfaces like `to_batches()` or others would enable different scan options. I think [2] is similar to your problem. 1-4 are some issues

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread wish maple
+1 (non-binding) It would help a lot when processing UTF-8 related data! Xuwei Andrew Lamb 于2023年8月22日周二 00:11写道: > +1 > > This is a great example of collaboration > > On Sat, Aug 19, 2023 at 4:10 PM Chao Sun wrote: > > > +1 (non-binding)! > > > > On Fri, Aug 18, 2023 at 12:59 PM Felipe

RE: C++: State of parquet 2.x / nanosecond support

2023-07-14 Thread wish maple
Hi, Li Parquet 2.6 has been supported for a long time, and recently, in Parquet C++ and Python, Parquet 2.6 has been set to the default version of Parquet writer [1] [2]. So I think you can just use it! However, I don't know whether nanoarrow supports it. Best, Xuwei Fu [1]

Question about TypeHolder in arrow

2023-07-04 Thread wish maple
Hi, By looking into the code of arrow compute, I found there it uses `TypeHolder` [1], and expression might call `GetTypes` to get the input or output types. The document for `TypeHolder` says that it's a container for dynamically created `shared_ptr`. However, my view is: 1. It's widely used,

Question about nested columnar validity

2023-06-29 Thread wish maple
ity = true`, there offset might point to a invalid position Am I right? On 2023/06/29 12:10:52 Antoine Pitrou wrote: > > Le 29/06/2023 à 13:42, wish maple a écrit : > > Thanks all! > > So, in general: > > 1. For our Binary Like [1] format, and List formats [2], i

RE: Question about nested columnar validity

2023-06-29 Thread wish maple
/c6frlr9gcxy8qdhbmv8cn3rdjbrqxb1v [4] https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps Thanks, Xuwei Fu On 2023/06/28 15:03:11 wish maple wrote: > Hi, > > By looking at the arrow standard, when it comes to nested structure, like > StructArray[1] or FixedListArray[2], when parent

Question about nested columnar validity

2023-06-28 Thread wish maple
Hi, By looking at the arrow standard, when it comes to nested structure, like StructArray[1] or FixedListArray[2], when parent is not valid, the correspond child leaves "undefined". If it's a BinaryArray, when when it parent is not valid, would a validity member point to a undefined address?

RE: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread wish maple
On 2023/06/15 16:24:44 Joris Van den Bossche wrote: > Hi all, > > Bringing up https://github.com/apache/arrow/issues/35746 to the > mailing list: this issue proposes to bump the default Parquet version > we use for writing to Parquet files in the C++ library (and in the > various bindings

RE: [DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread wish maple
I have two parquet related bug fixes and I wonder if we can release them in 12.0.1 1. https://github.com/apache/arrow/pull/35428 2. https://github.com/apache/arrow/pull/35520 Patch 1 can cause BYTE_STREAM_SPLIT unable to be read if the previous parquet page is larger than the incoming one. Patch

RE: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread wish maple
I think the ArrayVector can have benefits above: 1. Converting a Batch in Velox or other system to arrow array could be much more lightweight. 2. Modifying, filter and copy array or string could be much more lightweight Velox can make a Vector mutable, seems that arrow array cannot. Seems it

RE: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

2023-04-23 Thread wish maple
On 2023/04/23 09:38:02 "Yang, Yang10" wrote: > Hi, > > As discussed in this issue: https://github.com/apache/arrow/issues/35287, currently Arrow only supports one parameter: compression_level to be customized. We would like to make more compression parameters (such as window_bits) customizable