Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread wish maple
Some configs, like use_thread would be true in Python but false in C++

Maybe we call fill all configs explicitly with same values

Best,
Xuwei Fu

J N  于2024年6月13日周四 13:32写道:

> Hello,
> We all know that there inherent overhead in Python, and we wanted to
> compare the performance of reading data using C++ Arrow against PyArrow for
> high throughput systems. Since I couldn't find any benchmarks online for
> this comparison, I decided to create my own. These programs read a Parquet
> file into arrow::Table in both C++ and Python, and are single threaded.
>
> Carrow benchmark -
> https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
> Pyarrow benchmark -
> https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
>
> Ps: I am new to arrow so some things might be inefficient in both
>
> They read a zstd compressed parquet file of around 300MB.
> The results were very different than what we expected.
> *Pyarrow*
> Total time: 5.347517251968384 seconds
>
> *C++ Arrow*
> Total time: 5.86806 seconds
>
> For smaller files however (0.5MB), c++ arrow was better
>
> *Pyarrow*
> gzip
> Total time: 0.013672113418579102 seconds
>
> *C++ Arrow*
> Total time: 0.00501744 seconds
> (carrow 10x better)
>
> So I have a question to the arrow experts, is this expected in the arrow
> world or is there some error in my benchmark?
>
> Thank you!
>
>
> --
> Warm Regards,
>
> Jay Narale
>


Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-10 Thread wish maple
Ah, only PMC can vote binding

Please regard me as non-binding

Best,
Xuwei Fu

wish maple  于2024年5月10日周五 10:39写道:

> +1 (binding)
>
> TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1
> Release candidate 16.1.0 works well on my M1 MacOS
>
> Best,
> Xuwei Fu
>
> David Li  于2024年5月10日周五 09:30写道:
>
>> +1 (binding)
>>
>> Tested sources with Conda on Debian 12/x86_64 (binaries failed due to
>> download flakiness)
>>
>> On Fri, May 10, 2024, at 07:02, Rok Mihevc wrote:
>> > +1 (non-binding)
>> >
>> > Ran:
>> > TEST_DEFAULT=0 TEST_SOURCE=1 ./verify-release-candidate.sh 16.1.0 1
>> > On Ubuntu 22.04.1 x86_64
>> >
>> > Thanks for the hard work Raul!
>> >
>> > Rok
>> >
>> > On Thu, May 9, 2024 at 6:51 PM Bryce Mecum 
>> wrote:
>> >
>> >> +1 (non-binding)
>> >>
>> >> I ran TEST_DEFAULT=0 TEST_CPP=1
>> >> ./dev/release/verify-release-candidate.sh 16.1.0 1 on aarch64 macOS
>> >> 14.4.1 with Homebrew. I did run into one failing test which I've filed
>> >> as [1].
>> >>
>> >> [1] https://github.com/apache/arrow/issues/41605
>> >>
>> >> On Thu, May 9, 2024 at 5:05 AM Raúl Cumplido 
>> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I would like to propose the following release candidate (RC1) of
>> Apache
>> >> > Arrow version 16.1.0. This is a release consisting of 35
>> >> > resolved GitHub issues[1].
>> >> >
>> >> > This release candidate is based on commit:
>> >> > 7dd1d34074af176d9e861a360e135ae57b21cf96 [2]
>> >> >
>> >> > The source release rc1 is hosted at [3].
>> >> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
>> >> > The changelog is located at [12].
>> >> >
>> >> > Please download, verify checksums and signatures, run the unit tests,
>> >> > and vote on the release. See [13] for how to validate a release
>> >> candidate.
>> >> >
>> >> > See also a verification result on GitHub pull request [14].
>> >> >
>> >> > The vote will be open for at least 72 hours.
>> >> >
>> >> > [ ] +1 Release this as Apache Arrow 16.1.0
>> >> > [ ] +0
>> >> > [ ] -1 Do not release this as Apache Arrow 16.1.0 because...
>> >> >
>> >> > [1]:
>> >>
>> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A16.1.0+is%3Aclosed
>> >> > [2]:
>> >>
>> https://github.com/apache/arrow/tree/7dd1d34074af176d9e861a360e135ae57b21cf96
>> >> > [3]:
>> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-16.1.0-rc1
>> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
>> >> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
>> >> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
>> >> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
>> >> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/16.1.0-rc1
>> >> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/16.1.0-rc1
>> >> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/16.1.0-rc1
>> >> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
>> >> > [12]:
>> >>
>> https://github.com/apache/arrow/blob/7dd1d34074af176d9e861a360e135ae57b21cf96/CHANGELOG.md
>> >> > [13]:
>> https://arrow.apache.org/docs/developers/release_verification.html
>> >> > [14]: https://github.com/apache/arrow/pull/41600
>> >>
>>
>


Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-09 Thread wish maple
+1 (binding)

TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1
Release candidate 16.1.0 works well on my M1 MacOS

Best,
Xuwei Fu

David Li  于2024年5月10日周五 09:30写道:

> +1 (binding)
>
> Tested sources with Conda on Debian 12/x86_64 (binaries failed due to
> download flakiness)
>
> On Fri, May 10, 2024, at 07:02, Rok Mihevc wrote:
> > +1 (non-binding)
> >
> > Ran:
> > TEST_DEFAULT=0 TEST_SOURCE=1 ./verify-release-candidate.sh 16.1.0 1
> > On Ubuntu 22.04.1 x86_64
> >
> > Thanks for the hard work Raul!
> >
> > Rok
> >
> > On Thu, May 9, 2024 at 6:51 PM Bryce Mecum  wrote:
> >
> >> +1 (non-binding)
> >>
> >> I ran TEST_DEFAULT=0 TEST_CPP=1
> >> ./dev/release/verify-release-candidate.sh 16.1.0 1 on aarch64 macOS
> >> 14.4.1 with Homebrew. I did run into one failing test which I've filed
> >> as [1].
> >>
> >> [1] https://github.com/apache/arrow/issues/41605
> >>
> >> On Thu, May 9, 2024 at 5:05 AM Raúl Cumplido  wrote:
> >> >
> >> > Hi,
> >> >
> >> > I would like to propose the following release candidate (RC1) of
> Apache
> >> > Arrow version 16.1.0. This is a release consisting of 35
> >> > resolved GitHub issues[1].
> >> >
> >> > This release candidate is based on commit:
> >> > 7dd1d34074af176d9e861a360e135ae57b21cf96 [2]
> >> >
> >> > The source release rc1 is hosted at [3].
> >> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> >> > The changelog is located at [12].
> >> >
> >> > Please download, verify checksums and signatures, run the unit tests,
> >> > and vote on the release. See [13] for how to validate a release
> >> candidate.
> >> >
> >> > See also a verification result on GitHub pull request [14].
> >> >
> >> > The vote will be open for at least 72 hours.
> >> >
> >> > [ ] +1 Release this as Apache Arrow 16.1.0
> >> > [ ] +0
> >> > [ ] -1 Do not release this as Apache Arrow 16.1.0 because...
> >> >
> >> > [1]:
> >>
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A16.1.0+is%3Aclosed
> >> > [2]:
> >>
> https://github.com/apache/arrow/tree/7dd1d34074af176d9e861a360e135ae57b21cf96
> >> > [3]:
> >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-16.1.0-rc1
> >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> >> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> >> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> >> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> >> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/16.1.0-rc1
> >> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/16.1.0-rc1
> >> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/16.1.0-rc1
> >> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> >> > [12]:
> >>
> https://github.com/apache/arrow/blob/7dd1d34074af176d9e861a360e135ae57b21cf96/CHANGELOG.md
> >> > [13]:
> https://arrow.apache.org/docs/developers/release_verification.html
> >> > [14]: https://github.com/apache/arrow/pull/41600
> >>
>


Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread wish maple
Congrats!

Best,
Xuwei Fu

Joris Van den Bossche  于2024年5月7日周二 21:53写道:

> On behalf of the Arrow PMC, I'm happy to announce that Dane Pitkin has
> accepted an invitation to become a committer on Apache Arrow. Welcome,
> and thank you for your contributions!
>
> Joris
>


Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread wish maple
Congrats!

Best,
Xuwei Fu

Kevin Gurney  于2024年4月11日周四 23:22写道:

> Congratulations, Sarah!! Well deserved!
> 
> From: Jacob Wujciak 
> Sent: Thursday, April 11, 2024 11:14 AM
> To: dev@arrow.apache.org 
> Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore
>
> Congratulations and welcome!
>
> Am Do., 11. Apr. 2024 um 17:11 Uhr schrieb Raúl Cumplido <
> rau...@apache.org
> >:
>
> > Congratulations Sarah!
> >
> > El jue, 11 abr 2024 a las 13:13, Sutou Kouhei ()
> > escribió:
> > >
> > > Hi,
> > >
> > > On behalf of the Arrow PMC, I'm happy to announce that Sarah
> > > Gilmore has accepted an invitation to become a committer on
> > > Apache Arrow. Welcome, and thank you for your contributions!
> > >
> > > Thanks,
> > > --
> > > kou
> >
>


Parquet: Legacy timestamp "adjustToUtc" conversion change in arrow 16.0

2024-04-10 Thread wish maple
The issue [1] mentions about the syntax change about arrow parquet. In
general, when reading from a Parquet file with legacy timestamp not written
by arrow, isAdjustedToUTC would be ignored during read. And when filtering
a file like this, filtering would not work.


When casting from a "deprecated" parquet 1.0 ConvertedType, a timestamp
should be force adjustedToUtc.

For the parquet standard part. Parquet has a ConvertedType for legacy
timestamp, the legacy timestamp *do not* having a adjustedToUtc flag. So,
for forward compatibility, when reading it we need to regard it as
adjustedToUtc ( A UTC Timestamp). See [2] [3].

However, as mentioned in [4]. Arrow legacy file ignores "adjustedToUtc", so
arrow parquet reader in C++ and Go don't follow the standard before 16.0.
This would be a breaking change. I wonder would this be ok, or we should
revert this change in C++ and Go back to previous implementation?

Best,
Xuwei Fu

[1] https://github.com/apache/arrow/issues/39489
[2]
https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L480-L485
[3]
https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md?plain=1#L308
[4] https://github.com/apache/arrow/pull/39491#issuecomment-1884465635


Re: [VOTE] Bulk ingestion support for Flight SQL (vote #2)

2024-04-06 Thread wish maple
+1 (non binding)

Best,
Xuwei Fu
ulk ingestion support for Flight SQL

David Li  于2024年4月5日周五 16:38写道:

> Hello,
>
> Joel Lubinitsky has proposed adding bulk ingestion support to Arrow Flight
> SQL [1]. This provides a path for uploading an Arrow dataset to a Flight
> SQL server to create or append to a table, without having to know the
> specifics of the SQL or Substrait support on the server. The functionality
> mimics similar functionality in ADBC. This is the second attempt at a vote
> [3].
>
> Joel has provided reference implementations of this for C++ and Go at [2],
> along with an integration test.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Accept this proposal
> [ ] +0
> [ ] -1 Do not accept this proposal because...
>
> [1]: https://lists.apache.org/thread/mo98rsh20047xljrbfymrks8f2ngn49z
> [2]: https://github.com/apache/arrow/pull/38256
> [3]: https://lists.apache.org/thread/c8n3t0452807wm1ol1hvj41rs1vso3tp
>
> Thanks,
> David


Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread wish maple
Congrats Joel!

Best,
Xuwei Fu

Matt Topol  于2024年4月1日周一 22:59写道:

> On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky has
> accepted an invitation to become a committer on Apache Arrow. Welcome, and
> thank you for your contributions!
>
> --Matt
>


Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread wish maple
Congrats!

Best,
Xuwei Fu

Nic Crane  于2024年3月18日周一 10:24写道:

> On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has
> accepted an invitation to become a committer on Apache Arrow. Welcome, and
> thank you for your contributions!
>
> Nic
>


Re: [C++][Parquet] Add support for writing bloom filter to Parquet file

2024-03-16 Thread wish maple
I was working on this previously[1]. But forgot the context for it. Now I'll
moving this forward

[1] https://github.com/apache/arrow/pull/37400

Best regards,
Xuwei Fu

Andrei Lazăr  于2024年3月17日周日 03:14写道:

> Hi,
>
> I would like proposing extending the C++ library to add support for writing
> bloom filters to a Parquet file.
>
> Could I please get some thoughts on this?
>
> I have raised this issue on GitHub to keep track of this proposal:
> https://github.com/apache/arrow/issues/40548.
>
> Thank you,
> Andrei
>


Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-05 Thread wish maple
+1
verified C++ and Python on M1 MacOS

Best,
Xuwei Fu

Raúl Cumplido  于2024年3月4日周一 17:05写道:

> Hi,
>
> I would like to propose the following release candidate (RC0) of Apache
> Arrow version 15.0.1. This is a release consisting of 37
> resolved GitHub issues[1].
>
> This release candidate is based on commit:
> 5ce6ff434c1e7daaa2d7f134349f3ce4c22683da [2]
>
> The source release rc0 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> The changelog is located at [12].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [13] for how to validate a release candidate.
>
> See also a verification result on GitHub pull request [14].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow 15.0.1
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 15.0.1 because...
>
> [1]:
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A15.0.1+is%3Aclosed
> [2]:
> https://github.com/apache/arrow/tree/5ce6ff434c1e7daaa2d7f134349f3ce4c22683da
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-15.0.1-rc0
> [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/15.0.1-rc0
> [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/15.0.1-rc0
> [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/15.0.1-rc0
> [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> [12]:
> https://github.com/apache/arrow/blob/5ce6ff434c1e7daaa2d7f134349f3ce4c22683da/CHANGELOG.md
> [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [14]: https://github.com/apache/arrow/pull/40211
>


[DISCUSS] Proposal: Efficient filtering in parquet-cpp

2023-12-29 Thread wish maple
Hi, all.

We're proposing Page Filtering in parquet-cpp implementation[1]. Currently,
parquet-cpp and arrow only support RowGroup/ColumnChunk level pruning. Now
we can support filtering with Parquet PageIndex[2]. The interface can be
also used to helping implementing the iceberg positional delete format.

Suggestions/observations from discussion on that draft included: - A
RowRanges API in parquet
- Support reading RowRanges in PageReader, RecordReader and parquet decoder
- Support passing a RowRanges to FileReader in parquet.

Sincerely, Xuwei Fu

[1]
https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM/edit
[2] https://github.com/apache/parquet-format/blob/master/PageIndex.md


Re: [VOTE] Release Apache Arrow 14.0.2 - RC3

2023-12-14 Thread wish maple
+1 (binding)

Verified C++ and Python in my M1 MacOS

Best,
Xuwei Fu

Jean-Baptiste Onofré  于2023年12月15日周五 00:19写道:

> +1 (non binding)
>
> I checked:
> - hash and signature are OK
> - build is OK as soon as submodule are added (see the discussion on
> another thread)
> - LICENSE and NOTICE look good (maybe worth updating copyright date)
> - I checked RAT, and some files in the exclude should actually contain
> ASF header. I will propose a PR to improve this, not a blocker for
> release though.
>
> Thanks !
> Regards
> JB
>
> On Wed, Dec 13, 2023 at 10:32 PM Raúl Cumplido  wrote:
> >
> > Hi,
> >
> > I would like to propose the following release candidate (RC3) of Apache
> > Arrow version 14.0.2. This is a release consisting of 30
> > resolved GitHub issues[1].
> >
> > This release candidate is based on commit:
> > 740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97 [2]
> >
> > The source release rc3 is hosted at [3].
> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > The changelog is located at [12].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [13] for how to validate a release
> candidate.
> >
> > See also a verification result on GitHub pull request [14].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 14.0.2
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 14.0.2 because...
> >
> > [1]:
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A14.0.2+is%3Aclosed
> > [2]:
> https://github.com/apache/arrow/tree/740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97
> > [3]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-14.0.2-rc3
> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/14.0.2-rc3
> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/14.0.2-rc3
> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/14.0.2-rc3
> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > [12]:
> https://github.com/apache/arrow/blob/740889f413af9b1ae1d81eb1e5a4a9fb4ce9cf97/CHANGELOG.md
> > [13]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > [14]: https://github.com/apache/arrow/pull/39193
>


Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread wish maple
Congrats Felipe!!!

Best,
Xuwei Fu

Benjamin Kietzman  于2023年12月7日周四 23:42写道:

> On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira
> Carvalho
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> Ben Kietzman
>


Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread wish maple
Congrats Andy!

Best,
Xuwei Fu

Andrew Lamb  于2023年11月27日周一 20:47写道:

> I am pleased to announce that the Arrow Project has a new PMC chair and VP
> as per our tradition of rotating the chair once a year. I have resigned and
> Andy Grove was duly elected by the PMC and approved unanimously by the
> board.
>
> Please join me in congratulating Andy Grove!
>
> Thanks,
> Andrew
>


Re: C++: Code that read parquet into Arrow Arrays?

2023-11-17 Thread wish maple
Hi,

The parquet is divided into arrow and parquet part.

1. The parquet part lowest position is parquet decoder, in [1].
The float point might choosing PLAIN, RLE_DCIT or BYTE_STREAM_SPLIT
encoding.
2. parquet::ColumnReader is applied beyond decoder, each row-group might
have
one or two ( if choosing dictionary encoding and fall-back to plain,
there're
two encoding in a RowGroup for a column). This is in [2]

Other modules are mentioned by Bryce.

Best,
Xuwei Fu

[1] https://github.com/apache/arrow/blob/main/cpp/src/parquet/encoding.cc
[2]
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc

Li Jin  于2023年11月18日周六 05:27写道:

> Hi,
>
> I am recently investigating a null/nan issue with Parquet and Arrow and
> wonder if someone can give me a pointer to the code that decodes Parquet
> row group into Arrow float/double arrays?
>
> Thanks,
> Li
>


Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread wish maple
Congrats Raul!

Best,
Xuwei Fu

Andrew Lamb  于2023年11月14日周二 03:28写道:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Raúl Cumplido  to become a PMC member and we are pleased to announce
> that  Raúl Cumplido has accepted.
>
> Please join me in congratulating them.
>
> Andrew
>


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread wish maple
Thanks kou and every nice person in arrow community!

I've learned a lot during learning and contribution to arrow and
parquet. Thanks for everyone's help.
Hope we can bring more fancy features in the future!

Best,
Xuwei Fu

Sutou Kouhei  于2023年10月23日周一 12:48写道:

> On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> --
> kou
>


Re: Apache Arrow file format

2023-10-22 Thread wish maple
further what others have already mentioned, the IPC file format
> is
> > > > > primarily optimised for IPC use-cases, that is exchanging the
> entire
> > > > > contents between processes. It is relatively inexpensive to encode
> > and
> > > > > decode, and supports all arrow datatypes, making it ideal for
> things
> > > > > like spill-to-disk processing, distributed shuffles, etc...
> > > > >
> > > > > Parquet by comparison is a storage format, optimised for space
> > > > > efficiency and selective querying, with [1] containing an overview
> of
> > > > > the various techniques the format affords. It is comparatively
> > > expensive
> > > > > to encode and decode, and instead relies on index structures and
> > > > > statistics to accelerate access.
> > > > >
> > > > > Both are therefore perfectly viable options depending on your
> > > particular
> > > > > use-case.
> > > > >
> > > > > [1]:
> > > > >
> > > > >
> > >
> >
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
> > > > >
> > > > > On 18/10/2023 13:59, Dewey Dunnington wrote:
> > > > > > Plenty of opinions here already, but I happen to think that IPC
> > > > > > streams and/or Arrow File/Feather are wildly underutilized. For
> the
> > > > > > use-case where you're mostly just going to read an entire file
> > into R
> > > > > > or Python it's a bit faster (and far superior to a CSV or
> pickling
> > or
> > > > > > .rds files in R).
> > > > > >
> > > > > >> you're going to read all the columns for a record batch in the
> > > file, no
> > > > > matter what
> > > > > > The metadata for each every column in every record batch has to
> be
> > > > > > read, but there's nothing inherent about the format that prevents
> > > > > > selectively loading into memory only the required buffers. (I
> don't
> > > > > > know off the top of my head if any reader implementation actually
> > > does
> > > > > > this).
> > > > > >
> > > > > > On Wed, Oct 18, 2023 at 12:02 AM wish maple <
> > maplewish...@gmail.com>
> > > > > wrote:
> > > > > >> Arrow IPC file is great, it focuses on in-memory representation
> > and
> > > > > direct
> > > > > >> computation.
> > > > > >> Basically, it can support compression and dictionary encoding,
> and
> > > can
> > > > > >> zero-copy
> > > > > >> deserialize the file to memory Arrow format.
> > > > > >>
> > > > > >> Parquet provides some strong functionality, like Statistics,
> which
> > > could
> > > > > >> help pruning
> > > > > >> unnecessary data during scanning and avoid cpu and io cust. And
> it
> > > has
> > > > > high
> > > > > >> efficient
> > > > > >> encoding, which could make the Parquet file smaller than the
> Arrow
> > > IPC
> > > > > file
> > > > > >> under the same
> > > > > >> data. However, currently some arrow data type cannot be convert
> to
> > > > > >> correspond Parquet type
> > > > > >> in the current arrow-cpp implementation. You can goto the arrow
> > > > > document to
> > > > > >> take a look.
> > > > > >>
> > > > > >> Adam Lippai  于2023年10月18日周三 10:50写道:
> > > > > >>
> > > > > >>> Also there is
> > > > > >>> https://github.com/lancedb/lance between the two formats.
> > > Depending
> > > > > on the
> > > > > >>> use case it can be a great choice.
> > > > > >>>
> > > > > >>> Best regards
> > > > > >>> Adam Lippai
> > > > > >>>
> > > > > >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol <
> zotthewiz...@gmail.com
> > >
> > > > > wrote:
> > > > > >>>
> > > > > >>>> One benefit of the feather format (i.e. Arrow IPC file format)
> > is
> > > the
> > > > > >>>> ability to mmap the file to easily handle reading sections of
> a
> > > larger
> > > > > >>> than
> > > > > >>>> memory file of data. Since, as Felipe mentioned, the format is
> > > > > focused on
> > > > > >>>> in-memory representation, you can easily and simply mmap the
> > file
> > > and
> > > > > use
> > > > > >>>> the raw bytes directly. For a large file that you only want to
> > > read
> > > > > >>>> sections of, this can be beneficial for IO and memory usage.
> > > > > >>>>
> > > > > >>>> Unfortunately, you are correct that it doesn't allow for easy
> > > column
> > > > > >>>> projecting (you're going to read all the columns for a record
> > > batch in
> > > > > >>> the
> > > > > >>>> file, no matter what). So it's going to be a trade off based
> on
> > > your
> > > > > >>> needs
> > > > > >>>> as to whether it makes sense, or if you should use a file
> format
> > > like
> > > > > >>>> Parquet instead.
> > > > > >>>>
> > > > > >>>> -Matt
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > > > >>>> felipe...@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> It’s not the best since the format is really focused on in-
> > > memory
> > > > > >>>>> representation and direct computation, but you can do it:
> > > > > >>>>>
> > > > > >>>>> https://arrow.apache.org/docs/python/feather.html
> > > > > >>>>>
> > > > > >>>>> —
> > > > > >>>>> Felipe
> > > > > >>>>>
> > > > > >>>>> On Tue, 17 Oct 2023 at 23:26 Nara <
> > > narayanan.arunacha...@gmail.com>
> > > > > >>>> wrote:
> > > > > >>>>>> Hi,
> > > > > >>>>>>
> > > > > >>>>>> Is it a good idea to use Apache Arrow as a file format?
> Looks
> > > like
> > > > > >>>>>> projecting columns isn't available by default.
> > > > > >>>>>>
> > > > > >>>>>> One of the benefits of Parquet file format is column
> > projection,
> > > > > >>> where
> > > > > >>>>> the
> > > > > >>>>>> IO is limited to just the columns projected.
> > > > > >>>>>>
> > > > > >>>>>> Regards ,
> > > > > >>>>>> Nara
> > > > > >>>>>>
> > > > >
> > >
> >
> >
>


Re: Apache Arrow file format

2023-10-17 Thread wish maple
Arrow IPC file is great, it focuses on in-memory representation and direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has high
efficient
encoding, which could make the Parquet file smaller than the Arrow IPC file
under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow document to
take a look.

Adam Lippai  于2023年10月18日周三 10:50写道:

> Also there is
> https://github.com/lancedb/lance between the two formats. Depending on the
> use case it can be a great choice.
>
> Best regards
> Adam Lippai
>
> On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:
>
> > One benefit of the feather format (i.e. Arrow IPC file format) is the
> > ability to mmap the file to easily handle reading sections of a larger
> than
> > memory file of data. Since, as Felipe mentioned, the format is focused on
> > in-memory representation, you can easily and simply mmap the file and use
> > the raw bytes directly. For a large file that you only want to read
> > sections of, this can be beneficial for IO and memory usage.
> >
> > Unfortunately, you are correct that it doesn't allow for easy column
> > projecting (you're going to read all the columns for a record batch in
> the
> > file, no matter what). So it's going to be a trade off based on your
> needs
> > as to whether it makes sense, or if you should use a file format like
> > Parquet instead.
> >
> > -Matt
> >
> >
> > On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com>
> > wrote:
> >
> > > It’s not the best since the format is really focused on in- memory
> > > representation and direct computation, but you can do it:
> > >
> > > https://arrow.apache.org/docs/python/feather.html
> > >
> > > —
> > > Felipe
> > >
> > > On Tue, 17 Oct 2023 at 23:26 Nara 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Is it a good idea to use Apache Arrow as a file format? Looks like
> > > > projecting columns isn't available by default.
> > > >
> > > > One of the benefits of Parquet file format is column projection,
> where
> > > the
> > > > IO is limited to just the columns projected.
> > > >
> > > > Regards ,
> > > > Nara
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread wish maple
Congratulations!

Raúl Cumplido  于2023年10月15日周日 20:48写道:

> Congratulations and welcome!
>
> El dom, 15 oct 2023, 13:57, Ian Cook  escribió:
>
> > Congratulations Curt!
> >
> > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb  wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Curt Hagenlocher
> > > has accepted an invitation to become a committer on Apache
> > > Arrow. Welcome, and thank you for your contributions!
> > >
> > > Andrew
> > >
> >
>


Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread wish maple
+1

LGTM, thanks!

Ian Cook  于2023年9月30日周六 00:49写道:

> +1 (non-binding)
>
> Thanks very much Felipe for your persistence and your commitment to
> addressing the numerous questions and comments that have been raised
> since the beginning of the discussion on this in April.
>
> On Fri, Sep 29, 2023 at 12:34 PM Benjamin Kietzman 
> wrote:
> >
> > +1
> >
> > On Fri, Sep 29, 2023 at 10:51 AM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> > > Yes, ListView is an implementation of Velox's ArrayVector [1] ("vector
> of
> > > arrays"). In Arrow we would naturally refer to them as "array of
> lists",
> > > but `ListArray` is taken by the existing offset-only list formats.
> > > Following the pattern adopted by other types in Arrow that use offsets
> and
> > > sizes, we adopt the suffix -View to differentiate list-views from
> lists.
> > >
> > > Velox doesn't offer the 64-bit variation, but since Arrow has both
> List and
> > > LargeList, it was natural to pair them with ListView and LargeListView.
> > >
> > > [2] is a link to the point of a talk by Mark Raasveldt where he
> describes
> > > the DuckDB list representation. Early in the talk, one of the slides
> [3]
> > > mentions how these formats were "co-designed together with Velox team".
> > >
> > > --
> > > Felipe
> > >
> > > [1]
> > >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > [2] https://youtu.be/bZOvAKGkzpQ?si=wgSwew3Ck8utteOI=1569
> > > [3] https://15721.courses.cs.cmu.edu/spring2023/slides/22-duckdb.pdf
> > >
> > > On Fri, Sep 29, 2023 at 9:32 AM Raphael Taylor-Davies
> > >  wrote:
> > >
> > > > Hi Felipe,
> > > >
> > > > Can I confirm that DuckDB and Velox use the same encoding for these
> > > > types, and so we aren't going to run into similar issues as [1]?
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael Taylor-Davies
> > > >
> > > > [1]:
> https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
> > > >
> > > > On 29/09/2023 13:09, Felipe Oliveira Carvalho wrote:
> > > > > Hello,
> > > > >
> > > > > I'd like to propose adding ListView and LargeListView arrays to the
> > > Arrow
> > > > > format.
> > > > > Previous discussion in [1][2], columnar format description and
> > > > flatbuffers
> > > > > changes in [3].
> > > > >
> > > > > There are implementations available in both C++ [4] and Go [5]. I'm
> > > > working
> > > > > on the integration tests which I will push to one of the PR
> branches
> > > > before
> > > > > they are merged. I've made a graph illustrating how this addition
> > > > affects,
> > > > > in a backwards compatible way, the type predicates and inheritance
> > > chain
> > > > on
> > > > > the C++ implementation. [6]
> > > > >
> > > > > The vote will be open for at least 72 hours not counting the
> weekend.
> > > > >
> > > > > [ ] +1 add the proposed ListView and LargeListView types to the
> Apache
> > > > > Arrow format
> > > > > [ ] -1 do not add the proposed ListView and LargeListView types to
> the
> > > > > Apache Arrow format
> > > > > because...
> > > > >
> > > > > Sincerely,
> > > > > Felipe
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
> > > > > [2]
> https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
> > > > > [3] https://github.com/apache/arrow/pull/37877
> > > > > [4] https://github.com/apache/arrow/pull/35345
> > > > > [5] https://github.com/apache/arrow/pull/37468
> > > > > [6]
> https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db
> > > > >
> > > >
> > >
>


Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
By the way, you can try to use a memory-profiler like [1] and [2] .
It would be help to find how the memory is used

Best,
Xuwei Fu

[1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
[2] https://google.github.io/tcmalloc/gperftools.html


Felipe Oliveira Carvalho  于2023年9月7日周四 00:28写道:

> > (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> increasing during the scan (looks linear to the number of files scanned).
>
> I wouldn't take this to mean a memory leak but the memory allocator not
> paging out virtual memory that has been allocated throughout the scan.
> Could you run your workload under a memory profiler?
>
> (3) Scan the same dataset twice in the same process doesn't increase the
> max rss.
>
> Another sign this isn't a leak, just the allocator reaching a level of
> memory commitment that it doesn't feel like undoing.
>
> --
> Felipe
>
> On Wed, Sep 6, 2023 at 12:56 PM Li Jin  wrote:
>
> > Hello,
> >
> > I have been testing "What is the max rss needed to scan through ~100G of
> > data in a parquet stored in gcs using Arrow C++".
> >
> > The current answer is about ~6G of memory which seems a bit high so I
> > looked into it. What I observed during the process led me to think that
> > there are some potential cache/memory issues in the dataset/parquet cpp
> > code.
> >
> > Main observation:
> > (1) As I am scanning through the dataset, I printed out (a) memory
> > allocated by the memory pool from ScanOptions (b) process rss. I found
> that
> > while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> > increasing during the scan (looks linear to the number of files scanned).
> > (2) I tested ScanNode in Arrow as well as an in-house library that
> > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > similar rss usage. (Which led me to think the issue is more likely to be
> in
> > the parquet cpp code instead of dataset code).
> > (3) Scan the same dataset twice in the same process doesn't increase the
> > max rss.
> >
> > I plan to look into the parquet cpp/dataset code but I wonder if someone
> > has some clues what the issue might be or where to look at?
> >
> > Thanks,
> > Li
> >
>


Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
1. In dataset, it might have `fragment_readahead` or other.
2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See
`FileReaderImpl::GetRecordBatchReader`)
3. In Parquet, if non-buffered read is enabled, when read a column, the
whole ColumChunk would be read.
Otherwise, it will "buffered" read it decided by buffer-size

Maybe I forgot someplaces. You can try to check that.

Best
Xuwei Fu

Li Jin  于2023年9月7日周四 00:16写道:

> Thanks both for the quick response! I wonder if there is some code in
> parquet cpp  that might be keeping some cached information (perhaps
> metadata) per file scanned?
>
> On Wed, Sep 6, 2023 at 12:10 PM wish maple  wrote:
>
> > I've met lots of Parquet Dataset issues. The main problem is that
> currently
> > we have 2 sets or API
> > and they have different scan-options. And sometimes different interfaces
> > like `to_batches()` or
> > others would enable different scan options.
> >
> > I think [2] is similar to your problem. 1-4 are some issues I met before.
> >
> > As for the code, you may take a look at :
> > 1. ParquetFileFormat and Dataset related.
> > 2. FileSystem and CacheRange. Parquet might use this to handle pre-buffer
> > 3. How Parquet RowReader handle IO
> >
> > [1] https://github.com/apache/arrow/issues/36765
> > [2] https://github.com/apache/arrow/issues/37139
> > [3] https://github.com/apache/arrow/issues/36587
> > [4] https://github.com/apache/arrow/issues/37136
> >
> > Li Jin  于2023年9月6日周三 23:56写道:
> >
> > > Hello,
> > >
> > > I have been testing "What is the max rss needed to scan through ~100G
> of
> > > data in a parquet stored in gcs using Arrow C++".
> > >
> > > The current answer is about ~6G of memory which seems a bit high so I
> > > looked into it. What I observed during the process led me to think that
> > > there are some potential cache/memory issues in the dataset/parquet cpp
> > > code.
> > >
> > > Main observation:
> > > (1) As I am scanning through the dataset, I printed out (a) memory
> > > allocated by the memory pool from ScanOptions (b) process rss. I found
> > that
> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b)
> keeps
> > > increasing during the scan (looks linear to the number of files
> scanned).
> > > (2) I tested ScanNode in Arrow as well as an in-house library that
> > > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > > similar rss usage. (Which led me to think the issue is more likely to
> be
> > in
> > > the parquet cpp code instead of dataset code).
> > > (3) Scan the same dataset twice in the same process doesn't increase
> the
> > > max rss.
> > >
> > > I plan to look into the parquet cpp/dataset code but I wonder if
> someone
> > > has some clues what the issue might be or where to look at?
> > >
> > > Thanks,
> > > Li
> > >
> >
>


Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
I've met lots of Parquet Dataset issues. The main problem is that currently
we have 2 sets or API
and they have different scan-options. And sometimes different interfaces
like `to_batches()` or
others would enable different scan options.

I think [2] is similar to your problem. 1-4 are some issues I met before.

As for the code, you may take a look at :
1. ParquetFileFormat and Dataset related.
2. FileSystem and CacheRange. Parquet might use this to handle pre-buffer
3. How Parquet RowReader handle IO

[1] https://github.com/apache/arrow/issues/36765
[2] https://github.com/apache/arrow/issues/37139
[3] https://github.com/apache/arrow/issues/36587
[4] https://github.com/apache/arrow/issues/37136

Li Jin  于2023年9月6日周三 23:56写道:

> Hello,
>
> I have been testing "What is the max rss needed to scan through ~100G of
> data in a parquet stored in gcs using Arrow C++".
>
> The current answer is about ~6G of memory which seems a bit high so I
> looked into it. What I observed during the process led me to think that
> there are some potential cache/memory issues in the dataset/parquet cpp
> code.
>
> Main observation:
> (1) As I am scanning through the dataset, I printed out (a) memory
> allocated by the memory pool from ScanOptions (b) process rss. I found that
> while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> increasing during the scan (looks linear to the number of files scanned).
> (2) I tested ScanNode in Arrow as well as an in-house library that
> implements its own "S3Dataset" similar to Arrow dataset, both showing
> similar rss usage. (Which led me to think the issue is more likely to be in
> the parquet cpp code instead of dataset code).
> (3) Scan the same dataset twice in the same process doesn't increase the
> max rss.
>
> I plan to look into the parquet cpp/dataset code but I wonder if someone
> has some clues what the issue might be or where to look at?
>
> Thanks,
> Li
>


Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread wish maple
+1 (non-binding)

It would help a lot when processing UTF-8 related data!

Xuwei

Andrew Lamb  于2023年8月22日周二 00:11写道:

> +1
>
> This is a great example of collaboration
>
> On Sat, Aug 19, 2023 at 4:10 PM Chao Sun  wrote:
>
> > +1 (non-binding)!
> >
> > On Fri, Aug 18, 2023 at 12:59 PM Felipe Oliveira Carvalho <
> > felipe...@gmail.com> wrote:
> >
> > > +1 (non-binding)
> > >
> > > —
> > > Felipe
> > >
> > > On Fri, 18 Aug 2023 at 18:48 Jacob Wujciak-Jens
> > >  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > On Fri, Aug 18, 2023 at 6:04 PM L. C. Hsieh 
> wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Fri, Aug 18, 2023 at 5:53 AM Neal Richardson
> > > > >  wrote:
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > Thanks all for the thoughtful discussions here.
> > > > > >
> > > > > > Neal
> > > > > >
> > > > > > On Fri, Aug 18, 2023 at 4:14 AM Raphael Taylor-Davies
> > > > > >  wrote:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Despite my earlier misgivings, I think this will be a valuable
> > > > addition
> > > > > > > to the specification.
> > > > > > >
> > > > > > > To clarify I've interpreted this as a vote on both Utf8View and
> > > > > > > BinaryView as in the linked PR.
> > > > > > >
> > > > > > > On 28/06/2023 20:34, Benjamin Kietzman wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I'd like to propose adding Utf8View arrays to the arrow
> format.
> > > > > > > > Previous discussion in [1], columnar format description in
> [2],
> > > > > > > > flatbuffers changes in [3].
> > > > > > > >
> > > > > > > > There are implementations available in both C++[4] and Go[5]
> > > which
> > > > > > > > exercise the new type over IPC. Utf8View format
> demonstrates[6]
> > > > > > > > significant performance benefits over Utf8 in common tasks.
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > >
> > > > > > > > [ ] +1 add the proposed Utf8View type to the Apache Arrow
> > format
> > > > > > > > [ ] -1 do not add the proposed Utf8View type to the Apache
> > Arrow
> > > > > format
> > > > > > > > because...
> > > > > > > >
> > > > > > > > Sincerely,
> > > > > > > > Ben Kietzman
> > > > > > > >
> > > > > > > > [1]
> > > > https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt
> > > > > > > > [2]
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/docs/source/format/Columnar.rst#variable-size-binary-view-layout
> > > > > > > > [3]
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/arrow/pull/35628/files#diff-0623d567d0260222d5501b4e169141b5070eabc2ec09c3482da453a3346c5bf3
> > > > > > > > [4] https://github.com/apache/arrow/pull/35628
> > > > > > > > [5] https://github.com/apache/arrow/pull/35769
> > > > > > > > [6]
> > > > > https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>


RE: C++: State of parquet 2.x / nanosecond support

2023-07-14 Thread wish maple
Hi, Li
Parquet 2.6 has been supported for a long time, and recently, in Parquet C++
and Python, Parquet 2.6 has been set to the default version of Parquet
writer [1] [2].
So I think you can just use it! However, I don't know whether nanoarrow
supports it.

Best,
Xuwei Fu

[1] https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
[2] https://github.com/apache/arrow/pull/36137

On 2023/07/14 13:25:22 Li Jin wrote:
> Hi,
>
> Recently I found myself in the need of nanosecond granularity timestamp.
> IIUC this is something supported in the newer version of parquet (2.6
> perhaps)? I wonder what is the state of that in Arrow and parquet cpp?
>
> Thanks,
> Li
>


Question about TypeHolder in arrow

2023-07-04 Thread wish maple
Hi,

By looking into the code of arrow compute, I found there it uses
`TypeHolder` [1], and expression might call `GetTypes` to get the input or
output types. The document for `TypeHolder` says that it's a container for
dynamically created `shared_ptr`. However, my view is:

1. It's widely used, and might store non-owned types(e.g [2])
2. There are some type factory for lots of data type, like [3], so
non-nested types tent to be used with singleton

I wonder that when would `TypeHolder` has owned types, when would them have
non-owned types, and would some primitive types like int8, int16 or types
like string shares the same underlying pointer?

Thanks,
Xuwei Fu


[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.h#L215
[2]
https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/expression_internal.h#L47
[3] https://github.com/apache/arrow/blob/main/cpp/src/arrow/type.cc#L2510


Question about nested columnar validity

2023-06-29 Thread wish maple
Sorry for being misleading. "valid" offset means that:
1. For Binary Like [1] format, and List formats [2], even if the parent
has `validity = false`. Their offset should be well-defined.
2. For StringView and ArrayView, if the parent has `validity = false`.
If they have `validity = true`, there offset might point to a invalid
position

Am I right?

On 2023/06/29 12:10:52 Antoine Pitrou wrote:
>
> Le 29/06/2023 à 13:42, wish maple a écrit :
> > Thanks all!
> > So, in general:
> > 1. For our Binary Like [1] format, and List formats [2], if the parent
is
> >  not valid, the offset should still be valid
>
> What do you call a "valid" offset?
>


RE: Question about nested columnar validity

2023-06-29 Thread wish maple
Thanks all!
So, in general:
1. For our Binary Like [1] format, and List formats [2], if the parent is
not valid, the offset should still be valid
2. For the StringView ListView [3] types arrow is currently working on,
if the parent is not valid, the child might has valid content

Am I right? And can I add corresponding words to the validity part in the
document[4]?

[1]
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout
[2]
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
[3] https://lists.apache.org/thread/c6frlr9gcxy8qdhbmv8cn3rdjbrqxb1v
[4] https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps

Thanks,
Xuwei Fu

On 2023/06/28 15:03:11 wish maple wrote:
> Hi,
>
> By looking at the arrow standard, when it comes to nested structure, like
> StructArray[1] or FixedListArray[2], when parent is not valid, the
> correspond child leaves "undefined".
>
> If it's a BinaryArray, when when it parent is not valid, would a validity
> member point to a undefined address?
>
> And if it's ListArray[3], when it parent is not valid, should it offset
and
> size be valid?
>
> Thanks,
> Xuwei Fu
>
> [1] https://arrow.apache.org/docs/format/Columnar.html#struct-layout
> [2]
> https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
> [3]
>
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
>


Question about nested columnar validity

2023-06-28 Thread wish maple
Hi,

By looking at the arrow standard, when it comes to nested structure, like
StructArray[1] or FixedListArray[2], when parent is not valid, the
correspond child leaves "undefined".

If it's a BinaryArray, when when it parent is not valid, would a validity
member point to a undefined address?

And if it's ListArray[3], when it parent is not valid, should it offset and
size be valid?

Thanks,
Xuwei Fu

[1] https://arrow.apache.org/docs/format/Columnar.html#struct-layout
[2]
https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout
[3]
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout


RE: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread wish maple
On 2023/06/15 16:24:44 Joris Van den Bossche wrote:
> Hi all,
>
> Bringing up https://github.com/apache/arrow/issues/35746 to the
> mailing list: this issue proposes to bump the default Parquet version
> we use for writing to Parquet files in the C++ library (and in the
> various bindings including pyarrow and R arrow) from the current
> default of "2.4" to "2.6".
>
> In practice, the only change is that the writer will, by default,
> write the Timestamp LogicalType with NANOS unit
> (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
)
> if your data uses timestamp("ns") (currently, such data gets coerced
> to microsecond resolution when writing to Parquet).
>
> In theory this could cause compatibility issues if the files you are
> writing need to be read by other Parquet implementations which don't
> yet support nanoseconds. But the Parquet format 2.6 was released in
> Sept 2018, and parquet-mr added support for it in 2018 as well.
>
> Unless there is pushback on this, we are currently planning to make
> this change for the upcoming Arrow 13.0.0 release.
>
> Best,
> Joris
>

In our current codebase, users can switch to all these formats:
1. Parquet 1.0
2. Parquet 2.0 (deprecated, similar to Parquet 2.6, might mean it could
support all kinds of 2.0 feature)
3. Parquet 2.4 (released in October 2017, enables UINT32 logical type)
4. Parquet 2.6 (released in September 2018, enables NANOS)

I think switching to 2.6 with nanos might break some legacy readers, but
currently most readers support
reading from NANOS, so I'm +1 with this proposal.

Best wishes,
Xuwei Fu


RE: [DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread wish maple
I have two parquet related bug fixes and I wonder if we can release them in
12.0.1
1. https://github.com/apache/arrow/pull/35428
2. https://github.com/apache/arrow/pull/35520

Patch 1 can cause BYTE_STREAM_SPLIT unable to be read if the previous
parquet page is larger than the incoming one.
Patch 2 might cause segment fault when Close row-group meets an
exception.

Best,

Xuwei Fu

On 2023/05/18 17:04:12 Weston Pace wrote:
> Regrettabl, 12.0.0 had a significant performance regression (I'll take the
> blame for not thinking through all the use cases), most easily exposed
when
> writing datasets from pandas / numpy data, which is being addressed in
> [1].  I believe this to be a fairly common use case and it may warrant a
> 12.0.1 patch.  Are there other issues that would need a patch?  Do we feel
> this issue is significant enough to justify the work?
>
> [1] https://github.com/apache/arrow/pull/35565
>


RE: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread wish maple
I think the ArrayVector can have benefits above:
1. Converting a Batch in Velox or other system to arrow array could be much
more lightweight.
2. Modifying, filter and copy array or string could be much more
lightweight

Velox can make a Vector mutable, seems that arrow array cannot. Seems it
makes little difference here.

On 2023/04/25 22:00:08 Felipe Oliveira Carvalho wrote:
> Hi folks,
>
> I would like to start a public discussion on the inclusion of a new array
> format to Arrow — array-view array. The name is also up for debate.
>
> This format is inspired by Velox's ArrayVector format [1]. Logically, this
> array represents an array of arrays. Each element is an array-view (offset
> and size pair) that points to a range within a nested "values" array
> (called "elements" in Velox docs). The nested array can be of any type,
> which makes this format very flexible and powerful.
>
> [image: ../_images/array-vector.png]
> 
>
> I'm currently working on a C++ implementation and plan to work on a Go
> implementation to fulfill the two-implementations requirement for format
> changes.
>
> The draft design:
>
> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32 sizes buffer]
> - 1 child array: "values" as an array of the type parameter
>
> validity_bitmap is used to differentiate between empty array views
> (sizes[i] == 0) and NULL array views (validity_bitmap[i] == 0).
>
> When the validity_bitmap[i] is 0, both sizes and offsets are undefined (as
> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is recommended
> if setting a value is not an issue to the system producing the arrays.
>
> offsets buffer is not required to be ordered and views don't have to be
> disjoint.
>
> [1]
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
>
> Thanks,
> Felipe O. Carvalho
>


RE: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

2023-04-23 Thread wish maple
On 2023/04/23 09:38:02 "Yang, Yang10" wrote:
> Hi,
>
> As discussed in this issue: https://github.com/apache/arrow/issues/35287,
currently Arrow only supports one parameter: compression_level to be
customized. We would like to make more compression parameters (such as
window_bits) customizable when creating the Codec, given the variety of
usage scenarios. As suggested by @kou, we may introduce a new options class
such as Codec::Options to make the structure clear and easy to extend. But
it may take some effort as this is more like a code structure refactor.
Passing a parameter directly is another approach, easy to implement but may
be hard to extend. So we would like some further discussion here. If you
have any suggestion or comments, please share them on above issue or here.
Thanks!
>
> Best,
> Yang
>

In most systems, including arrow, do not have a compression flag except for
compression
level. But I think we can provide a Codec::Options, like RocksDB [1]. With
an options,
we're able to config lz4 or zstd dictionary more flexible. So I think It's
ok.

Best regards,
Xuwei Fu

[1]
https://github.com/facebook/rocksdb/blob/main/include/rocksdb/advanced_options.h#L86