Re: [VOTE] Remove compute from Arrow JS

2021-11-02 Thread Micah Kornfield
I'd suggest maybe two things:
1.  If possible add deprecation warnings for the next release, and delete
the release after (we don't have a formal policy but if would be good to
give user heads up before out-right deletion).
2.  If 1 isn't an option then please add something to the 6.0.0 release
notes indicating the removal.

Cheers,
Micah

On Tue, Nov 2, 2021 at 1:54 PM Dominik Moritz  wrote:

> +1 from me as well.
>
> That brings us to 3 times +1 and no -1 or +0.
>
> Thank you, all. We will remove the compute code in the next Arrow version.
>
> On Nov 2, 2021 at 16:12:02, Paul Taylor  wrote:
>
> > +1 from me as well
> >
> > On Oct 27, 2021, at 6:58 PM, Brian Hulette  wrote:
> >
> > 
> > +1
> >
> > I don't think there's much reason to keep the compute code around when
> > there's a more performant, easier to use alternative. I think the only
> > unique feature of the arrow compute code was the ability to optimize
> > queries on dictionary-encoded columns, but Jeff added this to Arquero
> > almost a year ago now [1].
> >
> > Brian
> >
> > [1] https://github.com/uwdata/arquero/issues/86
> >
> > On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz 
> > wrote:
> >
> >> Dear Arrow community,
> >>
> >> We are proposing to remove the compute code from Arrow JS. Right now,
> the
> >> compute code is encapsulated in a DataFrame class that extends Table.
> The
> >> DataFrame implements a few functions such as filtering and counting with
> >> expressions. However, the predicate code is not very efficient (it’s
> >> interpreted) and most people only use Arrow to read data but don’t need
> >> compute. There are also more complete alternatives for doing compute on
> >> Arrow data structures such as Arquero (
> https://github.com/uwdata/arquero).
> >> By removing the compute code, we can focus on the IPC reading/writing
> and
> >> primitive types.
> >>
> >> The vote will be open for at least 72 hours.
> >>
> >> [ ] +1 Remove compute from Arrow JS
> >> [ ] +0
> >> [ ] -1 Do not remove compute because…
> >>
> >> Thank you,
> >> Dominik
> >>
> >
>


Re: Arrow in HPC

2021-11-02 Thread Jed Brown
"David Li"  writes:

> Thanks for the clarification Yibo, looking forward to the results. Even if it 
> is a very hacky PoC it will be interesting to see how it affects performance, 
> though as Keith points out there are benefits in general to UCX (or similar 
> library), and we can work out the implementation plan from there.
>
> To Benson's point - the work done to get UCX supported would pave the way to 
> supporting other backends as well. I'm personally not familiar with UCX, MPI, 
> etc. so is MPI here more about playing well with established practices or 
> does it also offer potential hardware support/performance improvements like 
> UCX would?

There are two main implementations of MPI, MPICH and Open MPI, both of which 
are permissively licensed open source community projects. Both have direct 
support for UCX and unless your needs are very specific, the overhead of going 
through MPI is likely to be negligible. Both also have proprietary derivatives, 
such as Cray MPI (MPICH derivative) and Spectrum MPI (Open MPI derivative), 
which may have optimizations for proprietary networks. Both MPICH and Open MPI 
can be built without UCX, and this is often easier (UCX 'master' is more 
volatile in my experience).

The vast majority of distributed memory scientific applications use MPI or 
higher level libraries, rather than writing directly to UCX (which provides 
less coverage of HPC networks). I think MPI compatibility is important.

>From way up-thread (sorry):

>> > Jed - how would you see MPI and Flight interacting? As another
>> > transport/alternative to UCX? I admit I'm not familiar with the HPC
>> > space.

MPI has collective operations like MPI_Allreduce (perform a reduction and give 
every process the result; these run in log(P) or better time with small 
constants -- 15 microseconds is typical for a cheap reduction operation on a 
million processes). MPI supports user-defined operations for reductions and 
prefix-scan operations. If we defined MPI_Ops for Arrow types, we could compute 
summary statistics and other algorithmic building blocks fast at arbitrary 
scale.

The collective execution model might not be everyone's bag, but MPI_Op can also 
be used in one-sided operations (MPI_Accumulate and MPI_Fetch_and_op) and 
dropping into collective mode has big advantages for certain algorithms in 
computational statistics/machine learning.


Re: [VOTE] Remove compute from Arrow JS

2021-11-02 Thread Dominik Moritz
+1 from me as well.

That brings us to 3 times +1 and no -1 or +0.

Thank you, all. We will remove the compute code in the next Arrow version.

On Nov 2, 2021 at 16:12:02, Paul Taylor  wrote:

> +1 from me as well
>
> On Oct 27, 2021, at 6:58 PM, Brian Hulette  wrote:
>
> 
> +1
>
> I don't think there's much reason to keep the compute code around when
> there's a more performant, easier to use alternative. I think the only
> unique feature of the arrow compute code was the ability to optimize
> queries on dictionary-encoded columns, but Jeff added this to Arquero
> almost a year ago now [1].
>
> Brian
>
> [1] https://github.com/uwdata/arquero/issues/86
>
> On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz 
> wrote:
>
>> Dear Arrow community,
>>
>> We are proposing to remove the compute code from Arrow JS. Right now, the
>> compute code is encapsulated in a DataFrame class that extends Table. The
>> DataFrame implements a few functions such as filtering and counting with
>> expressions. However, the predicate code is not very efficient (it’s
>> interpreted) and most people only use Arrow to read data but don’t need
>> compute. There are also more complete alternatives for doing compute on
>> Arrow data structures such as Arquero (https://github.com/uwdata/arquero).
>> By removing the compute code, we can focus on the IPC reading/writing and
>> primitive types.
>>
>> The vote will be open for at least 72 hours.
>>
>> [ ] +1 Remove compute from Arrow JS
>> [ ] +0
>> [ ] -1 Do not remove compute because…
>>
>> Thank you,
>> Dominik
>>
>


Re: [VOTE] Remove compute from Arrow JS

2021-11-02 Thread Paul Taylor
+1 from me as well

> On Oct 27, 2021, at 6:58 PM, Brian Hulette  wrote:
> 
> 
> +1
> 
> I don't think there's much reason to keep the compute code around when 
> there's a more performant, easier to use alternative. I think the only unique 
> feature of the arrow compute code was the ability to optimize queries on 
> dictionary-encoded columns, but Jeff added this to Arquero almost a year ago 
> now [1].
> 
> Brian
> 
> [1] https://github.com/uwdata/arquero/issues/86
> 
>> On Wed, Oct 27, 2021 at 4:46 PM Dominik Moritz  wrote:
>> Dear Arrow community,
>> 
>> We are proposing to remove the compute code from Arrow JS. Right now, the 
>> compute code is encapsulated in a DataFrame class that extends Table. The 
>> DataFrame implements a few functions such as filtering and counting with 
>> expressions. However, the predicate code is not very efficient (it’s 
>> interpreted) and most people only use Arrow to read data but don’t need 
>> compute. There are also more complete alternatives for doing compute on 
>> Arrow data structures such as Arquero (https://github.com/uwdata/arquero). 
>> By removing the compute code, we can focus on the IPC reading/writing and 
>> primitive types.
>> 
>> The vote will be open for at least 72 hours.
>> 
>> [ ] +1 Remove compute from Arrow JS
>> [ ] +0
>> [ ] -1 Do not remove compute because…
>> 
>> Thank you,
>> Dominik


RE: [VOTE][RESULT] Release Apache Arrow 6.0.0 - RC3

2021-11-02 Thread Matthew Topol
The patch to the release-6.0.0 branch to fix the Go issue has been merged and 
is ready for a 6.0.1 release once everything else is.

--Matt

-Original Message-
From: Ian Cook  
Sent: Friday, October 29, 2021 11:48 AM
To: dev@arrow.apache.org
Subject: Re: [VOTE][RESULT] Release Apache Arrow 6.0.0 - RC3

I am working on the vcpkg port update today.

Ian

On Fri, Oct 29, 2021 at 11:31 AM Neal Richardson  
wrote:
>
> R package has been accepted by CRAN, though we will have to patch and 
> resubmit due to a sanitizer error ( 
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARRO
> W-14514__;!!PBKjc0U4!d9xTVNlUQ_oXfpArpI6ZKK8mmNd7POAOZaaRtLZuOJ6oaaka6
> NVJR7z2iR--8mA$  for the failure, 
> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-14515__;!!PBKjc0U4!d9xTVNlUQ_oXfpArpI6ZKK8mmNd7POAOZaaRtLZuOJ6oaaka6NVJR7z2ZtWuR6Y$
>   for the missing CI--we test UBSAN with gcc but apparently CRAN also does 
> UBSAN with clang, which is where this came up).
>
> Neal
>
> 1. [done] bump version numbers
> 2. [done] upload source
> 3. [done] upload binaries
> 4. [done] update website
> 5. [depends-on-brew] upload ruby gems
> 6. [done] upload js packages
> 8. [done] upload C# packages
> 10. [in-pr] update conda recipes
> 11. [done] upload wheels/sdist to pypi 12. [ ] update homebrew 
> packages 13. [done] update maven artifacts 14. [done] update msys2 15. 
> [done*] update R packages 16. [Ian] update vcpkg port 17. [done] 
> update tags for Go modules 18. [done] update docs 19. [done] announced 
> to mailing lists
>
>
> > On Wed, Oct 27, 2021 at 8:55 AM Sutou Kouhei  wrote:
> > >
> > > 1. [in-pr] bump version numbers
> > > 2. [done] upload source
> > > 3. [done] upload binaries
> > > 4. [in-pr] update website
> > > 5. [depends-on-brew] upload ruby gems 6. [done] upload js packages 
> > > 8. [done] upload C# packages 10. [ ] update conda recipes 11. 
> > > [done] upload wheels/sdist to pypi 12. [ ] update homebrew 
> > > packages 13. [done] update maven artifacts 14. [done] update msys2 
> > > 15. [Neal] update R packages 16. [Ian] update vcpkg port 17. 
> > > [done] update tags for Go modules 18. [ ] update docs
> > >
> > > In 
> > >   "Re: [VOTE][RESULT] Release Apache Arrow 6.0.0 - RC3" on Tue, 26 
> > > Oct
> > 2021 17:09:24 +0200,
> > >   Krisztián Szűcs  wrote:
> > >
> > > > The current status of the post release tasks:
> > > >
> > > > 1. [in-pr] bump version numbers
> > > > 2. [done] upload source
> > > > 3. [done] upload binaries
> > > > 4. [in-pr] update website
> > > > 5. [depends-on-brew] upload ruby gems 6. [done] upload js 
> > > > packages 8. [done] upload C# packages 10. [ ] update conda 
> > > > recipes 11. [done] upload wheels/sdist to pypi 12. [ ] update 
> > > > homebrew packages 13. [done] update maven artifacts 14. [ ] 
> > > > update msys2 15. [Neal] update R packages 16. [Ian] update vcpkg 
> > > > port 17. [done] update tags for Go modules 18. [ ] update docs
> > > >
> > > > On Tue, Oct 26, 2021 at 2:33 PM Krisztián Szűcs 
> > > >  wrote:
> > > >>
> > > >> Resending with RESULT subject line.
> > > >>
> > > >> The VOTE carries with 3 binding +1 and 2 non-binding +1 votes.
> > > >>
> > > >> I'm starting the post release tasks and will keep you posted 
> > > >> about the current status.
> > > >>
> > > >> Thanks everyone!
> > > >>
> > > >> >
> > > >> > On Tue, Oct 26, 2021 at 1:56 PM Benson Muite <
> > benson_mu...@emailplus.org> wrote:
> > > >> > >
> > > >> > > Ok. Thanks for the feedback.
> > > >> > >
> > > >> > > Javascript may have problems when using nohup
> > > >> > >
> > > >> > > so directly running
> > > >> > >
> > > >> > > env "TEST_DEFAULT=0" env "TEST_JS=1"  bash 
> > > >> > > dev/release/verify-release-candidate.sh source 6.0.0 3
> > > >> > >
> > > >> > > seems to work, but
> > > >> > >
> > > >> > > nohup env "TEST_DEFAULT=0" env "TEST_JS=1"  bash 
> > > >> > > dev/release/verify-release-candidate.sh source 6.0.0 3 > 
> > > >> > > log.out &
> > > >> > >
> > > >> > > may not work [1].
> > > >> > >
> > > >> > > [1]
> > > >> > >
> > https://urldefense.com/v3/__https://stackoverflow.com/questions/1660
> > 4176/error-ebadf-bad-file-descriptor-when-running-node-using-nohup-o
> > f-forever__;!!PBKjc0U4!d9xTVNlUQ_oXfpArpI6ZKK8mmNd7POAOZaaRtLZuOJ6oa
> > aka6NVJR7z2pDkY85c$
> > > >> > >
> > > >> > > On 10/26/21 2:32 PM, Krisztián Szűcs wrote:
> > > >> > > > Thanks Benson for verifying!
> > > >> > > >
> > > >> > > > Created a jira to track the depreciation warnings [1] and 
> > > >> > > > seems
> > like
> > > >> > > > you've already created a PR for the javascript issue [2].
> > > >> > > > Luckily, these issues are not blockers.
> > > >> > > >
> > > >> > > > [1]: 
> > > >> > > > https://urldefense.com/v3/__https://issues.apache.org/jir
> > > >> > > > a/browse/ARROW-14468__;!!PBKjc0U4!d9xTVNlUQ_oXfpArpI6ZKK8
> > > >> > > > mmNd7POAOZaaRtLZuOJ6oaaka6NVJR7z2sw7TCkU$
> > > >> > > > [2]:
> > https://urldefense.com/v3/__https://github.com/apache/arrow/commit/b
> > 

Re: Synergies with Apache Avro?

2021-11-02 Thread Micah Kornfield
>
> Wrt to row iterations and native rows: my understanding is that even
> though most Avro APIs present themselves as iterators of rows, internally
> they read a whole compressed serialized block into memory, decompress it,
> and then deserialize item by item into a row ("read block -> decompress
> block -> decode item by item into rows -> read next block"). Avro is based
> on batches of rows (blocks) that are compressed individually (similar to
> parquet pages, but all column chunks are serialized in a single page within
> a row group).


I haven't looked at it for a while but my recollection, at least in java,
is streaming process for each step outlined rather than a batch process
(i.e. decompress some bytes, then decode them lazily a "Next Row" is
called).

My hypothesis (we can bench this) is that if the user wants to perform any
> compute over the data, it is advantageous to load the block to arrow
> (decompressed block -> RecordBatch), benefiting from arrow's analytics
> performance instead, as opposed to using a native row-based format where we
> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
> As usual, there are use-cases where this does not hold - I am thinking in
> terms of traditional ETL / CPU intensive stuff.


Do you have a target system in mind?  As I said for columnar/arrow native
query engines this obviously sounds like a win, but for row oriented
processing engines, the transposition costs are going to eat into any
gains. There is also non-zero engineering effort to implement the necessary
filter/selection push down APIs that most of them provide.  That being
said, I'd love to see real world ETL pipeline benchmarks :)


On Tue, Nov 2, 2021 at 4:39 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thank you all for all your comments.
>
> The first comments: thanks a lot for your suggestions. I tried with
> mimalloc and there is indeed a -25% improvement for avro-rs. =)
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
>> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
>> of the complexity of parsing avro is the schema evolution rules, I haven't
>> looked at whether the canonical implementations do any optimization for
>> the
>> happy case when reader and writer schema are the same.
>>
>
> The graph was for a single column of a constant string of 3 bytes ("foo")
> each divided into (avro) blocks of 4000 rows each (default block size of
> 16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
> integer column, and compressed blocks (deflate): with equal speedups.
> Generic benchmarks are obviously catered for. I agree that schema evolution
> adds extra CPU time, and that this is the happy case; I have not
> benchmarked those yet.
>
> With respect to being a single column, I agree. The second bench that you
> saw is still a single column (of integers): I wanted to check whether the
> cost was the allocation of the strings, or the elements of the rows (the
> speedup is equivalent).
>
> However, I pushed a new bench where we are reading 6 columns [string,
> bool, int, string, string, string|null], speedup is 5x for mz-avro and 4x
> for avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
> [image: avro_read_mixed.png]
>
> Wrt to row iterations and native rows: my understanding is that even
> though most Avro APIs present themselves as iterators of rows, internally
> they read a whole compressed serialized block into memory, decompress it,
> and then deserialize item by item into a row ("read block -> decompress
> block -> decode item by item into rows -> read next block"). Avro is based
> on batches of rows (blocks) that are compressed individually (similar to
> parquet pages, but all column chunks are serialized in a single page within
> a row group).
>
> In this context, my thinking of Arrow vs Vec is that once loaded
> in memory, a block behaves like a serialized blob that we can deserialize
> to any in-memory format according to some rules.
>
> My hypothesis (we can bench this) is that if the user wants to perform any
> compute over the data, it is advantageous to load the block to arrow
> (decompressed block -> RecordBatch), benefiting from arrow's analytics
> performance instead, as opposed to using a native row-based format where we
> can't leverage SIMD/cache hits/must allocate and deallocate on every item.
> As usual, there are use-cases where this does not hold - I am thinking in
> terms of traditional ETL / CPU intensive stuff.
>
> My surprise is that even without the compute in mind, deserializing blocks
> to arrow is faster than I antecipated, and  wanted to check if someone went
> through this exercise before trying more exotic benches.
>
> Best,
> Jorge
>
> [1] https://github.com/dataEngineeringLabs/arrow2-benches
>
>
> On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield 
> wrote:
>
>> Hi Jorge,
>>
>> > The results are a bit surprising: reading 2^20 rows of 

RE: Re: [VOTE][RESULT] Release Apache Arrow 6.0.0 - RC3

2021-11-02 Thread Daijiro Fukuda

I have created Homebrew packages PR (12).

daipom


1. [done] bump version numbers
2. [done] upload source
3. [done] upload binaries
4. [done] update website
5. [depends-on-brew] upload ruby gems
6. [done] upload js packages
8. [done] upload C# packages
10. [in-pr] update conda recipes
11. [done] upload wheels/sdist to pypi
12. [in-pr] update homebrew packages
13. [done] update maven artifacts
14. [done] update msys2
15. [done] update R packages
16. [Ian] update vcpkg port
17. [done] update tags for Go modules
18. [done] update docs
19. [done] announced to mailing lists


On 2021/10/29 15:31:01 Neal Richardson wrote:
> R package has been accepted by CRAN, though we will have to patch and
> resubmit due to a sanitizer error (
> https://issues.apache.org/jira/browse/ARROW-14514 for the failure,
> https://issues.apache.org/jira/browse/ARROW-14515 for the missing CI--we
> test UBSAN with gcc but apparently CRAN also does UBSAN with clang, which
> is where this came up).
>
> Neal
>
> 1. [done] bump version numbers
> 2. [done] upload source
> 3. [done] upload binaries
> 4. [done] update website
> 5. [depends-on-brew] upload ruby gems
> 6. [done] upload js packages
> 8. [done] upload C# packages
> 10. [in-pr] update conda recipes
> 11. [done] upload wheels/sdist to pypi
> 12. [ ] update homebrew packages
> 13. [done] update maven artifacts
> 14. [done] update msys2
> 15. [done*] update R packages
> 16. [Ian] update vcpkg port
> 17. [done] update tags for Go modules
> 18. [done] update docs
> 19. [done] announced to mailing lists
>
>
> > On Wed, Oct 27, 2021 at 8:55 AM Sutou Kouhei  
wrote:

> > >
> > > 1. [in-pr] bump version numbers
> > > 2. [done] upload source
> > > 3. [done] upload binaries
> > > 4. [in-pr] update website
> > > 5. [depends-on-brew] upload ruby gems
> > > 6. [done] upload js packages
> > > 8. [done] upload C# packages
> > > 10. [ ] update conda recipes
> > > 11. [done] upload wheels/sdist to pypi
> > > 12. [ ] update homebrew packages
> > > 13. [done] update maven artifacts
> > > 14. [done] update msys2
> > > 15. [Neal] update R packages
> > > 16. [Ian] update vcpkg port
> > > 17. [done] update tags for Go modules
> > > 18. [ ] update docs
> > >
> > > In 
> > > "Re: [VOTE][RESULT] Release Apache Arrow 6.0.0 - RC3" on Tue, 26 Oct
> > 2021 17:09:24 +0200,
> > > Krisztián Szűcs  wrote:
> > >
> > > > The current status of the post release tasks:
> > > >
> > > > 1. [in-pr] bump version numbers
> > > > 2. [done] upload source
> > > > 3. [done] upload binaries
> > > > 4. [in-pr] update website
> > > > 5. [depends-on-brew] upload ruby gems
> > > > 6. [done] upload js packages
> > > > 8. [done] upload C# packages
> > > > 10. [ ] update conda recipes
> > > > 11. [done] upload wheels/sdist to pypi
> > > > 12. [ ] update homebrew packages
> > > > 13. [done] update maven artifacts
> > > > 14. [ ] update msys2
> > > > 15. [Neal] update R packages
> > > > 16. [Ian] update vcpkg port
> > > > 17. [done] update tags for Go modules
> > > > 18. [ ] update docs
> > > >
> > > > On Tue, Oct 26, 2021 at 2:33 PM Krisztián Szűcs
> > > >  wrote:
> > > >>
> > > >> Resending with RESULT subject line.
> > > >>
> > > >> The VOTE carries with 3 binding +1 and 2 non-binding +1 votes.
> > > >>
> > > >> I'm starting the post release tasks and will keep you posted 
about the

> > > >> current status.
> > > >>
> > > >> Thanks everyone!
> > > >>
> > > >> >
> > > >> > On Tue, Oct 26, 2021 at 1:56 PM Benson Muite <
> > benson_mu...@emailplus.org> wrote:
> > > >> > >
> > > >> > > Ok. Thanks for the feedback.
> > > >> > >
> > > >> > > Javascript may have problems when using nohup
> > > >> > >
> > > >> > > so directly running
> > > >> > >
> > > >> > > env "TEST_DEFAULT=0" env "TEST_JS=1" bash
> > > >> > > dev/release/verify-release-candidate.sh source 6.0.0 3
> > > >> > >
> > > >> > > seems to work, but
> > > >> > >
> > > >> > > nohup env "TEST_DEFAULT=0" env "TEST_JS=1" bash
> > > >> > > dev/release/verify-release-candidate.sh source 6.0.0 3 > 
log.out &

> > > >> > >
> > > >> > > may not work [1].
> > > >> > >
> > > >> > > [1]
> > > >> > >
> > 
https://stackoverflow.com/questions/16604176/error-ebadf-bad-file-descriptor-when-running-node-using-nohup-of-forever

> > > >> > >
> > > >> > > On 10/26/21 2:32 PM, Krisztián Szűcs wrote:
> > > >> > > > Thanks Benson for verifying!
> > > >> > > >
> > > >> > > > Created a jira to track the depreciation warnings [1] 
and seems

> > like
> > > >> > > > you've already created a PR for the javascript issue [2].
> > > >> > > > Luckily, these issues are not blockers.
> > > >> > > >
> > > >> > > > [1]: https://issues.apache.org/jira/browse/ARROW-14468
> > > >> > > > [2]:
> > 
https://github.com/apache/arrow/commit/b4bc846fcdf189ae0443b8445c3ef69fc4131764

> > > >> > > >
> > > >> > > >
> > > >> > > > On Sat, Oct 23, 2021 at 1:59 AM Benson Muite <
> > benson_mu...@emailplus.org> wrote:
> > > >> > > >>
> > > >> > > >> on Ubuntu 20.04 x86
> > > >> > > >>
> > > >> > > >> Checked sources (C++, Python, 

Re: Who monitors the Github Actions Cron jobs?

2021-11-02 Thread Krisztián Szűcs
Agree, we should move them to crossbow.

On Mon, Nov 1, 2021 at 10:01 PM Sutou Kouhei  wrote:
>
> +1
>
> In <20211101160628.6f5de9d0@fsol>
>   "Who monitors the Github Actions Cron jobs?" on Mon, 1 Nov 2021 16:06:28 
> +0100,
>   Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > It appears the C++ Cron CI builds have been failing in a while, and
> > nobody has noticed:
> > https://github.com/apache/arrow/actions/workflows/cpp_cron.yml
> >
> > Should we promote these to crossbow builds instead?
> >
> > Regards
> >
> > Antoine.
> >
> >


Re: Synergies with Apache Avro?

2021-11-02 Thread Jorge Cardoso Leitão
Thank you all for all your comments.

The first comments: thanks a lot for your suggestions. I tried with
mimalloc and there is indeed a -25% improvement for avro-rs. =)

This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>

The graph was for a single column of a constant string of 3 bytes ("foo")
each divided into (avro) blocks of 4000 rows each (default block size of
16kb). I also tried random strings of 3 bytes and 7 bytes, as well as an
integer column, and compressed blocks (deflate): with equal speedups.
Generic benchmarks are obviously catered for. I agree that schema evolution
adds extra CPU time, and that this is the happy case; I have not
benchmarked those yet.

With respect to being a single column, I agree. The second bench that you
saw is still a single column (of integers): I wanted to check whether the
cost was the allocation of the strings, or the elements of the rows (the
speedup is equivalent).

However, I pushed a new bench where we are reading 6 columns [string, bool,
int, string, string, string|null], speedup is 5x for mz-avro and 4x for
avro-rs on my machine @ 2^20 rows (pushed latest code to main [1]).
[image: avro_read_mixed.png]

Wrt to row iterations and native rows: my understanding is that even though
most Avro APIs present themselves as iterators of rows, internally they
read a whole compressed serialized block into memory, decompress it, and
then deserialize item by item into a row ("read block -> decompress block
-> decode item by item into rows -> read next block"). Avro is based on
batches of rows (blocks) that are compressed individually (similar to
parquet pages, but all column chunks are serialized in a single page within
a row group).

In this context, my thinking of Arrow vs Vec is that once loaded in
memory, a block behaves like a serialized blob that we can deserialize to
any in-memory format according to some rules.

My hypothesis (we can bench this) is that if the user wants to perform any
compute over the data, it is advantageous to load the block to arrow
(decompressed block -> RecordBatch), benefiting from arrow's analytics
performance instead, as opposed to using a native row-based format where we
can't leverage SIMD/cache hits/must allocate and deallocate on every item.
As usual, there are use-cases where this does not hold - I am thinking in
terms of traditional ETL / CPU intensive stuff.

My surprise is that even without the compute in mind, deserializing blocks
to arrow is faster than I antecipated, and  wanted to check if someone went
through this exercise before trying more exotic benches.

Best,
Jorge

[1] https://github.com/dataEngineeringLabs/arrow2-benches


On Mon, Nov 1, 2021 at 3:37 AM Micah Kornfield 
wrote:

> Hi Jorge,
>
> > The results are a bit surprising: reading 2^20 rows of 3 byte strings is
> > ~6x faster than the official Avro Rust implementation and ~20x faster vs
> > "fastavro"
>
>
> This sentence is a little bit hard to parse.  Is a row of 3 strings or a
> row of 1 string consisting of 3 bytes?  Was the example hard-coded?  A lot
> of the complexity of parsing avro is the schema evolution rules, I haven't
> looked at whether the canonical implementations do any optimization for the
> happy case when reader and writer schema are the same.
>
> There is a "Java Avro -> Arrow" implementation checked but it is somewhat
> broken today (I filed an issue on this a while ago) that delegates parsing
> the t/from the Avro java library.  I also think there might be faster
> implementations that aren't the canonical implementations (I seem to recall
> a JIT version for java for example and fastavro is another).  For both Java
> and Python I'd imagine there would be some decent speed improvements simply
> by avoiding the "boxing" task of moving language primitive types to native
> memory.
>
> I was planning (and still might get to it sometime in 2022) to have a C++
> parser for Avro.  Wes cross-posted this to the Avro mailing list when I
> thought I had time to work on it a couple of years ago and I don't recall
> any response to it.  The Rust avro library I believe was also just recently
> adopted/donated into the Apache Avro project.
>
> Avro seems to be pretty common so having the ability to convert to and from
> it is I think is generally valuable.
>
> Cheers,
> Micah
>
>
> On Sun, Oct 31, 2021 at 12:26 PM Daniël Heres 
> wrote:
>
> > Rust allows to easily swap the global allocator to e.g. mimalloc or
> > snmalloc, even without the library supporting to change the allocator. In
> > my experience this indeed helps with allocation heavy code (I have seen
> > changes of up to 30%).
> >
> > Best regards,
> >
> > Daniël
> >
> >
> > On Sun, Oct 

Re: [VOTE][RUST] Release Apache Arrow Rust 6.1.0 RC1

2021-11-02 Thread Andrew Lamb
A defect in the changelog [1] for 6.1.0 RC1 has been discovered.

Since this release has not yet been approved / released, are there any
opinions on creating a new release candidate or releasing the RC1 candidate?

Thanks,
Andrew

[1] https://github.com/apache/arrow-rs/pull/905

On Sun, Oct 31, 2021 at 2:53 PM Andy Grove  wrote:

> +1 (binding).
>
> Verified on Ubuntu 20.04.3 LTS.
>
> On Fri, Oct 29, 2021 at 8:07 AM Andrew Lamb  wrote:
>
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 6.1.0.
> >
> > This release candidate is based on commit:
> > 03d4f4cd3f4cdf39d97a1028471bd9cfbed9 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > NOTE that some versions of the release verification script had a bug [5]
> > which was recently fixed, so please ensure you are using the latest
> version
> > from master
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> >
> https://github.com/apache/arrow-rs/tree/03d4f4cd3f4cdf39d97a1028471bd9cfbed9
> > [2]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-6.1.0-rc1
> > [3]:
> >
> >
> https://github.com/apache/arrow-rs/blob/03d4f4cd3f4cdf39d97a1028471bd9cfbed9/CHANGELOG.md
> > [4]:
> >
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
> > [5]: https://github.com/apache/arrow-rs/pull/882
> > -
> >
>