[DISCUSS][Python] combine_chunks and copies

2023-10-18 Thread Spencer Nelson
pyarrow.ChunkedArray.combine_chunks is a method which is documented as
"Flatten this ChunkedArray into a single non-chunked array."

Incidentally, it happens to *always* copy the underlying chunk data - even
if the ChunkedArray is composed of just a single contiguous chunk which
could be returned directly. That has major performance impact for my
particular application, which calls `combine_chunks` on all ChunkedArrays
to compact them. When there is one chunk, this copy is unnecessary, but my
application spends about 5% to 15% of its total runtime just on these
copies!

A workaround is trivial to implement, but this seems like an unnecessary
footgun. But the point has been raised that perhaps the incidental copy
that combine_chunks does is actually part of its API, since users might
depend on that copy. This was brought up in a PR [0] and an issue [1].

My discussion topic: is this side-effect a part of the combine_chunks API?
If it is, I think it should be documented as such, opening the space for a
new method which avoids the unnecessary copy. If not, I think we should
improve its performance.

---

[0]: "Optimize combine_chunks when there is only one chunk"
https://github.com/apache/arrow/pull/37319
[1]: "Concatenating a single array is a compaction utility"
https://github.com/apache/arrow/issues/37878


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Jonathan Keane
+1

-Jon


On Wed, Oct 18, 2023 at 2:26 PM Felipe Oliveira Carvalho <
felipe...@gmail.com> wrote:

> +1
>
> On Wed, Oct 18, 2023 at 2:49 PM Dewey Dunnington
>  wrote:
>
> > +1!
> >
> > On Wed, Oct 18, 2023 at 2:14 PM Matt Topol 
> wrote:
> > >
> > > +1
> > >
> > > On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou 
> > wrote:
> > >
> > > > +1
> > > >
> > > > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > > > Hello all,
> > > > >
> > > > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > > > BinaryView types in the Arrow C data interface [1].
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 - I'm in favor of these new C data format strings
> > > > > [ ] +0
> > > > > [ ] -1 - I'm against adding these new format strings because
> > > > >
> > > > > Ben Kietzman
> > > > >
> > > > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > > > >
> > > >
> >
>


Re: [VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-18 Thread Andrew Lamb
+1 (binding) -- thank you Raphael

Verified on x86 Mac

Hint for anyone else verifying, this is RC*2* (RC1 hit an issue[1])

Andrew

[1]: https://github.com/apache/arrow-rs/pull/4950

On Wed, Oct 18, 2023 at 12:39 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Verified on M1 Mac.
>
> Thanks Raphael.
>
> On Wed, Oct 18, 2023 at 6:59 AM Raphael Taylor-Davies
>  wrote:
> >
> > Hi,
> >
> > I would like to propose a release of Apache Arrow Rust Implementation,
> > version 48.0.0 *RC2*.
> >
> > Please note that there were issues with the first release candidate that
> > required cutting a second.
> >
> > This release candidate is based on commit:
> > 51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]
> >
> > The proposed release tarball and signatures are hosted at [2].
> >
> > The changelog is located at [3].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. There is a script [4] that automates some of
> > the verification.
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow Rust
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow Rust  because...
> >
> > [1]:
> >
> https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463
> > [2]:
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
> > [3]:
> >
> https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
> > [4]:
> >
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh
>


Re: Apache Arrow file format

2023-10-18 Thread Antoine Pitrou



The fact that they describe Arrow and Feather as distinct formats 
(they're not!) with different characteristics is a bit of a bummer.



Le 18/10/2023 à 22:20, Andrew Lamb a écrit :

If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.

[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
 wrote:


To further what others have already mentioned, the IPC file format is
primarily optimised for IPC use-cases, that is exchanging the entire
contents between processes. It is relatively inexpensive to encode and
decode, and supports all arrow datatypes, making it ideal for things
like spill-to-disk processing, distributed shuffles, etc...

Parquet by comparison is a storage format, optimised for space
efficiency and selective querying, with [1] containing an overview of
the various techniques the format affords. It is comparatively expensive
to encode and decode, and instead relies on index structures and
statistics to accelerate access.

Both are therefore perfectly viable options depending on your particular
use-case.

[1]:

https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/

On 18/10/2023 13:59, Dewey Dunnington wrote:

Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).


you're going to read all the columns for a record batch in the file, no

matter what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple 

wrote:

Arrow IPC file is great, it focuses on in-memory representation and

direct

computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has

high

efficient
encoding, which could make the Parquet file smaller than the Arrow IPC

file

under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow

document to

take a look.

Adam Lippai  于2023年10月18日周三 10:50写道:


Also there is
https://github.com/lancedb/lance between the two formats. Depending

on the

use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol 

wrote:



One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger

than

memory file of data. Since, as Felipe mentioned, the format is

focused on

in-memory representation, you can easily and simply mmap the file and

use

the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in

the

file, no matter what). So it's going to be a trade off based on your

needs

as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt


On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
felipe...@gmail.com>
wrote:


It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:

https://arrow.apache.org/docs/python/feather.html

—
Felipe

On Tue, 17 Oct 2023 at 23:26 Nara 

wrote:

Hi,

Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.

One of the benefits of Parquet file format is column projection,

where

the

IO is limited to just the columns projected.

Regards ,
Nara







Re: Apache Arrow file format

2023-10-18 Thread Andrew Lamb
If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.

[1] https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

On Wed, Oct 18, 2023 at 10:10 AM Raphael Taylor-Davies
 wrote:

> To further what others have already mentioned, the IPC file format is
> primarily optimised for IPC use-cases, that is exchanging the entire
> contents between processes. It is relatively inexpensive to encode and
> decode, and supports all arrow datatypes, making it ideal for things
> like spill-to-disk processing, distributed shuffles, etc...
>
> Parquet by comparison is a storage format, optimised for space
> efficiency and selective querying, with [1] containing an overview of
> the various techniques the format affords. It is comparatively expensive
> to encode and decode, and instead relies on index structures and
> statistics to accelerate access.
>
> Both are therefore perfectly viable options depending on your particular
> use-case.
>
> [1]:
>
> https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
>
> On 18/10/2023 13:59, Dewey Dunnington wrote:
> > Plenty of opinions here already, but I happen to think that IPC
> > streams and/or Arrow File/Feather are wildly underutilized. For the
> > use-case where you're mostly just going to read an entire file into R
> > or Python it's a bit faster (and far superior to a CSV or pickling or
> > .rds files in R).
> >
> >> you're going to read all the columns for a record batch in the file, no
> matter what
> > The metadata for each every column in every record batch has to be
> > read, but there's nothing inherent about the format that prevents
> > selectively loading into memory only the required buffers. (I don't
> > know off the top of my head if any reader implementation actually does
> > this).
> >
> > On Wed, Oct 18, 2023 at 12:02 AM wish maple 
> wrote:
> >> Arrow IPC file is great, it focuses on in-memory representation and
> direct
> >> computation.
> >> Basically, it can support compression and dictionary encoding, and can
> >> zero-copy
> >> deserialize the file to memory Arrow format.
> >>
> >> Parquet provides some strong functionality, like Statistics, which could
> >> help pruning
> >> unnecessary data during scanning and avoid cpu and io cust. And it has
> high
> >> efficient
> >> encoding, which could make the Parquet file smaller than the Arrow IPC
> file
> >> under the same
> >> data. However, currently some arrow data type cannot be convert to
> >> correspond Parquet type
> >> in the current arrow-cpp implementation. You can goto the arrow
> document to
> >> take a look.
> >>
> >> Adam Lippai  于2023年10月18日周三 10:50写道:
> >>
> >>> Also there is
> >>> https://github.com/lancedb/lance between the two formats. Depending
> on the
> >>> use case it can be a great choice.
> >>>
> >>> Best regards
> >>> Adam Lippai
> >>>
> >>> On Tue, Oct 17, 2023 at 22:44 Matt Topol 
> wrote:
> >>>
>  One benefit of the feather format (i.e. Arrow IPC file format) is the
>  ability to mmap the file to easily handle reading sections of a larger
> >>> than
>  memory file of data. Since, as Felipe mentioned, the format is
> focused on
>  in-memory representation, you can easily and simply mmap the file and
> use
>  the raw bytes directly. For a large file that you only want to read
>  sections of, this can be beneficial for IO and memory usage.
> 
>  Unfortunately, you are correct that it doesn't allow for easy column
>  projecting (you're going to read all the columns for a record batch in
> >>> the
>  file, no matter what). So it's going to be a trade off based on your
> >>> needs
>  as to whether it makes sense, or if you should use a file format like
>  Parquet instead.
> 
>  -Matt
> 
> 
>  On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
>  felipe...@gmail.com>
>  wrote:
> 
> > It’s not the best since the format is really focused on in- memory
> > representation and direct computation, but you can do it:
> >
> > https://arrow.apache.org/docs/python/feather.html
> >
> > —
> > Felipe
> >
> > On Tue, 17 Oct 2023 at 23:26 Nara 
>  wrote:
> >> Hi,
> >>
> >> Is it a good idea to use Apache Arrow as a file format? Looks like
> >> projecting columns isn't available by default.
> >>
> >> One of the benefits of Parquet file format is column projection,
> >>> where
> > the
> >> IO is limited to just the columns projected.
> >>
> >> Regards ,
> >> Nara
> >>
>


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Felipe Oliveira Carvalho
+1

On Wed, Oct 18, 2023 at 2:49 PM Dewey Dunnington
 wrote:

> +1!
>
> On Wed, Oct 18, 2023 at 2:14 PM Matt Topol  wrote:
> >
> > +1
> >
> > On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou 
> wrote:
> >
> > > +1
> > >
> > > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > > Hello all,
> > > >
> > > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > > BinaryView types in the Arrow C data interface [1].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > [ ] +1 - I'm in favor of these new C data format strings
> > > > [ ] +0
> > > > [ ] -1 - I'm against adding these new format strings because
> > > >
> > > > Ben Kietzman
> > > >
> > > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > > >
> > >
>


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Dewey Dunnington
+1!

On Wed, Oct 18, 2023 at 2:14 PM Matt Topol  wrote:
>
> +1
>
> On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou  wrote:
>
> > +1
> >
> > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > > Hello all,
> > >
> > > I propose "vu" and "vz" as format strings for the Utf8View and
> > > BinaryView types in the Arrow C data interface [1].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 - I'm in favor of these new C data format strings
> > > [ ] +0
> > > [ ] -1 - I'm against adding these new format strings because
> > >
> > > Ben Kietzman
> > >
> > > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> > >
> >


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Matt Topol
+1

On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou  wrote:

> +1
>
> Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > I propose "vu" and "vz" as format strings for the Utf8View and
> > BinaryView types in the Arrow C data interface [1].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 - I'm in favor of these new C data format strings
> > [ ] +0
> > [ ] -1 - I'm against adding these new format strings because
> >
> > Ben Kietzman
> >
> > [1] https://arrow.apache.org/docs/format/CDataInterface.html
> >
>


Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Antoine Pitrou

+1

Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit :

Hello all,

I propose "vu" and "vz" as format strings for the Utf8View and
BinaryView types in the Arrow C data interface [1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of these new C data format strings
[ ] +0
[ ] -1 - I'm against adding these new format strings because

Ben Kietzman

[1] https://arrow.apache.org/docs/format/CDataInterface.html



[VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Benjamin Kietzman
Hello all,

I propose "vu" and "vz" as format strings for the Utf8View and
BinaryView types in the Arrow C data interface [1].

The vote will be open for at least 72 hours.

[ ] +1 - I'm in favor of these new C data format strings
[ ] +0
[ ] -1 - I'm against adding these new format strings because

Ben Kietzman

[1] https://arrow.apache.org/docs/format/CDataInterface.html


Re: [VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-18 Thread L. C. Hsieh
+1 (binding)

Verified on M1 Mac.

Thanks Raphael.

On Wed, Oct 18, 2023 at 6:59 AM Raphael Taylor-Davies
 wrote:
>
> Hi,
>
> I would like to propose a release of Apache Arrow Rust Implementation,
> version 48.0.0 *RC2*.
>
> Please note that there were issues with the first release candidate that
> required cutting a second.
>
> This release candidate is based on commit:
> 51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]
>
> The proposed release tarball and signatures are hosted at [2].
>
> The changelog is located at [3].
>
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. There is a script [4] that automates some of
> the verification.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Release this as Apache Arrow Rust
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow Rust  because...
>
> [1]:
> https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463
> [2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
> [3]:
> https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
> [4]:
> https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


Re: Apache Arrow file format

2023-10-18 Thread Raphael Taylor-Davies
To further what others have already mentioned, the IPC file format is 
primarily optimised for IPC use-cases, that is exchanging the entire 
contents between processes. It is relatively inexpensive to encode and 
decode, and supports all arrow datatypes, making it ideal for things 
like spill-to-disk processing, distributed shuffles, etc...


Parquet by comparison is a storage format, optimised for space 
efficiency and selective querying, with [1] containing an overview of 
the various techniques the format affords. It is comparatively expensive 
to encode and decode, and instead relies on index structures and 
statistics to accelerate access.


Both are therefore perfectly viable options depending on your particular 
use-case.


[1]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/


On 18/10/2023 13:59, Dewey Dunnington wrote:

Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).


you're going to read all the columns for a record batch in the file, no matter 
what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple  wrote:

Arrow IPC file is great, it focuses on in-memory representation and direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.

Parquet provides some strong functionality, like Statistics, which could
help pruning
unnecessary data during scanning and avoid cpu and io cust. And it has high
efficient
encoding, which could make the Parquet file smaller than the Arrow IPC file
under the same
data. However, currently some arrow data type cannot be convert to
correspond Parquet type
in the current arrow-cpp implementation. You can goto the arrow document to
take a look.

Adam Lippai  于2023年10月18日周三 10:50写道:


Also there is
https://github.com/lancedb/lance between the two formats. Depending on the
use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:


One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger

than

memory file of data. Since, as Felipe mentioned, the format is focused on
in-memory representation, you can easily and simply mmap the file and use
the raw bytes directly. For a large file that you only want to read
sections of, this can be beneficial for IO and memory usage.

Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the columns for a record batch in

the

file, no matter what). So it's going to be a trade off based on your

needs

as to whether it makes sense, or if you should use a file format like
Parquet instead.

-Matt


On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
felipe...@gmail.com>
wrote:


It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:

https://arrow.apache.org/docs/python/feather.html

—
Felipe

On Tue, 17 Oct 2023 at 23:26 Nara 

wrote:

Hi,

Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.

One of the benefits of Parquet file format is column projection,

where

the

IO is limited to just the columns projected.

Regards ,
Nara



[VOTE][RUST] Release Apache Arrow Rust 48.0.0 RC2

2023-10-18 Thread Raphael Taylor-Davies

Hi,

I would like to propose a release of Apache Arrow Rust Implementation, 
version 48.0.0 *RC2*.


Please note that there were issues with the first release candidate that 
required cutting a second.


This release candidate is based on commit: 
51ac6fec8755147cd6b1dfe7d76bfdcfacad0463 [1]


The proposed release tarball and signatures are hosted at [2].

The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. There is a script [4] that automates some of
the verification.

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow Rust
[ ] +0
[ ] -1 Do not release this as Apache Arrow Rust  because...

[1]: 
https://github.com/apache/arrow-rs/tree/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463

[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-rs-48.0.0-rc2
[3]: 
https://github.com/apache/arrow-rs/blob/51ac6fec8755147cd6b1dfe7d76bfdcfacad0463/CHANGELOG.md
[4]: 
https://github.com/apache/arrow-rs/blob/master/dev/release/verify-release-candidate.sh


Re: Help regarding setting up the r package in arrow apache

2023-10-18 Thread Jonathan Keane
For development of the R package with docker containers, the link [1] that
Nic sent in this same thread is the place to go. In addition to that
docker-focused one, there are a handful of others that might prove useful
to you in getting your development environment setup [2].

If you run into any issues, feel free to post here, but it's helpful to do
so with debugging mode on (i.e. set the env var ARROW_DEV to true) and to
provide the exact commands you sent along with the output you're seeing so
we can help diagnose what's going wrong.

[1] – https://arrow.apache.org/docs/r/articles/developers/docker.html
[2] – https://arrow.apache.org/docs/r/articles/index.html#developer-guides

-Jon


On Wed, Oct 18, 2023 at 2:48 AM Divyansh Khatri 
wrote:

> I am trying to contribute to the arrow project.so i am trying to setup the
> project on locally.
>
> On Tue, 17 Oct 2023 at 05:14, Bryce Mecum  wrote:
>
> > That error makes it look like you're running `docker compose up` from
> > the root of the Arrow source tree which is likely not what you want.
> > Are you trying to use the Arrow R package in a Docker container or are
> > you trying to contribute to it by developing inside of a Docker
> > container? Nic's link [1] is a good starting point.
> >
> > [1] https://arrow.apache.org/docs/r/articles/developers/docker.html
> >
> > On Mon, Oct 16, 2023 at 4:31 AM Divyansh Khatri
> >  wrote:
> > >
> > > Hi,so i am basically using the docker cmd 'docker compose up -d' in the
> > > docker-compose.yml but i am encountering this error(Error response from
> > > daemon: manifest for amd64/maven:3.5.4-eclipse-temurin-8 not found:
> > > manifest unknown: manifest unknown)so i am not sure how to proceed from
> > > here?
> > >
> > > On Mon, 16 Oct 2023 at 14:17, Nic Crane  wrote:
> > >
> > > > Hi Divyansh,
> > > >
> > > > There are instructions for creating a R package dev setup here:
> > > > https://arrow.apache.org/docs/r/articles/developers/setup.html
> > > >
> > > > If you can explain a bit more about what you've tried so far and
> > what's not
> > > > working, we may be able to advise.
> > > >
> > > > Best wishes,
> > > >
> > > > Nic
> > > >
> > > > On Mon, 16 Oct 2023 at 06:02, Divyansh Khatri <
> > divyanshkhatri...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > I am having problems regarding setting up the r package using
> docker
> > of
> > > > the
> > > > > apache arrow.Can you give me the step by step process of how do i
> > setup
> > > > the
> > > > > r package in my vs code system using docker.
> > > > >
> > > >
> >
>


Re: Apache Arrow file format

2023-10-18 Thread Dewey Dunnington
Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).

> you're going to read all the columns for a record batch in the file, no 
> matter what

The metadata for each every column in every record batch has to be
read, but there's nothing inherent about the format that prevents
selectively loading into memory only the required buffers. (I don't
know off the top of my head if any reader implementation actually does
this).

On Wed, Oct 18, 2023 at 12:02 AM wish maple  wrote:
>
> Arrow IPC file is great, it focuses on in-memory representation and direct
> computation.
> Basically, it can support compression and dictionary encoding, and can
> zero-copy
> deserialize the file to memory Arrow format.
>
> Parquet provides some strong functionality, like Statistics, which could
> help pruning
> unnecessary data during scanning and avoid cpu and io cust. And it has high
> efficient
> encoding, which could make the Parquet file smaller than the Arrow IPC file
> under the same
> data. However, currently some arrow data type cannot be convert to
> correspond Parquet type
> in the current arrow-cpp implementation. You can goto the arrow document to
> take a look.
>
> Adam Lippai  于2023年10月18日周三 10:50写道:
>
> > Also there is
> > https://github.com/lancedb/lance between the two formats. Depending on the
> > use case it can be a great choice.
> >
> > Best regards
> > Adam Lippai
> >
> > On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:
> >
> > > One benefit of the feather format (i.e. Arrow IPC file format) is the
> > > ability to mmap the file to easily handle reading sections of a larger
> > than
> > > memory file of data. Since, as Felipe mentioned, the format is focused on
> > > in-memory representation, you can easily and simply mmap the file and use
> > > the raw bytes directly. For a large file that you only want to read
> > > sections of, this can be beneficial for IO and memory usage.
> > >
> > > Unfortunately, you are correct that it doesn't allow for easy column
> > > projecting (you're going to read all the columns for a record batch in
> > the
> > > file, no matter what). So it's going to be a trade off based on your
> > needs
> > > as to whether it makes sense, or if you should use a file format like
> > > Parquet instead.
> > >
> > > -Matt
> > >
> > >
> > > On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com>
> > > wrote:
> > >
> > > > It’s not the best since the format is really focused on in- memory
> > > > representation and direct computation, but you can do it:
> > > >
> > > > https://arrow.apache.org/docs/python/feather.html
> > > >
> > > > —
> > > > Felipe
> > > >
> > > > On Tue, 17 Oct 2023 at 23:26 Nara 
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Is it a good idea to use Apache Arrow as a file format? Looks like
> > > > > projecting columns isn't available by default.
> > > > >
> > > > > One of the benefits of Parquet file format is column projection,
> > where
> > > > the
> > > > > IO is limited to just the columns projected.
> > > > >
> > > > > Regards ,
> > > > > Nara
> > > > >
> > > >
> > >
> >


Re: Help regarding setting up the r package in arrow apache

2023-10-18 Thread Divyansh Khatri
I am trying to contribute to the arrow project.so i am trying to setup the
project on locally.

On Tue, 17 Oct 2023 at 05:14, Bryce Mecum  wrote:

> That error makes it look like you're running `docker compose up` from
> the root of the Arrow source tree which is likely not what you want.
> Are you trying to use the Arrow R package in a Docker container or are
> you trying to contribute to it by developing inside of a Docker
> container? Nic's link [1] is a good starting point.
>
> [1] https://arrow.apache.org/docs/r/articles/developers/docker.html
>
> On Mon, Oct 16, 2023 at 4:31 AM Divyansh Khatri
>  wrote:
> >
> > Hi,so i am basically using the docker cmd 'docker compose up -d' in the
> > docker-compose.yml but i am encountering this error(Error response from
> > daemon: manifest for amd64/maven:3.5.4-eclipse-temurin-8 not found:
> > manifest unknown: manifest unknown)so i am not sure how to proceed from
> > here?
> >
> > On Mon, 16 Oct 2023 at 14:17, Nic Crane  wrote:
> >
> > > Hi Divyansh,
> > >
> > > There are instructions for creating a R package dev setup here:
> > > https://arrow.apache.org/docs/r/articles/developers/setup.html
> > >
> > > If you can explain a bit more about what you've tried so far and
> what's not
> > > working, we may be able to advise.
> > >
> > > Best wishes,
> > >
> > > Nic
> > >
> > > On Mon, 16 Oct 2023 at 06:02, Divyansh Khatri <
> divyanshkhatri...@gmail.com
> > > >
> > > wrote:
> > >
> > > > I am having problems regarding setting up the r package using docker
> of
> > > the
> > > > apache arrow.Can you give me the step by step process of how do i
> setup
> > > the
> > > > r package in my vs code system using docker.
> > > >
> > >
>


Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-18 Thread Alenka Frim
Congrats and welcome Curt!

On Tue, Oct 17, 2023 at 3:06 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Welcome to the team, Curt!
>
> On Mon, 16 Oct 2023 at 23:17, Curt Hagenlocher 
> wrote:
> >
> > Thanks, all!
> >
> > On Mon, Oct 16, 2023 at 9:19 AM Dane Pitkin  >
> > wrote:
> >
> > > Congrats Curt!
> > >
> > > On Mon, Oct 16, 2023 at 12:00 PM Kevin Gurney
> > > 
> > > wrote:
> > >
> > > > Congratulations, Curt!
> > > > 
> > > > From: Weston Pace 
> > > > Sent: Sunday, October 15, 2023 5:32 PM
> > > > To: dev@arrow.apache.org 
> > > > Subject: Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher
> > > >
> > > > Congratulations!
> > > >
> > > > On Sun, Oct 15, 2023, 8:51 AM Gang Wu  wrote:
> > > >
> > > > > Congrats!
> > > > >
> > > > > On Sun, Oct 15, 2023 at 10:49 PM David Li 
> wrote:
> > > > >
> > > > > > Congrats & welcome Curt!
> > > > > >
> > > > > > On Sun, Oct 15, 2023, at 09:03, wish maple wrote:
> > > > > > > Congratulations!
> > > > > > >
> > > > > > > Raúl Cumplido  于2023年10月15日周日 20:48写道:
> > > > > > >
> > > > > > >> Congratulations and welcome!
> > > > > > >>
> > > > > > >> El dom, 15 oct 2023, 13:57, Ian Cook 
> > > > escribió:
> > > > > > >>
> > > > > > >> > Congratulations Curt!
> > > > > > >> >
> > > > > > >> > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb <
> al...@influxdata.com
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > On behalf of the Arrow PMC, I'm happy to announce that
> Curt
> > > > > > Hagenlocher
> > > > > > >> > > has accepted an invitation to become a committer on Apache
> > > > > > >> > > Arrow. Welcome, and thank you for your contributions!
> > > > > > >> > >
> > > > > > >> > > Andrew
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
>