Re: [DISCUSS] Deprecate UCX transport for Arrow Flight in favor of Dissociated IPC Protocol

2024-07-17 Thread Adam Lippai
Hi Raul,

Finishing an experiment is good, it can help exploring more in the future
(if the community doesn’t see it as a baggage to carry forever).

Do you have any conclusions, a summary what was learned?

I might be wrong, but my understanding was that the initial goal was
replacing the TCP+TLS+HTTP/2+GRPC stack. The dissociated protocol handles
the IPC format. Is there anything low level focusing on the
network/transport for HPC users, data centers in the work? Or did we learn
GRPC is good enough, not the bottleneck most of the time?

Best regards,
Adam Lippai



On Wed, Jul 17, 2024 at 12:30 Raúl Cumplido  wrote:

> Hi,
>
> I've followed up with a PR to remove UCX transport for flight [1].
>
> Thanks,
> Raúl
>
> [1] https://github.com/apache/arrow/pull/43297
>
> El mié, 19 jun 2024 a las 11:29, Raúl Cumplido ()
> escribió:
> >
> > Hi,
> >
> > I would like to discuss deprecation of the UCX transport for Arrow
> > Flight (ARROW_WITH_UCX).
> >
> > From conversations I've had with Matt Topol and David Li:
> > - This was implemented as an experimental PoC in order to run some
> > benchmarks with flight over UCX [1]
> > - We should encourage usage of the Dissociated IPC Protocol instead of
> > that implementation [2]
> >
> > Some upstream systems are building flight with UCX and we should
> > probably not encourage its use.
> >
> > Are there any thoughts about it?
> >
> > Kind regards,
> > Raúl
> > [1] https://github.com/apache/arrow/pull/12442
> > [2] https://arrow.apache.org/docs/dev/format/DissociatedIPC.html
>


Re: [VOTE] Release Apache Arrow 17.0.0 - RC2

2024-07-15 Thread Adam Lippai
Do I read it correctly that there is tpch regression in the R benchmark?

Best regards,
Adam Lippai

On Mon, Jul 15, 2024 at 06:05 Fokko Driesprong  wrote:

> Thanks to everyone who contributed to the new release!
>
> +1 (non-binding)
>
> I've tested against PyIceberg
> <https://github.com/apache/iceberg-python/pull/929>.
>
> Kind regards,
> Fokko
>
> Op ma 15 jul 2024 om 10:29 schreef David Li :
>
> > +1 (binding)
> >
> > Tested on Debian 12/x86_64
> >
> > On Mon, Jul 15, 2024, at 15:31, Gang Wu wrote:
> > > +1 (non-binding)
> > >
> > > Verified C++ on my M1 Mac by running:
> > > - TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 17.0.0 2
> > >
> > > BTW, I ran into this issue as well:
> > > https://github.com/apache/arrow/issues/43167
> > >
> > > Best,
> > > Gang
> > >
> > > On Mon, Jul 15, 2024 at 1:39 PM Jean-Baptiste Onofré 
> > > wrote:
> > >
> > >> +1 (non binding)
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On Fri, Jul 12, 2024 at 11:56 AM Raúl Cumplido 
> > wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I would like to propose the following release candidate (RC2) of
> > Apache
> > >> > Arrow version 17.0.0. This is a release consisting of 321
> > >> > resolved GitHub issues[1].
> > >> >
> > >> > This release candidate is based on commit:
> > >> > 6a2e19a852b367c72d7b12da4d104456491ed8b7 [2]
> > >> >
> > >> > The source release rc2 is hosted at [3].
> > >> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
> > >> > The changelog is located at [12].
> > >> >
> > >> > Please download, verify checksums and signatures, run the unit
> tests,
> > >> > and vote on the release. See [13] for how to validate a release
> > >> candidate.
> > >> >
> > >> > See also a verification result on GitHub pull request [14].
> > >> >
> > >> > The vote will be open for at least 72 hours.
> > >> >
> > >> > [ ] +1 Release this as Apache Arrow 17.0.0
> > >> > [ ] +0
> > >> > [ ] -1 Do not release this as Apache Arrow 17.0.0 because...
> > >> >
> > >> > [1]:
> > >>
> >
> https://github.com/apache/arrow/issues?q=is%3Aissue+milestone%3A17.0.0+is%3Aclosed
> > >> > [2]:
> > >>
> >
> https://github.com/apache/arrow/tree/6a2e19a852b367c72d7b12da4d104456491ed8b7
> > >> > [3]:
> > >> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-17.0.0-rc2
> > >> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
> > >> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
> > >> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
> > >> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
> > >> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/17.0.0-rc2
> > >> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/17.0.0-rc2
> > >> > [10]:
> https://apache.jfrog.io/artifactory/arrow/python-rc/17.0.0-rc2
> > >> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
> > >> > [12]:
> > >>
> >
> https://github.com/apache/arrow/blob/6a2e19a852b367c72d7b12da4d104456491ed8b7/CHANGELOG.md
> > >> > [13]:
> > https://arrow.apache.org/docs/developers/release_verification.html
> > >> > [14]: https://github.com/apache/arrow/pull/43220
> > >>
> >
>


Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Adam Lippai
It’s not strictly statistics, but would this also cover constraints and
indexes? Table, recordbatch and column primary keys, unique keys, sort
keys, bloom filters, hnsw index and shape (ndarray for keys xyz).

Not sure which backends (DB, parquet, lance) expose which natively, but
might worth considering it for a minute.

Best regards,
Adam Lippai

On Sun, Jun 9, 2024 at 17:36 Sutou Kouhei  wrote:

> Hi,
>
> In 
>   "Re: [DISCUSS] Statistics through the C data interface" on Sun, 9 Jun
> 2024 22:11:54 +0200,
>   Antoine Pitrou  wrote:
>
> >>>> Fields:
> >>>> | Name   | Type  | Comments |
> >>>> ||---|  |
> >>>> | column | utf8  | (2)  |
> >>>> | key| utf8 not null | (3)  |
> >>>
> >>> 1. Should the key be something like `dictionary(int32, utf8)` to make
> >>> the representation more efficient where there are many columns?
> >> Dictionary is more efficient. But we need to standardize not
> >> only key but also ID -> key mapping.
> >
> > I don't get why we would need to standardize ID -> key mapping. The
> > key names would be significant, the dictionary mapping is just for
> > efficiency.
>
> Ah, space efficiency was only discussed here, right? I
> thought that computational efficiency is also discussed
> here. If we standardize ID -> key mapping, consumers don't
> need to compare key names.
>
> Example: We want to find "distinct_count" statistics.
>
> If we standardize ID -> key mapping (1 -> "distinct_count"),
> consumers can find "distinct_count" statistics by finding ID
> 1 entry.
>
> If we don't standardize ID -> key mapping, consumers need to
> compare key name to find "distinct_count" statistics.
>
>
> Anyway, this (string comparison) will not be a large
> overhead because (1) statistics data will not be large data
> and (2) consumers can cache ID -> key mapping to avoid
> duplicated string comparisons. So standardizing ID -> key
> mapping isn't required.
>
>
> Thanks,
> --
> kou
>


Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
It supports writing v2, but defaults to v1.
hadoopConfiguration.set(“parquet.writer.version”, “v2”)

Best regards,
Adam Lippai


On Wed, Apr 24, 2024 at 11:40 Prem Sahoo  wrote:

> They do support Reading of Parquet V2 , but writing is not supported by
> Spark for V2.
>
> On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai  wrote:
>
> > Hi Wes,
> >
> > As far as I remember hive, spark, impala, duckdb or even proprietary
> > systems like hyper, Vertica all support reading data page v2 now. The
> most
> > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall
> > the support seems much better than a year or two ago.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Wed, Apr 24, 2024 at 10:51 Wes McKinney  wrote:
> >
> > > I think there is confusion about the Parquet "V2" (including the V2
> data
> > > pages, and other details) and the 2.x.y releases of the format library
> > > artifact. They aren't the same unfortunately. I don't think the V2
> > metadata
> > > structures (the data pages in particular, and new column encoding) is
> > > widely adopted / readable.
> > >
> > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace 
> > wrote:
> > >
> > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it
> is
> > > not
> > > > > official . They are advising not to use Parquet V2 for writing
> > (though
> > > > code
> > > > > is available ) .*
> > > >
> > > > This would be news to me.  Parquet releases are listed (by the
> parquet
> > > > community) at [1]
> > > >
> > > > The vote to release parquet 2.10 is here: [2]
> > > >
> > > > Neither of these links mention anything about this being an
> > experimental,
> > > > unofficial, or non-finalized release.
> > > >
> > > > I understand your concern.  I believe your quotes are coming from
> your
> > > > discussion on the parquet mailing list here [3].  This communication
> is
> > > > unfortunate and confusing to me as well.
> > > >
> > > > [1] https://parquet.apache.org/blog/
> > > > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> > > >
> > > >
> > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo 
> > wrote:
> > > >
> > > > > Hello Jacob,
> > > > > Thanks for the information, and my apologies for the weird format
> of
> > my
> > > > > email.
> > > > >
> > > > > This is the email from the Parquet community. May I know why
> pyarrow
> > is
> > > > > using Parquet V2 which is not official yet ?
> > > > >
> > > > > My question is from Parquet community V2 is not final yet so it is
> > not
> > > > > official yet.
> > > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge.
> > > Parquet
> > > > V2
> > > > > as a standard isn't finalized just yet. Meaning there is no formal,
> > > > > *finalized* "contract" that specifies what it means to write data
> in
> > > the
> > > > V2
> > > > > version. The discussions/conversations about what the final V2
> > standard
> > > > may
> > > > > be are still in progress and are evolving.
> > > > >
> > > > > That being said, because V2 code does exist (though unfinalized),
> > there
> > > > are
> > > > > clients / tools that are writing data in the un-finalized V2
> format,
> > as
> > > > > seems to be the case with Dremio.
> > > > >
> > > > > Now, as that comment you quoted said, you can have Spark write V2
> > > files,
> > > > > but it's worth being mindful about the fact that V2 is a moving
> > target
> > > > and
> > > > > can (and likely will) change. You can overwrite
> > parquet.writer.version
> > > to
> > > > > specify your desired version, but it can be dangerous to produce
> data
> > > in
> > > > a
> > > > > moving-target format. For example, let's say you write a bunch of
> > data
> > > in
> > > > > Parquet V2, and then the community decides to make a breaking
> change
> > &g

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
Hi Wes,

As far as I remember hive, spark, impala, duckdb or even proprietary
systems like hyper, Vertica all support reading data page v2 now. The most
recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall
the support seems much better than a year or two ago.

Best regards,
Adam Lippai

On Wed, Apr 24, 2024 at 10:51 Wes McKinney  wrote:

> I think there is confusion about the Parquet "V2" (including the V2 data
> pages, and other details) and the 2.x.y releases of the format library
> artifact. They aren't the same unfortunately. I don't think the V2 metadata
> structures (the data pages in particular, and new column encoding) is
> widely adopted / readable.
>
> On Wed, Apr 24, 2024 at 9:32 AM Weston Pace  wrote:
>
> > > *As per Apache Parquet Community Parquet V2 is not final yet so it is
> not
> > > official . They are advising not to use Parquet V2 for writing (though
> > code
> > > is available ) .*
> >
> > This would be news to me.  Parquet releases are listed (by the parquet
> > community) at [1]
> >
> > The vote to release parquet 2.10 is here: [2]
> >
> > Neither of these links mention anything about this being an experimental,
> > unofficial, or non-finalized release.
> >
> > I understand your concern.  I believe your quotes are coming from your
> > discussion on the parquet mailing list here [3].  This communication is
> > unfortunate and confusing to me as well.
> >
> > [1] https://parquet.apache.org/blog/
> > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> >
> >
> > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo  wrote:
> >
> > > Hello Jacob,
> > > Thanks for the information, and my apologies for the weird format of my
> > > email.
> > >
> > > This is the email from the Parquet community. May I know why pyarrow is
> > > using Parquet V2 which is not official yet ?
> > >
> > > My question is from Parquet community V2 is not final yet so it is not
> > > official yet.
> > > "Hi Prem - Maybe I can help clarify to the best of my knowledge.
> Parquet
> > V2
> > > as a standard isn't finalized just yet. Meaning there is no formal,
> > > *finalized* "contract" that specifies what it means to write data in
> the
> > V2
> > > version. The discussions/conversations about what the final V2 standard
> > may
> > > be are still in progress and are evolving.
> > >
> > > That being said, because V2 code does exist (though unfinalized), there
> > are
> > > clients / tools that are writing data in the un-finalized V2 format, as
> > > seems to be the case with Dremio.
> > >
> > > Now, as that comment you quoted said, you can have Spark write V2
> files,
> > > but it's worth being mindful about the fact that V2 is a moving target
> > and
> > > can (and likely will) change. You can overwrite parquet.writer.version
> to
> > > specify your desired version, but it can be dangerous to produce data
> in
> > a
> > > moving-target format. For example, let's say you write a bunch of data
> in
> > > Parquet V2, and then the community decides to make a breaking change
> > (which
> > > is completely fine / allowed since V2 isn't finalized). You are now
> left
> > > having to deal with a potentially large and complicated file format
> > update.
> > > That's why it's not recommended to write files in parquet v2 just yet."
> > >
> > >
> > > *As per Apache Parquet Community Parquet V2 is not final yet so it is
> not
> > > official . They are advising not to use Parquet V2 for writing (though
> > code
> > > is available ) .*
> > >
> > >
> > > *As per above Spark hasn't started using Parquet V2 for writing *.
> > >
> > > May I know how an unstable /unofficial  version is being used in
> pyarrow
> > ?
> > >
> > >
> > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > First off, please try to clean up formating of emails to be legible
> > when
> > > > forwarding/quoting previous messages multiple times, especially when
> > most
> > > > of the quotes do not contain any useful information. It makes it much
> > > > easier to parse the message and thus quicker to answer.
> >

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Adam Lippai
As an outsider I suspect the only reason for these “common beliefs” is that
Spark simply doesn’t support some of the breaking features (eg the
nanoseconds data type). Maybe closing the very few gaps would resolve the
issue for good.

Best regards,
Adam Lippai

On Wed, Apr 24, 2024 at 10:32 Weston Pace  wrote:

> > *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> > official . They are advising not to use Parquet V2 for writing (though
> code
> > is available ) .*
>
> This would be news to me.  Parquet releases are listed (by the parquet
> community) at [1]
>
> The vote to release parquet 2.10 is here: [2]
>
> Neither of these links mention anything about this being an experimental,
> unofficial, or non-finalized release.
>
> I understand your concern.  I believe your quotes are coming from your
> discussion on the parquet mailing list here [3].  This communication is
> unfortunate and confusing to me as well.
>
> [1] https://parquet.apache.org/blog/
> [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
>
>
> On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo  wrote:
>
> > Hello Jacob,
> > Thanks for the information, and my apologies for the weird format of my
> > email.
> >
> > This is the email from the Parquet community. May I know why pyarrow is
> > using Parquet V2 which is not official yet ?
> >
> > My question is from Parquet community V2 is not final yet so it is not
> > official yet.
> > "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet
> V2
> > as a standard isn't finalized just yet. Meaning there is no formal,
> > *finalized* "contract" that specifies what it means to write data in the
> V2
> > version. The discussions/conversations about what the final V2 standard
> may
> > be are still in progress and are evolving.
> >
> > That being said, because V2 code does exist (though unfinalized), there
> are
> > clients / tools that are writing data in the un-finalized V2 format, as
> > seems to be the case with Dremio.
> >
> > Now, as that comment you quoted said, you can have Spark write V2 files,
> > but it's worth being mindful about the fact that V2 is a moving target
> and
> > can (and likely will) change. You can overwrite parquet.writer.version to
> > specify your desired version, but it can be dangerous to produce data in
> a
> > moving-target format. For example, let's say you write a bunch of data in
> > Parquet V2, and then the community decides to make a breaking change
> (which
> > is completely fine / allowed since V2 isn't finalized). You are now left
> > having to deal with a potentially large and complicated file format
> update.
> > That's why it's not recommended to write files in parquet v2 just yet."
> >
> >
> > *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> > official . They are advising not to use Parquet V2 for writing (though
> code
> > is available ) .*
> >
> >
> > *As per above Spark hasn't started using Parquet V2 for writing *.
> >
> > May I know how an unstable /unofficial  version is being used in pyarrow
> ?
> >
> >
> > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak 
> > wrote:
> >
> > > Hello,
> > >
> > > First off, please try to clean up formating of emails to be legible
> when
> > > forwarding/quoting previous messages multiple times, especially when
> most
> > > of the quotes do not contain any useful information. It makes it much
> > > easier to parse the message and thus quicker to answer.
> > >
> > > The short answer is that we switched to 2.4 and more recently to 2.6 as
> > > the default to enable the usage of features these versions provide. As
> > you
> > > have correctly quoted from the docs you can still write 1.0 if you want
> > to
> > > ensure compatibility with systems that can not process the 'newer'
> > versions
> > > yet (2.6 was released in 2018!).
> > >
> > > You can find the long form discussions about these changes here:
> > > https://issues.apache.org/jira/browse/ARROW-12203
> > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > >
> > > Best
> > > Jacob
> > >
> > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > Hello Team,
> > > > Could you please share your thoughts about below questions?
> > > > Se

Re: [INFO] Arrow 16.0.0 feature freeze - 8th April

2024-03-14 Thread Adam Lippai
Pandas and NumPy will have major releases in the next month or so. Tracking
each other’s timelines might help avoiding unexpected breaks.

Best regards,
Adam Lippai

On Thu, Mar 14, 2024 at 11:00 Raúl Cumplido  wrote:

> Hi,
>
> In preparation for the next major Arrow release (16.0.0) I am planning
> on setting the 8th of April, second Monday of April, as the feature
> freeze date.
>
> If there are any issues that should block the release please remember
> to add the `Priority: blocker` label in the issue on GitHub. There are
> currently two issues identified as blockers [1].
>
> If the date doesn't work for someone, please let me know by answering
> on this email thread.
>
> Thanks,
> Raúl
>
> [1] https://github.com/apache/arrow/labels/Priority%3A%20Blocker
>


Re: [DISC][Java]: Migrate Arrow Java to JPMS Java Platform Module System

2023-12-05 Thread Adam Lippai
I believe Spark 4.0 was mentioned before. It’ll require Java 17 and will be
released in a few months (June?).

Best regards,
Adam Lippai

On Tue, Dec 5, 2023 at 12:05 David Li  wrote:

> Thanks James for delving into this mess.
>
> It looks like this change is unavoidable if we want to modularize? I think
> this is OK. Will the CLI argument change as we continue modularizing, or is
> this the only change that will be needed?
>
> On Mon, Dec 4, 2023, at 20:07, James Duong wrote:
> > Hello,
> >
> > I did some work to separate the below PR into smaller PRs.
> >
> >
> >   *   Updating the versions of dependencies and maven plugins is done
> > and merged into master.
> >   *   I separated out the work modularizing arrow-vector,
> > arrow-memory-core/unsafe, and arrow-memory-netty.
> >
> > Modularizing arrow-memory-core requires a smaller change to user
> > command-line arguments. Instead of:
> > --add-opens=java.base/java.nio=ALL-UNNAMED
> >
> > The user needs to add:
> > --add-opens=java.base/java.nio=org.apache.arrow.memory.core,ALL-UNNAMED
> >
> > I initially tried to modularize arrow-vector separately from
> > arrow-memory-core but found that any meaningful operation in
> > arrow-vector would trigger an illegal access in memory-core if it
> > wasn’t modularized.
> >
> > I was able to run the tests for arrow-compression and arrow-tools
> > successfully after modularizing memory-core, memory-unsafe-, and
> > arrow-vector. Note that I had more success by making memory-core and
> > memory-unsafe automatic modules.
> >
> > I think we should make a decision here on if we want to bite the bullet
> > and introduce a breaking user-facing change around command-line
> > options. The other option is to wait for JDK 21 to modularize. That’s
> > farther down the line and requires refactoring much of the memory
> > module code and implementing a module using the foreign memory
> > interface.
> >
> > From: James Duong 
> > Date: Tuesday, November 28, 2023 at 6:48 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: [DISC][Java]: Migrate Arrow Java to JPMS Java Platform
> > Module System
> > Hi,
> >
> > I’ve made some major progress on this work in this PR:
> > https://github.com/apache/arrow/pull/38876
> >
> >
> >   *   The maven plugin for compiling module-info.java files using JDK 8
> > is working correctly.
> >   *   arrow-format, arrow-memory-core, arrow-memory-netty,
> > arrow-memory-unsafe, and arrow-vector have been modularized
> > successfully.
> >  *   Tests pass locally for all of these modules.
> >  *   They fail in CI. This is likely from me not updating a profile
> > somewhere.
> >
> > Similar to David’s PR from below, arrow-memory and modules needed to be
> > refactored fairly significantly and split into two modules: a
> > public-facing JPMS module and a separate module which adds to Netty’s
> > packages (memory-netty-buffer-patch). What’s more problematic is that
> > because we are using named modules now, users need to add more
> > arguments to their Java command line to use arrow. If one were to use
> > arrow-memory-netty they would need to add the following:
> >
> > --add-opens java.base/jdk.internal.misc=io.netty.common
> >
> --patch-module=io.netty.buffer=${project.basedir}/../memory-netty-buffer-patch/target/arrow-memory-netty-buffer-patch-${project.version}.jar
>
> >
> --add-opens=java.base/java.nio=org.apache.arrow.memory.core,io.netty.common,ALL-UNNAMED
> >
> > Depending on where the memory-netty-buffer-patch JAR is located, and
> > what version, the command the user needs to supply changes, so this
> > seems like it’d be really inconvenient.
> >
> > Do we want to proceed with modularizing existing memory modules? Both
> > netty and unsafe? Or wait until the new memory module from Java 21 is
> > available?
> >
> > The module-info.java files are written fairly naively. I haven’t
> > inspected thoroughly to determine what packages users will need.
> >
> > We can continue modularizing more components in a separate PR. Ideally
> > all the user breakage (class movement, new command-line argument
> > requirements) happens within one major Arrow version.
> >
> > From: James Duong 
> > Date: Tuesday, November 21, 2023 at 1:16 PM
> > To: dev@arrow.apache.org 
> > Subject: Re: [DISC][Java]: Migrate Arrow Java to JPMS Java Platform
> > Module System
> > I’m following up on this topic.
> >
> > David has a PR from last year that’s done much of the

Re: Apache Arrow file format

2023-10-17 Thread Adam Lippai
Also there is
https://github.com/lancedb/lance between the two formats. Depending on the
use case it can be a great choice.

Best regards
Adam Lippai

On Tue, Oct 17, 2023 at 22:44 Matt Topol  wrote:

> One benefit of the feather format (i.e. Arrow IPC file format) is the
> ability to mmap the file to easily handle reading sections of a larger than
> memory file of data. Since, as Felipe mentioned, the format is focused on
> in-memory representation, you can easily and simply mmap the file and use
> the raw bytes directly. For a large file that you only want to read
> sections of, this can be beneficial for IO and memory usage.
>
> Unfortunately, you are correct that it doesn't allow for easy column
> projecting (you're going to read all the columns for a record batch in the
> file, no matter what). So it's going to be a trade off based on your needs
> as to whether it makes sense, or if you should use a file format like
> Parquet instead.
>
> -Matt
>
>
> On Tue, Oct 17, 2023, 10:31 PM Felipe Oliveira Carvalho <
> felipe...@gmail.com>
> wrote:
>
> > It’s not the best since the format is really focused on in- memory
> > representation and direct computation, but you can do it:
> >
> > https://arrow.apache.org/docs/python/feather.html
> >
> > —
> > Felipe
> >
> > On Tue, 17 Oct 2023 at 23:26 Nara 
> wrote:
> >
> > > Hi,
> > >
> > > Is it a good idea to use Apache Arrow as a file format? Looks like
> > > projecting columns isn't available by default.
> > >
> > > One of the benefits of Parquet file format is column projection, where
> > the
> > > IO is limited to just the columns projected.
> > >
> > > Regards ,
> > > Nara
> > >
> >
>


Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Adam Lippai
Compiled code usually means binaries you can’t derive in a deterministic,
verifiable way from the source code *shipped next to it*. So in this case
any developer should be able to reproduce the flatbuffers output from the
release package only.

“Caches”, multi stage compilation etc should be ok.

Best regards,
Adam Lippai

On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou  wrote:

>
> If the main impetus for the verification script is to comply with ASF
> requirements, probably the script can be made much simpler, such as just
> verify the GPG signatures are valid? Or perhaps this can be achieved
> without a script at all.
>
> The irony is that, however complex, our verification script doesn't seem
> to check the actual ASF requirements on artifacts.
>
> For example, we don't check that """a source release SHOULD not contain
> compiled code""" (also, what does "compiled code" mean? does generated
> code, e.g. by the Flatbuffers compiler, apply?)
>
> Checking that the release """MUST be sufficient for a user to build and
> test the release provided they have access to the appropriate platform
> and tools""" is ill-defined and potentially tautologic, because the
> "appropriate platform and tools" is too imprecise and contextual (can
> the "appropriate platform and tools" contain a bunch of proprietary
> software that gets linked with the binaries? Well, it can, otherwise you
> can't build on Windows).
>
> Regards
>
> Antoine.
>
>
>
> Le 22/08/2023 à 12:31, Raúl Cumplido a écrit :
> > Hi,
> >
> > I do agree that currently verifying the release locally provides
> > little benefit for the effort we have to put in but I thought this was
> > required as per Apache policy:
> > https://www.apache.org/legal/release-policy.html#release-approval
> >
> > Copying the important bit:
> > """
> > Before casting +1 binding votes, individuals are REQUIRED to download
> > all signed source code packages onto their own hardware, verify that
> > they meet all requirements of ASF policy on releases as described
> > below, validate all cryptographic signatures, compile as provided, and
> > test the result on their own platform.
> > """
> >
> > I also think we should try and challenge those.
> >
> > In the past we have identified some minor issues on the local
> > verification but I don't recall any of them being blockers for the
> > release.
> >
> > Thanks,
> > Raúl
> >
> > El mar, 22 ago 2023 a las 11:46, Andrew Lamb ()
> escribió:
> >>
> >> The Rust arrow implementation (arrow-rs) and DataFusion also use release
> >> verification scripts, mostly inherited from when they were split from
> the
> >> mono repo. They have found issues from time to time, for us, but those
> >> issues are often not platform related and have not been release
> blockers.
> >>
> >> Thankfully for Rust, the verification scripts don't need much
> maintenance
> >> so we just continue the ceremony. However, I certainly don't think we
> would
> >> lose much/any test coverage if we stopped their use.
> >>
> >> Andrew
> >>
> >> On Tue, Aug 22, 2023 at 4:54 AM Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Hello,
> >>>
> >>> Abiding by the Apache Software Foundation's guidelines, every Arrow
> >>> release is voted on and requires at least 3 "binding" votes to be
> approved.
> >>>
> >>> Also, every Arrow release vote is accompanied by a little ceremonial
> >>> where contributors and core developers run a release verification
> script
> >>> on their machine, wait for long minutes (sometimes an hour) and report
> >>> the results.
> >>>
> >>> This ceremonial has gone on for years, and it has not really been
> >>> questioned. Yet, it's not obvious to me what it is achieving exactly.
> >>> I've been here since 2018, but I don't really understand what the
> >>> verification script is testing for, or, more importantly, *why* it is
> >>> testing for what it is testing. I'm probably not the only one?
> >>>
> >>> I would like to bring the following points:
> >>>
> >>> * platform compatibility is (supposed to be) exercised on Continuous
> >>> Integration; there is no understandable reason why it should be
> >>> ceremoniously tested on eac

Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-30 Thread Adam Lippai
Hi Bechir,

GraphBlas is mainly an interface, a list of functions.

This is similar to how BLAS is implemented by OpenBlas and MKL.

Gabor Szárnyas has a great collection on how they are implanted and how
they perform:
https://github.com/GraphBLAS/GraphBLAS-Pointers

Best regards,
Adam Lippai

On Fri, Jun 30, 2023 at 08:04 Bechir Ben Daadouch 
wrote:

> Hey Adam,
>
> Thank you very much for taking the time to respond and for the suggestions
> :)
>
> I am aiming to assess the viability of Apache Arrow for graph algorithms
> and data structures, rather than addressing a particular issue. I am
> interested to determine the extent to which I can utilize Apache Arrow for
> this purpose and what potential challenges may arise.
>
> and since I have already encountered some roadblocks early on in my
> evaluation of Apache Arrow for graph algorithms and data structures, I am
> very interested in hearing your point of view on this topic.
>
> Thanks,
> Bechir
>
> On Fri, Jun 30, 2023, 13:38 Adam Lippai  wrote:
>
> > Hi,
> >
> > I’d recommend integrating with GraphBlas/suitesparse which are sparse
> > matrix multiplication based algorithms for common graph problems. There
> > might be an overlap in the data structures used, eg creating and a new
> > arrow type like the recent tensor type addition, or simply creating zero
> > copy or memcopy based efficient converters would be something worth
> > exploring.
> >
> > Best regards,
> > Adam alippai
> >
> > On Fri, Jun 30, 2023 at 06:57 Bechir Ben Daadouch <
> bechirche...@gmail.com>
> > wrote:
> >
> > > Thank you for taking the time to answer :)
> > >
> > > I don't have a fix Use-Case, but I am trying yo build a POC and
> evaluate
> > > whether Apache Arrow could be adequate in the context of graphs. But I
> > > found out very quickly that I won't be able to do all the necessary
> > > algorithm steps using only Apache Arrow without resorting to other
> > > libraries.
> > >
> > > On Fri, Jun 30, 2023, 07:36 Benson Muite 
> > > wrote:
> > >
> > > > On 6/30/23 04:21, Bechir Ben Daadouch wrote:
> > > > > Dear Apache Arrow Dev Community,
> > > > >
> > > > > My name is Bechir, I am currently working on a project that
> involves
> > > > > implementing graph algorithms in Apache Arrow.
> > > > >
> > > > > The initial plan was to construct a node structure and a subsequent
> > > graph
> > > > > that would encompass all the nodes. However, I quickly realized
> that
> > > due
> > > > to
> > > > > Apache Arrow's columnar format, this approach was not feasible.
> > > > >
> > > > > I tried a couple of things, including the implementation of the
> > > > > shortest-path algorithm. However, I rapidly discovered that
> > > manipulating
> > > > > arrow objects, particularly when applying graph algorithms, proved
> > more
> > > > > complex than anticipated and it became very clear that I would need
> > to
> > > > > resort to some data structures outside of what arrow offers (i.e.:
> > > Heapq
> > > > > wouldn't be possible using arrow).
> > > > >
> > > > > I also gave a shot at doing it similar to a certain SQL method
> (see:
> > > > > https://ibb.co/0rPGB42 ), but ran into some roadblocks there too
> > and I
> > > > > ended up having to resort to using Pandas for some transformations.
> > > > >
> > > > > My next course of action is to experiment with compressed sparse
> > rows,
> > > > > hoping to execute Matrix Multiplication using this method. But
> > > honestly,
> > > > > with what I know right now, I remain skeptical about the
> feasibility
> > > > > of it. However,
> > > > > before committing to this approach, I would greatly appreciate your
> > > > opinion
> > > > > based on your experience with Apache Arrow.
> > > > >
> > > > > Thank you very much for your time.
> > > > >
> > > > > Looking forward to potentially discussing this further.
> > > > >
> > > > > Many thanks,
> > > > > Bechir
> > > > >
> > > > Arrow may not be the best choice for most graph algorithms as they
> > > > typically require random memory accesses that will be difficult to
> > > > coalesce into forms that allow for vectorization. If your data will
> fit
> > > > in memory of a single node, you might consider:
> > > > https://github.com/DrTimothyAldenDavis/GraphBLAS
> > > > https://pypi.org/project/python-graphblas/
> > > > https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl
> > > >
> > >
> >
>


Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-30 Thread Adam Lippai
Hi,

I’d recommend integrating with GraphBlas/suitesparse which are sparse
matrix multiplication based algorithms for common graph problems. There
might be an overlap in the data structures used, eg creating and a new
arrow type like the recent tensor type addition, or simply creating zero
copy or memcopy based efficient converters would be something worth
exploring.

Best regards,
Adam alippai

On Fri, Jun 30, 2023 at 06:57 Bechir Ben Daadouch 
wrote:

> Thank you for taking the time to answer :)
>
> I don't have a fix Use-Case, but I am trying yo build a POC and evaluate
> whether Apache Arrow could be adequate in the context of graphs. But I
> found out very quickly that I won't be able to do all the necessary
> algorithm steps using only Apache Arrow without resorting to other
> libraries.
>
> On Fri, Jun 30, 2023, 07:36 Benson Muite 
> wrote:
>
> > On 6/30/23 04:21, Bechir Ben Daadouch wrote:
> > > Dear Apache Arrow Dev Community,
> > >
> > > My name is Bechir, I am currently working on a project that involves
> > > implementing graph algorithms in Apache Arrow.
> > >
> > > The initial plan was to construct a node structure and a subsequent
> graph
> > > that would encompass all the nodes. However, I quickly realized that
> due
> > to
> > > Apache Arrow's columnar format, this approach was not feasible.
> > >
> > > I tried a couple of things, including the implementation of the
> > > shortest-path algorithm. However, I rapidly discovered that
> manipulating
> > > arrow objects, particularly when applying graph algorithms, proved more
> > > complex than anticipated and it became very clear that I would need to
> > > resort to some data structures outside of what arrow offers (i.e.:
> Heapq
> > > wouldn't be possible using arrow).
> > >
> > > I also gave a shot at doing it similar to a certain SQL method (see:
> > > https://ibb.co/0rPGB42 ), but ran into some roadblocks there too and I
> > > ended up having to resort to using Pandas for some transformations.
> > >
> > > My next course of action is to experiment with compressed sparse rows,
> > > hoping to execute Matrix Multiplication using this method. But
> honestly,
> > > with what I know right now, I remain skeptical about the feasibility
> > > of it. However,
> > > before committing to this approach, I would greatly appreciate your
> > opinion
> > > based on your experience with Apache Arrow.
> > >
> > > Thank you very much for your time.
> > >
> > > Looking forward to potentially discussing this further.
> > >
> > > Many thanks,
> > > Bechir
> > >
> > Arrow may not be the best choice for most graph algorithms as they
> > typically require random memory accesses that will be difficult to
> > coalesce into forms that allow for vectorization. If your data will fit
> > in memory of a single node, you might consider:
> > https://github.com/DrTimothyAldenDavis/GraphBLAS
> > https://pypi.org/project/python-graphblas/
> > https://github.com/JuliaSparse/SuiteSparseGraphBLAS.jl
> >
>


Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-13 Thread Adam Lippai
Hi Alenka,

We didn’t discuss or benchmark the alternative formats. My understanding is
that the best should be similar to an primitive double Arrow column.
Currently the parquet (de)serialization takes 3x longer than desired for
the new Tensor type. That sounds more than “chasing the last 20% of
performance”.

The conversation can be continued separately, the most pressing questions
or issues are:
1. We might want to specify that a tensor has to consist of fixed size,
non-null and non-nested items to avoid confusion. This is a big constraint,
however makes it easier to have consistent assumptions and optimize eg the
parquet storage later. Alternatively we can define a DoubleTensor later (or
just accept the behavior varies a lot depending on the stored data, even
int and double tensor matmul is ridiculously confusing anyways)
2. Adding a fixed byte array based storage for fixedsizedlist with
primitives for Arrow<->Parquet conversion is desired to improve the
performance. It was still slower than storing doubles, but much better than
storing the list. We lose the parquet features eg delta encoding, list item
statistics or bloom filter (this might be already missing for lists, I
didn’t check yet)
3. The pandas numpy array is good news. I will confirm if the memory in the
column is continuous and operations can be vectorized or is it more similar
to an object storage with individual pointers

I don’t think the above are blocking issues. I’ve raised this here only
because I remember how annoying the timestamp and timezone conversions were
(not round tripping with pandas, parquet storage change).

P.S. I have almost zero experience with DNNs, but some reference how our
layout compares to NCHW or what batch sizes are can be interesting in the
docs:
https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html
I guess it’s all doable with the proposed extension.

Best regards,
Adam Lippai


On Mon, Mar 13, 2023 at 4:15 AM Alenka Frim 
wrote:

> Hi Adam,
>
> you are referring to the issue you raised on the Arrow repo [1] that turned
> into a good discussion about FixedSizeList and the current conversion
> to Parquet.
>
> Please correct me if I am wrong, but the outcome of the discussion was that
> the
> conversion is still pretty fast (much faster than commonly used
> serialization formats for
> tensors) though not as fast compared to other primitives in Apache Arrow.
>
> My opinion is that the discussion on this topic can be opened up
> separately in
> connection to optimising conversion between FixedSizeList as an Arrow
> format
> to Parquet, if there is still a need to do so.
>
> For this canonical extension type I would say it is an implementation
> detail
> and you mention a way to handle that with Parquet in the issue mentioned
> [2].
>
> I do not think there should be any issues in the conversion to Pandas.
> The conversion to numpy is not expensive and I would think the conversion
> to pandas should be the same. See PyArrow illustrative implementation [3].
>
> [1]: https://github.com/apache/arrow/issues/34510
> [2]: https://github.com/apache/arrow/issues/34510#issuecomment-1464463384
> [3]:
>
> https://github.com/apache/arrow/pull/33948/files#diff-efc1a41cdf04b6ec96d822dbec1f1993e0bbd17050b1b5f1275c8e3443a38828
>
> All well,
> Alenka
>
> On Fri, Mar 10, 2023 at 11:32 PM Adam Lippai  wrote:
>
> > Since the specification explicitly mentions FixedSizeList, but the
> current
> > conversion to/from parquet is expensive compared to doubles and other
> > primitives (the nested type needs repetition and definition levels)
> should
> > we discuss what’s the recommendation when integrating with other
> non-arrow
> > systems or is that an implementation detail only? (Pandas, parquet)
> >
> > Best regards,
> > Adam Lippai
> >
> > On Wed, Mar 8, 2023 at 1:13 AM Alenka Frim  > .invalid>
> > wrote:
> >
> > > >
> > > > Just one comment, though: since we also define a separate "Tensor"
> IPC
> > > > structure in Arrow, maybe we should state the relationship somewhere
> in
> > > the
> > > > documentation? (Even if the answer is "no relationship".)
> > > >
> > >
> > > Agree David, thanks for bringing it up.
> > >
> > > I will add the information about "no relationship" to the Tensor IPC
> > > structure into the spec and will also keep in mind to add it to the
> > > documentation that follows the implementations.
> > >
> >
>


Re: [VOTE][Format] Fixed shape tensor Canonical Extension Type

2023-03-10 Thread Adam Lippai
Since the specification explicitly mentions FixedSizeList, but the current
conversion to/from parquet is expensive compared to doubles and other
primitives (the nested type needs repetition and definition levels) should
we discuss what’s the recommendation when integrating with other non-arrow
systems or is that an implementation detail only? (Pandas, parquet)

Best regards,
Adam Lippai

On Wed, Mar 8, 2023 at 1:13 AM Alenka Frim 
wrote:

> >
> > Just one comment, though: since we also define a separate "Tensor" IPC
> > structure in Arrow, maybe we should state the relationship somewhere in
> the
> > documentation? (Even if the answer is "no relationship".)
> >
>
> Agree David, thanks for bringing it up.
>
> I will add the information about "no relationship" to the Tensor IPC
> structure into the spec and will also keep in mind to add it to the
> documentation that follows the implementations.
>


Re: [DISCUSS] Flight RPC/Flight SQL/ADBC enhancements

2023-02-17 Thread Adam Lippai
One more thing to consider regarding transactions is the potential support
for distributed transactions.
It can be interesting for parallel data fetching, data insertion (ETL job
with many workers).

Best regards,
Adam Lippai


On Fri, Feb 17, 2023 at 4:46 PM Matthew Topol 
wrote:

> Looking at the info, there's already some SqlInfo values that correspond to
> indicating support for transaction isolation types, so we just need to add
> the ability to request the isolation desired when Beginning a transaction,
> I'll aim to suggest an edit to the document later today or tomorrow.
>
> On Fri, Feb 17, 2023 at 2:02 PM David Li  wrote:
>
> > I think it'd make sense. Do you want to propose a section in the doc?
> >
> > I think it'd make the most sense to just define enums/flags in Flight SQL
> > instead of arbitrary options. (A corresponding set of SqlInfo values
> could
> > indicate support for each of them.)
> >
> > On Thu, Feb 16, 2023, at 16:16, Matthew Topol wrote:
> > > While implementing Transaction handling for ADBC via Flight SQL's
> > > transaction primitives, another potential enhancement would be to
> expand
> > > the BeginTransaction request to include a spot for "options" such as
> > > IsolationLevel or marking a transaction as ReadOnly.
> > >
> > > Anyone have thoughts on this?
> > >
> > > On Wed, Feb 15, 2023 at 10:19 AM David Li  wrote:
> > >
> > >> The ADBC and Flight SQL proposals have been updated for
> > >> Micah/Taeyun/Will's comments.
> > >>
> > >> On Wed, Feb 15, 2023, at 09:17, David Li wrote:
> > >> > Hi Taeyun,
> > >> >
> > >> > Thanks for the detailed feedback!
> > >> >
> > >> > - I will clarify that PollFlightInfo should return as quickly as
> > >> > possible on the first call, and that updates in progress value are
> > also
> > >> > OK (though the server shouldn't spam updates). (I wanted to avoid
> > >> > streaming calls as it does not work as well with browser-based gRPC
> > >> > clients.)
> > >> > - I will clarify cancel_descriptor to note that it is optional.
> > >> > - I wanted to avoid adding several new RPC methods, but if there is
> > >> > rough agreement that these would be generally useful, I will add
> them
> > >> > and deprecate the Flight SQL message [3]. (We could also possibly
> > >> > define 'standard' DoAction Protobuf messages, but I worry about
> > >> > implementation [1]. I may prototype this first, since then we could
> > >> > avoid having redundant paths in Flight RPC/Flight SQL.) If we do
> this,
> > >> > I think we do not need cancel_descriptor. (It can work like
> > >> > CancelQuery.)
> > >> > - I meant that CancelQuery should work with a partial FlightInfo
> from
> > a
> > >> > PollFlightInfo response. However this doesn't work if there's no
> > >> > endpoints in the response! I will add app_metadata fields to
> > >> > FlightInfo/FlightEndpoint. I think this can also be useful for
> > >> > applications that need to add their own semantics to these messages
> > >> > anyways, since Ticket is not meant to be parsed by the client. (You
> > >> > could stuff the info into the schema, but that also doesn't work if
> > the
> > >> > schema is not yet known.)
> > >> >
> > >> > As for the partial DoGet: I think this is interesting and we can
> > >> > discuss. Google BigQuery Storage supports this use case [2]. As you
> > >> > note, if you are using this to request only a few rows, you may not
> > >> > benefit much from Arrow.
> > >> >
> > >> > [1]: The C++ Protobuf library makes it difficult to define and share
> > >> > messages across multiple shared libraries. On Windows, protoc does
> not
> > >> > properly insert dllimport/dllexport macros (despite claiming to),
> and
> > >> > on Unixes Protobuf interacts oddly with our linker script/symbol
> > >> > hiding. This would be a lot of work, but I wonder if we could use an
> > >> > implementation like upb/nanopb that does not rely on global state
> for
> > >> > Arrow. This would also hopefully ease conflicts with projects that
> > want
> > >> > to use their own Protobuf definitions - as with Substrait. The main
> > >&

Re: Predicate Pushdown/Arrow-rs Usage Question

2023-01-10 Thread Adam Lippai
Row group level predicate pushdowns should be supported in both C++ and
Rust. What’s the use case / query you want to speed up?

Page index and bloom filters are brand new and low level in arrow-rs, but
there is support for them. AFAIK C++ doesn’t have full standard coverage
for either.

Best regards,
Adam Lippai

On Tue, Jan 10, 2023 at 9:35 PM SHI BEI  wrote:

> Hi arrow community,
>
>
>
>
> I'm new to the arrow project and am trying to use arrow and parquet in a
> C/C++ project. To improve the query peformance, I plan to take the
> advantage of parquet row-group level and page level statistics when
> querying data, but GLib/C++ SDK is lack of implement for parquet predicates
> pushdown. I have noticed that some works are in process to support parquet
> predicates pushdown, but it will take some time. So I want to know whether
> if it's possible to use arrow-rs instead, and is there any one have some
> pricate in the same scene. Any one can help will be appricated!
>
>
>
> SHI BEI
> shibei...@foxmail.com


Re: Remote datasets

2022-04-12 Thread Adam Lippai
Hi David,

This is a perfect answer. I was looking for the Fragment concept and the
issues you linked make it easy to follow.
I understand this is a really hard field with a ton of work, getting
chunking, prefetch and backpressure correctly + adding filter predicate and
other computation pushdown is an infinitely complex task.

Thank you for making this clear. You make so great progress it's hard to
keep up even with the big picture :D

Best regards,
Adam Lippai

On Tue, Apr 12, 2022 at 4:46 PM David Li  wrote:

> TL;DR yes, if and when all is said and done.
>
> Breaking this down…
>
> Substrait isn't really relevant here. It's a way to serialize a query in a
> way that's agnostic to whatever's actually generating or executing the
> query.
>
> But if you have a Substrait plan, that can get converted by the Arrow C++
> Query Engine into its internal "ExecPlan" for execution, which is what's
> actually implementing the joins, aggregations, etc. This engine operates in
> a streaming fashion, so your application can take the data you get out and
> use it with a Flight service/client, yes.
>
> The query engine pulls input from the Arrow Datasets library. (Though
> while I speak of them separately, really, they are intertwined.) Datasets
> is also a streaming interface to read Arrow data from various underlying
> datasources, implementing things like projection pushdown and partitioning
> where possible. This is agnostic to whether the data is local or remote,
> i.e. there's no explicit concept of "remote dataset". It's all datasets
> whether it's in memory, on local disk, or across the network.
>
> So if/when a Flight datasource ("Fragment") is implemented for Arrow
> Datasets, this will be consumed in a streaming fashion, by a query engine
> which itself is streaming, which can be fed into a streaming interface like
> Flight. There's a good amount of work to do to ensure this all works well
> together (e.g. ensuring backpressure gets reflected across all these
> layers), but what you are asking for is in principle doable, if not quite
> yet implemented.
>
> -David
>
> On Tue, Apr 12, 2022, at 16:21, Adam Lippai wrote:
> > Hi James,
> >
> > Your answer helps, yes.
> > My question is whether I will be able to join two datasets (producing a
> new
> > dataset) in a streaming way or do I have to fetch the whole response and
> > keep it in memory?
> > So if my local node has memory constraints, will it be able to stream
> data
> > from an Apache Flight datasource and stream it back to a different Apache
> > Flight target?
> > If the answer is yes, is it because there will be a Remote Dataset
> concept
> > or will it use "distributed computing" using Substrait?
> >
> > Best regards,
> > Adam Lippai
> >
> > On Tue, Apr 12, 2022 at 4:14 PM James Duong  .invalid>
> > wrote:
> >
> >> Hi Adam,
> >>
> >> Arrow Flight can be used to provide an RPC framework that returns
> datasets
> >> (sent over the wire as arrow buffers) and exposes them from a
> FlightClient
> >> as Arrow RecordBatches without serialization. Is this what you mean by
> >> remote datasets?
> >> Arrow Flight SQL is an application layer built on top of Arrow Flight
> that
> >> standardizes remote execution of SQL queries, getting catalog
> information,
> >> getting SQL capabilities, and other access-related concepts. Arrow
> Flight
> >> SQL is intended to provide a universal user-facing front end for
> existing
> >> SQL-capable database engines.
> >>
> >> Neither are really intended for computation, just remote access.
> >>
> >> On Tue, Apr 12, 2022 at 12:51 PM Adam Lippai  wrote:
> >>
> >> > Hi,
> >> >
> >> > I saw really nice features like groupby and join developed recently.
> >> > I like how Dataset is supported for joins and how streamed processing
> is
> >> > gaining momentum in Arrow.
> >> >
> >> > Does Apache Arrow have the concept of remote datasets eg using Arrow
> >> > Flight? Or will this happen directly using S3 and other protocols
> only? I
> >> > know some work has started in Substrait, but that might be a whole new
> >> > level of integration, hence my question focusing on data first.
> >> >
> >> > I was trying to browse the JIRA issues, but the future picture wasn't
> >> clear
> >> > based on that
> >> >
> >> > Best regards,
> >> > Adam Lippai
> >> >
> >>
> >>
> >> --
> >>
> >> *James Duong*
> >> Lead Software Developer
> >> Bit Quill Technologies Inc.
> >> Direct: +1.604.562.6082 | jam...@bitquilltech.com
> >> https://www.bitquilltech.com
> >>
> >> This email message is for the sole use of the intended recipient(s) and
> may
> >> contain confidential and privileged information.  Any unauthorized
> review,
> >> use, disclosure, or distribution is prohibited.  If you are not the
> >> intended recipient, please contact the sender by reply email and destroy
> >> all copies of the original message.  Thank you.
> >>
>


Re: Remote datasets

2022-04-12 Thread Adam Lippai
Hi James,

Your answer helps, yes.
My question is whether I will be able to join two datasets (producing a new
dataset) in a streaming way or do I have to fetch the whole response and
keep it in memory?
So if my local node has memory constraints, will it be able to stream data
from an Apache Flight datasource and stream it back to a different Apache
Flight target?
If the answer is yes, is it because there will be a Remote Dataset concept
or will it use "distributed computing" using Substrait?

Best regards,
Adam Lippai

On Tue, Apr 12, 2022 at 4:14 PM James Duong 
wrote:

> Hi Adam,
>
> Arrow Flight can be used to provide an RPC framework that returns datasets
> (sent over the wire as arrow buffers) and exposes them from a FlightClient
> as Arrow RecordBatches without serialization. Is this what you mean by
> remote datasets?
> Arrow Flight SQL is an application layer built on top of Arrow Flight that
> standardizes remote execution of SQL queries, getting catalog information,
> getting SQL capabilities, and other access-related concepts. Arrow Flight
> SQL is intended to provide a universal user-facing front end for existing
> SQL-capable database engines.
>
> Neither are really intended for computation, just remote access.
>
> On Tue, Apr 12, 2022 at 12:51 PM Adam Lippai  wrote:
>
> > Hi,
> >
> > I saw really nice features like groupby and join developed recently.
> > I like how Dataset is supported for joins and how streamed processing is
> > gaining momentum in Arrow.
> >
> > Does Apache Arrow have the concept of remote datasets eg using Arrow
> > Flight? Or will this happen directly using S3 and other protocols only? I
> > know some work has started in Substrait, but that might be a whole new
> > level of integration, hence my question focusing on data first.
> >
> > I was trying to browse the JIRA issues, but the future picture wasn't
> clear
> > based on that
> >
> > Best regards,
> > Adam Lippai
> >
>
>
> --
>
> *James Duong*
> Lead Software Developer
> Bit Quill Technologies Inc.
> Direct: +1.604.562.6082 | jam...@bitquilltech.com
> https://www.bitquilltech.com
>
> This email message is for the sole use of the intended recipient(s) and may
> contain confidential and privileged information.  Any unauthorized review,
> use, disclosure, or distribution is prohibited.  If you are not the
> intended recipient, please contact the sender by reply email and destroy
> all copies of the original message.  Thank you.
>


Remote datasets

2022-04-12 Thread Adam Lippai
Hi,

I saw really nice features like groupby and join developed recently.
I like how Dataset is supported for joins and how streamed processing is
gaining momentum in Arrow.

Does Apache Arrow have the concept of remote datasets eg using Arrow
Flight? Or will this happen directly using S3 and other protocols only? I
know some work has started in Substrait, but that might be a whole new
level of integration, hence my question focusing on data first.

I was trying to browse the JIRA issues, but the future picture wasn't clear
based on that

Best regards,
Adam Lippai


Re: Flight/FlightSQL Optimization for Small Results?

2022-02-28 Thread Adam Lippai
I saw the same. A small, stateless query ability would be nice (connection
open, initialization, query in one message, the resultset in the response
in one message)

On Mon, Feb 28, 2022, 13:12 Micah Kornfield  wrote:

> I'm rereviewing the Flight SQL interfaces, and I'm not sure if I'm missing
> it but is there any optimization for small results?  My concern is that the
> overhead of the RPCs for the DoGet after executing the query could add
> non-trivial latency for smaller results.
>
> Has anybody else thought about this/investigated it?  Am I understanding
> this correctly?
>
> Thanks,
> Micah
>


Re: [DISCUSS] [RUST] More Frequent arrow-rs release schedule

2022-01-02 Thread Adam Lippai
While I don't know how much legacy API support would downstream projects
need (without a good case Andrews suggestion should be the "default",
keeping minor / patch versions is the extra workflow), there is one nit I
can add:
According to the Rust recommendations every package used as a dependency
for other packages should have 1.0+ version and use semantic versioning
(exactly what Andrew proposed)
https://semver.org/#how-do-i-know-when-to-release-100

The proposed behavior could lead to a toxic and messy arrow-rs package,
however if someone feels this is happening, we should maintain smaller
packages (arrow-rs broken into multiple crates), not changing whether we
follow semver or how we bend the guidance.

Best regards,
Adam Lippai


On Sun, Jan 2, 2022 at 2:52 PM Micah Kornfield 
wrote:

> Hi Andrew,
> Happy new year. I'm not familiar with Rust but it seems like a lot of the
> intention here is to really treat Arrow-RS as a pre-1.0 package (we should
> maybe be doing the same for other libraries as well but that is a separate
> issue).  Is it possible to deprecate the existing arrow-rs and create a new
> 'arrow-rs2' package and keep it at a pre-1.0 until the community feels that
> it has reached 1.0.
>
> I'm sure this will cause some pain and confusion but it might be better in
> the long run?
>
> -Micah
>
> On Sat, Jan 1, 2022 at 4:55 AM Andrew Lamb  wrote:
>
> > Happy New Year,
> >
> > I would like to discuss more frequent major arrow-rs releases and have
> > written my thoughts on a ticket [1] and would love any additional
> community
> > feedback. Among other things this would result in versions that no longer
> > align with the C/C++ implementation.
> >
> > Thanks,
> > Andrew
> >
> > [1]: https://github.com/apache/arrow-rs/issues/1120
> >
>


Re: [VOTE] Release Apache Arrow JS 6.0.2

2021-11-23 Thread Adam Lippai
I think the voting makes sense, but not because it's similar to the Rust
release process. A big difference is that the Rust release is a minor
release with a lot of code and features added and not patch release with
essential bug fixes only.

Best regards,
Adam Lippai

On Tue, Nov 23, 2021, 19:10 Benson Muite  wrote:

> https://issues.apache.org/jira/browse/ARROW-14801
>
> Rust has its own repository and does frequent point releases:
> https://github.com/apache/arrow-rs/tree/master/dev/release
>
> however, even point releases require 3 PMC binding +1 votes and API
> breaking changes can only take place on major releases.
>
> Many of the tests for releases can be automated, possibly relieving some
> of the PMC burden in the current process.  Judgement on code quality and
> software license is still required though[1]. Similarly, releases need
> to be signed.
>
>
> [1] https://infra.apache.org/release-publishing.html
>
> On 11/23/21 7:33 PM, Dominik Moritz wrote:
> >   I tested Node v14.18.1 and tests pass. I think we can go ahead and
> make a
> > release.
> >
> > @Benson, could you help me update the script to work off of branches. I
> > don’t know what the expected process for release verification is. I’d be
> > happy to adopt another process.
> >
> > On Nov 20, 2021 at 09:57:53, Dominik Moritz  wrote:
> >
> >> Thanks for catching that.
> >>
> >> Jest is used for running the tests and jest supports node 14.15. Could
> we
> >> switch to node 14.15 instead of 14.0 for this test?
> >>
> >> On Nov 20, 2021 at 05:37:00, Benson Muite 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Tested this on AlmaLinux 8. Following steps:
> >>>
> >>>  export NVM_DIR="`pwd`/.nvm"
> >>>  mkdir -p $NVM_DIR
> >>>  curl -o-
> >>> https://raw.githubusercontent.com/nvm-sh/nvm/v0.35.3/install.sh | \
> >>>PROFILE=/dev/null bash
> >>>  [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
> >>>
> >>>  nvm install --lts
> >>>  npm install -g yarn
> >>>  git clone https://github.com/apache/arrow
> >>>  cd arrow
> >>>  git checkout release-6.0.2-js
> >>>  cd js
> >>>  yarn --frozen-lockfile
> >>>  yarn run-s clean:all lint build
> >>>  yarn test
> >>>
> >>> Tests pass.
> >>>
> >>> yarn 1.22.17
> >>> npm 8.1.0
> >>> node 16.13.0
> >>>
> >>> Tests also pass on
> >>> node 17.0.0
> >>>
> >>>
> >>> Node 14 is supported until 2023, however if one tries to use Node 14,
> >>> one gets the error:
> >>>
> >>> jest@27.0.6: The engine "node" is incompatible with this module.
> >>> Expected version "^10.13.0 || ^12.13.0 || ^14.15.0 || >=15.0.0". Got
> >>> "14.0.0"
> >>> error Found incompatible module.
> >>>
> >>>
> >>> The current release verification script could be update to support
> >>> testing directly from a branch if this will be the point release
> process
> >>> in future.
> >>>
> >>> On 11/20/21 12:25 AM, Dominik Moritz wrote:
> >>>
> >>> Hi,
> >>>
> >>>
> >>> I would like to propose a patch release for Arrow JS. The release is
> >>> forked
> >>>
> >>> off of maint-6.0.x and available at
> >>>
> >>> https://github.com/apache/arrow/tree/release-6.0.2-js.
> >>>
> >>>
> >>> The release contains two fixes for the js bundle:
> >>>
> >>> ARROW-14773: [JS] Fix sourcemap paths
> >>>
> >>> <https://github.com/apache/arrow/pull/11741>
> >>>
> >>> ARROW-14774: [JS] Correct package exports
> >>>
> >>> <https://github.com/apache/arrow/pull/11742>
> >>>
> >>>
> >>> [ ] +1 Release this as Apache Arrow JS 6.0.2
> >>>
> >>> [ ] +0
> >>>
> >>> [ ] -1 Do not release this as Apache Arrow JS 6.0.2 because...
> >>>
> >>>
> >>> Thank you,
> >>>
> >>> Dominik
> >>>
> >>>
> >>>
> >>>
> >
>
>


Re: Synergies with Apache Avro?

2021-10-31 Thread Adam Lippai
Hi Jorge,

Just an idea: Do the Avro libs support different allocators? Maybe using a
different one (e.g. mimalloc) would yield more similar results by working
around the fragmentation you described.

This wouldn't change the fact that they are relatively slow, however it
could allow you better apples to apples comparison thus better CPU
profiling and understanding of the nuances.

Best regards,
Adam Lippai


On Sun, Oct 31, 2021, 17:42 Jorge Cardoso Leitão 
wrote:

> Hi,
>
> I am reporting back a conclusion that I recently arrived at when adding
> support for reading Avro to Arrow.
>
> Avro is a storage format that does not have an associated in-memory
> format. In Rust, the official implementation deserializes an enum, in
> Python to a vector of Object, and I suspect in Java to an equivalent vector
> of object. The important aspect is that all of them use fragmented memory
> regions (as opposed to what we do with e.g. one uint8 buffer for
> StringArray).
>
> I benchmarked reading to arrow vs reading via the official Avro
> implementations. The results are a bit surprising: reading 2^20 rows of 3
> byte strings is ~6x faster than the official Avro Rust implementation and
> ~20x faster vs "fastavro", a C implementation with bindings for Python (pip
> install fastavro), all with a difference slope (see graph below or numbers
> and used code here [1]).
> [image: avro_read.png]
>
> I found this a bit surprising because we need to read row by row and
> perform a transpose of the data (from rows to columns) which is usually
> expensive. Furthermore, reading strings can't be that much optimized after
> all.
>
> To investigate the root cause, I drilled down to the flamegraphs for both
> the official avro rust implementation and the arrow2 implementation: the
> majority of the time in the Avro implementation is spent allocating
> individual strings (to build the [str] - equivalents); the majority of the
> time in arrow2 is equally divided between zigzag decoding (to get the
> length of the item), reallocs, and utf8 validation.
>
> My hypothesis is that the difference in performance is unrelated to a
> particular implementation of arrow or avro, but to a general concept of
> reading to [str] vs arrow. Specifically, the item by item allocation
> strategy is far worse than what we do in Arrow with a single region which
> we reallocate from time to time with exponential growth. In some
> architectures we even benefit from the __memmove_avx_unaligned_erms
> instruction that makes it even cheaper to reallocate.
>
> Has anyone else performed such benchmarks or played with Avro -> Arrow and
> found supporting / opposing findings to this hypothesis?
>
> If this hypothesis holds (e.g. with a similar result against the Java
> implementation of Avro), it imo puts arrow as a strong candidate for the
> default format of Avro implementations to deserialize into when using it
> in-memory, which could benefit both projects?
>
> Best,
> Jorge
>
> [1] https://github.com/DataEngineeringLabs/arrow2-benches
>
>
>


Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-06 Thread Adam Lippai
Hi,

Thanks for the detailed answer.

In contrast to my previous email, my opinionated part:

Generally I like the idea of smaller crates, it helps with a lot of stuff
(different targets, build time), but those benefits can be achieved by
feature gates too.
The upside would be out-of-sync crate releases.

Maintenance is important, historically speaking I've seen it solved for
open source by private companies offering it as a paid service.
You are right that currently only 3 months of support is provided for free,
but personally I don't see that as an issue.
There are professional libraries and software with close to 100% market
share in their field which support the last or last two versions only
(Chrome, OS-es, compilers).
I find it hard to imagine we'd want to do it *better*, that sounds to be an
illusion, but I'd like to be wrong on this one :)
Professionally speaking, when picking projects, having Apache (or other)
governance and community is more important for the businesses I worked
with, than the release schedule or API stability / versioning.


Based on the above and that there are about a dozen active Rust arrow
contributors, any promise for reliable maintenance over years would be a
lie in my eyes.
DataFusion, Polars, odbc2parquet and others had issues with the changes
being too slow, not too fast.

I'm a big advocate of middle grounds and I still believe that your efforts
and ideal setup is compatible with arrow-rs, nobody would stop you creating
a 5.23.0 release next to the 6.1.0 if you'd want to backport anything and
nobody would stop you cutting an out-of-schedule 6.2 or even 7.0 release if
it's to ensure security. The frequent Apache release process - which we
were afraid of - was smooth so far, with surprisingly nice support from
members of different languages / implementations.

Also I believe that any plan you'd have turning arrow2 into arrow-rs 6.0
would be more than welcome on a public vote, along with the technical
chances you propose (eg. cutting a separate arrow-io crate).


At least 6 key members showed their excitement for your changes in this
thread and even more on Slack/GitHub ;)

Best regards,
Adam Lippai

On Fri, Aug 6, 2021 at 10:07 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Thanks for your input.
>
> Every time there is a new major release, all new development shifts towards
> that new API and users of previous APIs are left behind. It is not just a
> matter of SemVer and size of version numbers, there is a whole development
> shift to be on top of the new API.
>
> I disagree that a software that has a major release every 3 months and no
> maintenance window over previous versions is stable. I alluded to the Tokio
> example because Tokio 1.0 recently became the runtime of rust-based AWS
> lambda functions [1]; this commitment is only possible by enforcing API
> stability and maintenance beyond a 3 month period (at least 3 years in
> their case).
>
> Also, imo the current major version number is not meaningless: divided by
> the software age, it constitutes the historical release pattern and is
> usually a good predictor of the pattern used in future releases.
>
> The evidence is that we haven't been able to support any version for any
> period of time; recently, Andrew has been doing amazing work at supporting
> the latest version for a period of 3 months. I.e. an application that
> depends on `arrow = ^5.0` has a support window of 3 months. Given that we
> have not backported any security fixes to previous versions, it is
> reasonable to assume that security patches are also applied within a 3
> month period only.
>
> As contributor of arrow2, I would rather not have arrow2 under Apache Arrow
> than having to release it under its current versioning and scheduling (this
> is similar to some of Julia's concerns). As a contributor to the Apache
> Arrow, I currently cannot guarantee a maintenance window over arrow-rs for
> any period of time because it is unsafe by design and I do not have the
> motivation to fix it. As both, I am confident that the core arrow2 will
> soon reach a point where we can live with and develop on top of it for at
> least a year. This is not true to the whole API surface, though: there are
> APIs that we will need to change more often until stability can be
> promised.
>
> So, I am requesting that we tie the discussion of arrow2 to how it will be
> released.
>
> Could a middle ground be somewhere along the lines of splitting the crate
> in smaller crates that are versioned independently. I.e. continue to
> release `arrow` under the same versioning and cadence, and create 3 new
> crates, arrow-core, arrow-compute, and arrow-io (see also [2]) that would
> have their own versioning at 0.X until stability is achieved, bas

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-08-05 Thread Adam Lippai
Not taking sides, just two technical notes below.

Server.org clearly defines (
https://semver.org/#how-do-i-know-when-to-release-100) the versions >1.0.0.
* If it's used in production, it's 1.0.0.
* If it provides an API others depend on then it's 1.0.0.
* If you intend to keep backward compatibility, it's 1.0.0.
Tl;Dr 1.0.0 represents a version which from point we guarantee that
non-production releases are marked (alpha, beta, rc) and breaking (API)
changes, backwards incompatible changes result in major version bump. This
we already do, 4x per year.

The second fact is that arrow2 uses the arrow name, but it doesn't have
apache governance. It's not released from GitHub.com/apache, there are no
formal releases, there are no votes. This is not correct or fair usage of
the brand (on the same level as DataFuse, or db-benchmark calling a custom
R implementation arrow) even if it's "unofficial". My understanding is that
arrow2 can be an unofficial implementation with a different name or an
arrow-rs experiment with the intention to merge the code, but not both.

I think both issues could be solved and I really value and like the arrow2
work so far. That's the right way. I hope we'll see it in prod either way
as soon as it's ready.

Best regards,
Adam Lippai

On Wed, Aug 4, 2021, 08:25 QP Hou  wrote:

> Just my two cents.
>
> I think we all have the same goal here, which is to accelerate the
> transitioning of arrow to arrow2 as the official arrow rust
> implementation.
>
> In my opinion, the biggest gain we can get from merging two projects
> into one repo is to have some kind of a policy to enforce that every
> new feature/test added to the current arrow implementation also  needs
> to be added to the arrow2 implementation. This way, we can make sure
> the gap between arrow and arrow2 is closing on every iteration.
> Without this, I tend to agree with Jorge that merging two repos would
> add more overhead to his work and slow him down.
>
> For those who want to contribute to arrow2 to accelerate the
> transition, I don't think they would have problem sending PRs to the
> arrow2 repo. For those who are not interested in contributing to
> arrow2, merging the arrow2 code base into the current arrow-rs repo
> won't incentivize them to contribute. Merging arrow2 into current
> arrow-rs repo could help with discovery. But I think this can be
> achieved by adding a big note in the current arrow-rs README to
> encourage contributions to the arrow2 repo as well.
>
> At the end of the day, Jorge is currently the sole active contributor
> to the arrow2 implementation, so I think he would have the most say on
> what's the most productive way to push arrow2 forward. The only
> concern I have with regards to merging arrow2 into arrow-rs right now
> is Jorge spent all the efforts to do the merge, then it turned out
> that he is still the only active contributor to arrow2 within
> arrow-rs, but with more overhead that he has to deal with.
>
> As for maintaining semantic versioning for arrow2, Andy had a good
> point that we could still release arrow2 with its own versioning even
> if we merge it into the arrow-rs repo. So I don't think we should
> worry/focus too much about versioning in our discussion. Velocity to
> close the gap between arrow-rs and arrow2 is the most important thing.
>
> Lastly, I do agree with Andrew that it would be good to only maintain
> a single arrow crate in crates.io in the long run. As he mentioned,
> when the current arrow2 code base becomes stable, we could still
> release it under the arrow namespace in crates.io with a major version
> bump. The absolute value in the major version doesn't really matter as
> long as we stick to the convention that breaking change will result in
> a major version bump.
>
> Thanks,
> QP
>
>
>
> On Tue, Aug 3, 2021 at 5:31 PM paddy horan  wrote:
> >
> > Hi Jorge,
> >
> > I see value in consolidating development in a single repo and releasing
> under the existing arrow crate.  Regarding versioning, I think once we
> follow semantic versioning we are fine.  I don't think it's worth migrating
> to a different repo and crate to comply with the de-facto standard you
> mention.
> >
> > Just one person's opinion though,
> > Paddy
> >
> >
> > -Original Message-
> > From: Jorge Cardoso Leitão 
> > Sent: Tuesday, August 3, 2021 5:23 PM
> > To: dev@arrow.apache.org
> > Subject: Re: [Discuss] [Rust] Arrow2/parquet2 going foward
> >
> > Hi Paddy,
> >
> > > What do you think about moving Arrow2 into the main Arrow repo where
> > > it
> > is only enabled via an "experime

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-18 Thread Adam Lippai
Hi Simon,

There are several Arrow implementations in parallel:
https://arrow.apache.org/docs/status.html
The Python and R versions are based on Arrow C++, others are
completely separate projects.
Arrow-rs and Arrow2 are referring to the Rust implementation, Arrow C++ is
not going to be replaced.

The thread is about making a really big rewrite of the Arrow-rs
implementation, now called arrow2.
This will be a more idiomatic and safer Rust implementation, incorporating
the experience collected during the Arrow-rs development.

As the Arrow Rust community intends to keep a single Rust implementation,
it's quite a challenge how to release such API breaks allowing the
downstream projects to keep up with the development and provide feedback.

Tl;dr Arrow2 replaces the Arrow-rs Rust implementation, it has nothing to
do with the C++ implementation (other than all of them share the same
concepts and implement the same standards, formats)

Best regards,
Adam Lippai

On Sun, Jul 18, 2021 at 7:01 PM Simon Perkins 
wrote:

> I'm interested to hear what the relation between arrow2, arrow-rs and the
> main github apache/arrow is. Is the intention to replace the C++ codebase
> with a rust implementation?
>
> The reason I'm asking is that I'm adding complex number support in the C++
> codebase. It may instead be a better idea to do this in the Rust
> implementation if it is indeed replacing the C++ implementation.
>
>
>
> On Sat, Jul 17, 2021 at 1:59 PM Andrew Lamb  wrote:
>
> > What if we released "beta" [1] versions of arrow on cargo at whatever
> pace
> > was necessary? That way dependent crates could opt in to bleeding edge
> > functionality / APIs.
> >
> > There is tension between full technical freedom to change APIs and the
> > needs of downstream projects for a more stable API.
> >
> > Whatever its technical faults may be, projects that rely on arrow (such
> as
> > anything based on DataFusion, like my own) need to be supported as they
> > have made the bet on Rust Arrow. I don't think we can abandon maintenance
> > on the existing codebase until we have a successor ready.
> >
> > Andrew
> >
> > p.s. I personally very much like Adam's suggestion for "Arrow 6.0 in Oct
> > 2021 be based on arrow2" but that is predicated on wanting to have arrow2
> > widely used by downstreams at that point.
> >
> > [1]
> >
> >
> https://stackoverflow.com/questions/46373028/how-to-release-a-beta-version-of-a-crate-for-limited-public-testing
> >
> >
> > On Sat, Jul 17, 2021 at 5:56 AM Adam Lippai  wrote:
> >
> > > 5.0 is being released right now, which means from timing perspective
> this
> > > is the worst moment for arrow2, indeed. You'd need to wait the full 3
> > > months. On the other hand does releasing a 6.0 beta based on arrow2 on
> > Aug
> > > 1st, rc on Sept 1st and releasing the stable on Oct 1st sound like a
> bad
> > > plan?
> > >
> > > I don't think a 6.0-beta release would be confusing and dedicating most
> > of
> > > the 5.0->6.0 cycle to this change doesn't sound excessive.
> > >
> > > I think this approach wouldn't result in extra work (backporting the
> > > important changes to 5.1,5.2 release). It only shows the magnitude of
> > this
> > > change, the work would be done by you anyways, this would just make it
> > > clear this is a huge effort.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Sat, Jul 17, 2021, 11:31 Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Arrow2 and parquet2 have passed the IP clearance vote and are ready
> to
> > be
> > > > merged to apache/* repos.
> > > >
> > > > My plan is to merge them and PR to both of them to the latest updates
> > on
> > > my
> > > > own repo, so that I can temporarily (and hopefully permanently)
> archive
> > > the
> > > > versions of my account and move development to apache/*.
> > > >
> > > > Most of the work happening in arrow-rs is backward compatible or
> simple
> > > to
> > > > deprecate. However, this situation is different in arrow2 and
> > parquet2. A
> > > > release cadence of a major every 3 months is prohibitive at the pace
> > > that I
> > > > am plowing through.
> > > >
> > > > The core API (types, alloc, buffer, bitmap, array, mutable arr

Re: [Discuss] [Rust] Arrow2/parquet2 going foward

2021-07-17 Thread Adam Lippai
5.0 is being released right now, which means from timing perspective this
is the worst moment for arrow2, indeed. You'd need to wait the full 3
months. On the other hand does releasing a 6.0 beta based on arrow2 on Aug
1st, rc on Sept 1st and releasing the stable on Oct 1st sound like a bad
plan?

I don't think a 6.0-beta release would be confusing and dedicating most of
the 5.0->6.0 cycle to this change doesn't sound excessive.

I think this approach wouldn't result in extra work (backporting the
important changes to 5.1,5.2 release). It only shows the magnitude of this
change, the work would be done by you anyways, this would just make it
clear this is a huge effort.

Best regards,
Adam Lippai

On Sat, Jul 17, 2021, 11:31 Jorge Cardoso Leitão 
wrote:

> Hi,
>
> Arrow2 and parquet2 have passed the IP clearance vote and are ready to be
> merged to apache/* repos.
>
> My plan is to merge them and PR to both of them to the latest updates on my
> own repo, so that I can temporarily (and hopefully permanently) archive the
> versions of my account and move development to apache/*.
>
> Most of the work happening in arrow-rs is backward compatible or simple to
> deprecate. However, this situation is different in arrow2 and parquet2. A
> release cadence of a major every 3 months is prohibitive at the pace that I
> am plowing through.
>
> The core API (types, alloc, buffer, bitmap, array, mutable array) is imo
> stable and not prone to change much, but the non-core API (namely IO and
> compute) is prone to change. Examples:
>
> * Add Scalar API to allow dynamic casting over the aggregate kernels and
> parquet statistics
> * move compute/ from the arrow crate into a separate crate
> * move io/ from the arrow crate into a separate crate
> * add option to select encoding based on DataType and field name when
> writing to parquet
>
> (I will create issues for them in the experimental repos for proper
> visibility and discussion).
>
> This situation is usually addressed via the 0.X model in semver 2 (in
> Python fastAPI <https://fastapi.tiangolo.com/> is a predominant example
> that uses it, and almost all in Rust also uses it). However, there are a
> couple of blockers in this context:
>
> 1. We do not allow releases of experimental repos to avoid confusion over
> which is *the* official package.
> 2. arrow-rs is at version 5, and some dependencies like IOx/Influx seem to
> prefer a slower release cadence of breaking changes.
>
> On the other hand, other parts of the community do not care about this
> aspect. Polars for example, the fastest DataFrame in H2O benchmarks,
> currently maintains an arrow2 branch that is faster and safer than master
> [1], and will be releasing the Python binaries from the arrow2 branch. We
> would like to release the Rust API also based on arrow2, which requires it
> to be in Cargo.
>
> The best “hack” that I can come up with given the constraints above is to
> release arrow2 and parquet2 in cargo.io from my personal account so that
> dependents can release to cargo while still making it obvious that they are
> not the official release. However, this is obviously not ideal.
>
> Any suggestions?
>
> [1] https://github.com/pola-rs/polars/pull/922
>
> Best,
> Jorge
>


Re: Improving PR workload management for Arrow maintainers

2021-07-01 Thread Adam Lippai
Not sure if it's applicable, but GitHub is improving:
https://github.blog/changelog/2021-06-23-whats-new-with-github-issues/

That spreadsheet-like issue tracking looks concise.

Best regards,
Adam Lippai

On Wed, Jun 30, 2021, 10:28 Antoine Pitrou  wrote:

>
> Le 30/06/2021 à 10:04, Wes McKinney a écrit :
> >
> > I guess my concern with this is how to quickly separate out "PRs I am
> > keeping an eye on". If there are 100 active PRs and only 20 of them
> > are ones you've interacted with, how do you know which ones need your
> > attention? GitHub does have the "reviewed-by" filter which could be
> > good enough
>
> There's also the "involves" filter that can also select PRs you have
> commented on without giving a formal review.
>
> However, those filters don't let you know which PRs are pending review
> if you haven't already commented on them.
>
> Regards
>
> Antoine.
>


Re: Long title on github page

2021-06-10 Thread Adam Lippai
+1

On Thu, Jun 10, 2021, 23:38 Antoine Pitrou  wrote:

>
> Sound good enough to me.
>
>
> Le 10/06/2021 à 23:35, Wes McKinney a écrit :
> > I hate to reopen this can of worms again, but here is my effort to
> > synthesize feedback:
> >
> > "Apache Arrow is a multi-language toolbox for accelerated data
> > interchange and in-memory processing."
> >
> > On Thu, Jun 10, 2021 at 12:37 PM Dominik Moritz 
> wrote:
> >>
> >> I thought there were some good suggestions in this thread. @Wes, did you
> >> find a description you liked?
> >>
> >> On May 18, 2021 at 06:24:47, Adam Hooper  wrote:
> >>
> >>> Poll question: why did you choose Arrow?
> >>>
> >>> Personally: I researched Arrow because it's a spec for IPC. (My
> requirement
> >>> was: "wrap computations in a separate process.") I chose Arrow for its
> >>> community and ecosystem -- in other words, because my peers chose it.
> >>>
> >>> I happen to use the compute kernel and Parquet capabilities every day;
> but
> >>> they did not sway me at all. I would choose Arrow if it were nothing
> but
> >>> this spec and this community. (I chose HTML, after all.)
> >>>
> >>> I see the *code* as one enormous proof that the *spec* is good, and as
> a
> >>> collection of examples and best practices.
> >>>
> >>> ... so a great pitch to me would be: "Apache Arrow is a data format and
> >>> toolbox for efficient in-memory processing."
> >>>
> >>> Enjoy life,
> >>> Adam
> >>>
> >>> On Tue, May 18, 2021 at 2:38 AM Aldrin 
> wrote:
> >>>
> >>> "Apache Arrow is a data processing library that also provides a
> uniform,
> >>>
> >>> efficient interface for data systems."
> >>>
> >>>
> >>> This probably still isn't quite right, I imagine the bit about "for
> data
> >>>
> >>> systems" needs some addition (maybe "for transport between data
> systems")?
> >>>
> >>>
> >>> My primary motivators:
> >>>
> >>>
> >>> - "A data processing library":
> >>>
> >>>- Arrow provides many language bindings, but ultimately they're
> all
> >>>
> >>>part of the same "library ecosystem", which I think is fine to
> >>>
> >>> capture in
> >>>
> >>>"library"
> >>>
> >>>- A main goal of arrow is for processing to be fast, whatever
> that
> >>>
> >>>processing may be
> >>>
> >>>- "uniform, efficient interface for data systems":
> >>>
> >>>- Arrow, provides (or tries to) a cohesive ("uniform")
> interface for
> >>>
> >>>data processing (although it has several APIs to do this)
> >>>
> >>>- Also, IMO, a motivation for arrow was a format and library to
> >>>
> >>>facilitate processing, but that provided functions and
> >>>
> >>> interfaces to easily
> >>>
> >>>translate into optimized data formats used by disparate data
> systems
> >>>
> >>>(cassandra, hadoop, etc.).
> >>>
> >>>- Arrow tries to be transparently zero-copy, which is part of
> the
> >>>
> >>>interface for efficiency
> >>>
> >>> - Arrow certainly has a data format, but that format is the crux
> of the
> >>>
> >>> interface (IMO). However, it also makes using other formats easy
> (via
> >>>
> >>> filesystem API and parquet reader/writers, etc.). So, focusing on
> the
> >>>
> >>> data
> >>>
> >>> format seems unnecessary in such a terse description.
> >>>
> >>>
> >>>
> >>> Aldrin Montana
> >>>
> >>> Computer Science PhD Student
> >>>
> >>> UC Santa Cruz
> >>>
> >>>
> >>>
> >>> On Mon, May 17, 2021 at 5:07 PM Weston Pace 
> wrote:
> >>>
> >>>
>  I'd avoid the word "structured" as it is somewhat ill-defined.
> >>>
> 
> >>>
>  On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
> >>>
>   wrote:
> >>>
> >
> >>>
> > more marketed:
> >>>
> > How about: "Apache Arrow is a format and language-agnostic library
> >>>
>  focused
> >>>
> > on efficient sharing and processing of structured data."
> >>>
> >
> >>>
> > On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> 
> >>>
> > wrote:
> >>>
> >
> >>>
> >> How about: "Apache Arrow is a collection of specifications, cross
> >>>
>  language
> >>>
> >> libraries and applications focused on efficient sharing and
> >>>
> >>> processing
> >>>
>  of
> >>>
> >> structured data."
> >>>
> >>
> >>>
> >> On Mon, May 17, 2021 at 3:06 PM Wes McKinney 
> >>>
>  wrote:
> >>>
> >>
> >>>
> >>> On Mon, May 17, 2021 at 4:58 PM Weston Pace  >>>
> 
> >>>
> >> wrote:
> >>>
> 
> >>>
> > “Apache Arrow is a format and compute kernel for in-memory
> >>>
> >>> data”
> >>>
> 
> >>>
>  I like this but no one ever knows what "in-memory" means (or they
> >>>
>  just
> >>>
>  think 'data is always in memory').  How about...
> >>>
> 
> >>>
>  "Apache Arrow is a format and compute kernel for zero-copy
> >>>
>  processing
> >>>
>  and sharing of data."
> >>>
> 
> >>>
>  or...
> >>>
> 
> >>>
> >

Re: Long title on github page

2021-05-17 Thread Adam Lippai
Hi,

I'm 100% behind Wes.
Being not just a file format, but adding compute and libs are the best
selling points of Arrow.
It shouldn't be reduced to "a file format and it's utils", as the ecosystem
is at least that important.
This is something we have to emphasize constantly.

Best regards,
Adam Lippai

On Mon, May 17, 2021 at 8:49 PM Wes McKinney  wrote:

> I think less is better in the description, but unfortunately the
> association of Arrow as being "just a data format" has been actively
> harmful in some ways to community growth. We have a data format, yes,
> but we are also creating a computational platform to go hand-in-hand
> with the data format to make it easier to build fast applications that
> use the data format. So the description needs to capture both of these
> ideas.
>
> On Mon, May 17, 2021 at 12:15 PM Julian Hyde 
> wrote:
> >
> > I think that the “cross-language development platform for” is noise.
> (I’m sure that JPEG developers think that JPEG is a “cross-language
> development platform” too. But it isn’t. It is an image format.)
> >
> > "Apache Arrow is data format for efficient in-memory processing.”
> >
> > I’ll note that In marketing speak, we are developing a high-concept
> pitch [1] here. Every company needs a name, a brand, a high-concept pitch,
> and 3- or 4-sentence description. But every Apache project needs these too.
> It’s worth spending the time on the description, also, and then use them in
> all the places that we describe Arrow.
> >
> > Julian
> >
> > [1] https://www.growthink.com/content/whats-your-high-concept-pitch
> >
> >
> >
> > > On May 17, 2021, at 7:38 AM, Eduardo Ponce 
> wrote:
> > >
> > > I agree with Nate's and Brian's suggestions, but would like to add
> that we
> > > can make it a one-liner for more conciseness and consistency with other
> > > Apache projects.
> > > Apologies if it seems I am going around the suggestions loop again.
> > >
> > > "Apache Arrow is a cross-language development platform enabling
> efficient
> > > in-memory data processing and transport."
> > >
> > >
> > >
> > >
> > > On Mon, May 17, 2021 at 10:11 AM Brian Hulette 
> wrote:
> > >
> > >> Thank you for bringing this up Dominik. I sampled some of the
> descriptions
> > >> for other Apache projects I frequent, the ones with a meaningful
> > >> description have a single sentence:
> > >>
> > >> github.com/apache/spark - Apache Spark - A unified analytics engine
> for
> > >> large-scale data processing
> > >> github.com/apache/beam - Apache Beam is a unified programming model
> for
> > >> Batch and Streaming
> > >> github.com/apache/avro - Apache Avro is a data serialization system
> > >>
> > >> Several others (Flink, Hadoop, ...) just have  "[Mirror of] Apache
> "
> > >> as the description.
> > >>
> > >> +1 for Nate's suggestion "Apache Arrow is a cross-language development
> > >> platform for in-memory data. It enables systems to process and
> transport
> > >> data more efficiently."
> > >>
> > >> On Mon, May 17, 2021 at 5:23 AM Wes McKinney 
> wrote:
> > >>
> > >>> It's probably best for description to limit mentions of specific
> > >>> features. There are some high level features mentioned in the
> > >>> description now ("computational libraries and zero-copy streaming
> > >>> messaging and interprocess communication"), but now in 2021 since the
> > >>> project has grown so much, it could leave people with a limited view
> > >>> of what they might find here.
> > >>>
> > >>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> > >>>  wrote:
> > >>>>
> > >>>> How about
> > >>>> 'Apache Arrow is a cross-language development platform for in-memory
> > >>> data.
> > >>>> It enables systems to process and transport data efficiently,
> > >> providing a
> > >>>> simple and fast library for partitioning of large tables'?
> > >>>>
> > >>>> Sorry the delay, long election day
> > >>>>
> > >>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> > >>> natebauernfe...@deephaven.io>
> > >>>> wrote:
> > >>&

Re: [VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Adam Lippai
+1 (non-binding)

Best regards,
Adam Lippai

On Wed, May 12, 2021 at 12:16 AM Andrew Lamb  wrote:

> Per previous discussions, I would like to propose a new release process for
> arrow-rs, releasing officially to crates.io every 2 weeks in addition to
> the quarterly release of the other releases.
>
> The proposal is available as [1] , based on previous discussions [2][3] in
> the mailing list and comments on the draft document [4].
>
> Please vote in the following manner. The vote will be open for at least 72
> hours.
>
> [ ] +1 Implement the release process described in the proposal
> [ ] +0
> [ ] -1 Do not implement the process because...
>
> Thank you for your patience and participation,
> Andrew
>
>
> [1]
>
> https://docs.google.com/document/d/1tMQ67iu8XyGGZuj--h9WQYB9inCk6c2sL_4xMTwENGc/edit?ts=60961758
>
> [2]
>
> https://lists.apache.org/thread.html/r6b9baf59e3cd1a91905b5f802057026dfa627f00507638b605a3ff1b%40%3Cdev.arrow.apache.org%3E
>
> [3]
>
> https://lists.apache.org/thread.html/r832296a5bdf8eb363ef1ed7012b8e2dde3fa6894e7fa925a66c6e791%40%3Cdev.arrow.apache.org%3E
>
>
> [4]
>
> https://docs.google.com/document/d/1QTGah5dkRG0Z6Gny_QCHmqMg7L2HmcbEpRISsfNEhSA/edit?ts=60904ac1
>


Re: [RUST] Proposal for more frequent Rust Arrow release process

2021-05-02 Thread Adam Lippai
Hi,

my two cents: at my previous workplace (Tresorit) we created releases every
week and that process was even heavier including cross
platform verification and approval, a 8-12 hours long dedicated manual QA
process, discussions with the marketing and support teams.
Open source is certainly different as we had 2-8 full time devs per
platform and 2 dedicated QA engineers to ensure this pace (including the
actual development), however when we had struggles it was because:

   1. People were on vacation and the flow was disrupted
   2. We had big, cross platform changes (timespan over 2 weeks) which we
   didn't merge back in time, it hurt us in 50%+ of the cases when we tried to
   "do it in separate feature branch", which might or might not be a lot

The overhead wasn't big, it was almost only the communication with the
manual QA and creating human readable release notes. It was never about
doing a lot of work, doing only a little, doing bugfix, features or
refactoring. It didn't really matter for the overhead or failure rate.
A few years later Google devops started promoting DORA:
https://cloud.google.com/blog/products/devops-sre/another-way-to-gauge-your-devops-performance-according-to-dora
and their experience is the same, more than one release per month is needed
to make it stable with lower failure rate and being able to act within a
day if anything goes wrong.
My team was web and javascript focused (but we provided APIs) and we
automated most of the deployment process (well, not the approval and manual
QA of course), however I saw this working pretty well for years, if
anything increasing the pace made the process more deterministic.

Strictness of the language and the compiler might render this approach
impractical (eg. TypeScript and TS based libs struggle more than JS), but I
think management and QA-wise this proposal improves the process eventually
and we'll get a better product *while not working more.*

I hope this little anecdotal story reassures some of you that this is not
an over-ambitious change and it enables everyone to continue creating great
things.

Best regards,
Adam Lippa

On Sun, May 2, 2021 at 1:00 PM Andrew Lamb  wrote:

> Micah and Julian, thank you both for your thoughts.
>
> I largely agree with Micah;  While the Apache  process may be heavy weight
> in certain aspects, I think we can achieve Rust releases every 2 week
> within that framework.
>
> As I doubt very much we'll get it perfect on the first try, I was
> envisioning that we start with 2 week releases, see how it goes for a
> while, and then adjust as needed.
>
> Andrew
>
>
> On Sat, May 1, 2021 at 11:20 PM Micah Kornfield 
> wrote:
>
> > >
> > > Apache releases are quite heavyweight, so
> > > work best at a frequency of 2 - 3 months, whereas (IIUC) Rust
> > > developers expect a more lightweight release on a weekly cadence.
> >
> >
> > Thanks, I think clarifications on requirements are helpful.  IIUC is is
> > actually a 2 week cadence which is still fast but seems doable with
> > dedicated community members (and some investment in tooling).
> >
> > What makes the releases heavy weight?  It seems like the process is
> > slightly more tedious than necessarily onerous.  Generating a signed
> > tarball, seems like it should take ~5 minutes or less with the proper
> > tooling?  Verification is more heavy weight but again with the proper
> > tooling and a good system for testing out more changes, it does not seem
> > like it should take too much developer time if no issues arise.  There
> are
> > 3 active contributors to Rust on the PMC, so if they are willing to sign
> up
> > for doing the work of verification and voting on this cadence, what would
> > the other requirements around the process be?
> >
> > Best,
> > Micah
> >
> > On Sat, May 1, 2021 at 3:05 PM Julian Hyde  wrote:
> >
> > > The main tension is not in the proposal but the requirements. It's a
> > > classic impedance mismatch. Apache releases are quite heavyweight, so
> > > work best at a frequency of 2 - 3 months, whereas (IIUC) Rust
> > > developers expect a more lightweight release on a weekly cadence. I
> > > was trying to find other projects that had had the same problem, and
> > > solved it somehow. And also raise awareness within Apache that the
> > > release process is problematic for some communities in 2021.
> > >
> > > To correct a couple of misconceptions:
> > > * In Apache, the signed source artifacts (tarball) are literally the
> > > release. Not a git hash, not a set of binary artifacts. That is what
> > > people need to vote on.
> > > * The release vote does not have to last 72 hours. It can be a shorter
> > > period, if the community agrees.
> > >
> > > Julian
> > >
> > >
> > > On Sat, May 1, 2021 at 1:31 PM Micah Kornfield 
> > > wrote:
> > > >
> > > > Hi Julian,
> > > > I didn't read this proposal as being in tension with apache releases.
> > It
> > > > sounds like the intention is to hold a vote every two weeks to
> verify a
> > > > rele

Re: Including JS patch in 4.0.1 if released

2021-04-28 Thread Adam Lippai
Really good question, it should throw upon Builder() creation.
The change in the PR fixes an exception when importing the Arrow lib.
I'll test it later today or tomorrow.

Best regards,
Adam Lippai


On Wed, Apr 28, 2021 at 6:38 PM Dominik Moritz  wrote:

> Thank you for the pull request.
>
> I’m curious regarding, CSP (which restricts the use of eval), did you try
> that Arrow 3 works? I’m wondering since Arrow already uses a Function
> constructor in
>
> https://github.com/apache/arrow/blob/0d11014ee8e6ce408ddbbdfb788d901dd6c6374f/js/src/builder/valid.ts#L66
> .
>
> On Apr 28, 2021 at 00:55:27, Adam Lippai  wrote:
>
> > Hi,
> >
> > I'd want to propose including https://github.com/apache/arrow/pull/10181
> > in
> > 4.0.1 release, if it happens.
> > While the issue is barely a bugfix, it's still a minor regression since
> > 3.0.0. It happens only in special circumstances, eg. using Rollup bundler
> > or restricting eval() for security reasons. It doesn't warrant a new
> > release on it's own as it affects a marginal fraction of the consumers
> > only.
> >
> > The change shouldn't affect the API or the required NodeJS versions.
> >
> > Best regards,
> > Adam Lippai
> >
>


Including JS patch in 4.0.1 if released

2021-04-28 Thread Adam Lippai
Hi,

I'd want to propose including https://github.com/apache/arrow/pull/10181 in
4.0.1 release, if it happens.
While the issue is barely a bugfix, it's still a minor regression since
3.0.0. It happens only in special circumstances, eg. using Rollup bundler
or restricting eval() for security reasons. It doesn't warrant a new
release on it's own as it affects a marginal fraction of the consumers
only.

The change shouldn't affect the API or the required NodeJS versions.

Best regards,
Adam Lippai


Re: Rust sync meeting

2021-04-08 Thread Adam Lippai
Didn't this happen in the thread "[Rust] [Discuss] proposal to redesign
Arrow crate to resolve safety violations" on 7th February before any commit
in arrow2 (resulting in zero discussion or any objection)?

Best regards,
Adam Lippai

On Thu, Apr 8, 2021 at 4:41 PM Wes McKinney  wrote:

> Hi Ben,
>
> I’m not suggesting adding any extra development process to slow things down
> on an experimental project like this. My principle objection is to the lack
> of discussion on the public record.
>
> A short DISCUSS email from Jorge explaining the project and soliciting
> feedback from the community would have been enough. It would have also been
> an opportunity to set up auxiliary git repositories or branches to
> accommodate the work on ASF channels (to enable collaboration inside the
> community) rather than occurring as a “lone wolf” project outside the
> community.
>
> On Thu, Apr 8, 2021 at 9:03 AM Benjamin Blodgett <
> benjaminblodg...@gmail.com>
> wrote:
>
> > Wes,
> >
> > I think we understand your administrative prerogative.  I think in both
> > Julia and Rust cases, you just have engineers that want to go fast and do
> > very needed, deep changes for security and performance.  I think, and
> this
> > is just my wild guess, that to go through the Apahce process (because
> it's
> > slow and administered by part timers) for such exploratory work would be
> > detrimental, because it would slow said communities down.
> >
> > I think we should let them do their 'exploratory development branches',
> and
> > commit/merge back whatever they want to/feel they have got to a material
> > contribution/end point.  I think you did something similar with Pandas2
> and
> > feather/arrow.
> >
> > Either way, we should be praising and encouraging Jorge to go fast and
> see
> > it as a development branch, and figure out how to remerge as we see the
> > value of the new features.
> >
> > I think at this point, they are going and we should just encourage them
> and
> > realize there is enough support in the Rust and Julia community to deal
> > with the merge down the road post rewrite.  I do see Jorges Arrow2 is
> > substantially better in many vectors and solves a lot of problems that
> have
> > been plaguing Arrow.  Also we need a fundamental rewrite of APIs to make
> a
> > rust version of Zero copy work, which could mean we have to have a
> slightly
> > different arrow format in how it interacts with memory.
> >
> > " I do think that I was vocal enough. At some point
> > the interactions here started to affect my wellbeing and I thus decided
> to
> > scale down by efforts."
> >
> > On Thu, Apr 8, 2021 at 6:03 AM Wes McKinney  wrote:
> >
> > > On Thu, Apr 8, 2021 at 7:49 AM Wes McKinney 
> wrote:
> > > >
> > > > With both what has occurred with the Julia project and what may
> > > > possibly be occurring (what appears to me to be occurring) with these
> > > > Rust overhaul projects, is that the communities expectations with
> > > > regards to Openness aren't being followed.
> > > >
> > > > If a change is significant and will affect other developers in the
> > > > community, such as a large refactor or a large PR that will interfere
> > > > with development in some portion of the codebase (example, I wrote a
> > > > large and disruptive C++ PR last may that affected all of our compute
> > > > kernels), then it needs to be discussed in a public place as soon as
> > > > you are aware of it. The Openness tenet is about an obligation that
> > > > individuals have to communicate with their fellow collaborators.
> > >
> > > For the record, the reason I'm making a fuss about this is that a
> > > contributor proposed a contribution moratorium in the Apache project
> > > as the result of work happening outside the community
> > >
> > > "would like to raise the idea of temporarily suspending major PRs
> against
> > > Rust
> > > Arrow/DataFusion until the work to incorporate the two big changes for
> > > Rust/DataFusion:
> > >
> > > 1. Jorge's major refactor/rewrite of the core Rust Arrow code.
> > > ...
> > > "
> > >
> > > I'm not trying to make things difficult for you all, but this looks
> > > like the kind of thing we would like to avoid.
> > >
> > > > Jorge has said that "that [the community] is unwilling to change some
> > > > of it

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Adam Lippai
Hi Weston,

Objective note:
I'm just a user, but I want to add that so far the Arrow releases are
pretty good quality which means you are making good calls.

Personal opinion:
There were several annoying bugs, where one would have to change a
parameter between parquet V1/V2, threaded / non-threaded, but nothing
exceptional yet.
If you feel the work is ready and you are concerned about it's unusual
size, then I'd say go with the merge, my experience is that size is not
worrying on it's own, there is no need for extra caution.
If you feel it's under-tested, under-reviewed compared to the previous code
then it's a different topic, it should be as good as the current *average*.
You can make it the default behavior, if the bugs are critical, everybody
can stay on 3.x instead of 4.0 until 4.0.1 arrives or use workarounds (eg.
disable threading).
Version 4.0 is not a critical bugfix release one would need to upgrade to
instantly.
You wouldn't steal them the opportunity to lower the risks or resolve bugs
in production.

Final thought:
While they are good enough, the releases in this field  - like pandas, dask
or airflow - can't be compared to how you deliver the new major versions,
so be proud and optimistic :)

Best regards,
Adam Lippai

On Wed, Apr 7, 2021 at 3:53 PM David Li  wrote:

> Hey Weston,
>
> First, thanks for all your work in getting these changes so far. I think
> it's also been a valuable experience in working with async code, and
> hopefully the problems we've run into so far will help inform further work,
> including with the query engine.
>
> If you're not comfortable merging, then we shouldn't try to rush it
> through, regardless of reviewer availability. If we're going to split it
> up, I would prefer the 'ScannerV2' approach, as while it'll clutter the API
> for a bit, at least gives us a fallback if we observe threading issues 'in
> the wild'.
>
> As for rxcpp - I took a quick look and while it doesn't seem abandoned, at
> least the maintainers are no longer very active. While the library looks
> mature, there are (for example) issues filed for compatibility with future
> C++ versions sitting unreviewed, and the blurb about it targeting only the
> latest C++ compilers might not work so well for us.
>
> I think it may be useful to explicitly align the async generator utilities
> with their rxcpp or ReactiveX equivalents so that there's some common
> concepts we can refer back to, especially if we further expand the
> utilities. While not many of us may be familiar with rxcpp already, at
> least we'd have a reference for how our utilities are supposed to work.
>
> Using the framework for query execution is an interesting point - doing so
> might feel like wasted work but again, hopefully we could apply the lessons
> here towards the query framework. (For instance, integrating any
> debug/tracing utilities we may have wanted, as with detecting 'abandoned'
> futures in ARROW-12207.)
>
> -David
>
> On 2021/04/07 17:18:30, Weston Pace  wrote:
> > I have been working the last few months on ARROW-7001 [0] which
> > enables nested parallelism by converting the dataset scanning to
> > asynchronous (previously announced here[1] and discussed here[2]).  In
> > addition to enabling nested parallelism this also allows for parallel
> > readahead which gives significant benefits on parallel filesystems
> > (i.e. S3).
> >
> > The good news: The PR is passing CI (it looks like there is one
> > failure in python that may or may not be related.  The RTools 3.5
> > failure is expected at the moment).  David Li has done some great work
> > investigating the performance benefits.  There are substantial
> > benefits on S3 for both IPC [3] and Parquet [4] across a variety of
> > parameters.  With additional work[5] the CSV reader could also reap
> > these benefits.
> >
> > The bad news: The change is significant.  Previously Micah has
> > expressed some concern about the viralness of Futures.  Also, when I
> > was requesting a pull review, Antoine expressed some concern about the
> > substantial amount of tricky code we are going to have to maintain in
> > src/arrow/util/async_generator.h.  The change is also innately
> > error-prone as it deals with threading.  Last week's trouble with the
> > R build was a direct result of some of the precursor work for this
> > feature.
> >
> > **For the above reasons I am sadly recommending that this feature not
> > target 4.0 as I had previously hoped.**
> >
> > For discussion:
> >
> > 1) Merging
> >
> > The change is significant (+1400/-730 atm).  I have done my bes

Re: [Rust] Contributing to Apache Arrow

2021-03-03 Thread Adam Lippai
Thank you Antoine

Best regards,
Adam Lippai

On Wed, Mar 3, 2021 at 12:28 PM Antoine Pitrou  wrote:

>
> I think one needs to have the "Contributor" role to be assigned issues.
> I've added to both of you (Ivan and Adam), so you should be able to
> auto-assign now.
>
> Regards
>
> Antoine.
>
>
> On Wed, 3 Mar 2021 10:46:23 +0100
> Adam Lippai  wrote:
> > I can't do it either (when logged in), it's not allowed.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Wed, Mar 3, 2021 at 9:53 AM Yibo Cai  wrote:
> >
> > > Hi Ivan,
> > >
> > > I guess you didn't log in Jira? Otherwise you will see "Assign to me"
> link
> > > at the right pane.
> > > You can click "Log In" at the upper right corner, maybe "Sign up" an
> > > account if you don’t have.
> > >
> > > Yibo
> > >
> > >
> > > -Original Message-
> > > From: Ivan Vankov 
> > > Sent: Wednesday, March 3, 2021 16:41
> > > To: dev@arrow.apache.org
> > > Subject: [Rust] Contributing to Apache Arrow
> > >
> > > Hello,
> > > I decided to try contributing to Apache arrow. Since I'm completely
> new to
> > > this project I've chosen a beginner friendly task ARROW-10903, but I
> cannot
> > > assign it to myself. So, could someone please help with that?
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose the
> > > contents to any other person, use it for any purpose, or store or copy
> the
> > > information in any medium. Thank you.
> > >
> >
>
>
>
>


Re: [Rust] Contributing to Apache Arrow

2021-03-03 Thread Adam Lippai
I can't do it either (when logged in), it's not allowed.

Best regards,
Adam Lippai

On Wed, Mar 3, 2021 at 9:53 AM Yibo Cai  wrote:

> Hi Ivan,
>
> I guess you didn't log in Jira? Otherwise you will see "Assign to me" link
> at the right pane.
> You can click "Log In" at the upper right corner, maybe "Sign up" an
> account if you don’t have.
>
> Yibo
>
>
> -Original Message-
> From: Ivan Vankov 
> Sent: Wednesday, March 3, 2021 16:41
> To: dev@arrow.apache.org
> Subject: [Rust] Contributing to Apache Arrow
>
> Hello,
> I decided to try contributing to Apache arrow. Since I'm completely new to
> this project I've chosen a beginner friendly task ARROW-10903, but I cannot
> assign it to myself. So, could someone please help with that?
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>


Re: Requirements on JIRA usage in Apache Arrow

2021-03-02 Thread Adam Lippai
Antoine,

you are right I'm directly challenging the statement that any issue
tracker, forum or chat is as good as GitHub.
I'm not speaking about tools and efficiency. In those terms you are
absolutely right and most of the solutions are clearly superior to GitHub.

I was talking about reach, community size and ease of onboarding.
I don't think I need to bring examples of how GitHub is a magnitude ahead
of others, being The Ecosystem for OSS development.
I don't like this trend, I'd be happy to see the ecosystem to be more
distributed on GitLab and Bitbucket, but that's not the current status and
not a trend today.

The new people have to learn to interact with the Arrow community now. I
don't doubt their ability to learn it, but the thing is that they have to
learn and get involved in Arrow specific tools.
Most of the people are less focused on Arrow, they use dozens or hundreds
of projects, so we are asking them to move from their usual workflow
(GitHub) to a specialized one.
They are users first, active members and developers second and they always
will be in majority. We might want or not want to please that future group.

Let me know what you think, whether you agree that people are more familiar
with GitHub than other tools without putting in extra effort.
I was trying to give attention to the people/social/community aspect, not
the ease of use or the right level of automation.
I don't think the current setup is any harder than others, but it's
different and an outlier.

As this is one (minor) aspect of the question only, I don't think I need to
convince you this is important or I am right.
I was feeling that we are a little bit in an echo chamber, that's why I
brought up this controversial dimension.
I didn't want to exaggerate and I don't think I did when I used the words
"huge difference", "magnitude".
>From a users perspective discussions sometimes happen on GitHub, but never
on GitLab, BitBucket or Jira.
I might live in my own bubble, but I didn't see a popular Jira tracker
where discussions are live and diverse yet.

Likely choosing GitHub would shift the focus (towards users and ecosystem
from development) and temporarily (measured in months or years) put more
work on the existing core members.

P.S. I have a positive experience here with you and the Arrow community,
I'm grateful for all the answers and help I got. The mails above are not a
criticism, not a little bit.

Best regards,
Adam Lippai

On Tue, Mar 2, 2021 at 12:35 PM Antoine Pitrou  wrote:

> On Tue, 2 Mar 2021 11:10:23 +0100
> Adam Lippai  wrote:
> >
> > All the (multiple) mailing lists, stack overflow and JIRA are definitely
> > barriers for new contributors.
>
> I'm not sure what Stack Overflow has to do with this?  Interaction with
> Stack Overflow isn't required to contribute to Arrow.
>
> (also, I don't really understand the concern with SO, at least where
> user-friendliness is concerned)
>
> > They require familiarity (people born after 2000 are not familiar with
> > mailing lists or JIRA, but they are with GitHub) and setup (filters,
> > notifications).
>
> Well, I'm not very impressed by this argument.  "People born after 2000"
> aren't cognitively different, and they should be able to adapt to the
> same tools as other people.  Everyone was unfamiliar with mailing-lists
> and issue trackers at some point, and very diverse people learned to be
> familiar with them.
>
> I'm also concerned by the laziness that seems implied by the "Github or
> nothing" mentality.  Experienced developers need to master a variety of
> tools over their career.  Learning a second issue tracker is a very
> mild effort to require of them.
>
> > Keeping everything (discussions, issues, PRs) in one place has huge added
> > value, but not for the core members and people working in this
> environment
> > for years.
>
> It does have added value, but I disagree that it's "huge". There are
> integrations in place between Github and the Apache JIRA that are
> perhaps not to the level of the integrations within Github itself, but
> still convenient.
>
> We can discuss opening more communication spaces.  But they will need
> core developer attention (since mailing-lists are not going to vanish),
> which will increase the required effort to keep up.
>
> > I understand if we stick with JIRA, but I'm 100% sure there are people
> not
> > asking questions, not raising issues, not giving feedback and not
> > contributing because of the mailing lists and JIRA already.
> > They wouldn't have the best ROI, but we can acknowledge there is a room
> for
> > improvement.
>
> Sure.  But I doubt that framing the topic as "it's Github that we need"
> is going to lead to productive discussion.
>
> Regards
>
> Antoine.
>
>
>


Re: Requirements on JIRA usage in Apache Arrow

2021-03-02 Thread Adam Lippai
I've seen multiple Apache projects using GitHub for issue tracking, but are
not really positive examples.
Often they don't use the milestones and labels available, I'd be sad if
we'd end up with that style.

GitHub on it's own is usually good enough if used correctly, when it's
helped with bots.
There are multiple OSS projects doing decent work and administration using
GitHub only.

All the (multiple) mailing lists, stack overflow and JIRA are definitely
barriers for new contributors.
They require familiarity (people born after 2000 are not familiar with
mailing lists or JIRA, but they are with GitHub) and setup (filters,
notifications).
I admit this helps reducing the noise, dealing with serious participants
only, making things the "right way", but I wanted to note this different
aspect too.

Keeping everything (discussions, issues, PRs) in one place has huge added
value, but not for the core members and people working in this environment
for years.
While I don't necessarily think it is worth switching now, converging to a
single platform improves the reach and diversifies/grows the Arrow
community in the long run.

I understand if we stick with JIRA, but I'm 100% sure there are people not
asking questions, not raising issues, not giving feedback and not
contributing because of the mailing lists and JIRA already.
They wouldn't have the best ROI, but we can acknowledge there is a room for
improvement.

Best regards,
Adam LIppai


On Tue, Mar 2, 2021 at 10:26 AM Antoine Pitrou  wrote:

>
> Hi Jorge,
>
> On Tue, 2 Mar 2021 08:55:03 +0100
> Jorge Cardoso Leitão  wrote:
> > Hi,
> >
> > FWIW, the amount of bureaucracy that goes into JIRA is a major
> contributing
> > factor for the reduction of my time commitment to this project by 80%+.
>
> Can you expand a bit on this?  In particular, which aspects of using
> JIRA feel bureaucratic?  Is it the requirement to create a new issue
> for each PR?  Or is it other concerns (such as the UI for entering or
> searching issues)?
>
> I can't say I like JIRA myself, but at least it provides the
> classification and navigation features that I would expect from an
> issue tracker.  The Github issue tracker AFAIK is rudimentary and not
> really practical when a project has accumulated many issues (but they
> may have changed this recently).
>
> > The major challenge is that most discussions happen where PRs are created
> > and seen, which is on github, but JIRA and mailing list is used for other
> > types of decisions. In this model, how do we preserve curated information
> > about the decision process while at the same time leverage both JIRA and
> > github's capabilities?
>
> In my experience, discussion on JIRA is about the issue itself (for
> example diagnosing a bug or discussing a feature), then discussion on
> the PR is about the implementation.  JIRA discussions are generally
> readable by users (and indeed, users often participate) while PR
> discussions are really for developers of the project.
>
> > OTOH, asking contributors to create a jira account
> > and committers to add the person as contributor, as well as the email
> spam
> > and the merge process is a large barrier.
>
> FWIW, I've set up a mail filter that sends all "work logged" automated
> mail to the trashbin.  I agree it's unfortunate that developers have to
> do that.  I also have other qualms with the Apache JIRA configuration,
> such as the fact that "labels" (keywords) are shared between all
> projects, so there is essentially a million of them with no effort at
> taxonomy.
>
> > IMO the foundation could be clearer wrt to what does it mean with
> > information being preserved and available (e.g. on apache servers?) and
> if
> > yes, follow it through by hosting all their projects on their own github
> /
> > gitlab / whatever, where issues and PRs are on the same platform, and
> offer
> > SSO for contributors as a way to prove identity across the system. But
> that
> > is also a complex operation with a lot of unknowns...
>
> From what I see of the ASF's velocity, I wouldn't expect such a large
> breakthrough in the short future.
>
> (this is not trying to badmouth the ASF, just a pragmatic evaluation)
>
> Regards
>
> Antoine.
>
>
>


Re: Arrow 3.0 release

2021-01-13 Thread Adam Lippai
There were so many Rust developments recently (321 issues out of the total
650, daily almost 4 issues - including the weekends and holidays) that
there is a big chance cool and great PRs are delayed to 4.0.0.
This is not "just okay", but something we should celebrate.
I'm simply stunned with the amount of work you have done since 2.0.0.
If anything, the release cadence could be improved, but that's already on
the horizon.

I'm just a simple user, but thanks to everybody in the Arrow community. I'm
learning a lot just by reading the developments in different fields
(datafusion optimizer, python api improvements, kernels optimizations,
parquet integrations).

Best regards,
Adam Lippai

On Wed, Jan 13, 2021 at 3:30 PM Neville Dipale 
wrote:

> Good day,
>
> I was hoping to complete the parquet list writer PR in time, but even
> though I've been burning the midnight oil addressing the remaining
> issues in the PR, I won't make it in time :(
>
> I've removed the Rust blocker from the milestone, so there should be
> nothing on my side blocking us.
>
> Thanks everyone for the hard work.
>
> Neville
>
> On Wed, 13 Jan 2021 at 03:29, Neal Richardson  >
> wrote:
>
> > Last call for 3.0.
> > https://cwiki.apache.org/confluence/display/ARROW/Arrow+3.0.0+Release is
> > closing out, and hopefully we'll be releasable by the end of tomorrow. If
> > you're trying to get something merged in time for the release, now is the
> > time!
> >
> > Thanks all for your hard work under challenging circumstances to get us
> to
> > this point.
> >
> > Neal
> >
> > On Mon, Jan 11, 2021 at 11:48 AM Antoine Pitrou 
> > wrote:
> >
> > >
> > > I think it would be nice to get
> > > https://github.com/apache/arrow/pull/9164 in because it changes the
> > > current behaviour and is also stricter with its inputs.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 11/01/2021 à 20:46, Neal Richardson a écrit :
> > > > Hi all,
> > > > We seem to be getting closer on resolving our CI and packaging
> > > challenges,
> > > > despite several additional, unexpected setbacks last week. I think we
> > > > should be in releasable condition in the next day or two. If you have
> > any
> > > > other outstanding issues you're trying to get in the 3.0 release, now
> > is
> > > > the time.
> > > >
> > > > We need a volunteer from the PMC to be release manager for 3.0.0.
> > > Krisztián
> > > > has done the last several releases, and I was wondering if there was
> > > anyone
> > > > else out there who wanted to take a turn and give him a break this
> > time.
> > > >
> > > > Neal
> > > >
> > > > On Wed, Jan 6, 2021 at 9:03 PM Sutou Kouhei 
> > wrote:
> > > >
> > > >> Hi Neal,
> > > >>
> > > >> Thanks!
> > > >>
> > > >>> ARROW-11155: [C++][Packaging] Move gandiva crossbow jobs off of
> > > Travis-CI
> > > >>
> > > >> I closed this because it has been done by
> > > >> https://issues.apache.org/jira/browse/ARROW-11015 .
> > > >>
> > > >>
> > > >> Thanks,
> > > >> --
> > > >> kou
> > > >>
> > > >> In  > gcx6bskna6udz...@mail.gmail.com>
> > > >>   "Re: Arrow 3.0 release" on Wed, 6 Jan 2021 11:39:02 -0800,
> > > >>   Neal Richardson  wrote:
> > > >>
> > > >>> I made some JIRAs for these issues:
> > > >>>
> > > >>> ARROW-11152: [CI][C++] Fix Homebrew numpy installation on macOS
> > builds
> > > >>> ARROW-11153: [C++][Packaging] Move debian/ubuntu/centos packaging
> off
> > > of
> > > >>> Travis-CI (assigned to Kou)
> > > >>> ARROW-11154: [CI][C++] Move homebrew crossbow tests off of
> Travis-CI
> > > >>> ARROW-11155: [C++][Packaging] Move gandiva crossbow jobs off of
> > > Travis-CI
> > > >>>
> > > >>> On Wed, Jan 6, 2021 at 5:50 AM Andrew Wieteska <
> > > >> andrew.r.wiete...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Kou,
> > > >>>>
> > > >>>> For sure! I'll work on this.
> > > >>>>
> > > >

Re: Serializing nested pandas dataframes

2020-10-31 Thread Adam Lippai
Hi,

this sounds really promising. I'm curious how JS handles structarrays, but
in theory it should work.

Best regards,
Adam Lippai

On Fri, Oct 30, 2020 at 3:07 PM Benjamin Kietzman 
wrote:

> Hi Adam,
>
> Arrow does not support nesting tables inside other tables. However, a
> record batch
> is interchangeable with a struct array so you could achieve something
> similar
> by converting from a RecordBatch with columns `...c` to a StructArray with
> child
> arrays `...c`. In C++ we have /RecordBatch::{To,From}StructArray/ for this
> purpose.
> Only from_struct_array is exposed in python but to_struct_array would be a
> simple
> change to make.
>
> Grouping could then be emulated by sorting the StructArray and wrapping it
> in a
> ListArray so that each list item contains the rows of a group. (This is
> similar to
> Impala's interpretation of list and map columns as persistent
> joins/groupings
>
> https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_queries
> )
>
> Would that be sufficient for your use case?
>
> On Thu, Oct 29, 2020 at 5:19 PM Adam Lippai  wrote:
>
> > This is what I want to extend for multiple tables:
> >
> >
> https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790
> > I would need to come up with custom binary wrapper for multiple
> serialized
> > pyarrow tables and since Arrow supports hierarchical data to some level,
> I
> > was looking for built-in support of nested tables.
> > I understand this might not be available on API level.
> >
> > Best regards,
> > Adam Lippai
> >
> > On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai  wrote:
> >
> > > If I have a DataFrame with columns Date, Category, Value and group by
> > > Category I'll have multiple DataFrames with Date, Value columns.
> > > The result of the groupby is DataFrameGroupBy, which can't be
> serialized.
> > > This is why I tried to assemble a nested DataFrame instead (like the
> one
> > in
> > > the SO link previously), but that doesn't work either.
> > >
> > > As Apache Arrow JS doesn't support groupby (processing the original DF
> on
> > > the client-side), I was thinking of pushing the groupby operation to
> the
> > > server side (pyarrow), doing the groupby in pandas before serializing
> and
> > > sending it to the client.
> > > I was wondering whether this (nested arrow tables) is a supported
> feature
> > > or not (by calling chained table.toArray() or similar solution)
> > > Currently I process it in pure JS, it's not that ugly, but not really
> > > idiomatic either. The lack of Categorial data type and processing it
> row
> > by
> > > row certainly has it's perf. price.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
> > > jorisvandenboss...@gmail.com> wrote:
> > >
> > >> Can you give a more specific example of what kind of hierarchical data
> > >> you want to serialize? (eg the output of a groupby operation in pandas
> > >> typically is still a dataframe that can be converted to pyarrow and
> > >> serialized).
> > >>
> > >> In general, for hierarchical data we have the nested data types (eg
> > >> struct type when you nest "multiple columns in a single column").
> > >>
> > >> Joris
> > >>
> > >>
> > >> On Thu, 29 Oct 2020 at 15:29, Adam Lippai  wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > is there a way to serialize (IPC) hierarchical tabular data (eg.
> > output
> > >> of
> > >> > pandas groupby) in python?
> > >> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
> > >> throws
> > >> > error:
> > >> >
> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
> > >> >
> > >> > Best regards,
> > >> > Adam Lippai
> > >>
> > >
> >
>


Re: Serializing nested pandas dataframes

2020-10-29 Thread Adam Lippai
This is what I want to extend for multiple tables:
https://issues.apache.org/jira/browse/ARROW-10045?focusedCommentId=17207790&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17207790
I would need to come up with custom binary wrapper for multiple serialized
pyarrow tables and since Arrow supports hierarchical data to some level, I
was looking for built-in support of nested tables.
I understand this might not be available on API level.

Best regards,
Adam Lippai

On Thu, Oct 29, 2020 at 10:14 PM Adam Lippai  wrote:

> If I have a DataFrame with columns Date, Category, Value and group by
> Category I'll have multiple DataFrames with Date, Value columns.
> The result of the groupby is DataFrameGroupBy, which can't be serialized.
> This is why I tried to assemble a nested DataFrame instead (like the one in
> the SO link previously), but that doesn't work either.
>
> As Apache Arrow JS doesn't support groupby (processing the original DF on
> the client-side), I was thinking of pushing the groupby operation to the
> server side (pyarrow), doing the groupby in pandas before serializing and
> sending it to the client.
> I was wondering whether this (nested arrow tables) is a supported feature
> or not (by calling chained table.toArray() or similar solution)
> Currently I process it in pure JS, it's not that ugly, but not really
> idiomatic either. The lack of Categorial data type and processing it row by
> row certainly has it's perf. price.
>
> Best regards,
> Adam Lippai
>
> On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
>> Can you give a more specific example of what kind of hierarchical data
>> you want to serialize? (eg the output of a groupby operation in pandas
>> typically is still a dataframe that can be converted to pyarrow and
>> serialized).
>>
>> In general, for hierarchical data we have the nested data types (eg
>> struct type when you nest "multiple columns in a single column").
>>
>> Joris
>>
>>
>> On Thu, 29 Oct 2020 at 15:29, Adam Lippai  wrote:
>> >
>> > Hi,
>> >
>> > is there a way to serialize (IPC) hierarchical tabular data (eg. output
>> of
>> > pandas groupby) in python?
>> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
>> throws
>> > error:
>> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
>> >
>> > Best regards,
>> > Adam Lippai
>>
>


Re: Serializing nested pandas dataframes

2020-10-29 Thread Adam Lippai
If I have a DataFrame with columns Date, Category, Value and group by
Category I'll have multiple DataFrames with Date, Value columns.
The result of the groupby is DataFrameGroupBy, which can't be serialized.
This is why I tried to assemble a nested DataFrame instead (like the one in
the SO link previously), but that doesn't work either.

As Apache Arrow JS doesn't support groupby (processing the original DF on
the client-side), I was thinking of pushing the groupby operation to the
server side (pyarrow), doing the groupby in pandas before serializing and
sending it to the client.
I was wondering whether this (nested arrow tables) is a supported feature
or not (by calling chained table.toArray() or similar solution)
Currently I process it in pure JS, it's not that ugly, but not really
idiomatic either. The lack of Categorial data type and processing it row by
row certainly has it's perf. price.

Best regards,
Adam Lippai

On Thu, Oct 29, 2020 at 9:39 PM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Can you give a more specific example of what kind of hierarchical data
> you want to serialize? (eg the output of a groupby operation in pandas
> typically is still a dataframe that can be converted to pyarrow and
> serialized).
>
> In general, for hierarchical data we have the nested data types (eg
> struct type when you nest "multiple columns in a single column").
>
> Joris
>
>
> On Thu, 29 Oct 2020 at 15:29, Adam Lippai  wrote:
> >
> > Hi,
> >
> > is there a way to serialize (IPC) hierarchical tabular data (eg. output
> of
> > pandas groupby) in python?
> > I've tried to call pa.ipc.serialize_pandas() on this example, but it
> throws
> > error:
> > https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes
> >
> > Best regards,
> > Adam Lippai
>


Serializing nested pandas dataframes

2020-10-29 Thread Adam Lippai
Hi,

is there a way to serialize (IPC) hierarchical tabular data (eg. output of
pandas groupby) in python?
I've tried to call pa.ipc.serialize_pandas() on this example, but it throws
error:
https://stackoverflow.com/questions/51505504/pandas-nesting-dataframes

Best regards,
Adam Lippai


Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Adam Lippai
Hi Neville,

yes, my concerns against common row based DB APIs is that I use
Arrow/Parquet for OLAP too.
What https://turbodbc.readthedocs.io/en/latest/ (python) or
https://github.com/pacman82/odbc-api#state (rust) does is that they read
large blocks of data instead of processing rows one-by-one, but indeed, the
ODBC and the Postgresql wire protocol is still row based.

Clickhouse is an interesting example, as it directly supports arrow and
parquet *server-side* (I didn't try it yet, just read it in the docs).

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale 
wrote:

> Thanks for the feedback
>
> My interest is mainly in the narrow usecase of reading and writing batch
> data,
> so I wouldn't want to deal with producing and consuming rows per se.
> Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
> row-based or OLTP case,
> and I'm considering something more suitable for the OLAP case.
>
> @Wes I'll have a read through the Python DB API, I've also been looking at
> JDBC
> as well as how Apache Spark manages to get such good performance from JDBC.
>
> I haven't been an ODBC fan, but mainly because of historic struggles with
> getting it to work
> on Linux envs where I don't have system control. WIth that said, we could
> still support ODBC.
>
> @Jorge, I have an implementation at rust-dataframe (
> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
> which uses rust-postgres. I however don't use the row-based API as that
> comes at
> a serialization cost (going from bytes > Rust types > Arrow).
> I instead use the
> Postgres binary format (
>
> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
> ).
> That postgres module would be the starting point of such separate crate.
>
> For Postgres <> Arrow type conversions, I leverage 2 methods:
>
> 1. When reading a table, we I get schema from the *information_schema*
> system
> table
> 2. When reading a query, I issue the query with a 1-row limit, and convert
> the row's schema to an Arrow schema
>
> @Adam I think async and pooling would be attainable yes, if an underlying
> SQL crate
> uses R2D2 for pooling, an API that supports that could be provided.
>
> In summary, I'm thinking along the lines of:
>
> * A reader that takes connection parameters & a query or table
> * The reader can handle partitioning if need be (similar to how Spark does
> it)
> * The reader returns a Schema, and can be iterated on to return data in
> batches
>
> * A writer that takes connection parameters and a table
> * The writer writes batches to a table, and is able to write batches in
> parallel
>
> In the case of a hypothetical interfacing with column databases like
> Clickhouse,
> we would be able to levarage materialising arrows from columns, instead of
> the
> potential column-wise conversions that can be performed from row-based
> APIs.
>
> Neville
>
>
> On Sun, 27 Sep 2020 at 22:08, Adam Lippai  wrote:
>
> > One more universal approach is to use ODBC, this is a recent Rust
> > conversation (with example) on the topic:
> > https://github.com/Koka/odbc-rs/issues/140
> >
> > Honestly I find the Python DB API too simple, all it provides is a
> > row-by-row API. I miss four things:
> >
> >- Batched or bulk processing both for data loading and dumping.
> >- Async support (python has asyncio and async web frameworks, but no
> >async DB spec). SQLAlchemy async support is coming soon and there is
> >https://github.com/encode/databases
> >- Connection pooling (it's common to use TLS, connection reuse would
> be
> >nice as TLS 1.3 is not here yet)
> >- Failover / load balancing support (this is connected to the
> previous)
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > That would be awesome! I agree with this, and would be really useful,
> as
> > it
> > > would leverage all the goodies that RDMS have wrt to transitions, etc.
> > >
> > > I would probably go for having database-specifics outside of the arrow
> > > project, so that they can be used by other folks beyond arrow, and keep
> > the
> > > arrow-specifics (i.e. conversion from the format from the specific
> > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> > with
> > > some standard to be easier to handle different DBs.
> > >
> > > I think that there are two layers

Re: [Rust] Arrow SQL Adapters/Connectors

2020-09-27 Thread Adam Lippai
One more universal approach is to use ODBC, this is a recent Rust
conversation (with example) on the topic:
https://github.com/Koka/odbc-rs/issues/140

Honestly I find the Python DB API too simple, all it provides is a
row-by-row API. I miss four things:

   - Batched or bulk processing both for data loading and dumping.
   - Async support (python has asyncio and async web frameworks, but no
   async DB spec). SQLAlchemy async support is coming soon and there is
   https://github.com/encode/databases
   - Connection pooling (it's common to use TLS, connection reuse would be
   nice as TLS 1.3 is not here yet)
   - Failover / load balancing support (this is connected to the previous)

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> That would be awesome! I agree with this, and would be really useful, as it
> would leverage all the goodies that RDMS have wrt to transitions, etc.
>
> I would probably go for having database-specifics outside of the arrow
> project, so that they can be used by other folks beyond arrow, and keep the
> arrow-specifics (i.e. conversion from the format from the specific
> databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with
> some standard to be easier to handle different DBs.
>
> I think that there are two layers: one is how to connect to a database, the
> other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as
> it standardizes things like `connect` and `tpc_begin`, as well as how
> things should be serialized to Python objects (e.g. dates should be
> datetime.date). This split is done by postgres for Rust
> <https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
> * postges-async
> * postges-sync (a blocking wrapper of postgres-async)
> * postges-types (to convert to native rust  < IMO this one is what we
> want to offer in Arrow)
> * postges-TLS
> * postges-openssl
>
> `postges-sync` implements Iterator (`client.query`), and postges-async
> implements Stream.
>
> One idea is to have a generic iterator/stream adapter, that yields
> RecordBatches. The implementation of this trait by different providers
> would give support to be used in Arrow and DataFusion.
>
> Besides postgres, one idea is to pick the top from this list
> <https://db-engines.com/en/ranking>:
>
> * Oracle
> * MySQL
> * MsSQL
>
> Another idea is to start by by supporting SQLite, which is a good
> development env to work with relational databases.
>
> Best,
> Jorge
>
>
>
>
>
> On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale 
> wrote:
>
> > Hi Arrow developers
> >
> > I would like to gauge the appetite for an Arrow SQL connector that:
> >
> > * Reads and writes Arrow data to and from SQL databases
> > * Reads tables and queries into record batches, and writes batches to
> > tables (either append or overwrite)
> > * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> > relatively easy and well-documented)
> > * Provides a batch interface that abstracts away the different database
> > semantics, and exposes a RecordBatchReader (
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > ),
> > and perhaps a RecordBatchWriter
> > * Resides in the Rust repo as either an arrow::sql module (like
> arrow::csv,
> > arrow::json, arrow::ipc) or alternatively is a separate crate in the
> > workspace  (*arrow-sql*?)
> >
> > I would be able to contribute a Postgres reader/writer as a start.
> > I could make this a separate crate, but to drive adoption I would prefer
> > this living in Arrow, also it can remain updated (sometimes we reorganise
> > modules and end up breaking dependencies).
> >
> > Also, being developed next to DataFusion could allow DF to support SQL
> > databases, as this would be yet another datasource.
> >
> > Some questions:
> > * Should such library support async, sync or both IO methods?
> > * Other than postgres, what other databases would be interesting? Here
> I'm
> > hoping that once we've established a suitable API, it could be easier to
> > natively support more database types.
> >
> > Potential concerns:
> >
> > * Sparse database support
> > It's a lot of effort to write database connectors, especially if starting
> > from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> > database servers?
> > Perhaps in that case we could keep the module without publishing it to
> > crates.io until we're happy with database support, or even its usage.
> >
> > * Dependency bl

Re: Arrow Flight + Go, Arrow for Realtime

2020-08-12 Thread Adam Lippai
Arrow is mainly about batching data and leveraging all the opportunities
this gives.
This means you either have to buffer the data yourself and flush it when a
reasonable sized batch is complete or play with preallocating Arrow
structures
This was discussed recently, you might be interested in the thread:
https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html

Note: I'm not an Arrow developer, I'm just following the "streaming"
features of the Arrow lib, I'm interested in having a "rolling window" API
(like a fixed size FIFO queue).

Best regards,
Adam Lippai

On Wed, Aug 12, 2020 at 11:29 AM  wrote:

> I'm looking at using Arrow for a realtime IoT project which includes use
> cases both on server, and also for transferring /using in a Browser via
> WASM,  and have a few  questions.
>
>
>
> Language in use is Go.
>
>
>
> Is anyone working on implementing   Arrow-Flight in Go ?  (According to
> the feature matrix,  nothing ready yet, so wanted to check.
>
>
>
> Has anyone tried using Apache Arrow in  Go WASM  (Webassembly) ?   if so,
> any issues ?
>
>
>
> Any pointers/documentation  on using/extending Arrow for realtime streaming
> cases.   (Specifically where a DataFrame is requested, but then it needs to
> 'grow' as new data arrives, often at high speed).
>
> Not language specific, just trying to understand the right pattern for
> using
> Arrow for this,  and couldn't' find much in the docs.
>
>
>
> Regards
>
>
>
> Mark.
>
>


Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library

2020-07-14 Thread Adam Lippai
"I don't know much about either, but I'm curious why you would expect this
to be the case?"
Looks like this is not true, it was just my perception reading the
different articles.
They are practically the same for a "hello world" if compiled carefully. So
this is really up to a real world comparison / benchmark.
I stated before that some features are easy to switch off in Rust, but I'm
not sure whether we depend on libc or not (target wasm32-unknown-unknown
doesn't support it, loading emscripten was heavy in the past)

Best regards,
Adam Lippai


On Tue, Jul 14, 2020 at 6:36 PM Micah Kornfield 
wrote:

> Hi Adam,
>
>> This sounds really interesting, how about adding the wasm build (C++) to
>> the releases?
>
> I think this just needs someone to volunteer to do it and maintain it (at
> a minimum if it doesn't already exist we need CI for it).  We would also
> need to figure out details of publishing and integrating it into the
> release process.
>
> I've done a lot of asm.js work (different from wasm) in the past, but my
>> assumption would be that using Rust instead of C++ as source for wasm
>> should result in smaller wasm binaries.
>
> I don't know much about either, but I'm curious why you would expect this
> to be the case?
>
> On Tue, Jul 14, 2020 at 8:07 AM Adam Lippai  wrote:
>
>> This sounds really interesting, how about adding the wasm build (C++) to
>> the releases?
>> I've done a lot of asm.js work (different from wasm) in the past, but my
>> assumption would be that using Rust instead of C++ as source for wasm
>> should result in smaller wasm binaries.
>> Rust Arrow doesn't really use exotic solutions, eg. simd or tokio
>> dependency can be turned off.
>>
>> Having DataFusion + some performant data access in browsers or even in
>> node.js would be useful.
>> Not needing to build fancy HTTP/GraphQL API over the Rust/C++ impl. but
>> moving the data processing code to the client is viable for "small"
>> workloads.
>> Ofc if JS Arrow lands Flight support this may become less of an issue,
>> but AFAIK it's gRPC based which would need setting up a gRPC reverse proxy
>> for C++/Rust Arrow.
>> Overall both the code-duplication and feature fragmentation would
>> decrease by using a single source (like you don't have a full Python impl.
>> for obvious reasons)
>>
>> Best regards,
>> Adam Lippai
>>
>> On Tue, Jul 14, 2020 at 4:27 PM Micah Kornfield 
>> wrote:
>>
>>> Fwiw, I believe at least the core c++ library already can be compiled to
>>> wasm. I  think perspective does this [1]
>>>
>>>
>>>  I'm curious What are you hoping to achieve with embedded wasm  in spark?
>>>
>>> Thanks,
>>> Micah
>>>
>>> [1] https://perspective.finos.org/
>>>
>>> On Tuesday, July 14, 2020, Brian Hulette  wrote:
>>>
>>> > That sounds great! I'd like to have some support for using the rust
>>> and/or
>>> > C++ libraries in the browser via wasm as well.
>>> > As long as the community is ok with your overall approach "to add
>>> compiler
>>> > conditionals around any I/O features and libc dependent features of
>>> these
>>> > two libraries," I think it may be best to start with a PR and discuss
>>> > specifics from there.
>>> >
>>> > Do any rust contributors have objections to this?
>>> >
>>> > Brian
>>> >
>>> > On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal  wrote:
>>> >
>>> > >  Hi all,
>>> > >
>>> > > Looking for guidance on how to submit a design and PR to add WASM32
>>> > support
>>> > > to apache arrow's rust libraries.
>>> > >
>>> > > I am looking to use the arrow library to pass data in arrow format
>>> > between
>>> > > the host spark environment and UDFs defined in WASM .
>>> > >
>>> > > I created the following JIRA ticket to capture the work
>>> > > https://issues.apache.org/jira/browse/ARROW-9453
>>> > >
>>> > > Thanks,
>>> > > RJ
>>> > >
>>> >
>>>
>>


Re: [Discuss] [Rust] Looking to add Wasm32 compile target for rust library

2020-07-14 Thread Adam Lippai
This sounds really interesting, how about adding the wasm build (C++) to
the releases?
I've done a lot of asm.js work (different from wasm) in the past, but my
assumption would be that using Rust instead of C++ as source for wasm
should result in smaller wasm binaries.
Rust Arrow doesn't really use exotic solutions, eg. simd or tokio
dependency can be turned off.

Having DataFusion + some performant data access in browsers or even in
node.js would be useful.
Not needing to build fancy HTTP/GraphQL API over the Rust/C++ impl. but
moving the data processing code to the client is viable for "small"
workloads.
Ofc if JS Arrow lands Flight support this may become less of an issue, but
AFAIK it's gRPC based which would need setting up a gRPC reverse proxy for
C++/Rust Arrow.
Overall both the code-duplication and feature fragmentation would decrease
by using a single source (like you don't have a full Python impl. for
obvious reasons)

Best regards,
Adam Lippai

On Tue, Jul 14, 2020 at 4:27 PM Micah Kornfield 
wrote:

> Fwiw, I believe at least the core c++ library already can be compiled to
> wasm. I  think perspective does this [1]
>
>
>  I'm curious What are you hoping to achieve with embedded wasm  in spark?
>
> Thanks,
> Micah
>
> [1] https://perspective.finos.org/
>
> On Tuesday, July 14, 2020, Brian Hulette  wrote:
>
> > That sounds great! I'd like to have some support for using the rust
> and/or
> > C++ libraries in the browser via wasm as well.
> > As long as the community is ok with your overall approach "to add
> compiler
> > conditionals around any I/O features and libc dependent features of these
> > two libraries," I think it may be best to start with a PR and discuss
> > specifics from there.
> >
> > Do any rust contributors have objections to this?
> >
> > Brian
> >
> > On Mon, Jul 13, 2020 at 9:42 PM RJ Atwal  wrote:
> >
> > >  Hi all,
> > >
> > > Looking for guidance on how to submit a design and PR to add WASM32
> > support
> > > to apache arrow's rust libraries.
> > >
> > > I am looking to use the arrow library to pass data in arrow format
> > between
> > > the host spark environment and UDFs defined in WASM .
> > >
> > > I created the following JIRA ticket to capture the work
> > > https://issues.apache.org/jira/browse/ARROW-9453
> > >
> > > Thanks,
> > > RJ
> > >
> >
>


Re: Helping new contributors get started [was Re: Renaming master branch, removing blacklist/whitelist]

2020-06-20 Thread Adam Lippai
Undoubtedly, you always answer and that is amazing. Now all the help is
core/pro -> beginner, but a average <-> average or average-> beginner
cooperation would be nice. I understand it's not the time to introduce it
yet, we don't have the critical mass. I didn't think of SO before, but
indeed, it serves this purpose, it's a good forum for this.

Thanks for the detailed answer.

Best regards,
Adam Lippai


On Sat, Jun 20, 2020, 22:38 Wes McKinney  wrote:

> On Sat, Jun 20, 2020 at 3:19 PM Adam Lippai  wrote:
> >
> > I've seen better and worse examples before.
> > I was an active, beginner Drupal developer ~12 years ago. The Drupal
> > project community was very strong, particularly in Hungary where I live.
> > International and local IRC channels, international and local
> > forums+events, highly customized issue tracker and superb documentation.
> It
> > was more mature and bigger that time. On the other hand when I tried to
> > give back to Angular or React... Well... You are already ahead of them.
> > React eventually recognized the problem and they try to solve it, but a
> > large company's bureaucracy doesn't help that.
> >
> > My experience with Arrow is aligned with my expectations of a project of
> > this age or size (and in a few fields you are awesome!). Andy Grove,
> > xhochy, wesm, Joris were welcoming and responsive on Jira, Twitter and
> this
> > mailing list too. Ofc nobody worked for free on my ideas and I can't
> > develop C++ or Rust alone (yet). What I can do now is tracking the
> > development, the PRs (I've added a few more or less valuable, but not so
> > unique comments) and I'm subscribed to a few Jira issues.
> >
> > At this point I could use a gitter/IRC/slack channel for discussions -
> with
> > peers instead of core devs - and using mailing list + JIRA doesn't help
> > either. They are simply cumbersome, hard to navigate/search, focus is
> lost
> > when somebody is not sure what's interesting. A simpler issue tracker (eg
> > GitHub issues) and a super simple forum instead of mailing list would
> lower
> > the barriers. I don't think this is a priority as this setup certainly
> > serves your current workflows.
>
> On this I will say: we used to have a Slack channel but it didn't work
> well. Only a few core developers ever looked at it and because of the
> general "Slackification" of open source a lot of people would join the
> Slack channel looking for help and be unable to get it. People also
> reported bugs in Slack and we would learn about them weeks after the
> fact, or never. I think if we added a new official communications
> channel for the project right now it would likely suffer the same
> fate. If we had 10x as many core developers then there might be enough
> core devs who are comfortable with the additional modality that it
> might make sense. We still have lots of people reporting bugs on Stack
> Overflow and very few core developers regularly look at the SO
> questions.
>
> By contrast, we nearly unfailingly respond to people on the mailing
> list and JIRA. So if people are looking for help they can certainly
> get it there.
>
> > Keep up the good work, you are amazing! I can't wait a more complete
> > DataFusion, group by and join for pyarrow and other dozen exciting
> > opportunities and features.
> >
> > tl;dr you are great, not behind, local communities/meetups are a good
> > opportunity (but covid...), I find Jira + mailing list hard to use
> > (mentally, as not core dev)
> >
> > Best regards,
> > Adam Lippai
> >
> >
> >
> > On Sat, Jun 20, 2020, 21:23 Wes McKinney  wrote:
> >
> > > On Sat, Jun 20, 2020 at 1:52 PM Neal Richardson
> > >  wrote:
> > > >
> > > > Hi Suvayu,
> > > > Thanks for your feedback. I'm sorry to hear that you feel that you
> > > haven't
> > > > had the best experiences trying to contribute to the project. For
> what
> > > it's
> > > > worth, I believe that raising concerns like this _is_ itself a
> valuable
> > > > contribution. So even if you haven't gotten to the point of having a
> pull
> > > > request merged, I don't think it's accurate to say that you've been
> > > trying
> > > > unsuccessfully to contribute--you're contributing right now.
> > > >
> > > > As it turns out, just the other day I opened a JIRA issue about
> improving
> > > > the contributor guide (
> https://issues.apache.

Re: Helping new contributors get started [was Re: Renaming master branch, removing blacklist/whitelist]

2020-06-20 Thread Adam Lippai
I've seen better and worse examples before.
I was an active, beginner Drupal developer ~12 years ago. The Drupal
project community was very strong, particularly in Hungary where I live.
International and local IRC channels, international and local
forums+events, highly customized issue tracker and superb documentation. It
was more mature and bigger that time. On the other hand when I tried to
give back to Angular or React... Well... You are already ahead of them.
React eventually recognized the problem and they try to solve it, but a
large company's bureaucracy doesn't help that.

My experience with Arrow is aligned with my expectations of a project of
this age or size (and in a few fields you are awesome!). Andy Grove,
xhochy, wesm, Joris were welcoming and responsive on Jira, Twitter and this
mailing list too. Ofc nobody worked for free on my ideas and I can't
develop C++ or Rust alone (yet). What I can do now is tracking the
development, the PRs (I've added a few more or less valuable, but not so
unique comments) and I'm subscribed to a few Jira issues.

At this point I could use a gitter/IRC/slack channel for discussions - with
peers instead of core devs - and using mailing list + JIRA doesn't help
either. They are simply cumbersome, hard to navigate/search, focus is lost
when somebody is not sure what's interesting. A simpler issue tracker (eg
GitHub issues) and a super simple forum instead of mailing list would lower
the barriers. I don't think this is a priority as this setup certainly
serves your current workflows.

Keep up the good work, you are amazing! I can't wait a more complete
DataFusion, group by and join for pyarrow and other dozen exciting
opportunities and features.

tl;dr you are great, not behind, local communities/meetups are a good
opportunity (but covid...), I find Jira + mailing list hard to use
(mentally, as not core dev)

Best regards,
Adam Lippai



On Sat, Jun 20, 2020, 21:23 Wes McKinney  wrote:

> On Sat, Jun 20, 2020 at 1:52 PM Neal Richardson
>  wrote:
> >
> > Hi Suvayu,
> > Thanks for your feedback. I'm sorry to hear that you feel that you
> haven't
> > had the best experiences trying to contribute to the project. For what
> it's
> > worth, I believe that raising concerns like this _is_ itself a valuable
> > contribution. So even if you haven't gotten to the point of having a pull
> > request merged, I don't think it's accurate to say that you've been
> trying
> > unsuccessfully to contribute--you're contributing right now.
> >
> > As it turns out, just the other day I opened a JIRA issue about improving
> > the contributor guide (https://issues.apache.org/jira/browse/ARROW-9189
> ),
> > and I'll be taking that up next week as part of our 1.0 website
> overhaul. I
> > agree that we can do a better job in helping new contributors
> participate,
> > and that many of those forms of contribution need not require lots of
> time
> > from Arrow core developers. Wes's point about the limited bandwidth to
> > provide mentorship is valid; that said, I've seen many successful cases
> of
> > first-time contributors getting the support they need. While there's
> > certainly room for improvement, I'm optimistic that we're on the right
> > track.
>
> Yes — to be clear, the core developers in my experience (myself
> included) are spending a lot of time responding to questions on JIRA,
> clarifying issues with issue reporters, and offering advice about how
> to proceed. Additionally, we spend a lot of time reviewing code and
> helping people get their patches ready to be merged. There's no way we
> would have 500+ contributors if we were not doing these things.
>
> As far as getting the help that's needed from core developers, the
> thing that helps someone like me the most is to have the "request" be
> as specific and direct as possible. In any given day I might look at
> 50-100 different issues and so if it's not clear what I need to do I
> will often move on to the next thing. Example direct requests:
>
> * Do you think $PROPOSED_APPROACH is the right one?
> * In which file(s) should I be looking to make changes?
> * Is there anything related in the codebase I can look at to learn?
>
> I'm sure we can put this advice in our contributor guide.
>
> If you ask these questions and do not get an answer, it is OK to ask again.
>
> I see six JIRA issues from Suvayu in the project
>
> * https://issues.apache.org/jira/browse/ARROW-1956
> * https://issues.apache.org/jira/browse/ARROW-3806
> * https://issues.apache.org/jira/browse/ARROW-4930
> * https://issues.apache.org/jira/browse/ARROW-3792
> * https://issues.apache.org

Re: Pandas string type

2020-06-18 Thread Adam Lippai
Thanks for the detailed answer.
It's indeed 5-10% faster with the correct arguments you provided, but the
performance is far from the categorical type based solution.
I'll track the linked pandas issue. I'm not a C++ dev, but I'll be happy to
test, benchmark or add docs.

Best regards,
Adam Lippai

On Thu, Jun 18, 2020 at 10:08 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Adam,
>
> On Wed, 17 Jun 2020 at 13:07, Adam Lippai  wrote:
>
> > Hi,
> >
> > I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/
> > where
> > Wes writes
> >
> > > "string or binary data would come with additional overhead while pandas
> > > continues to use Python objects in its memory representation"
> >
> >
> > Pandas 1.0 introduced StringDType which I thought could help with the
> issue
> > (I didn't check the internals, I assume they still use Python objects,
> just
> > not Numpy, but I had nothing to lose).
> >
> > My issue is that if I create an PyArrow array with a = pa.array(["a",
> > "b"]*1) and call .to_pandas() the dtype of the dataframe is
> > still "object". I tried to add a types_mapper function (docs is not
> really
> > helpful so I've simply created def mapper(t): return pd.StringDtype) but
> it
> > didn't work.
> >
>
> Two caveats here: 1) the function needs to return an *instance* and not a
> class (so `return pd.StringDtype()`), and 2) this keyword only works for
> Table.to_pandas right now (this is certainly something that should either
> be fixed or either be clarified in the docs).
>
> So taking your example array, and putting it in a Table, and then
> converting to pandas, the types_mapper keyword works:
>
> >>> table = pa.table({'a': a})
> >>> df = table.to_pandas(types_mapper={pa.string(): pd.StringDtype()}.get)
> >>> df.dtypes
> astring
> dtype: object
>
> Now, the pandas string dtype is currently still using Python objects to
> store the strings (so similarly as using an object dtype). There are plans
> to store the strings more efficiently (eg using arrow's string array memory
> layout), see https://github.com/pandas-dev/pandas/issues/8640/.
>
> But so right now, if you have many repeated strings, I would still go for
> the category/dictionary type, as that will be a lot more efficient for
> further processing in pandas.
>
>
>
> >
> > Is this a future feature? Would it help anything? For now I'm happy to
> use
> > category/dictionary data, as the column is low cardinality and it makes
> it
> > 5x faster, but I was hoping for a simpler solution. I don't know the
> > internals but if "a" and "b" are immutable strings it shouldn't
> > really differ from using Category type (even if it's creating python
> > objects for them, as it can be done with 2 immutable objects). Converting
> > compressed parquet -> pyarrow is fast (less than 10 seconds), it's
> pyarrow
> > -> pandas which is slow, running for 7 minutes (so I think pyarrow
> already
> > has a nice implementation)
> >
>
> There is a `deduplicate_objects` keyword in to_pandas exactly for this (to
> avoid creating multiple Python objects for identical strings).
> However, as indicated above, and depending on what your further processing
> steps are in pandas, using a categorical/dictionary type might still be the
> better option.
>
> Joris
>
>
> >
> > Best regards,
> > Adam Lippai
> >
>


Pandas string type

2020-06-17 Thread Adam Lippai
Hi,

I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ where
Wes writes

> "string or binary data would come with additional overhead while pandas
> continues to use Python objects in its memory representation"


Pandas 1.0 introduced StringDType which I thought could help with the issue
(I didn't check the internals, I assume they still use Python objects, just
not Numpy, but I had nothing to lose).

My issue is that if I create an PyArrow array with a = pa.array(["a",
"b"]*1) and call .to_pandas() the dtype of the dataframe is
still "object". I tried to add a types_mapper function (docs is not really
helpful so I've simply created def mapper(t): return pd.StringDtype) but it
didn't work.

Is this a future feature? Would it help anything? For now I'm happy to use
category/dictionary data, as the column is low cardinality and it makes it
5x faster, but I was hoping for a simpler solution. I don't know the
internals but if "a" and "b" are immutable strings it shouldn't
really differ from using Category type (even if it's creating python
objects for them, as it can be done with 2 immutable objects). Converting
compressed parquet -> pyarrow is fast (less than 10 seconds), it's pyarrow
-> pandas which is slow, running for 7 minutes (so I think pyarrow already
has a nice implementation)

Best regards,
Adam Lippai


[jira] [Created] (ARROW-6774) Reading parquet file is slow

2019-10-02 Thread Adam Lippai (Jira)
Adam Lippai created ARROW-6774:
--

 Summary: Reading parquet file is slow
 Key: ARROW-6774
 URL: https://issues.apache.org/jira/browse/ARROW-6774
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.15.0
Reporter: Adam Lippai


Using the example at [https://github.com/apache/arrow/tree/master/rust/parquet] 
is slow.

The snippet 
{code:none}
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
let start = Instant::now();
while let Some(record) = iter.next() {}
let duration = start.elapsed();
println!("{:?}", duration);
{code}
Runs for 17sec for a ~160MB parquet file.

If there is a more effective way to load a parquet file, it would be nice to 
add it to the readme.

P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6712) [Rust] [Parquet] Reading parquet file into an ndarray

2019-09-26 Thread Adam Lippai (Jira)
Adam Lippai created ARROW-6712:
--

 Summary: [Rust] [Parquet] Reading parquet file into an ndarray 
 Key: ARROW-6712
 URL: https://issues.apache.org/jira/browse/ARROW-6712
 Project: Apache Arrow
  Issue Type: Wish
  Components: Rust
Reporter: Adam Lippai


What's the best way to read a .parquet file into a rust ndarray structure?

Can it be effective with the current API? I assume row iteration is not the 
best idea :) 

I can imagine that even parallel column loading would be possible. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6702) [Rust] [DataFusion] Incorrect partition read

2019-09-26 Thread Adam Lippai (Jira)
Adam Lippai created ARROW-6702:
--

 Summary: [Rust] [DataFusion] Incorrect partition read
 Key: ARROW-6702
 URL: https://issues.apache.org/jira/browse/ARROW-6702
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Affects Versions: 0.15.0
Reporter: Adam Lippai


Reading a dir structure of duplicated alltypes_plain.parquet returns 8 rows 
instead of 16 (e.g. read by pandas parquet reader)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6579) [Python] Parallel pyarrow.parquet.write_to_dataset

2019-09-17 Thread Adam Lippai (Jira)
Adam Lippai created ARROW-6579:
--

 Summary: [Python] Parallel pyarrow.parquet.write_to_dataset
 Key: ARROW-6579
 URL: https://issues.apache.org/jira/browse/ARROW-6579
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.14.1
Reporter: Adam Lippai


pyarrow.parquet.write_to_dataset() is single-threaded now and converts the 
table from/to Pandas. We should lower the dataset writing to C++ (dropping 
Pandas usage) so it's easier to write the partitioned dataset using multiple 
threads.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)