Poll question: why did you choose Arrow?
Personally: I researched Arrow because it's a spec for IPC. (My requirement
was: "wrap computations in a separate process.") I chose Arrow for its
community and ecosystem -- in other words, because my peers chose it.
I happen to use the compute kernel and Parquet capabilities every day; but
they did not sway me at all. I would choose Arrow if it were nothing but
this spec and this community. (I chose HTML, after all.)
I see the *code* as one enormous proof that the *spec* is good, and as a
collection of examples and best practices.
... so a great pitch to me would be: "Apache Arrow is a data format and
toolbox for efficient in-memory processing."
Enjoy life,
Adam
On Tue, May 18, 2021 at 2:38 AM Aldrin <akmon...@ucsc.edu.invalid> wrote:
"Apache Arrow is a data processing library that also provides a uniform,
efficient interface for data systems."
This probably still isn't quite right, I imagine the bit about "for data
systems" needs some addition (maybe "for transport between data systems")?
My primary motivators:
- "A data processing library":
- Arrow provides many language bindings, but ultimately they're all
part of the same "library ecosystem", which I think is fine to
capture in
"library"
- A main goal of arrow is for processing to be fast, whatever that
processing may be
- "uniform, efficient interface for data systems":
- Arrow, provides (or tries to) a cohesive ("uniform") interface for
data processing (although it has several APIs to do this)
- Also, IMO, a motivation for arrow was a format and library to
facilitate processing, but that provided functions and
interfaces to easily
translate into optimized data formats used by disparate data systems
(cassandra, hadoop, etc.).
- Arrow tries to be transparently zero-copy, which is part of the
interface for efficiency
- Arrow certainly has a data format, but that format is the crux of the
interface (IMO). However, it also makes using other formats easy (via
filesystem API and parquet reader/writers, etc.). So, focusing on the
data
format seems unnecessary in such a terse description.
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
On Mon, May 17, 2021 at 5:07 PM Weston Pace <weston.p...@gmail.com> wrote:
I'd avoid the word "structured" as it is somewhat ill-defined.
On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
<mauri...@ursacomputing.com> wrote:
more marketed:
How about: "Apache Arrow is a format and language-agnostic library
focused
on efficient sharing and processing of structured data."
On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com
wrote:
How about: "Apache Arrow is a collection of specifications, cross
language
libraries and applications focused on efficient sharing and
processing
of
structured data."
On Mon, May 17, 2021 at 3:06 PM Wes McKinney <wesmck...@gmail.com>
wrote:
On Mon, May 17, 2021 at 4:58 PM Weston Pace <weston.p...@gmail.com
wrote:
“Apache Arrow is a format and compute kernel for in-memory
data”
I like this but no one ever knows what "in-memory" means (or they
just
think 'data is always in memory'). How about...
"Apache Arrow is a format and compute kernel for zero-copy
processing
and sharing of data."
or...
"Apache Arrow is a format and compute kernel for processing and
sharing data without serialization overhead."
A few issues with this:
* Multiple PL aspect unclear (is a single piece of software, or
multiple pieces of software?)
* Development platform aspect unclear
I see that some people don't like the word "platform". Some people
come to this project and want to find an end-to-end application,
rather than a developer toolkit that they can use to build
applications. Perhaps we should be more explicit and use
"computational development toolkit" instead of "platform".
Although marshalling[1] would probably be a more precise word it
is
not as well known.
[1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
<mauri...@ursacomputing.com> wrote:
a few ideas
github.com/apache/arrow - Apache Arrow is an efficient library
for
big data
processing and sharing
github.com/apache/arrow - Apache Arrow is a computational tool
for
processing, storing and sharing large datasets
github.com/apache/arrow - Apache Arrow is a fast and simple
library
for
big data analytics
*github.com/apache/arrow <http://github.com/apache/arrow> -
Apache
Arrow is
a powerful workhorse for analytic operations on modern
hardware*
On Mon, May 17, 2021 at 3:13 PM Julian Hyde <
jhyde.apa...@gmail.com>
wrote:
Alright, well, whatever it is, it must fit into one breath.
If
the
high-concept pitch is successful, people will stick around
for
the
full
pitch.
Words such as “platform” and “enable” are noise. You say
“platform”,
they
start to say “what exactly do you mean by platform”, the
elevator
doors
open, and they’re gone.
“Apache Arrow is a format and compute kernel for in-memory
data”
On May 17, 2021, at 12:03 PM, Eduardo Ponce <
edponc...@gmail.com
wrote:
One more suggestion for the bucket:
"Apache Arrow is a computational platform for efficient
in-memory
data
representation and processing."
On Mon, May 17, 2021 at 2:49 PM Wes McKinney <
wesmck...@gmail.com>
wrote:
I think less is better in the description, but
unfortunately the
association of Arrow as being "just a data format" has
been
actively
harmful in some ways to community growth. We have a data
format,
yes,
but we are also creating a computational platform to go
hand-in-hand
with the data format to make it easier to build fast
applications
that
use the data format. So the description needs to capture
both of
these
ideas.
On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
jhyde.apa...@gmail.com>
wrote:
I think that the “cross-language development platform
for”
is
noise.
(I’m sure that JPEG developers think that JPEG is a
“cross-language
development platform” too. But it isn’t. It is an image
format.)
"Apache Arrow is data format for efficient in-memory
processing.”
I’ll note that In marketing speak, we are developing a
high-concept
pitch [1] here. Every company needs a name, a brand, a
high-concept
pitch,
and 3- or 4-sentence description. But every Apache project
needs
these
too.
It’s worth spending the time on the description, also, and
then
use
them in
all the places that we describe Arrow.
Julian
[1]
https://www.growthink.com/content/whats-your-high-concept-pitch
On May 17, 2021, at 7:38 AM, Eduardo Ponce <
edponc...@gmail.com
wrote:
I agree with Nate's and Brian's suggestions, but would
like to
add
that we
can make it a one-liner for more conciseness and
consistency
with
other
Apache projects.
Apologies if it seems I am going around the suggestions
loop
again.
"Apache Arrow is a cross-language development platform
enabling
efficient
in-memory data processing and transport."
On Mon, May 17, 2021 at 10:11 AM Brian Hulette <
bhule...@apache.org>
wrote:
Thank you for bringing this up Dominik. I sampled some
of the
descriptions
for other Apache projects I frequent, the ones with a
meaningful
description have a single sentence:
github.com/apache/spark - Apache Spark - A unified
analytics
engine
for
large-scale data processing
github.com/apache/beam - Apache Beam is a unified
programming
model
for
Batch and Streaming
github.com/apache/avro - Apache Avro is a data
serialization
system
Several others (Flink, Hadoop, ...) just have "[Mirror
of]
Apache
<name>"
as the description.
+1 for Nate's suggestion "Apache Arrow is a
cross-language
development
platform for in-memory data. It enables systems to
process
and
transport
data more efficiently."
On Mon, May 17, 2021 at 5:23 AM Wes McKinney <
wesmck...@gmail.com>
wrote:
It's probably best for description to limit mentions
of
specific
features. There are some high level features mentioned
in
the
description now ("computational libraries and
zero-copy
streaming
messaging and interprocess communication"), but now in
2021
since
the
project has grown so much, it could leave people with
a
limited view
of what they might find here.
On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
<mauri...@ursacomputing.com> wrote:
How about
'Apache Arrow is a cross-language development
platform
for
in-memory
data.
It enables systems to process and transport data
efficiently,
providing a
simple and fast library for partitioning of large
tables'?
Sorry the delay, long election day
On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
natebauernfe...@deephaven.io>
wrote:
Suggestion: faster -> more efficiently
"Apache Arrow is a cross-language development
platform for
in-memory
data. It enables systems to process and transport
data
more
efficiently."
On Sun, May 16, 2021 at 11:35 AM Wes McKinney <
wesmck...@gmail.com
wrote:
Here's what there now:
"Apache Arrow is a cross-language development
platform
for
in-memory
data. It specifies a standardized
language-independent
columnar
memory
format for flat and hierarchical data, organized
for
efficient
analytic operations on modern hardware. It also
provides
computational
libraries and zero-copy streaming messaging and
interprocess
communication…"
How about something shorter like
"Apache Arrow is a cross-language development
platform
for
in-memory
data. It enables systems to process and transport
data
faster."
Suggestions / refinements from others welcome
On Sat, May 15, 2021 at 9:12 PM Dominik Moritz <
domor...@cmu.edu
wrote:
Super minor issue but could someone make the
description
on
GitHub
shorter?
GitHub puts the description into the title of the
page
and makes
it
hard
to find it in URL autocomplete.
--
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com