Sound good enough to me.


Le 10/06/2021 à 23:35, Wes McKinney a écrit :
I hate to reopen this can of worms again, but here is my effort to
synthesize feedback:

"Apache Arrow is a multi-language toolbox for accelerated data
interchange and in-memory processing."

On Thu, Jun 10, 2021 at 12:37 PM Dominik Moritz <domor...@apache.org> wrote:

I thought there were some good suggestions in this thread. @Wes, did you
find a description you liked?

On May 18, 2021 at 06:24:47, Adam Hooper <a...@adamhooper.com> wrote:

Poll question: why did you choose Arrow?

Personally: I researched Arrow because it's a spec for IPC. (My requirement
was: "wrap computations in a separate process.") I chose Arrow for its
community and ecosystem -- in other words, because my peers chose it.

I happen to use the compute kernel and Parquet capabilities every day; but
they did not sway me at all. I would choose Arrow if it were nothing but
this spec and this community. (I chose HTML, after all.)

I see the *code* as one enormous proof that the *spec* is good, and as a
collection of examples and best practices.

... so a great pitch to me would be: "Apache Arrow is a data format and
toolbox for efficient in-memory processing."

Enjoy life,
Adam

On Tue, May 18, 2021 at 2:38 AM Aldrin <akmon...@ucsc.edu.invalid> wrote:

"Apache Arrow is a data processing library that also provides a uniform,

efficient interface for data systems."


This probably still isn't quite right, I imagine the bit about "for data

systems" needs some addition (maybe "for transport between data systems")?


My primary motivators:


    - "A data processing library":

       - Arrow provides many language bindings, but ultimately they're all

       part of the same "library ecosystem", which I think is fine to

capture in

       "library"

       - A main goal of arrow is for processing to be fast, whatever that

       processing may be

       - "uniform, efficient interface for data systems":

       - Arrow, provides (or tries to) a cohesive ("uniform") interface for

       data processing (although it has several APIs to do this)

       - Also, IMO, a motivation for arrow was a format and library to

       facilitate processing, but that provided functions and

interfaces to easily

       translate into optimized data formats used by disparate data systems

       (cassandra, hadoop, etc.).

       - Arrow tries to be transparently zero-copy, which is part of the

       interface for efficiency

    - Arrow certainly has a data format, but that format is the crux of the

    interface (IMO). However, it also makes using other formats easy (via

    filesystem API and parquet reader/writers, etc.). So, focusing on the

data

    format seems unnecessary in such a terse description.



Aldrin Montana

Computer Science PhD Student

UC Santa Cruz



On Mon, May 17, 2021 at 5:07 PM Weston Pace <weston.p...@gmail.com> wrote:


I'd avoid the word "structured" as it is somewhat ill-defined.



On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas

<mauri...@ursacomputing.com> wrote:



more marketed:

How about: "Apache Arrow is a format and language-agnostic library

focused

on efficient sharing and processing of structured data."



On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com



wrote:



How about: "Apache Arrow is a collection of specifications, cross

language

libraries and applications focused on efficient sharing and

processing

of

structured data."



On Mon, May 17, 2021 at 3:06 PM Wes McKinney <wesmck...@gmail.com>

wrote:



On Mon, May 17, 2021 at 4:58 PM Weston Pace <weston.p...@gmail.com



wrote:



“Apache Arrow is a format and compute kernel for in-memory

data”



I like this but no one ever knows what "in-memory" means (or they

just

think 'data is always in memory').  How about...



"Apache Arrow is a format and compute kernel for zero-copy

processing

and sharing of data."



or...



"Apache Arrow is a format and compute kernel for processing and

sharing data without serialization overhead."



A few issues with this:



* Multiple PL aspect unclear (is a single piece of software, or

multiple pieces of software?)

* Development platform aspect unclear



I see that some people don't like the word "platform". Some people

come to this project and want to find an end-to-end application,

rather than a developer toolkit that they can use to build

applications. Perhaps we should be more explicit and use

"computational development toolkit" instead of "platform".



Although marshalling[1] would probably be a more precise word it

is

not as well known.



[1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)



On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas

<mauri...@ursacomputing.com> wrote:



a few ideas



github.com/apache/arrow - Apache Arrow is an efficient library

for

big data

processing and sharing



github.com/apache/arrow - Apache Arrow is a computational tool

for

processing, storing and sharing large datasets



github.com/apache/arrow - Apache Arrow is a  fast and simple

library

for

big data analytics



*github.com/apache/arrow <http://github.com/apache/arrow> -

Apache

Arrow is

a powerful workhorse for analytic operations on modern

hardware*





On Mon, May 17, 2021 at 3:13 PM Julian Hyde <

jhyde.apa...@gmail.com>

wrote:



Alright, well, whatever it is, it must fit into one breath.

If

the

high-concept pitch is successful, people will stick around

for

the

full

pitch.



Words such as “platform” and “enable” are noise. You say

“platform”,

they

start to say “what exactly do you mean by platform”, the

elevator

doors

open, and they’re gone.



“Apache Arrow is a format and compute kernel for in-memory

data”





On May 17, 2021, at 12:03 PM, Eduardo Ponce <

edponc...@gmail.com



wrote:



One more suggestion for the bucket:

"Apache Arrow is a computational platform for efficient

in-memory

data

representation and processing."



On Mon, May 17, 2021 at 2:49 PM Wes McKinney <

wesmck...@gmail.com>

wrote:



I think less is better in the description, but

unfortunately the

association of Arrow as being "just a data format" has

been

actively

harmful in some ways to community growth. We have a data

format,

yes,

but we are also creating a computational platform to go

hand-in-hand

with the data format to make it easier to build fast

applications

that

use the data format. So the description needs to capture

both of

these

ideas.



On Mon, May 17, 2021 at 12:15 PM Julian Hyde <

jhyde.apa...@gmail.com>

wrote:



I think that the “cross-language development platform

for”

is

noise.

(I’m sure that JPEG developers think that JPEG is a

“cross-language

development platform” too. But it isn’t. It is an image

format.)



"Apache Arrow is data format for efficient in-memory

processing.”



I’ll note that In marketing speak, we are developing a

high-concept

pitch [1] here. Every company needs a name, a brand, a

high-concept

pitch,

and 3- or 4-sentence description. But every Apache project

needs

these

too.

It’s worth spending the time on the description, also, and

then

use

them in

all the places that we describe Arrow.



Julian



[1]

https://www.growthink.com/content/whats-your-high-concept-pitch







On May 17, 2021, at 7:38 AM, Eduardo Ponce <

edponc...@gmail.com



wrote:



I agree with Nate's and Brian's suggestions, but would

like to

add

that we

can make it a one-liner for more conciseness and

consistency

with

other

Apache projects.

Apologies if it seems I am going around the suggestions

loop

again.



"Apache Arrow is a cross-language development platform

enabling

efficient

in-memory data processing and transport."









On Mon, May 17, 2021 at 10:11 AM Brian Hulette <

bhule...@apache.org>

wrote:



Thank you for bringing this up Dominik. I sampled some

of the

descriptions

for other Apache projects I frequent, the ones with a

meaningful

description have a single sentence:



github.com/apache/spark - Apache Spark - A unified

analytics

engine

for

large-scale data processing

github.com/apache/beam - Apache Beam is a unified

programming

model

for

Batch and Streaming

github.com/apache/avro - Apache Avro is a data

serialization

system



Several others (Flink, Hadoop, ...) just have  "[Mirror

of]

Apache

<name>"

as the description.



+1 for Nate's suggestion "Apache Arrow is a

cross-language

development

platform for in-memory data. It enables systems to

process

and

transport

data more efficiently."



On Mon, May 17, 2021 at 5:23 AM Wes McKinney <

wesmck...@gmail.com>

wrote:



It's probably best for description to limit mentions

of

specific

features. There are some high level features mentioned

in

the

description now ("computational libraries and

zero-copy

streaming

messaging and interprocess communication"), but now in

2021

since

the

project has grown so much, it could leave people with

a

limited view

of what they might find here.



On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas

<mauri...@ursacomputing.com> wrote:



How about

'Apache Arrow is a cross-language development

platform

for

in-memory

data.

It enables systems to process and transport data

efficiently,

providing a

simple and fast library for partitioning of large

tables'?



Sorry the delay, long election day



On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <

natebauernfe...@deephaven.io>

wrote:



Suggestion: faster -> more efficiently



"Apache Arrow is a cross-language development

platform for

in-memory

data. It enables systems to process and transport

data

more

efficiently."



On Sun, May 16, 2021 at 11:35 AM Wes McKinney <

wesmck...@gmail.com



wrote:



Here's what there now:



"Apache Arrow is a cross-language development

platform

for

in-memory

data. It specifies a standardized

language-independent

columnar

memory

format for flat and hierarchical data, organized

for

efficient

analytic operations on modern hardware. It also

provides

computational

libraries and zero-copy streaming messaging and

interprocess

communication…"



How about something shorter like



"Apache Arrow is a cross-language development

platform

for

in-memory

data. It enables systems to process and transport

data

faster."



Suggestions / refinements from others welcome





On Sat, May 15, 2021 at 9:12 PM Dominik Moritz <

domor...@cmu.edu



wrote:



Super minor issue but could someone make the

description

on

GitHub

shorter?







GitHub puts the description into the title of the

page

and makes

it

hard

to find it in URL autocomplete.









--
























--
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to