Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Julian Hyde Sun, 22 Oct 2017 20:05:20 -0700

It's best if a project's (or company's) marketing has several tiers.
An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is
a phrase, e.g. "book rooms with locals, rather than hotels", and
expanded description.


I think the question of whether this replaces Avro is best handled in an FAQ.

On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>> But my concern is that I saw some time ago some people questioning "Is Arrow 
>> a replacement for Avro?" (also Flatbuffers seems to be something we get
> often compared to). For at least these two cases, I see that we want
> to achieve different goals. We want to work with them together to
> build a better data analytics ecosystem but at least from my
> perspective, we don't want to replace all existing serialization
> formats.
>
> Indeed, the most common problem I have experienced is that people who
> do not build data processing engines professionally sometimes get
> confused about the distinction between in-memory formats and
> serialization formats (Parquet, Avro, Protocol Buffers, etc.). The
> vast majority of developers rarely get this "close to the metal" and
> mainly think about storage formats and data access layers in terms of
> their high level semantics like "tables" and "records".
>
> The distinction between Arrow and zero-copy serialization formats like
> Flatbuffers and Cap'n Proto is another thing that I often find myself
> explaining. I don't think there's any way we can resolve these
> confusions in ~100 words.
>
> I would like for us to write some blog posts helping people mentally
> classify the technologies since it would help people understand both
> how Arrow is different as well as how it is a complementary / not
> mutually exclusive technology. I find that programmers are sometimes
> prone to dichotomous / binary thinking (which leads to the inclination
> to cast one technology as "the same as" another) and it's rare that a
> new, category-defining technology like this comes along. People even
> hear the "columnar" buzzword and then ask "wait, so is this replacing
> Parquet?".
>
> The audience for the Arrow project are the developers of data
> processing engines. We need to precisely message that developers who
> work with complex in-memory data sets (especially using shared memory
> and memory-mappable devices like GPUs and NVM), even if they are not
> always columnar / structured, are welcome and indeed desired members
> of our community. As an example, our collaboration with the Ray
> project has been a success (and bodes well for use in more machine
> learning applications) because we can compose our zero-copy structured
> data representation with general buffer memory management to create
> richer, memory-efficient data access interfaces.
>
> I'll spend a little time tweaking the blurb a bit based on Julian's
> edits and post for more feedback.
>
> - Wes
>
> On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> I clearly understand that all four layers are important to Arrow (and we
>> should mention them, maybe graphically) on the Arrow landing page. But
>> my concern is that I saw some time ago some people questioning "Is Arrow
>> a replacement for Avro?" (also Flatbuffers seems to be something we get
>> often compared to). For at least these two cases, I see that we want to
>> achieve different goals. We want to work with them together to build a
>> better data analytics ecosystem but at least from my perspective, we
>> don't want to replace all existing serialization formats. One of the
>> main points that people should show that there is a boundary in Arrow's
>> scope is the "in-memory" objective but I still would like to keep the
>> "columnar" somewhere in the description. It might be slightly
>> de-emphasized but it is still there as one of the focal point. From my
>> perspective, 3 of the four layers are still very much focused on
>> columnar memory.
>>
>> Uwe
>>
>> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
>>> > Still, I would like to see "columnar" used in the first sentence as this 
>>> > is the main focus of the project.
>>>
>>> It's interesting, slightly de-emphasizing the role of the columnar
>>> format is actually one of my objectives of the revisions. It does not
>>> mean that the columnar specification is not a critical component of
>>> the project: it absolutely is and one of centerpieces of the project.
>>>
>>> But the scope of Arrow has already become larger than that -- as time
>>> goes on the project's center of gravity concerns general management of
>>> in-memory analytical datasets. These may not be structured (and
>>> columnar) 100% of the time -- for example, you could use Arrow to
>>> write a collection of simple buffers (without any additional type
>>> metadata) to shared memory, then read them back with zero copy. This
>>> requires maintaining a general "memory management system" that is
>>> necessary for everything else, and the columnar format is built on top
>>> of this. It's pretty complex to be able to manage zero-copy memory
>>> references for arbitrarily complex
>>>
>>> I see the C++ library in 4 distinct layers, for example:
>>>
>>> * General zero-copy memory management: Plasma, arrow::Buffer,
>>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
>>> io::BufferReader, etc.)
>>> * Columnar memory format / data structures / in-memory metadata :
>>> arrow::DataType / Array
>>> * Structured data IPC: arrays, record batches, and any other new
>>> message types (e.g. tensors)
>>> * Columnar in-memory analytics: what we are just beginning to
>>> implement in arrow/compute
>>>
>>> I think to express to the open source community that in-memory data
>>> problems that are not columnar are of no interest to the Arrow
>>> community would be needlessly closing off collaboration opportunities.
>>> It's important that a larger audience is able to consume Arrow's
>>> memory management layer and IPC tools (e.g. they can easily be used
>>> for deep learning / ML applications) and use them to create more kinds
>>> of applications architected around the mantra of zero-copy. With new
>>> architectures designed to leverage non-volatile memory on the horizon,
>>> this grows more important with each passing day.
>>>
>>> - Wes
>>>
>>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>> > Thank you Wes and Julian for taking the approach to improve the elevator
>>> > pitch. I really like the improvements. Still, I would like to see
>>> > "columnar" used in the first sentence as this is the main focus of the
>>> > project.
>>> >
>>> > Uwe
>>> >
>>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
>>> >> Thanks Julian, I like the changes.
>>> >>
>>> >> For the last part I agree listing languages is good; we would do well
>>> >> to include JavaScript and Ruby in that list. Hopefully the list will
>>> >> keep growing longer!
>>> >>
>>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote:
>>> >> > Your proposed version is definitely an improvement.
>>> >> >
>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>> >> >> structured data access and analytics. It specifies a standardized
>>> >> >> language-independent columnar memory format for flat and hierarchical
>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>> >> >> communication. It also provides computational libraries for efficient
>>> >> >> in-memory analytics on modern hardware.”
>>> >> >
>>> >> > I propose a few tweaks:
>>> >> >
>>> >> > Simplify sentence 1 to
>>> >> >
>>> >> >   Apache Arrow is a cross-language development platform for in-memory
>>> >> >   data.
>>> >> >
>>> >> > This is easier to parse, captures the gist, and the other parts are 
>>> >> > covered
>>> >> > in later sentences.
>>> >> >
>>> >> > To me, the cache-efficient format is more fundamental important than
>>> >> > streaming and IPC (you can build the latter). Therefore I’d change
>>> >> > sentence 2 to
>>> >> >
>>> >> >   It specifies a standardized language-independent columnar memory
>>> >> >   format for flat and hierarchical data, organized for efficient 
>>> >> > analytic
>>> >> >   operations on modern hardware.
>>> >> >
>>> >> > Which leaves sentence 3 as
>>> >> >
>>> >> >   It also provides computational libraries for zero-copy streaming
>>> >> >   messaging and interprocess communication.
>>> >> >
>>> >> > And add sentence 4,
>>> >> >
>>> >> >   Languages supported include C and C++, Java, and Python.
>>> >> >
>>> >> > Julian
>>> >> >
>>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <wesmck...@gmail.com> 
>>> >> >> wrote:
>>> >> >>
>>> >> >> I believe we would benefit from modified language to describe the
>>> >> >> nature and scope of the Arrow project.
>>> >> >>
>>> >> >> Currently, our GitHub project description (and what we use in release
>>> >> >> announcements) states:
>>> >> >>
>>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
>>> >> >> accelerate big data. It houses a set of canonical in-memory
>>> >> >> representations of flat and hierarchical data along with multiple
>>> >> >> language-bindings for structure manipulation. It also provides IPC and
>>> >> >> common algorithm implementations."
>>> >> >>
>>> >> >> I think this could be perhaps restated in the following way:
>>> >> >>
>>> >> >> "Apache Arrow is a cross-language development platform for in-memory
>>> >> >> structured data access and analytics. It specifies a standardized
>>> >> >> language-independent columnar memory format for flat and hierarchical
>>> >> >> data, with support for zero-copy streaming messaging and interprocess
>>> >> >> communication. It also provides computational libraries for efficient
>>> >> >> in-memory analytics on modern hardware."
>>> >> >>
>>> >> >> It is true that we have been mostly focused on hardening the details
>>> >> >> of the Arrow format and related issues around messaging and IPC, which
>>> >> >> are necessary for everything else we may contemplate building in the
>>> >> >> future. Since I plan to be building a library of computational tools
>>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
>>> >> >> it would be a good idea to clearly state that building general purpose
>>> >> >> analytics implementations (i.e. the sorts of things you find in "data
>>> >> >> frame libraries" like pandas) is part of the mission of the project.
>>> >> >>
>>> >> >> Feedback on the above would be appreciated how we could do a better
>>> >> >> job representing our past, present, and future community goals.
>>> >> >>
>>> >> >> Thanks
>>> >> >> Wes
>>> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Reply via email to