It's best if a project's (or company's) marketing has several tiers. An "elevator pitch" of 2-3 sentences, a "high concept pitch" which is a phrase, e.g. "book rooms with locals, rather than hotels", and expanded description.
I think the question of whether this replaces Avro is best handled in an FAQ. On Sun, Oct 22, 2017 at 5:35 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> But my concern is that I saw some time ago some people questioning "Is Arrow >> a replacement for Avro?" (also Flatbuffers seems to be something we get > often compared to). For at least these two cases, I see that we want > to achieve different goals. We want to work with them together to > build a better data analytics ecosystem but at least from my > perspective, we don't want to replace all existing serialization > formats. > > Indeed, the most common problem I have experienced is that people who > do not build data processing engines professionally sometimes get > confused about the distinction between in-memory formats and > serialization formats (Parquet, Avro, Protocol Buffers, etc.). The > vast majority of developers rarely get this "close to the metal" and > mainly think about storage formats and data access layers in terms of > their high level semantics like "tables" and "records". > > The distinction between Arrow and zero-copy serialization formats like > Flatbuffers and Cap'n Proto is another thing that I often find myself > explaining. I don't think there's any way we can resolve these > confusions in ~100 words. > > I would like for us to write some blog posts helping people mentally > classify the technologies since it would help people understand both > how Arrow is different as well as how it is a complementary / not > mutually exclusive technology. I find that programmers are sometimes > prone to dichotomous / binary thinking (which leads to the inclination > to cast one technology as "the same as" another) and it's rare that a > new, category-defining technology like this comes along. People even > hear the "columnar" buzzword and then ask "wait, so is this replacing > Parquet?". > > The audience for the Arrow project are the developers of data > processing engines. We need to precisely message that developers who > work with complex in-memory data sets (especially using shared memory > and memory-mappable devices like GPUs and NVM), even if they are not > always columnar / structured, are welcome and indeed desired members > of our community. As an example, our collaboration with the Ray > project has been a success (and bodes well for use in more machine > learning applications) because we can compose our zero-copy structured > data representation with general buffer memory management to create > richer, memory-efficient data access interfaces. > > I'll spend a little time tweaking the blurb a bit based on Julian's > edits and post for more feedback. > > - Wes > > On Sun, Oct 22, 2017 at 8:01 AM, Uwe L. Korn <uw...@xhochy.com> wrote: >> I clearly understand that all four layers are important to Arrow (and we >> should mention them, maybe graphically) on the Arrow landing page. But >> my concern is that I saw some time ago some people questioning "Is Arrow >> a replacement for Avro?" (also Flatbuffers seems to be something we get >> often compared to). For at least these two cases, I see that we want to >> achieve different goals. We want to work with them together to build a >> better data analytics ecosystem but at least from my perspective, we >> don't want to replace all existing serialization formats. One of the >> main points that people should show that there is a boundary in Arrow's >> scope is the "in-memory" objective but I still would like to keep the >> "columnar" somewhere in the description. It might be slightly >> de-emphasized but it is still there as one of the focal point. From my >> perspective, 3 of the four layers are still very much focused on >> columnar memory. >> >> Uwe >> >> On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote: >>> > Still, I would like to see "columnar" used in the first sentence as this >>> > is the main focus of the project. >>> >>> It's interesting, slightly de-emphasizing the role of the columnar >>> format is actually one of my objectives of the revisions. It does not >>> mean that the columnar specification is not a critical component of >>> the project: it absolutely is and one of centerpieces of the project. >>> >>> But the scope of Arrow has already become larger than that -- as time >>> goes on the project's center of gravity concerns general management of >>> in-memory analytical datasets. These may not be structured (and >>> columnar) 100% of the time -- for example, you could use Arrow to >>> write a collection of simple buffers (without any additional type >>> metadata) to shared memory, then read them back with zero copy. This >>> requires maintaining a general "memory management system" that is >>> necessary for everything else, and the columnar format is built on top >>> of this. It's pretty complex to be able to manage zero-copy memory >>> references for arbitrarily complex >>> >>> I see the C++ library in 4 distinct layers, for example: >>> >>> * General zero-copy memory management: Plasma, arrow::Buffer, >>> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy >>> io::BufferReader, etc.) >>> * Columnar memory format / data structures / in-memory metadata : >>> arrow::DataType / Array >>> * Structured data IPC: arrays, record batches, and any other new >>> message types (e.g. tensors) >>> * Columnar in-memory analytics: what we are just beginning to >>> implement in arrow/compute >>> >>> I think to express to the open source community that in-memory data >>> problems that are not columnar are of no interest to the Arrow >>> community would be needlessly closing off collaboration opportunities. >>> It's important that a larger audience is able to consume Arrow's >>> memory management layer and IPC tools (e.g. they can easily be used >>> for deep learning / ML applications) and use them to create more kinds >>> of applications architected around the mantra of zero-copy. With new >>> architectures designed to leverage non-volatile memory on the horizon, >>> this grows more important with each passing day. >>> >>> - Wes >>> >>> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <uw...@xhochy.com> wrote: >>> > Thank you Wes and Julian for taking the approach to improve the elevator >>> > pitch. I really like the improvements. Still, I would like to see >>> > "columnar" used in the first sentence as this is the main focus of the >>> > project. >>> > >>> > Uwe >>> > >>> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote: >>> >> Thanks Julian, I like the changes. >>> >> >>> >> For the last part I agree listing languages is good; we would do well >>> >> to include JavaScript and Ruby in that list. Hopefully the list will >>> >> keep growing longer! >>> >> >>> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <jh...@apache.org> wrote: >>> >> > Your proposed version is definitely an improvement. >>> >> > >>> >> >> "Apache Arrow is a cross-language development platform for in-memory >>> >> >> structured data access and analytics. It specifies a standardized >>> >> >> language-independent columnar memory format for flat and hierarchical >>> >> >> data, with support for zero-copy streaming messaging and interprocess >>> >> >> communication. It also provides computational libraries for efficient >>> >> >> in-memory analytics on modern hardware.” >>> >> > >>> >> > I propose a few tweaks: >>> >> > >>> >> > Simplify sentence 1 to >>> >> > >>> >> > Apache Arrow is a cross-language development platform for in-memory >>> >> > data. >>> >> > >>> >> > This is easier to parse, captures the gist, and the other parts are >>> >> > covered >>> >> > in later sentences. >>> >> > >>> >> > To me, the cache-efficient format is more fundamental important than >>> >> > streaming and IPC (you can build the latter). Therefore I’d change >>> >> > sentence 2 to >>> >> > >>> >> > It specifies a standardized language-independent columnar memory >>> >> > format for flat and hierarchical data, organized for efficient >>> >> > analytic >>> >> > operations on modern hardware. >>> >> > >>> >> > Which leaves sentence 3 as >>> >> > >>> >> > It also provides computational libraries for zero-copy streaming >>> >> > messaging and interprocess communication. >>> >> > >>> >> > And add sentence 4, >>> >> > >>> >> > Languages supported include C and C++, Java, and Python. >>> >> > >>> >> > Julian >>> >> > >>> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <wesmck...@gmail.com> >>> >> >> wrote: >>> >> >> >>> >> >> I believe we would benefit from modified language to describe the >>> >> >> nature and scope of the Arrow project. >>> >> >> >>> >> >> Currently, our GitHub project description (and what we use in release >>> >> >> announcements) states: >>> >> >> >>> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to >>> >> >> accelerate big data. It houses a set of canonical in-memory >>> >> >> representations of flat and hierarchical data along with multiple >>> >> >> language-bindings for structure manipulation. It also provides IPC and >>> >> >> common algorithm implementations." >>> >> >> >>> >> >> I think this could be perhaps restated in the following way: >>> >> >> >>> >> >> "Apache Arrow is a cross-language development platform for in-memory >>> >> >> structured data access and analytics. It specifies a standardized >>> >> >> language-independent columnar memory format for flat and hierarchical >>> >> >> data, with support for zero-copy streaming messaging and interprocess >>> >> >> communication. It also provides computational libraries for efficient >>> >> >> in-memory analytics on modern hardware." >>> >> >> >>> >> >> It is true that we have been mostly focused on hardening the details >>> >> >> of the Arrow format and related issues around messaging and IPC, which >>> >> >> are necessary for everything else we may contemplate building in the >>> >> >> future. Since I plan to be building a library of computational tools >>> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think >>> >> >> it would be a good idea to clearly state that building general purpose >>> >> >> analytics implementations (i.e. the sorts of things you find in "data >>> >> >> frame libraries" like pandas) is part of the mission of the project. >>> >> >> >>> >> >> Feedback on the above would be appreciated how we could do a better >>> >> >> job representing our past, present, and future community goals. >>> >> >> >>> >> >> Thanks >>> >> >> Wes >>> >> >