Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Uwe L. Korn Sun, 22 Oct 2017 05:01:56 -0700

I clearly understand that all four layers are important to Arrow (and we
should mention them, maybe graphically) on the Arrow landing page. But
my concern is that I saw some time ago some people questioning "Is Arrow
a replacement for Avro?" (also Flatbuffers seems to be something we get
often compared to). For at least these two cases, I see that we want to
achieve different goals. We want to work with them together to build a
better data analytics ecosystem but at least from my perspective, we
don't want to replace all existing serialization formats. One of the
main points that people should show that there is a boundary in Arrow's
scope is the "in-memory" objective but I still would like to keep the
"columnar" somewhere in the description. It might be slightly
de-emphasized but it is still there as one of the focal point. From my
perspective, 3 of the four layers are still very much focused on
columnar memory.


Uwe

On Sun, Oct 22, 2017, at 01:46 PM, Wes McKinney wrote:
> > Still, I would like to see "columnar" used in the first sentence as this is 
> > the main focus of the project.
> 
> It's interesting, slightly de-emphasizing the role of the columnar
> format is actually one of my objectives of the revisions. It does not
> mean that the columnar specification is not a critical component of
> the project: it absolutely is and one of centerpieces of the project.
> 
> But the scope of Arrow has already become larger than that -- as time
> goes on the project's center of gravity concerns general management of
> in-memory analytical datasets. These may not be structured (and
> columnar) 100% of the time -- for example, you could use Arrow to
> write a collection of simple buffers (without any additional type
> metadata) to shared memory, then read them back with zero copy. This
> requires maintaining a general "memory management system" that is
> necessary for everything else, and the columnar format is built on top
> of this. It's pretty complex to be able to manage zero-copy memory
> references for arbitrarily complex
> 
> I see the C++ library in 4 distinct layers, for example:
> 
> * General zero-copy memory management: Plasma, arrow::Buffer,
> arrow::MemoryPool, the contents of arrow::io (e.g. zero-copy
> io::BufferReader, etc.)
> * Columnar memory format / data structures / in-memory metadata :
> arrow::DataType / Array
> * Structured data IPC: arrays, record batches, and any other new
> message types (e.g. tensors)
> * Columnar in-memory analytics: what we are just beginning to
> implement in arrow/compute
> 
> I think to express to the open source community that in-memory data
> problems that are not columnar are of no interest to the Arrow
> community would be needlessly closing off collaboration opportunities.
> It's important that a larger audience is able to consume Arrow's
> memory management layer and IPC tools (e.g. they can easily be used
> for deep learning / ML applications) and use them to create more kinds
> of applications architected around the mantra of zero-copy. With new
> architectures designed to leverage non-volatile memory on the horizon,
> this grows more important with each passing day.
> 
> - Wes
> 
> On Sun, Oct 22, 2017 at 7:32 AM, Uwe L. Korn <[email protected]> wrote:
> > Thank you Wes and Julian for taking the approach to improve the elevator
> > pitch. I really like the improvements. Still, I would like to see
> > "columnar" used in the first sentence as this is the main focus of the
> > project.
> >
> > Uwe
> >
> > On Sat, Oct 21, 2017, at 10:32 PM, Wes McKinney wrote:
> >> Thanks Julian, I like the changes.
> >>
> >> For the last part I agree listing languages is good; we would do well
> >> to include JavaScript and Ruby in that list. Hopefully the list will
> >> keep growing longer!
> >>
> >> On Sat, Oct 21, 2017 at 4:20 PM, Julian Hyde <[email protected]> wrote:
> >> > Your proposed version is definitely an improvement.
> >> >
> >> >> "Apache Arrow is a cross-language development platform for in-memory
> >> >> structured data access and analytics. It specifies a standardized
> >> >> language-independent columnar memory format for flat and hierarchical
> >> >> data, with support for zero-copy streaming messaging and interprocess
> >> >> communication. It also provides computational libraries for efficient
> >> >> in-memory analytics on modern hardware.”
> >> >
> >> > I propose a few tweaks:
> >> >
> >> > Simplify sentence 1 to
> >> >
> >> >   Apache Arrow is a cross-language development platform for in-memory
> >> >   data.
> >> >
> >> > This is easier to parse, captures the gist, and the other parts are 
> >> > covered
> >> > in later sentences.
> >> >
> >> > To me, the cache-efficient format is more fundamental important than
> >> > streaming and IPC (you can build the latter). Therefore I’d change
> >> > sentence 2 to
> >> >
> >> >   It specifies a standardized language-independent columnar memory
> >> >   format for flat and hierarchical data, organized for efficient analytic
> >> >   operations on modern hardware.
> >> >
> >> > Which leaves sentence 3 as
> >> >
> >> >   It also provides computational libraries for zero-copy streaming
> >> >   messaging and interprocess communication.
> >> >
> >> > And add sentence 4,
> >> >
> >> >   Languages supported include C and C++, Java, and Python.
> >> >
> >> > Julian
> >> >
> >> >> On Oct 21, 2017, at 10:58 AM, Wes McKinney <[email protected]> wrote:
> >> >>
> >> >> I believe we would benefit from modified language to describe the
> >> >> nature and scope of the Arrow project.
> >> >>
> >> >> Currently, our GitHub project description (and what we use in release
> >> >> announcements) states:
> >> >>
> >> >> "Apache Arrow is a columnar in-memory analytics layer designed to
> >> >> accelerate big data. It houses a set of canonical in-memory
> >> >> representations of flat and hierarchical data along with multiple
> >> >> language-bindings for structure manipulation. It also provides IPC and
> >> >> common algorithm implementations."
> >> >>
> >> >> I think this could be perhaps restated in the following way:
> >> >>
> >> >> "Apache Arrow is a cross-language development platform for in-memory
> >> >> structured data access and analytics. It specifies a standardized
> >> >> language-independent columnar memory format for flat and hierarchical
> >> >> data, with support for zero-copy streaming messaging and interprocess
> >> >> communication. It also provides computational libraries for efficient
> >> >> in-memory analytics on modern hardware."
> >> >>
> >> >> It is true that we have been mostly focused on hardening the details
> >> >> of the Arrow format and related issues around messaging and IPC, which
> >> >> are necessary for everything else we may contemplate building in the
> >> >> future. Since I plan to be building a library of computational tools
> >> >> in C++ for the native code community (Python, Ruby, R, etc.), I think
> >> >> it would be a good idea to clearly state that building general purpose
> >> >> analytics implementations (i.e. the sorts of things you find in "data
> >> >> frame libraries" like pandas) is part of the mission of the project.
> >> >>
> >> >> Feedback on the above would be appreciated how we could do a better
> >> >> job representing our past, present, and future community goals.
> >> >>
> >> >> Thanks
> >> >> Wes
> >> >

Re: [DISCUSS] Updating Arrow's "elevator pitch" on web properties

Reply via email to