Andrew Lamb has a good overview talk, see:
http://andrew.nerdnetworks.org/
we need to put this in written form so it is easier to digest and
adoption of the format can grow.
On 12/21/21 8:56 AM, Tao Wang wrote:
Just a followup question.
I looked through the python cookbook. So basically, Arrow provides some
API to manipulate data in its own format.
And I get the idea that columnar benefits data analytics performance.
But it is not strictly necessary for applications to build analytic
functions based on Arrow data format, right? For example,
Spark may have its own spark-native format rather than the Arrow format.
Yes, that is correct.
Best,
Tao
On Mon, Dec 20, 2021 at 5:22 PM Tao Wang <[email protected]
<mailto:[email protected]>> wrote:
Thanks for the references and explanation!
Best,
Tao
On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky
<[email protected] <mailto:[email protected]>> wrote:
Hi Tao,
You’re right that there is a cost to convert Arrow’s columnar
format to a row format, but generally data analytics are done on
columnar arrays for performance reasons. Any modern data
analytics engine will use some form of “structure of arrays”
(i.e. one array per column, rather than one struct per row).
One example of this could be for grouped aggregation, where you
group by a key column (let’s call it A) and sum another column
(call it B). This can be broken up into two steps: group ID
mapping and summation. The first stage touches only column A, as
you find which row indices have the same values and assign them
to groups. The summation step then touches only column B. As you
can see, each part of the analysis touches only one column at a
time. Thus if we were to use a row-oriented format we would be
loading data into cache that we weren’t actually using,
effectively reducing our cache size by half! By using a columnar
format, we minimize cache misses by loading more useful data at
a time.
I hope this example helps illustrate why a columnar format is
better for analytical workloads, which is what Arrow is targeting.
Sasha
> 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]
<mailto:[email protected]>> написал(а):
>
> Hi,
>
> I looked through Arrow's docs about its formats and APIs.
>
> But I am still somewhat confused about typical usecases of Arrow.
>
> As in my understanding, the goal of Arrow is to eliminate the
(de)serialization costs among different data analytic systems,
since it has the common format.
>
> But, it still needs some data conversion between Arrow format
and language native format, right? For example, you have to
convert Arrow columnar-based format to C++ row-based format. Or
is there any usecase to directly conduct data analysis on
Arrow's format?
>
> Best,
> Tao