Re: A general question about Arrow

Sasha Krassovsky Sun, 19 Dec 2021 22:27:13 -0800

Hi Tao,
You’re right that there is a cost to convert Arrow’s columnar format to a row 
format, but generally data analytics are done on columnar arrays for 
performance reasons. Any modern data analytics engine will use some form of 
“structure of arrays” (i.e. one array per column, rather than one struct per 
row).


One example of this could be for grouped aggregation, where you group by a key 
column (let’s call it A) and sum another column (call it B). This can be broken 
up into two steps: group ID mapping and summation. The first stage touches only 
column A, as you find which row indices have the same values and assign them to 
groups. The summation step then touches only column B. As you can see, each 
part of the analysis touches only one column at a time. Thus if we were to use 
a row-oriented format we would be loading data into cache that we weren’t 
actually using, effectively reducing our cache size by half! By using a 
columnar format, we minimize cache misses by loading more useful data at a 
time. 

I hope this example helps illustrate why a columnar format is better for 
analytical workloads, which is what Arrow is targeting. 

Sasha 

> 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]> написал(а):
> 
> Hi,
> 
> I looked through Arrow's docs about its formats and APIs.
> 
> But I am still somewhat confused about typical usecases of Arrow.
> 
> As in my understanding, the goal of Arrow is to eliminate the 
> (de)serialization costs among different data analytic systems, since it has 
> the common format.
> 
> But, it still needs some data conversion between Arrow format and language 
> native format, right? For example, you have to convert Arrow columnar-based 
> format to C++ row-based format. Or is there any usecase to directly conduct 
> data analysis on Arrow's format?
> 
> Best,
> Tao

Re: A general question about Arrow

Reply via email to