Just a followup question. I looked through the python cookbook. So basically, Arrow provides some API to manipulate data in its own format.
And I get the idea that columnar benefits data analytics performance. But it is not strictly necessary for applications to build analytic functions based on Arrow data format, right? For example, Spark may have its own spark-native format rather than the Arrow format. Best, Tao On Mon, Dec 20, 2021 at 5:22 PM Tao Wang <[email protected]> wrote: > Thanks for the references and explanation! > > Best, > Tao > > On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky < > [email protected]> wrote: > >> Hi Tao, >> You’re right that there is a cost to convert Arrow’s columnar format to a >> row format, but generally data analytics are done on columnar arrays for >> performance reasons. Any modern data analytics engine will use some form of >> “structure of arrays” (i.e. one array per column, rather than one struct >> per row). >> >> One example of this could be for grouped aggregation, where you group by >> a key column (let’s call it A) and sum another column (call it B). This can >> be broken up into two steps: group ID mapping and summation. The first >> stage touches only column A, as you find which row indices have the same >> values and assign them to groups. The summation step then touches only >> column B. As you can see, each part of the analysis touches only one column >> at a time. Thus if we were to use a row-oriented format we would be loading >> data into cache that we weren’t actually using, effectively reducing our >> cache size by half! By using a columnar format, we minimize cache misses by >> loading more useful data at a time. >> >> I hope this example helps illustrate why a columnar format is better for >> analytical workloads, which is what Arrow is targeting. >> >> Sasha >> >> > 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]> написал(а): >> > >> > Hi, >> > >> > I looked through Arrow's docs about its formats and APIs. >> > >> > But I am still somewhat confused about typical usecases of Arrow. >> > >> > As in my understanding, the goal of Arrow is to eliminate the >> (de)serialization costs among different data analytic systems, since it has >> the common format. >> > >> > But, it still needs some data conversion between Arrow format and >> language native format, right? For example, you have to convert Arrow >> columnar-based format to C++ row-based format. Or is there any usecase to >> directly conduct data analysis on Arrow's format? >> > >> > Best, >> > Tao >> >
