Re: A general question about Arrow

Benson Muite Mon, 20 Dec 2021 22:10:41 -0800

Andrew Lamb has a good overview talk, see:
http://andrew.nerdnetworks.org/

we need to put this in written form so it is easier to digest andadoption of the format can grow.


On 12/21/21 8:56 AM, Tao Wang wrote:

Just a followup question.
I looked through the python cookbook. So basically, Arrow provides someAPI to manipulate data in its own format.
And I get the idea that columnar benefits data analytics performance.
But it is not strictly necessary for applications to build analyticfunctions based on Arrow data format, right? For example,
Spark may have its own spark-native format rather than the Arrow format.

Yes, that is correct.


Best,
Tao

On Mon, Dec 20, 2021 at 5:22 PM Tao Wang <[email protected]<mailto:[email protected]>> wrote:


    Thanks for the references and explanation!

    Best,
    Tao

    On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Tao,
        You’re right that there is a cost to convert Arrow’s columnar
        format to a row format, but generally data analytics are done on
        columnar arrays for performance reasons. Any modern data
        analytics engine will use some form of “structure of arrays”
        (i.e. one array per column, rather than one struct per row).

        One example of this could be for grouped aggregation, where you
        group by a key column (let’s call it A) and sum another column
        (call it B). This can be broken up into two steps: group ID
        mapping and summation. The first stage touches only column A, as
        you find which row indices have the same values and assign them
        to groups. The summation step then touches only column B. As you
        can see, each part of the analysis touches only one column at a
        time. Thus if we were to use a row-oriented format we would be
        loading data into cache that we weren’t actually using,
        effectively reducing our cache size by half! By using a columnar
        format, we minimize cache misses by loading more useful data at
        a time.

        I hope this example helps illustrate why a columnar format is
        better for analytical workloads, which is what Arrow is targeting.

        Sasha

         > 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]
        <mailto:[email protected]>> написал(а):
         >
         > Hi,
         >
         > I looked through Arrow's docs about its formats and APIs.
         >
         > But I am still somewhat confused about typical usecases of Arrow.
         >
         > As in my understanding, the goal of Arrow is to eliminate the
        (de)serialization costs among different data analytic systems,
        since it has the common format.
         >
         > But, it still needs some data conversion between Arrow format
        and language native format, right? For example, you have to
        convert Arrow columnar-based format to C++ row-based format. Or
        is there any usecase to directly conduct data analysis on
        Arrow's format?
         >
         > Best,
         > Tao

Re: A general question about Arrow

Reply via email to