[ https://issues.apache.org/jira/browse/ARROW-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Molina updated ARROW-12666: -------------------------------------- Fix Version/s: 8.0.0 (was: 7.0.0) > [Python] Array construction from numpy array is unclear about zero copy > behaviour > --------------------------------------------------------------------------------- > > Key: ARROW-12666 > URL: https://issues.apache.org/jira/browse/ARROW-12666 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 4.0.0 > Reporter: Alessandro Molina > Assignee: Alessandro Molina > Priority: Major > Fix For: 8.0.0 > > > When building an Arrow array from a numpy array it's very confusing from the > user point of view that the result is not always a new array. > Under the hood Arrow sometimes reuses the memory if no casting is needed > {code:python} > npa = np.array([1, 2, 3]*3) > arrow_array = pa.array(npa, type=pa.int64()) > npa[npa == 2] = 10 > print(arrow_array.to_pylist()) > # Prints: [1, 10, 3, 1, 10, 3, 1, 10, 3] > {code} > and sometimes doesn't if a cast is involved > {code:python} > npa = np.array([1, 2, 3]*3) > arrow_array = pa.array(npa, type=pa.int32()) > npa[npa == 2] = 10 > print(arrow_array.to_pylist()) > # Prints: [1, 2, 3, 1, 2, 3, 1, 2, 3] > {code} > For non primite types instead it does always copy > {code:python} > npa = np.array(["a", "b", "c"]*3) > arrow_array = pa.array(npa, type=pa.string()) > npa[npa == "b"] = "X" > print(arrow_array.to_pylist()) > # Prints: ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'] > # Different from numpy array that was modified > {code} > This behaviour needs a lot of attention from the user and understanding of > what's going on, which makes pyarrow hard to use. > A {{copy=True/False}} should be added to {{pa.array}} and the default value > should probably be {{copy=True}} so that by default you can always create an > arrow array out of a numpy one (as {{copy=False}} would probably have to > throw an exception in some cases where we can't guarantee zero copy, like > when building from a Python List) -- This message was sent by Atlassian Jira (v8.20.1#820001)