[ 
https://issues.apache.org/jira/browse/ARROW-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624887#comment-17624887
 ] 

Kouhei Sutou commented on ARROW-18161:
--------------------------------------

Thanks.
{{Arrow::Table.new(**table)}} is the cause of this problem.
We need to keep {{Arrow::Buffer}} for each {{arrow_frame[key].value}} while 
{{Arrow::Table.new(\*\*table)}} object is alive to avoid memory copy.
{{Arrow::Table.load(Arrow::Buffer.new)}} keeps a reference to the given 
{{Arrow::Buffer}} but {{Arrow::table.new(\*\*table)}} doesn't. So 
{{Arrow::Buffer}} s are GC-ed.

How about the following for now?

{code:ruby}
def _get_arrow_frame_from_proto_arrow_frame(arrow_frame)
  columns = {}
  buffers = []
  arrow_frame.keys.each do |key|
    buffer = Arrow::Buffer.new(arrow_frame[key].value)
    buffers << buffer
    tmp = Arrow::Table.load(buffer)
    col_name = create_friendly_name_for_key(key)
    columns[col_name] = tmp[0].data
  end
  table = Arrow::Table.new(**columns)
  table.instance_variable_set(:@buffers, buffers)
  table
end
{code}

We can avoid the {{instance_variable_set}} in the future by referring the 
associated buffer from all related objects such as {{Arrow::ChunkedArray}} in 
{{Arrow::Table}}.

> Reading error table causes mutations
> ------------------------------------
>
>                 Key: ARROW-18161
>                 URL: https://issues.apache.org/jira/browse/ARROW-18161
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Ruby
>    Affects Versions: 9.0.0
>         Environment: Ruby 3.1.2
>            Reporter: Noah Horton
>            Assignee: Kouhei Sutou
>            Priority: Major
>
> ven an Arrow::Table with several columns "X"
>  
> {code:ruby}
> # Rails console outputs
> 3.1.2 :107 > x.schema
>  => 
> #<Arrow::Schema:0x7ff2fbc426d8 ptr=0x55851587bc20 actual_values: int64
> dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :108 > x.schema
>  => 
> #<Arrow::Schema:0x7ff2fbbcda68 ptr=0x55851a541020 actual_values: int64
> dates: date32[day]                             
> expected_values: double>                       
> 3.1.2 :109 >  {code}
> Note that the object and pointer have both changed values.
> But the far bigger issue is that repeated reads from it will cause different 
> results:
> {code:ruby}
> 3.1.2 :097 > x[1][0]
>  => Sun, 22 Aug 2021 
> 3.1.2 :098 > x[1][1]
>  => nil 
> 3.1.2 :099 > x[1][0]
>  => nil {code}
> I have a lot of issues like this - when I have done these types of read 
> operations, I get the original table with the data in the columns all 
> shuffled around or deleted. 
> I do ingest the data slightly oddly in the first place as it comes in over 
> GRPC and I am using Arrow::Buffer to read it from the GRPC and then passing 
> that into Arrow::Table.load. But I would not expect that once it was in 
> Arrow::Table that I could do anything to permute it unintentionally.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to