[ 
https://issues.apache.org/jira/browse/ARROW-9336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152202#comment-17152202
 ] 

Steven Willis commented on ARROW-9336:
--------------------------------------

It looks like the python library has a {{pyarrow.json.read_json(data)}}, which 
creates a table properly even when a record is missing an entry in a struct. I 
didn't see an equivalent method for ruby, but having one that behaved the same 
as this python function would solve my issue.

> Creating RecordBatch with structs missing keys results in a malformed table
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-9336
>                 URL: https://issues.apache.org/jira/browse/ARROW-9336
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Ruby
>    Affects Versions: 0.17.1
>            Reporter: Steven Willis
>            Priority: Major
>
> Using {{::Arrow::RecordBatch.new(schema, data)}} (which uses the 
> {{RecordBatchBuilder}}) appears to handle when a record is missing an entry 
> for a top level column, but it doesn't handle when a record is missing an 
> entry within a struct column. For example, I'd expect the following code to 
> print out {{true}} for each {{puts}}, but 2 of them are {{false}}:
> {code:ruby}
> require 'parquet'
> require 'arrow'
> schema = [
>   {name: "a", type: "string"},
>   {name: "b", type: "struct", fields: [
>      {name: "c", type: "string"},
>      {name: "d", type: "string"},
>    ]
>   },
> ]
> arrow_schema = ::Arrow::Schema.new(schema)
> record_batch = ::Arrow::RecordBatch.new(
>   arrow_schema,
>   [
>     {"a" => "a", "b" => {"c" => "c",           }},
>     {            "b" => {"c" => "c",           }},
>     {            "b" => {            "d" => "d"}},
>   ]
> )
> table = record_batch.to_table
> puts(table['a'][0] == 'a')
> puts(table['a'][1].nil?)
> puts(table['a'][2].nil?)
> puts(table['b'][0].key?('c'))
> puts(table['b'][0]['c'] == 'c')
> puts(table['b'][0].key?('d'))
> puts(table['b'][0]['d'].nil?) # False ?
> puts(!table['b'][0].key?('e'))
> puts(table['b'][1].key?('c'))
> puts(table['b'][1]['c'] == 'c')
> puts(table['b'][1].key?('d'))
> puts(table['b'][1]['d'].nil?)
> puts(!table['b'][1].key?('e'))
> puts(table['b'][2].key?('c'))
> puts(table['b'][2]['c'].nil?)
> puts(table['b'][2].key?('d'))
> puts(table['b'][2]['d'] == 'd') # False ?
> puts(!table['b'][2].key?('e'))
> {code}
> I'd expect {{puts(table)}} to print this representation:
> {noformat}
>       a       b
> 0     a       {"c"=>"c", "d"=>nil}
> 1             {"c"=>"c", "d"=>nil}
> 2             {"c"=>nil, "d"=>"d"}
> {noformat}
> But it prints this instead:
> {noformat}
>       a       b
> 0     a       {"c"=>"c", "d"=>"d"}
> 1             {"c"=>"c", "d"=>nil}
> 2             {"c"=>nil, "d"=>nil}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to