timsaucer opened a new issue, #715:
URL: https://github.com/apache/datafusion-python/issues/715

   **Describe the bug**
   When you have a column that is a struct of struct and you attempt to index 
into the lowest level, if there is a null at the first level of the struct you 
get an unexpected result. In the dataframe below I have an `outer_1` stuct that 
if it is null and we try to access an inner member, we would expect to also get 
a null.
   
   I have exported this dataframe to parquet and tested on the rust side and 
the problem does not exist there, so I think it is something in this repo.
   
   **To Reproduce**
   ```
   ctx = SessionContext()
   
   batch = pa.RecordBatch.from_arrays(
       [pa.array([
           {"outer_1": {"inner_1": 1, "inner_2": 2}},
           {"outer_1": {"inner_1": 1, "inner_2": None}},
           {"outer_1": None},
       ])],
       names=["a"],
   )
   
   df = ctx.create_dataframe([[batch]])
   
   df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")
   
   df.select(col("a")).show()
   
   df.select(col("a")["outer_1"]).show()
   
   df.select(col("a")["outer_1"]["inner_2"]).show()
   ```
   
   Produces:
   
   ```
   03:20 PM (<1s)
   ctx = SessionContext()
   
   batch = pa.RecordBatch.from_arrays(
       [pa.array([
           {"outer_1": {"inner_1": 1, "inner_2": 2}},
           {"outer_1": {"inner_1": 1, "inner_2": None}},
           {"outer_1": None},
       ])],
       names=["a"],
   )
   
   df = ctx.create_dataframe([[batch]])
   
   df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")
   
   df.select(col("a")).show()
   
   df.select(col("a")["outer_1"]).show()
   
   df.select(col("a")["outer_1"]["inner_2"]).show()
   DataFrame()
   +-------------------------------------+
   | a                                   |
   +-------------------------------------+
   | {outer_1: {inner_1: 1, inner_2: 2}} |
   | {outer_1: {inner_1: 1, inner_2: }}  |
   | {outer_1: }                         |
   +-------------------------------------+
   DataFrame()
   +----------------------------------------------+
   | cc251bd408f114ca2a4354b6976d91339.a[outer_1] |
   +----------------------------------------------+
   | {inner_1: 1, inner_2: 2}                     |
   | {inner_1: 1, inner_2: }                      |
   |                                              |
   +----------------------------------------------+
   DataFrame()
   +-------------------------------------------------------+
   | cc251bd408f114ca2a4354b6976d91339.a[outer_1][inner_2] |
   +-------------------------------------------------------+
   | 2                                                     |
   |                                                       |
   | 0                                                     |
   +-------------------------------------------------------+
   ```
   
   **Expected behavior**
   
   Accessing a subfield of a null entry should also return null.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to