Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-26 Thread via GitHub


timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2132233093

   Closing in favor of https://github.com/apache/arrow/issues/41833
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-26 Thread via GitHub


timsaucer closed issue #715: Selecting struct field within field produces 
unexpected results
URL: https://github.com/apache/datafusion-python/issues/715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-26 Thread via GitHub


timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2132211293

   In my gist above, I went back an inserted values into the subfields 
`inner_1` and `inner_2` even though `outer` was null and I *am* able to 
reproduce the problem above, so I definitely think this is not a 
datafusion-python problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-26 Thread via GitHub


timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2132210685

   I think I know what's going on.
   
   Even if `outer` is null, we still have data within `inner_1` and `inner_2`. 
When pyarrow creates the record batch, it sets these to the default value 
rather than null even though the outer struct is null. Then on the datafusion 
side we index into these and get those default values.
   
   I *think* the right place to resolve this is in pyarrow setting null when 
all outer values are null. But maybe there is additional validity checks we 
should have. I'm going to think a little more about this issue before moving it 
to the most appropriate repo.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-26 Thread via GitHub


timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2132207121

   Further testing on the rust side makes me think it is something about how 
the batch record is created in pyarrow. I created the same dataframe using 
StructBuilder in the below gist and cannot reproduce the problem.
   
   https://gist.github.com/timsaucer/7527c0851b379d4e9c466d8972d49a01


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-25 Thread via GitHub


timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2131229316

   My statement above about testing on rust side is likely incorrect. I ran the 
same test above but loading the dataframe from a parquet file instead of 
creating in memory and the expected behavior is reproduced.
   
   If you amend these lines to the bottom of the minimal example
   
   ```
   df.write_parquet("save_out.parquet")
   
   df_reread = ctx.read_parquet("save_out.parquet")
   
   df_reread.show()
   df_reread.select(col("a")["outer_1"]["inner_2"]).show()
   ```
   
   You get the expected result
   ```
   DataFrame()
   +-+
   | a   |
   +-+
   | {outer_1: {inner_1: 1, inner_2: 2}} |
   | {outer_1: {inner_1: 1, inner_2: }}  |
   | {outer_1: } |
   +-+
   DataFrame()
   +-+
   | ?table?.a[outer_1][inner_2] |
   +-+
   | 2   |
   | |
   | |
   +-+
   ```
   
   It also shows the original table is reproduced. I'll continue digging but I 
no longer am convinced this is a python binding issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-24 Thread via GitHub


timsaucer opened a new issue, #715:
URL: https://github.com/apache/datafusion-python/issues/715

   **Describe the bug**
   When you have a column that is a struct of struct and you attempt to index 
into the lowest level, if there is a null at the first level of the struct you 
get an unexpected result. In the dataframe below I have an `outer_1` stuct that 
if it is null and we try to access an inner member, we would expect to also get 
a null.
   
   I have exported this dataframe to parquet and tested on the rust side and 
the problem does not exist there, so I think it is something in this repo.
   
   **To Reproduce**
   ```
   ctx = SessionContext()
   
   batch = pa.RecordBatch.from_arrays(
   [pa.array([
   {"outer_1": {"inner_1": 1, "inner_2": 2}},
   {"outer_1": {"inner_1": 1, "inner_2": None}},
   {"outer_1": None},
   ])],
   names=["a"],
   )
   
   df = ctx.create_dataframe([[batch]])
   
   df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")
   
   df.select(col("a")).show()
   
   df.select(col("a")["outer_1"]).show()
   
   df.select(col("a")["outer_1"]["inner_2"]).show()
   ```
   
   Produces:
   
   ```
   03:20 PM (<1s)
   ctx = SessionContext()
   
   batch = pa.RecordBatch.from_arrays(
   [pa.array([
   {"outer_1": {"inner_1": 1, "inner_2": 2}},
   {"outer_1": {"inner_1": 1, "inner_2": None}},
   {"outer_1": None},
   ])],
   names=["a"],
   )
   
   df = ctx.create_dataframe([[batch]])
   
   df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet")
   
   df.select(col("a")).show()
   
   df.select(col("a")["outer_1"]).show()
   
   df.select(col("a")["outer_1"]["inner_2"]).show()
   DataFrame()
   +-+
   | a   |
   +-+
   | {outer_1: {inner_1: 1, inner_2: 2}} |
   | {outer_1: {inner_1: 1, inner_2: }}  |
   | {outer_1: } |
   +-+
   DataFrame()
   +--+
   | cc251bd408f114ca2a4354b6976d91339.a[outer_1] |
   +--+
   | {inner_1: 1, inner_2: 2} |
   | {inner_1: 1, inner_2: }  |
   |  |
   +--+
   DataFrame()
   +---+
   | cc251bd408f114ca2a4354b6976d91339.a[outer_1][inner_2] |
   +---+
   | 2 |
   |   |
   | 0 |
   +---+
   ```
   
   **Expected behavior**
   
   Accessing a subfield of a null entry should also return null.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org