Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2132233093 Closing in favor of https://github.com/apache/arrow/issues/41833 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer closed issue #715: Selecting struct field within field produces unexpected results URL: https://github.com/apache/datafusion-python/issues/715 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2132211293 In my gist above, I went back an inserted values into the subfields `inner_1` and `inner_2` even though `outer` was null and I *am* able to reproduce the problem above, so I definitely think this is not a datafusion-python problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2132210685 I think I know what's going on. Even if `outer` is null, we still have data within `inner_1` and `inner_2`. When pyarrow creates the record batch, it sets these to the default value rather than null even though the outer struct is null. Then on the datafusion side we index into these and get those default values. I *think* the right place to resolve this is in pyarrow setting null when all outer values are null. But maybe there is additional validity checks we should have. I'm going to think a little more about this issue before moving it to the most appropriate repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2132207121 Further testing on the rust side makes me think it is something about how the batch record is created in pyarrow. I created the same dataframe using StructBuilder in the below gist and cannot reproduce the problem. https://gist.github.com/timsaucer/7527c0851b379d4e9c466d8972d49a01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2131229316 My statement above about testing on rust side is likely incorrect. I ran the same test above but loading the dataframe from a parquet file instead of creating in memory and the expected behavior is reproduced. If you amend these lines to the bottom of the minimal example ``` df.write_parquet("save_out.parquet") df_reread = ctx.read_parquet("save_out.parquet") df_reread.show() df_reread.select(col("a")["outer_1"]["inner_2"]).show() ``` You get the expected result ``` DataFrame() +-+ | a | +-+ | {outer_1: {inner_1: 1, inner_2: 2}} | | {outer_1: {inner_1: 1, inner_2: }} | | {outer_1: } | +-+ DataFrame() +-+ | ?table?.a[outer_1][inner_2] | +-+ | 2 | | | | | +-+ ``` It also shows the original table is reproduced. I'll continue digging but I no longer am convinced this is a python binding issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer opened a new issue, #715: URL: https://github.com/apache/datafusion-python/issues/715 **Describe the bug** When you have a column that is a struct of struct and you attempt to index into the lowest level, if there is a null at the first level of the struct you get an unexpected result. In the dataframe below I have an `outer_1` stuct that if it is null and we try to access an inner member, we would expect to also get a null. I have exported this dataframe to parquet and tested on the rust side and the problem does not exist there, so I think it is something in this repo. **To Reproduce** ``` ctx = SessionContext() batch = pa.RecordBatch.from_arrays( [pa.array([ {"outer_1": {"inner_1": 1, "inner_2": 2}}, {"outer_1": {"inner_1": 1, "inner_2": None}}, {"outer_1": None}, ])], names=["a"], ) df = ctx.create_dataframe([[batch]]) df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet") df.select(col("a")).show() df.select(col("a")["outer_1"]).show() df.select(col("a")["outer_1"]["inner_2"]).show() ``` Produces: ``` 03:20 PM (<1s) ctx = SessionContext() batch = pa.RecordBatch.from_arrays( [pa.array([ {"outer_1": {"inner_1": 1, "inner_2": 2}}, {"outer_1": {"inner_1": 1, "inner_2": None}}, {"outer_1": None}, ])], names=["a"], ) df = ctx.create_dataframe([[batch]]) df.write_parquet("/dbfs/tmp/tsaucer/struct_of_struct.parquet") df.select(col("a")).show() df.select(col("a")["outer_1"]).show() df.select(col("a")["outer_1"]["inner_2"]).show() DataFrame() +-+ | a | +-+ | {outer_1: {inner_1: 1, inner_2: 2}} | | {outer_1: {inner_1: 1, inner_2: }} | | {outer_1: } | +-+ DataFrame() +--+ | cc251bd408f114ca2a4354b6976d91339.a[outer_1] | +--+ | {inner_1: 1, inner_2: 2} | | {inner_1: 1, inner_2: } | | | +--+ DataFrame() +---+ | cc251bd408f114ca2a4354b6976d91339.a[outer_1][inner_2] | +---+ | 2 | | | | 0 | +---+ ``` **Expected behavior** Accessing a subfield of a null entry should also return null. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org