[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427218#comment-17427218 ] Joris Van den Bossche commented on ARROW-14196: --- bq. 2. Make it possible to select columns by eliding the list name components. Currently, I think the C++ API only deals with column indices? (at least for the Python bindings, the translation of column names to field indices happens in Python) For Python that should be relatively straightforward to implement. Opened ARROW-14286 for this. bq. If so, I'd have to support both naming conventions because both would exist in the wild. [~jpivarski] yes, but that's already the case right now as well. Parquet files written by (py)arrow will use a different name for the list element compared to parquet files written by other tools (that's actually what we are trying to harmonize). So if you select a subfield of a list field by name, you already need to take into account potentially different names at the moment. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 6.0.0 > > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426167#comment-17426167 ] Jim Pivarski commented on ARROW-14196: -- If it's changing a default, I can just set the option explicitly, right? If it's changing how columns are named in the file (currently "fieldA.list.fieldB.list", etc.), then that would require adjustment on my side, but it would be an adjustment that depends on how the file was written, not the pyarrow version, right? If so, I'd have to support both naming conventions because both would exist in the wild. If there is a change that Awkward Array has to adjust to, then now may be a good time to do it because we're going to be rewriting the "from_parquet" function soon, as part of integrating with Dask https://github.com/ContinuumIO/dask-awkward adopting fsspec, and using pyarrow's Dataset API (to replace our manual implementation). If there's something that we can set to get the new behavior, we can turn that on now while developing the new version and be ready for the change. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425938#comment-17425938 ] Micah Kornfield commented on ARROW-14196: - Also CC [~jpivarski] on thoughts on how this might impact awkward arrays and ways to mitigate. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425937#comment-17425937 ] Micah Kornfield commented on ARROW-14196: - I think similar to other default changes we have discussed, a very public advertisement of the intended change and consequences would be good. At some point we should just bite the bullet. I think with the first feature (take a compliant nested type and rename it on read) it would give users a temporary fallback. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425933#comment-17425933 ] Micah Kornfield commented on ARROW-14196: - Yeah, I think it is really only the name change that I'm concerned about. https://issues.apache.org/jira/browse/ARROW-13151 has another example where people where trying to reference things by path that was broken for other reasons. A few item that don't really solve all cases but would make things better or at least adaptable to the long term: 1. Add an option that translates "compliant nest type" name back to the arrow's naming schema. 2. Make it possible to select columns by eliding the list name components. Another question that is dataset specific, is if one file was written with compliant nested types and one was not, and both where read in the same dataset are the results sane (schema's get unified?) > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423486#comment-17423486 ] Truc Lam Nguyen commented on ARROW-14196: - [~judah.rand] sorry that I don't really understand your comment, could you please explain a little bit more? thanks > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423328#comment-17423328 ] Joris Van den Bossche commented on ARROW-14196: --- > I'm also surprised that a non-existing column name wouldn't return an error > instead of selecting nothing? With the new datasets API, it actually raises an error if a column name is not found. With the legacy implementation (or the plain {{ParquetFile}} interface), it ignores those. (and I was using use_legacy_dataset=True because with Datasets we don't yet support selecting nested fields ..) > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423323#comment-17423323 ] Antoine Pitrou commented on ARROW-14196: Is it really common to read only a list's child column without the list column itself? Otherwise we can live with the minor compatibility break, IMHO. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423324#comment-17423324 ] Antoine Pitrou commented on ARROW-14196: I'm also surprised that a non-existing column name wouldn't return an error instead of selecting nothing? > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423310#comment-17423310 ] Joris Van den Bossche commented on ARROW-14196: --- I wrote this with the latest pyarrow (master): {code:python} table= pa.table({'a': [[1, 2], [3, 4, 5]]}) pq.write_table(table, "test_nested_noncompliant.parquet") pq.write_table(table, "test_nested_compliant.parquet", use_compliant_nested_type=True) {code} In the latest pyarrow they both read fine, but so have different names (which can impact eg nested field refs): {code:python} >>> pq.read_table("test_nested_noncompliant.parquet") pyarrow.Table a: list child 0, item: int64 >>> pq.read_table("test_nested_compliant.parquet") pyarrow.Table a: list child 0, element: int64 {code} So eg doing {{pq.read_table("test_nested_noncompliant.parquet", columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant file, but doesn't select anything with the compliant file. Those files also read fine (and result in the same difference in list field names) with older versions of Arrow (tested down to Arrow 1.0). > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423307#comment-17423307 ] Antoine Pitrou commented on ARROW-14196: One possible investigation would be to write two different files, one with the option turned on, the other with the option turned off, and find out whether they are readable by 1) current version of Arrow 2) past versions of Arrow. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423302#comment-17423302 ] Joris Van den Bossche commented on ARROW-14196: --- Some relevant quotes from the ARROW-11497: > [Micah] I think the main reason it isn't enabled by default is it breaks > round trips for arrow data. This could potentially be fixed on the reader > side as well. >> [Antoine] Perhaps we could convert the field name at the Arrow<->Parquet >> boundary. > [Micah] This should be possible but it potentially needs another flag. I > think in the short term plumbing the additional flag through to python makes > sense and we can figure out a longer term solution if this becomes a larger > problem. >> [Antoine] It should simply be the default (and obviously right) behaviour. >> Am I missing something? > [Micah] Backwards compatibility? It might be possible to make some > inferences (haven't thought about it deeply). But I think if we were reading > a conforming java produced parquet file then we would get different column > names if we transformed on the border (maybe there can be some rules around > Arrow metadata being present). I think we can make the default to be > conforming behavior, but we should give users some level of control to > preserve the old behavior. --- I am not super familiar, but so the simplest option is to just switch the default of the {{compliant_nested_types}} option in ArrowWriterProperties. What would be the (possible backwards incompatible) consequences of that? We would start writing a different Parquet file (but actually following the spec). But I suppose that also when reading such a file, you would then get a different name for the sub-lists (which can impact selecting a sublist with a nested field reference?) To avoid having a breaking change on the read path, we could by default also convert the names at the Parquet->Arrow boundary (like the {{compliant_nested_types}} option already does on the Arrow->Parquet boundary). However, doing that can _also_ break code for people currently already reading compliant parquet files .. > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer
[ https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423299#comment-17423299 ] Judah commented on ARROW-14196: --- [~trucnguyenlam] Is this something that could be flipped now? https://github.com/pandas-dev/pandas/pull/43690#pullrequestreview-760364590 > [C++][Parquet] Default to compliant nested types in Parquet writer > -- > > Key: ARROW-14196 > URL: https://issues.apache.org/jira/browse/ARROW-14196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet >Reporter: Joris Van den Bossche >Priority: Major > > In C++ there is already an option to get the "compliant_nested_types" (to > have the list columns follow the Parquet specification), and ARROW-11497 > exposed this option in Python. > This is still set to False by default, but in the source it says "TODO: At > some point we should flip this.", and in ARROW-11497 there was also some > discussion about what it would take to change the default. > cc [~emkornfield] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)