[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-11 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17427218#comment-17427218
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

bq. 2.  Make it possible to select columns by eliding the list name components.

Currently, I think the C++ API only deals with column indices? (at least for 
the Python bindings, the translation of column names to field indices happens 
in Python) For Python that should be relatively straightforward to implement. 
Opened ARROW-14286 for this. 

bq. If so, I'd have to support both naming conventions because both would exist 
in the wild.

[~jpivarski] yes, but that's already the case right now as well. Parquet files 
written by (py)arrow will use a different name for the list element compared to 
parquet files written by other tools (that's actually what we are trying to 
harmonize). So if you select a subfield of a list field by name, you already 
need to take into account potentially different names at the moment. 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 6.0.0
>
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-08 Thread Jim Pivarski (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426167#comment-17426167
 ] 

Jim Pivarski commented on ARROW-14196:
--

If it's changing a default, I can just set the option explicitly, right?

If it's changing how columns are named in the file (currently 
"fieldA.list.fieldB.list", etc.), then that would require adjustment on my 
side, but it would be an adjustment that depends on how the file was written, 
not the pyarrow version, right? If so, I'd have to support both naming 
conventions because both would exist in the wild.

If there is a change that Awkward Array has to adjust to, then now may be a 
good time to do it because we're going to be rewriting the "from_parquet" 
function soon, as part of integrating with Dask

https://github.com/ContinuumIO/dask-awkward

adopting fsspec, and using pyarrow's Dataset API (to replace our manual 
implementation). If there's something that we can set to get the new behavior, 
we can turn that on now while developing the new version and be ready for the 
change.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425938#comment-17425938
 ] 

Micah Kornfield commented on ARROW-14196:
-

Also CC [~jpivarski] on thoughts on how this might impact awkward arrays and 
ways to mitigate.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425937#comment-17425937
 ] 

Micah Kornfield commented on ARROW-14196:
-

I think similar to other default changes we have discussed, a very public 
advertisement of the intended change and consequences would be good.  At some 
point we should just bite the bullet.  I think with the first feature (take a 
compliant nested type and rename it on read) it would give users a temporary 
fallback.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425933#comment-17425933
 ] 

Micah Kornfield commented on ARROW-14196:
-

Yeah, I think it is really only the name change that I'm concerned about.  
https://issues.apache.org/jira/browse/ARROW-13151 has another example where 
people where trying to reference things by path that was broken for other 
reasons.

 

A few item that don't really solve all cases but would make things better or at 
least adaptable to the long term:

1.  Add an option that translates "compliant nest type" name back to the 
arrow's naming schema.

2.  Make it possible to select columns by eliding the list name components.

 

Another question that is dataset specific, is if one file was written with 
compliant nested types and one was not, and both where read in the same dataset 
are the results sane (schema's get unified?)

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-02 Thread Truc Lam Nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423486#comment-17423486
 ] 

Truc Lam Nguyen commented on ARROW-14196:
-

[~judah.rand] sorry that I don't really understand your comment, could you 
please explain a little bit more? thanks

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423328#comment-17423328
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

> I'm also surprised that a non-existing column name wouldn't return an error 
> instead of selecting nothing?

With the new datasets API, it actually raises an error if a column name is not 
found. With the legacy implementation (or the plain {{ParquetFile}} interface), 
it ignores those. 
(and I was using use_legacy_dataset=True because with Datasets we don't yet 
support selecting nested fields ..)

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423323#comment-17423323
 ] 

Antoine Pitrou commented on ARROW-14196:


Is it really common to read only a list's child column without the list column 
itself? Otherwise we can live with the minor compatibility break, IMHO.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423324#comment-17423324
 ] 

Antoine Pitrou commented on ARROW-14196:


I'm also surprised that a non-existing column name wouldn't return an error 
instead of selecting nothing?

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423310#comment-17423310
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

I wrote this with the latest pyarrow (master):

{code:python}
table= pa.table({'a': [[1, 2], [3, 4, 5]]})
pq.write_table(table, "test_nested_noncompliant.parquet")
pq.write_table(table, "test_nested_compliant.parquet", 
use_compliant_nested_type=True)
{code}

In the latest pyarrow they both read fine, but so have different names (which 
can impact eg nested field refs):

{code:python}
>>> pq.read_table("test_nested_noncompliant.parquet")
pyarrow.Table
a: list
  child 0, item: int64

>>> pq.read_table("test_nested_compliant.parquet")
pyarrow.Table
a: list
  child 0, element: int64
{code}

So eg doing {{pq.read_table("test_nested_noncompliant.parquet", 
columns=["a.list.item"], use_legacy_dataset=True)}} for works the noncompliant 
file, but doesn't select anything with the compliant file.

Those files also read fine (and result in the same difference in list field 
names) with older versions of Arrow (tested down to Arrow 1.0). 

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423307#comment-17423307
 ] 

Antoine Pitrou commented on ARROW-14196:


One possible investigation would be to write two different files, one with the 
option turned on, the other with the option turned off, and find out whether 
they are readable by 1) current version of Arrow 2) past versions of Arrow.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423302#comment-17423302
 ] 

Joris Van den Bossche commented on ARROW-14196:
---

Some relevant quotes from the ARROW-11497:

> [Micah] I think the main reason it isn't enabled by default is it breaks 
> round trips for arrow data.  This could potentially be fixed on the reader 
> side as well.  

>> [Antoine] Perhaps we could convert the field name at the Arrow<->Parquet 
>> boundary.
> [Micah] This should be possible but it potentially needs another flag.  I 
> think in the short term plumbing the additional flag through to python makes 
> sense and we can figure out a longer term solution if this becomes a larger 
> problem.

>> [Antoine] It should simply be the default (and obviously right) behaviour. 
>> Am I missing something?
> [Micah] Backwards compatibility?  It might be possible to make some 
> inferences (haven't thought about it deeply).  But I think if we were reading 
> a conforming java produced parquet file then we would get different column 
> names if we transformed on the border (maybe there can be some rules around 
> Arrow metadata being present).  I think we can make the default to be 
> conforming behavior, but we should give users some level of control to 
> preserve the old behavior.

---

I am not super familiar, but so the simplest option is to just switch the 
default of the {{compliant_nested_types}} option in ArrowWriterProperties. What 
would be the (possible backwards incompatible) consequences of that?  
We would start writing a different Parquet file (but actually following the 
spec). But I suppose that also when reading such a file, you would then get a 
different name for the sub-lists (which can impact selecting a sublist with a 
nested field reference?) 
To avoid having a breaking change on the read path, we could by default also 
convert the names at the Parquet->Arrow boundary (like the 
{{compliant_nested_types}}  option already does on the Arrow->Parquet 
boundary). However, doing that can _also_ break code for people currently 
already reading compliant parquet files ..

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-01 Thread Judah (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423299#comment-17423299
 ] 

Judah commented on ARROW-14196:
---

[~trucnguyenlam] Is this something that could be flipped now?

https://github.com/pandas-dev/pandas/pull/43690#pullrequestreview-760364590

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)