[jira] [Commented] (ARROW-17410) [JS] Archery JS Build Fails
[ https://issues.apache.org/jira/browse/ARROW-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579832#comment-17579832 ] Raphael Taylor-Davies commented on ARROW-17410: --- I did some work to simplify the CI so that it can perhaps be more easily reproduced - [https://github.com/apache/arrow-rs/pull/2453/files] Fortunately it appears to be deterministic, but I'm not exactly sure what is causing it... > [JS] Archery JS Build Fails > --- > > Key: ARROW-17410 > URL: https://issues.apache.org/jira/browse/ARROW-17410 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, JavaScript >Reporter: Raphael Taylor-Davies >Priority: Minor > > We are seeing CI failures running the JS integration tests - > [https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true] > In particular > > {code:java} > [07:33:01] Error: gulp-google-closure-compiler: > java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got > 0xb1e0eb5b) > at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410) > at java.util.zip.ZipInputStream.read(ZipInputStream.java:199) > at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143) > at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500) > at > com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551) > at > com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246) > Error writing to stdin of the compiler. write EPIPE {code} > > This appears to be an issue with zlib v1.2.12 > [https://github.com/madler/zlib/issues/613] according to the corresponding > issue on google-closure-compiler - > https://github.com/google/closure-compiler-npm/issues/234 > I'm not sure what the solution here is, but thought I would flag it > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17410) [JS] Archery JS Build Fails
[ https://issues.apache.org/jira/browse/ARROW-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579773#comment-17579773 ] Raphael Taylor-Davies commented on ARROW-17410: --- Thank you for looking into this, perhaps the conda environment is providing a newer zlib than the system verison, which the JS build only finds when run within the integration context? > [JS] Archery JS Build Fails > --- > > Key: ARROW-17410 > URL: https://issues.apache.org/jira/browse/ARROW-17410 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, JavaScript >Reporter: Raphael Taylor-Davies >Priority: Minor > > We are seeing CI failures running the JS integration tests - > [https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true] > In particular > > {code:java} > [07:33:01] Error: gulp-google-closure-compiler: > java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got > 0xb1e0eb5b) > at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410) > at java.util.zip.ZipInputStream.read(ZipInputStream.java:199) > at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143) > at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500) > at > com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187) > at > com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551) > at > com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246) > Error writing to stdin of the compiler. write EPIPE {code} > > This appears to be an issue with zlib v1.2.12 > [https://github.com/madler/zlib/issues/613] according to the corresponding > issue on google-closure-compiler - > https://github.com/google/closure-compiler-npm/issues/234 > I'm not sure what the solution here is, but thought I would flag it > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17410) Archery JS Build Fails
Raphael Taylor-Davies created ARROW-17410: - Summary: Archery JS Build Fails Key: ARROW-17410 URL: https://issues.apache.org/jira/browse/ARROW-17410 Project: Apache Arrow Issue Type: Bug Reporter: Raphael Taylor-Davies We are seeing CI failures running the JS integration tests - [https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true] In particular {code:java} [07:33:01] Error: gulp-google-closure-compiler: java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 0xb1e0eb5b) at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410) at java.util.zip.ZipInputStream.read(ZipInputStream.java:199) at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143) at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121) at com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500) at com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084) at com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187) at com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551) at com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246) Error writing to stdin of the compiler. write EPIPE {code} This appears to be an issue with zlib v1.2.12 [https://github.com/madler/zlib/issues/613] according to the corresponding issue on google-closure-compiler - https://github.com/google/closure-compiler-npm/issues/234 I'm not sure what the solution here is, but thought I would flag it -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-16978) [C#] Intermittent Archery Failures
[ https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728 ] Raphael Taylor-Davies edited comment on ARROW-16978 at 7/5/22 3:57 PM: --- >From the root of an arrow checkout run ``` git clone [https://github.com/apache/arrow-rs.git] rust archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration ``` The failure concerns C# producing and C# consuming, so I'm not sure how important the rust-specific part actually is was (Author: tustvold): >From the root of an arrow checkout run ``` git clone https://github.com/apache/arrow-rs.git rust archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration ``` > [C#] Intermittent Archery Failures > -- > > Key: ARROW-16978 > URL: https://issues.apache.org/jira/browse/ARROW-16978 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > We are seeing intermittent archery failures in arrow-rs - > [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true] > {code:java} > FAILED TEST: datetime C# producing, C# consuming > 1 failures > File "/arrow/dev/archery/archery/integration/runner.py", line 246, in > _run_ipc_test_case > run_binaries(producer, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 100, in > run_gold > return self._run_gold(gold_dir, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 322, in > _run_gold > consumer.stream_to_file(consumer_stream_path, consumer_file_path) > File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in > stream_to_file > self.run_shell_command(cmd) > File "/arrow/dev/archery/archery/integration/tester.py", line 49, in > run_shell_command > subprocess.check_call(cmd, shell=True) > File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in > check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command > '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest > --mode stream-to-file -a > /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < > /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' > returned non-zero exit status 1. {code} > It is possible that this is something to do with how we are running the > archery tests, but I am at a loss as to how to debug this issue and would > appreciate some input. > I think it started around when > [this|https://github.com/apache/arrow/pull/13279] was merged > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-16978) [C#] Intermittent Archery Failures
[ https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728 ] Raphael Taylor-Davies edited comment on ARROW-16978 at 7/5/22 3:57 PM: --- >From the root of an arrow checkout run ``` git clone [https://github.com/apache/arrow-rs.git] rust archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration ``` The failure concerns C# producing and C# consuming, so I'm not sure how important the rust-specific part actually is as the failing test appears to only be using C# was (Author: tustvold): >From the root of an arrow checkout run ``` git clone [https://github.com/apache/arrow-rs.git] rust archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration ``` The failure concerns C# producing and C# consuming, so I'm not sure how important the rust-specific part actually is > [C#] Intermittent Archery Failures > -- > > Key: ARROW-16978 > URL: https://issues.apache.org/jira/browse/ARROW-16978 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > We are seeing intermittent archery failures in arrow-rs - > [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true] > {code:java} > FAILED TEST: datetime C# producing, C# consuming > 1 failures > File "/arrow/dev/archery/archery/integration/runner.py", line 246, in > _run_ipc_test_case > run_binaries(producer, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 100, in > run_gold > return self._run_gold(gold_dir, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 322, in > _run_gold > consumer.stream_to_file(consumer_stream_path, consumer_file_path) > File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in > stream_to_file > self.run_shell_command(cmd) > File "/arrow/dev/archery/archery/integration/tester.py", line 49, in > run_shell_command > subprocess.check_call(cmd, shell=True) > File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in > check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command > '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest > --mode stream-to-file -a > /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < > /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' > returned non-zero exit status 1. {code} > It is possible that this is something to do with how we are running the > archery tests, but I am at a loss as to how to debug this issue and would > appreciate some input. > I think it started around when > [this|https://github.com/apache/arrow/pull/13279] was merged > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16978) [C#] Intermittent Archery Failures
[ https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728 ] Raphael Taylor-Davies commented on ARROW-16978: --- >From the root of an arrow checkout run ``` git clone https://github.com/apache/arrow-rs.git rust archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration ``` > [C#] Intermittent Archery Failures > -- > > Key: ARROW-16978 > URL: https://issues.apache.org/jira/browse/ARROW-16978 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > We are seeing intermittent archery failures in arrow-rs - > [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true] > {code:java} > FAILED TEST: datetime C# producing, C# consuming > 1 failures > File "/arrow/dev/archery/archery/integration/runner.py", line 246, in > _run_ipc_test_case > run_binaries(producer, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 100, in > run_gold > return self._run_gold(gold_dir, consumer, test_case) > File "/arrow/dev/archery/archery/integration/runner.py", line 322, in > _run_gold > consumer.stream_to_file(consumer_stream_path, consumer_file_path) > File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in > stream_to_file > self.run_shell_command(cmd) > File "/arrow/dev/archery/archery/integration/tester.py", line 49, in > run_shell_command > subprocess.check_call(cmd, shell=True) > File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in > check_call > raise CalledProcessError(retcode, cmd) > subprocess.CalledProcessError: Command > '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest > --mode stream-to-file -a > /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < > /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' > returned non-zero exit status 1. {code} > It is possible that this is something to do with how we are running the > archery tests, but I am at a loss as to how to debug this issue and would > appreciate some input. > I think it started around when > [this|https://github.com/apache/arrow/pull/13279] was merged > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16978) [C#] Intermittent Archery Failures
Raphael Taylor-Davies created ARROW-16978: - Summary: [C#] Intermittent Archery Failures Key: ARROW-16978 URL: https://issues.apache.org/jira/browse/ARROW-16978 Project: Apache Arrow Issue Type: Bug Reporter: Raphael Taylor-Davies We are seeing intermittent archery failures in arrow-rs - [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true] {code:java} FAILED TEST: datetime C# producing, C# consuming 1 failures File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case run_binaries(producer, consumer, test_case) File "/arrow/dev/archery/archery/integration/runner.py", line 100, in run_gold return self._run_gold(gold_dir, consumer, test_case) File "/arrow/dev/archery/archery/integration/runner.py", line 322, in _run_gold consumer.stream_to_file(consumer_stream_path, consumer_file_path) File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file self.run_shell_command(cmd) File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command subprocess.check_call(cmd, shell=True) File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream' returned non-zero exit status 1. {code} It is possible that this is something to do with how we are running the archery tests, but I am at a loss as to how to debug this issue and would appreciate some input. I think it started around when [this|https://github.com/apache/arrow/pull/13279] was merged -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies resolved ARROW-16184. --- Resolution: Not A Bug This was a misunderstanding of the relationship between the embedded arrow schema and the parquet schema > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > - > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Raphael Taylor-Davies >Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} > This was closed as a limitation of the parquet 1.x format for representing > nanosecond timestamps. This is fine, however, the arrow schema embedded > within the parquet metadata still lists the data as being a nanosecond array. > This causes issues depending on which schema the reader opts to "trust". > This was discovered as part of the investigation into a bug report on the > arrow-rs parquet implementation - > [https://github.com/apache/arrow-rs/issues/1459] > Specifically the metadata written is > {code:java} > Schema { > endianness: Little, > fields: Some( > [ > Field { > name: Some( > "created", > ), > nullable: true, > type_type: Timestamp, > type_: Timestamp { > unit: NANOSECOND, > timezone: Some( > "UTC", > ), > }, > dictionary: None, > children: Some( > [], > ), > custom_metadata: None, > }, > ], > ), > custom_metadata: Some( > [ > KeyValue { > key: Some( > "pandas", > ), > value: Some( > "{\"index_columns\": [], \"column_indexes\": [], > \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", > \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", > \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": > \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}", > ), > }, > ], > ), > features: None, > } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522834#comment-17522834 ] Raphael Taylor-Davies commented on ARROW-16184: --- Do you know if this convention is documented anywhere, this would be a breaking change to the arrow-rs implementation and so it would be good to have something authoritative to reference as justification. That being said it seems odd to me that the less expressive schema would be treated as the authoritative one - if you can't trust the arrow schema, what is the point in embedding it? > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > - > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Raphael Taylor-Davies >Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} > This was closed as a limitation of the parquet 1.x format for representing > nanosecond timestamps. This is fine, however, the arrow schema embedded > within the parquet metadata still lists the data as being a nanosecond array. > This causes issues depending on which schema the reader opts to "trust". > This was discovered as part of the investigation into a bug report on the > arrow-rs parquet implementation - > [https://github.com/apache/arrow-rs/issues/1459] > Specifically the metadata written is > {code:java} > Schema { > endianness: Little, > fields: Some( > [ > Field { > name: Some( > "created", > ), > nullable: true, > type_type: Timestamp, > type_: Timestamp { > unit: NANOSECOND, > timezone: Some( > "UTC", > ), > }, > dictionary: None, > children: Some( > [], > ), > custom_metadata: None, > }, > ], > ), > custom_metadata: Some( > [ > KeyValue { > key: Some( > "pandas", > ), > value: Some( > "{\"index_columns\": [], \"column_indexes\": [], > \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", > \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", > \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": > \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}", > ), > }, > ], > ), > features: None, > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies updated ARROW-16184: -- Description: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust". This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459] Specifically the metadata written is {code:java} Schema { endianness: Little, fields: Some( [ Field { name: Some( "created", ), nullable: true, type_type: Timestamp, type_: Timestamp { unit: NANOSECOND, timezone: Some( "UTC", ), }, dictionary: None, children: Some( [], ), custom_metadata: None, }, ], ), custom_metadata: Some( [ KeyValue { key: Some( "pandas", ), value: Some( "{\"index_columns\": [], \"column_indexes\": [], \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}", ), }, ], ), features: None, } {code} was: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust". This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459] > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > - > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parque
[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies updated ARROW-16184: -- Component/s: (was: Python) Description: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust". This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459] was: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459] > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > - > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} > This was closed as a limitation of the parquet 1.x format for representing > nanosecond timestamps. This is fine, however, the arrow schema embedded > within the parquet metadata still lists the data as being a nanosecond array. > This causes issues depending on which schema the reader opts to "trust". > This was discovered as part of the investigation into a bug report on the > arrow-rs parquet implementation - > [https://github.com/apache/arrow-rs/issues/1459] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
[ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies updated ARROW-16184: -- Description: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459] was: As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. #!/usr/bin/env pythonimport pyarrow as paimport pyarrow.parquet as pqimport pandas as pd# create DataFrame with a datetime columndf = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created'])# create Arrow table from DataFrametable = pa.Table.from_pandas(df, preserve_index=False)# write the table as a parquet file, then read it back againpq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet')print(table.schema[0]) # pyarrow.Field (nanosecond units)print(table2.schema[0]) # pyarrow.Field (microsecond units) This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459 > [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet > - > > Key: ARROW-16184 > URL: https://issues.apache.org/jira/browse/ARROW-16184 > Project: Apache Arrow > Issue Type: Bug >Reporter: Raphael Taylor-Davies >Priority: Minor > > As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the > following code results in the schema changing when reading/writing a parquet > file. > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} > This was closed as a limitation of the parquet 1.x format for representing > nanosecond timestamps. This is fine, however, the arrow schema embedded > within the parquet metadata still lists the data as being a nanosecond array. > > This was discovered as part of the investigation into a bug report on the > arrow-rs parquet implementation - > [https://github.com/apache/arrow-rs/issues/1459] > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
Raphael Taylor-Davies created ARROW-16184: - Summary: [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet Key: ARROW-16184 URL: https://issues.apache.org/jira/browse/ARROW-16184 Project: Apache Arrow Issue Type: Bug Reporter: Raphael Taylor-Davies As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file. #!/usr/bin/env pythonimport pyarrow as paimport pyarrow.parquet as pqimport pandas as pd# create DataFrame with a datetime columndf = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created'])# create Arrow table from DataFrametable = pa.Table.from_pandas(df, preserve_index=False)# write the table as a parquet file, then read it back againpq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet')print(table.schema[0]) # pyarrow.Field (nanosecond units)print(table2.schema[0]) # pyarrow.Field (microsecond units) This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-12504) [Rust] Buffer::from_slice_ref incorrect capacity
Raphael Taylor-Davies created ARROW-12504: - Summary: [Rust] Buffer::from_slice_ref incorrect capacity Key: ARROW-12504 URL: https://issues.apache.org/jira/browse/ARROW-12504 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Buffer::from_slice_ref sets the capacity without taking into account the size of the slice elements -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-12493) [Rust] Support DictionaryArray in CSV and JSON formatters
[ https://issues.apache.org/jira/browse/ARROW-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raphael Taylor-Davies updated ARROW-12493: -- Summary: [Rust] Support DictionaryArray in CSV and JSON formatters (was: Support DictionaryArray in CSV and JSON formatters) > [Rust] Support DictionaryArray in CSV and JSON formatters > - > > Key: ARROW-12493 > URL: https://issues.apache.org/jira/browse/ARROW-12493 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Raphael Taylor-Davies >Assignee: Raphael Taylor-Davies >Priority: Minor > > Currently the CSV and JSON formatters do not support JSON and CSV arrays -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12493) Support DictionaryArray in CSV and JSON formatters
Raphael Taylor-Davies created ARROW-12493: - Summary: Support DictionaryArray in CSV and JSON formatters Key: ARROW-12493 URL: https://issues.apache.org/jira/browse/ARROW-12493 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Currently the CSV and JSON formatters do not support JSON and CSV arrays -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12426) [Rust] Concatenating dictionaries ignores values
Raphael Taylor-Davies created ARROW-12426: - Summary: [Rust] Concatenating dictionaries ignores values Key: ARROW-12426 URL: https://issues.apache.org/jira/browse/ARROW-12426 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies Concatenating dictionaries ignores the values array, at best leading to incorrect data, but often leading to keys with indexes beyond the bounds of the values array -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays
Raphael Taylor-Davies created ARROW-12425: - Summary: [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays Key: ARROW-12425 URL: https://issues.apache.org/jira/browse/ARROW-12425 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12400) [Rust] Re-enable transform module tests
Raphael Taylor-Davies created ARROW-12400: - Summary: [Rust] Re-enable transform module tests Key: ARROW-12400 URL: https://issues.apache.org/jira/browse/ARROW-12400 Project: Apache Arrow Issue Type: Improvement Reporter: Raphael Taylor-Davies Assignee: Raphael Taylor-Davies The tests in the root of the array/transform module are currently commented out, this appears to have been done as part of moving from Arc to ArrayData. -- This message was sent by Atlassian Jira (v8.3.4#803005)