[jira] [Commented] (ARROW-17410) [JS] Archery JS Build Fails

2022-08-15 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579832#comment-17579832
 ] 

Raphael Taylor-Davies commented on ARROW-17410:
---

I did some work to simplify the CI so that it can perhaps be more easily 
reproduced - [https://github.com/apache/arrow-rs/pull/2453/files]

 

Fortunately it appears to be deterministic, but I'm not exactly sure what is 
causing it...

> [JS] Archery JS Build Fails
> ---
>
> Key: ARROW-17410
> URL: https://issues.apache.org/jira/browse/ARROW-17410
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, JavaScript
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> We are seeing CI failures running the JS integration tests - 
> [https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true]
> In particular
>  
> {code:java}
> [07:33:01] Error: gulp-google-closure-compiler: 
> java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 
> 0xb1e0eb5b)
>   at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
>   at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
>   at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
>   at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
>   at 
> com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
>   at 
> com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
> Error writing to stdin of the compiler. write EPIPE {code}
>  
> This appears to be an issue with zlib v1.2.12 
> [https://github.com/madler/zlib/issues/613] according to the corresponding 
> issue on google-closure-compiler - 
> https://github.com/google/closure-compiler-npm/issues/234
> I'm not sure what the solution here is, but thought I would flag it
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17410) [JS] Archery JS Build Fails

2022-08-15 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579773#comment-17579773
 ] 

Raphael Taylor-Davies commented on ARROW-17410:
---

Thank you for looking into this, perhaps the conda environment is providing a 
newer zlib than the system verison, which the JS build only finds when run 
within the integration context?

> [JS] Archery JS Build Fails
> ---
>
> Key: ARROW-17410
> URL: https://issues.apache.org/jira/browse/ARROW-17410
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, JavaScript
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> We are seeing CI failures running the JS integration tests - 
> [https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true]
> In particular
>  
> {code:java}
> [07:33:01] Error: gulp-google-closure-compiler: 
> java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 
> 0xb1e0eb5b)
>   at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
>   at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
>   at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
>   at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
>   at 
> com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
>   at 
> com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
>   at 
> com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
> Error writing to stdin of the compiler. write EPIPE {code}
>  
> This appears to be an issue with zlib v1.2.12 
> [https://github.com/madler/zlib/issues/613] according to the corresponding 
> issue on google-closure-compiler - 
> https://github.com/google/closure-compiler-npm/issues/234
> I'm not sure what the solution here is, but thought I would flag it
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17410) Archery JS Build Fails

2022-08-15 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-17410:
-

 Summary: Archery JS Build Fails
 Key: ARROW-17410
 URL: https://issues.apache.org/jira/browse/ARROW-17410
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Raphael Taylor-Davies


We are seeing CI failures running the JS integration tests - 
[https://github.com/apache/arrow-rs/runs/7824734614?check_suite_focus=true]

In particular

 
{code:java}
[07:33:01] Error: gulp-google-closure-compiler: java.util.zip.ZipException: 
invalid entry CRC (expected 0x4e1f14a4 but got 0xb1e0eb5b)
at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
at 
com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
at 
com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
at 
com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
at 
com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
at 
com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
Error writing to stdin of the compiler. write EPIPE {code}
 

This appears to be an issue with zlib v1.2.12 
[https://github.com/madler/zlib/issues/613] according to the corresponding 
issue on google-closure-compiler - 
https://github.com/google/closure-compiler-npm/issues/234

I'm not sure what the solution here is, but thought I would flag it

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16978) [C#] Intermittent Archery Failures

2022-07-05 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728
 ] 

Raphael Taylor-Davies edited comment on ARROW-16978 at 7/5/22 3:57 PM:
---

>From the root of an arrow checkout run

```

git clone [https://github.com/apache/arrow-rs.git] rust

archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration

```

The failure concerns C# producing and C# consuming, so I'm not sure how 
important the rust-specific part actually is


was (Author: tustvold):
>From the root of an arrow checkout run

```

git clone https://github.com/apache/arrow-rs.git rust

archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration

```

> [C#] Intermittent Archery Failures
> --
>
> Key: ARROW-16978
> URL: https://issues.apache.org/jira/browse/ARROW-16978
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> We are seeing intermittent archery failures in arrow-rs - 
> [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true]
> {code:java}
> FAILED TEST: datetime C# producing,  C# consuming
> 1 failures
>   File "/arrow/dev/archery/archery/integration/runner.py", line 246, in 
> _run_ipc_test_case
> run_binaries(producer, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 100, in 
> run_gold
> return self._run_gold(gold_dir, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 322, in 
> _run_gold
> consumer.stream_to_file(consumer_stream_path, consumer_file_path)
>   File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in 
> stream_to_file
> self.run_shell_command(cmd)
>   File "/arrow/dev/archery/archery/integration/tester.py", line 49, in 
> run_shell_command
> subprocess.check_call(cmd, shell=True)
>   File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in 
> check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command 
> '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest
>  --mode stream-to-file -a 
> /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < 
> /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream'
>  returned non-zero exit status 1. {code}
> It is possible that this is something to do with how we are running the 
> archery tests, but I am at a loss as to how to debug this issue and would 
> appreciate some input.
> I think it started around when 
> [this|https://github.com/apache/arrow/pull/13279] was merged
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-16978) [C#] Intermittent Archery Failures

2022-07-05 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728
 ] 

Raphael Taylor-Davies edited comment on ARROW-16978 at 7/5/22 3:57 PM:
---

>From the root of an arrow checkout run

```

git clone [https://github.com/apache/arrow-rs.git] rust

archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration

```

The failure concerns C# producing and C# consuming, so I'm not sure how 
important the rust-specific part actually is as the failing test appears to 
only be using C#


was (Author: tustvold):
>From the root of an arrow checkout run

```

git clone [https://github.com/apache/arrow-rs.git] rust

archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration

```

The failure concerns C# producing and C# consuming, so I'm not sure how 
important the rust-specific part actually is

> [C#] Intermittent Archery Failures
> --
>
> Key: ARROW-16978
> URL: https://issues.apache.org/jira/browse/ARROW-16978
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> We are seeing intermittent archery failures in arrow-rs - 
> [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true]
> {code:java}
> FAILED TEST: datetime C# producing,  C# consuming
> 1 failures
>   File "/arrow/dev/archery/archery/integration/runner.py", line 246, in 
> _run_ipc_test_case
> run_binaries(producer, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 100, in 
> run_gold
> return self._run_gold(gold_dir, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 322, in 
> _run_gold
> consumer.stream_to_file(consumer_stream_path, consumer_file_path)
>   File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in 
> stream_to_file
> self.run_shell_command(cmd)
>   File "/arrow/dev/archery/archery/integration/tester.py", line 49, in 
> run_shell_command
> subprocess.check_call(cmd, shell=True)
>   File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in 
> check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command 
> '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest
>  --mode stream-to-file -a 
> /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < 
> /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream'
>  returned non-zero exit status 1. {code}
> It is possible that this is something to do with how we are running the 
> archery tests, but I am at a loss as to how to debug this issue and would 
> appreciate some input.
> I think it started around when 
> [this|https://github.com/apache/arrow/pull/13279] was merged
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16978) [C#] Intermittent Archery Failures

2022-07-05 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562728#comment-17562728
 ] 

Raphael Taylor-Davies commented on ARROW-16978:
---

>From the root of an arrow checkout run

```

git clone https://github.com/apache/arrow-rs.git rust

archery docker run -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration

```

> [C#] Intermittent Archery Failures
> --
>
> Key: ARROW-16978
> URL: https://issues.apache.org/jira/browse/ARROW-16978
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> We are seeing intermittent archery failures in arrow-rs - 
> [here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true]
> {code:java}
> FAILED TEST: datetime C# producing,  C# consuming
> 1 failures
>   File "/arrow/dev/archery/archery/integration/runner.py", line 246, in 
> _run_ipc_test_case
> run_binaries(producer, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 100, in 
> run_gold
> return self._run_gold(gold_dir, consumer, test_case)
>   File "/arrow/dev/archery/archery/integration/runner.py", line 322, in 
> _run_gold
> consumer.stream_to_file(consumer_stream_path, consumer_file_path)
>   File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in 
> stream_to_file
> self.run_shell_command(cmd)
>   File "/arrow/dev/archery/archery/integration/tester.py", line 49, in 
> run_shell_command
> subprocess.check_call(cmd, shell=True)
>   File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in 
> check_call
> raise CalledProcessError(retcode, cmd)
> subprocess.CalledProcessError: Command 
> '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest
>  --mode stream-to-file -a 
> /tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < 
> /arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream'
>  returned non-zero exit status 1. {code}
> It is possible that this is something to do with how we are running the 
> archery tests, but I am at a loss as to how to debug this issue and would 
> appreciate some input.
> I think it started around when 
> [this|https://github.com/apache/arrow/pull/13279] was merged
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16978) [C#] Intermittent Archery Failures

2022-07-05 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-16978:
-

 Summary: [C#] Intermittent Archery Failures
 Key: ARROW-16978
 URL: https://issues.apache.org/jira/browse/ARROW-16978
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Raphael Taylor-Davies


We are seeing intermittent archery failures in arrow-rs - 
[here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true]
{code:java}
FAILED TEST: datetime C# producing,  C# consuming
1 failures
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in 
_run_ipc_test_case
run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 100, in run_gold
return self._run_gold(gold_dir, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 322, in 
_run_gold
consumer.stream_to_file(consumer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in 
stream_to_file
self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in 
run_shell_command
subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in 
check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 
'/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest
 --mode stream-to-file -a 
/tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < 
/arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream'
 returned non-zero exit status 1. {code}
It is possible that this is something to do with how we are running the archery 
tests, but I am at a loss as to how to debug this issue and would appreciate 
some input.

I think it started around when 
[this|https://github.com/apache/arrow/pull/13279] was merged

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-06-01 Thread Raphael Taylor-Davies (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies resolved ARROW-16184.
---
Resolution: Not A Bug

This was a misunderstanding of the relationship between the embedded arrow 
schema and the parquet schema

> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], 
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-04-15 Thread Raphael Taylor-Davies (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522834#comment-17522834
 ] 

Raphael Taylor-Davies commented on ARROW-16184:
---

Do you know if this convention is documented anywhere, this would be a breaking 
change to the arrow-rs implementation and so it would be good to have something 
authoritative to reference as justification. That being said it seems odd to me 
that the less expressive schema would be treated as the authoritative one - if 
you can't trust the arrow schema, what is the point in embedding it?

> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], 
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-04-13 Thread Raphael Taylor-Davies (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies updated ARROW-16184:
--
Description: 
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array. This 
causes issues depending on which schema the reader opts to "trust".

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]

Specifically the metadata written is
{code:java}
Schema {
    endianness: Little,
    fields: Some(
        [
            Field {
                name: Some(
                    "created",
                ),
                nullable: true,
                type_type: Timestamp,
                type_: Timestamp {
                    unit: NANOSECOND,
                    timezone: Some(
                        "UTC",
                    ),
                },
                dictionary: None,
                children: Some(
                    [],
                ),
                custom_metadata: None,
            },
        ],
    ),
    custom_metadata: Some(
        [
            KeyValue {
                key: Some(
                    "pandas",
                ),
                value: Some(
                    "{\"index_columns\": [], \"column_indexes\": [], 
\"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
\"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
\"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
\"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
                ),
            },
        ],
    ),
    features: None,
} {code}

  was:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array. This 
causes issues depending on which schema the reader opts to "trust".

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]


> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parque

[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-04-13 Thread Raphael Taylor-Davies (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies updated ARROW-16184:
--
Component/s: (was: Python)
Description: 
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array. This 
causes issues depending on which schema the reader opts to "trust".

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]

  was:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.

{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}


This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array.

 

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]

 

 


> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-04-13 Thread Raphael Taylor-Davies (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies updated ARROW-16184:
--
Description: 
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.

{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}


This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array.

 

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]

 

 

  was:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
#!/usr/bin/env pythonimport pyarrow as paimport pyarrow.parquet as pqimport 
pandas as pd# create DataFrame with a datetime columndf = 
pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])# create Arrow table from 
DataFrametable = pa.Table.from_pandas(df, preserve_index=False)# write the 
table as a parquet file, then read it back againpq.write_table(table, 
'foo.parquet')
table2 = pq.read_table('foo.parquet')print(table.schema[0])  # 
pyarrow.Field (nanosecond units)print(table2.schema[0]) 
# pyarrow.Field (microsecond units)
 

This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array.

 

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459

 

 


> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Raphael Taylor-Davies
>Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array.
>  
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

2022-04-13 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-16184:
-

 Summary: [Python] Incorrect Timestamp Unit in Embedded Arrow 
Schema Within Parquet
 Key: ARROW-16184
 URL: https://issues.apache.org/jira/browse/ARROW-16184
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Raphael Taylor-Davies


As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
#!/usr/bin/env pythonimport pyarrow as paimport pyarrow.parquet as pqimport 
pandas as pd# create DataFrame with a datetime columndf = 
pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])# create Arrow table from 
DataFrametable = pa.Table.from_pandas(df, preserve_index=False)# write the 
table as a parquet file, then read it back againpq.write_table(table, 
'foo.parquet')
table2 = pq.read_table('foo.parquet')print(table.schema[0])  # 
pyarrow.Field (nanosecond units)print(table2.schema[0]) 
# pyarrow.Field (microsecond units)
 

This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array.

 

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-12504) [Rust] Buffer::from_slice_ref incorrect capacity

2021-04-22 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12504:
-

 Summary: [Rust] Buffer::from_slice_ref incorrect capacity
 Key: ARROW-12504
 URL: https://issues.apache.org/jira/browse/ARROW-12504
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Buffer::from_slice_ref sets the capacity without taking into account the size 
of the slice elements



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12493) [Rust] Support DictionaryArray in CSV and JSON formatters

2021-04-21 Thread Raphael Taylor-Davies (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raphael Taylor-Davies updated ARROW-12493:
--
Summary: [Rust] Support DictionaryArray in CSV and JSON formatters  (was: 
Support DictionaryArray in CSV and JSON formatters)

> [Rust] Support DictionaryArray in CSV and JSON formatters
> -
>
> Key: ARROW-12493
> URL: https://issues.apache.org/jira/browse/ARROW-12493
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Raphael Taylor-Davies
>Assignee: Raphael Taylor-Davies
>Priority: Minor
>
> Currently the CSV and JSON formatters do not support JSON and CSV arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12493) Support DictionaryArray in CSV and JSON formatters

2021-04-21 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12493:
-

 Summary: Support DictionaryArray in CSV and JSON formatters
 Key: ARROW-12493
 URL: https://issues.apache.org/jira/browse/ARROW-12493
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Currently the CSV and JSON formatters do not support JSON and CSV arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12426) [Rust] Concatenating dictionaries ignores values

2021-04-16 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12426:
-

 Summary: [Rust] Concatenating dictionaries ignores values
 Key: ARROW-12426
 URL: https://issues.apache.org/jira/browse/ARROW-12426
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


Concatenating dictionaries ignores the values array, at best leading to 
incorrect data, but often leading to keys with indexes beyond the bounds of the 
values array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12425) [Rust] new_null_array doesn't allocate keys buffer for dictionary arrays

2021-04-16 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12425:
-

 Summary: [Rust] new_null_array doesn't allocate keys buffer for 
dictionary arrays
 Key: ARROW-12425
 URL: https://issues.apache.org/jira/browse/ARROW-12425
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12400) [Rust] Re-enable transform module tests

2021-04-15 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-12400:
-

 Summary: [Rust] Re-enable transform module tests
 Key: ARROW-12400
 URL: https://issues.apache.org/jira/browse/ARROW-12400
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Raphael Taylor-Davies
Assignee: Raphael Taylor-Davies


The tests in the root of the array/transform module are currently commented 
out, this appears to have been done as part of moving from Arc to 
ArrayData.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)