[jira] [Created] (ARROW-15144) [Java] Unable to read IPC file in master

2021-12-16 Thread Jira
Jorge Leitão created ARROW-15144:


 Summary: [Java] Unable to read IPC file in master
 Key: ARROW-15144
 URL: https://issues.apache.org/jira/browse/ARROW-15144
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Jorge Leitão
 Attachments: generated_primitive.arrow

I think that PR https://github.com/apache/arrow/pull/11709 may have caused a 
regression in reading IPC files.

Attached is an arrow file that can't be read by the Java implementation, but it 
can be read by all other implementations. Its contents correspond exactly to 
the generated_primitive.json.gz used in integration tests.

Background:
The integration CI pipeline in Rust's arrow2 started failing after the PR 
mentioned above. The logs show that all but the Java implementation are able to 
consume the attached file (and more generally the files created by arrow2's 
implementation). The PR broke almost all tests, suggesting that it is not 
something specific to the file but a broader issue.

Log: 
https://pipelines.actions.githubusercontent.com/RJ1isxNgLS0jQX3HKOGkLQjJSEMqOm4RfxnyKHS4o90jAsObvY/_apis/pipelines/1/runs/14655/signedlogcontent/2?urlExpires=2021-12-17T05%3A35%3A25.6055769Z&urlSigningMethod=HMACV1&urlSignature=Nx7nRNdrcUCbtvOnnXAYGDEuSEJUiDT%2BU2jNcqqp%2FEs%3D

The logs also suggest that the Java implementation may be leaking memory when 
such an event happens.

{code:java}
2021-12-16T05:38:37.6833847Z 05:38:37.622 [main] ERROR 
org.apache.arrow.tools.Integration - Incompatible files
2021-12-16T05:38:37.6835533Z java.lang.IllegalArgumentException: Different 
values in column:
2021-12-16T05:38:37.6836731Z f11: Timestamp(SECOND, UTC) at index 0: null != 
-62135596800
2021-12-16T05:38:37.6838188Zat 
org.apache.arrow.vector.util.Validator.compareFieldVectors(Validator.java:133)
2021-12-16T05:38:37.6840563Zat 
org.apache.arrow.vector.util.Validator.compareVectorSchemaRoot(Validator.java:107)
2021-12-16T05:38:37.6842476Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:209)
2021-12-16T05:38:37.6843841Zat 
org.apache.arrow.tools.Integration.run(Integration.java:119)
2021-12-16T05:38:37.6845214Zat 
org.apache.arrow.tools.Integration.main(Integration.java:70)
2021-12-16T05:38:37.6846597ZSuppressed: java.lang.IllegalStateException: 
Memory was leaked by query. Memory leaked: (894)
2021-12-16T05:38:37.6847623Z Allocator(ROOT) 0/894/442402/2147483647 
(res/actual/peak/limit)
2021-12-16T05:38:37.6848029Z 
2021-12-16T05:38:37.6848996Zat 
org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
2021-12-16T05:38:37.6851316Zat 
org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
2021-12-16T05:38:37.6882832Zat 
org.apache.arrow.tools.Integration$Command$3.$closeResource(Integration.java:228)
2021-12-16T05:38:37.6884294Zat 
org.apache.arrow.tools.Integration$Command$3.execute(Integration.java:228)
2021-12-16T05:38:37.6885249Z... 2 common frames omitted
{code}

I can't discard the possibility that this is an issue in arrow2 and an 
undefined issue in the implementation - I am raising it here because all other 
implementations can read the files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15143) Remove incorrect comment on API of Transform for StringBinaryTransformExecBase

2021-12-16 Thread Eduardo Ponce (Jira)
Eduardo Ponce created ARROW-15143:
-

 Summary: Remove incorrect comment on API of Transform for 
StringBinaryTransformExecBase
 Key: ARROW-15143
 URL: https://issues.apache.org/jira/browse/ARROW-15143
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Eduardo Ponce
Assignee: Eduardo Ponce
 Fix For: 7.0.0


The [comment describing the API of Transform method for classes that inherit 
from 
StringBinaryTransformExecBase|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_string.cc#L626]
 incorrectly states there is a last parameter for returning error status. 
Instead the Transform method indicates an error by returning a negative value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15142) Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow

2021-12-16 Thread Karthik (Jira)
Karthik created ARROW-15142:
---

 Summary: Cannot mix struct and non-struct, non-null values error 
when saving nested types with PyArrow 
 Key: ARROW-15142
 URL: https://issues.apache.org/jira/browse/ARROW-15142
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1
Reporter: Karthik


When trying to save a Pandas dataframe with a nested type (list within list, 
list within dict) using pyarrow engine, the following error is encountered

{color:#e75c58}ArrowInvalid{color}: ('cannot mix list and non-list, non-null 
values', 'Conversion failed for column A with type object')

 

Repro:
{code:java}
import pandas as pd
x = pd.DataFrame({"A": [[24, 27, [1, 1]]]})
x.to_parquet('/tmp/a.pqt', engine="pyarrow")  {code}
Doing a bit of googling, it appears that this is a known Arrow shortcoming. 
However, this is a commonly encountered datastructure, and 'fastparquet' 
handles this seamlessly. Is there a proposed timeline/plan for fixing this?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15141) Fatal error condition occurred in aws_thread_launch

2021-12-16 Thread F. H. (Jira)
F. H. created ARROW-15141:
-

 Summary: Fatal error condition occurred in aws_thread_launch
 Key: ARROW-15141
 URL: https://issues.apache.org/jira/browse/ARROW-15141
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1, 6.0.0
 Environment: - `uname -a`:
Linux datalab2 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 
x86_64 x86_64 x86_64 GNU/Linux
- `mamba list | grep -i "pyarrow\|tensorflow\|^python"`
pyarrow   6.0.0   py39hff6fa39_1_cpuconda-forge
python3.9.7   hb7a2778_3_cpythonconda-forge
python-dateutil   2.8.2  pyhd8ed1ab_0conda-forge
python-flatbuffers1.12   pyhd8ed1ab_1conda-forge
python-irodsclient1.0.0  pyhd8ed1ab_0conda-forge
python-rocksdb0.7.0py39h7fcd5f3_4conda-forge
python_abi3.9  2_cp39conda-forge
tensorflow2.6.2   cuda112py39h9333c2f_0conda-forge
tensorflow-base   2.6.2   cuda112py39h7de589b_0conda-forge
tensorflow-estimator  2.6.2   cuda112py39h9333c2f_0conda-forge
tensorflow-gpu2.6.2   cuda112py39h0bbbad9_0conda-forge



Reporter: F. H.


Hi, I am getting randomly the following error when first running inference with 
a Tensorflow model and then writing the result to a `.parquet` file:
```

Fatal error condition occurred in 
/home/conda/feedstock_root/build_artifacts/aws-c-io_1633633131324/work/source/event_loop.c:72:
 aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, 
el_group, &thread_options) == AWS_OP_SUCCESS
Exiting Application

Stack trace:

/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_backtrace_print+0x59)
 [0x7ffb14235f19]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_fatal_assert+0x48)
 [0x7ffb14227098]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0x10a43)
 [0x7ffb1406ea43]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././././libaws-c-io.so.1.0.0(+0xe35a)
 [0x7ffb1406c35a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-c-common.so.1(aws_ref_count_release+0x1d)
 [0x7ffb14237fad]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../../././libaws-crt-cpp.so(_ZN3Aws3Crt2Io15ClientBootstrapD1Ev+0x3a)
 [0x7ffb142a2f5a]
/home//miniconda3/envs/spliceai_env/lib/python3.9/site-packages/pyarrow/../../.././libaws-cpp-sdk-core.so(+0x5f570)
 [0x7ffb147fd570]
/lib/x86_64-linux-gnu/libc.so.6(+0x49a27) [0x7ffb17f7da27]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0) [0x7ffb17f7dbe0]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa) [0x7ffb17f5b0ba]
/home//miniconda3/envs/spliceai_env/bin/python3.9(+0x20aa51) 
[0x562576609a51]
/bin/bash: line 1: 2341494 Aborted                 (core dumped)

```

 

My colleague ran into the same issue on Centos 8 while running the same job + 
same environment on SLURM, so I guess it could be some issue with tensorflow + 
pyarrow.

Also I found a github issue with multiple people running into the same issue:
[https://github.com/huggingface/datasets/issues/3310]

 

It would be very important to my lab that this bug gets resolved, as we cannot 
work with parquet any more. Unfortunately, we do not have the knowledge to fix 
it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15140) [CI] move to v2 of checkouts for GHA

2021-12-16 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15140:
--

 Summary: [CI] move to v2 of checkouts for GHA
 Key: ARROW-15140
 URL: https://issues.apache.org/jira/browse/ARROW-15140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Jonathan Keane
Assignee: Jonathan Keane


https://github.com/apache/arrow/pull/11974#discussion_r770942784



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15139) [Python] write_dataset's file_write_options are too confusing and/or undocumented

2021-12-16 Thread Weston Pace (Jira)
Weston Pace created ARROW-15139:
---

 Summary: [Python] write_dataset's file_write_options are too 
confusing and/or undocumented
 Key: ARROW-15139
 URL: https://issues.apache.org/jira/browse/ARROW-15139
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Weston Pace


ParquetFileWriteOptions / IpcFileWriteOptions / CsvFileWriteOptions are 
completely undocumented.

Also, creating them via the format is a little peculiar / confusing and I'm not 
sure why it is required.

It should be clear to the user what options they can set and how to set them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15138) [C++] Make ExecPlan::ToString give some additional information

2021-12-16 Thread Jira
Percy Camilo Triveño Aucahuasi created ARROW-15138:
--

 Summary: [C++] Make ExecPlan::ToString give some additional 
information
 Key: ARROW-15138
 URL: https://issues.apache.org/jira/browse/ARROW-15138
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Percy Camilo Triveño Aucahuasi


Currently we have this task https://issues.apache.org/jira/browse/ARROW-14233 
However, that task is very open about the scope. So this ticket will try to 
address some improvements for ExecPlan::ToString without the need to have 
additional sophistication nor to extend the scope of features for 
ExecPlan::ToString.

This means that we won't need to research about GraphViz (or other backends) 
here. We will just need to work to have these features for ExecPlan::ToString:
 * Print index/Id for each node  in the physical tree
 * Make sure all the nodes have names and correct index/IDs
 * Apply indentation for the printing process (this may be an optional setting 
in the API)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15137) [Dev] Update archery crossbow latest-prefix to work with nightly dates

2021-12-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15137:
-

 Summary: [Dev] Update archery crossbow latest-prefix to work with 
nightly dates
 Key: ARROW-15137
 URL: https://issues.apache.org/jira/browse/ARROW-15137
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15136) [C++] Make S3FS tests faster

2021-12-16 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15136:
--

 Summary: [C++] Make S3FS tests faster
 Key: ARROW-15136
 URL: https://issues.apache.org/jira/browse/ARROW-15136
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


One thing that would make development more pleasant would be to speed up the 
s3fs tests. Currently, each test waits for the launch of a new Minio child 
process. Instead, we may launch several of them in advance (listening on 
different ports and targetting different temporary directories), so that most 
tests don't bear the launch cost.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15135) [C++][R][Python] Support reading from Apache Iceberg tables

2021-12-16 Thread Will Jones (Jira)
Will Jones created ARROW-15135:
--

 Summary: [C++][R][Python] Support reading from Apache Iceberg 
tables
 Key: ARROW-15135
 URL: https://issues.apache.org/jira/browse/ARROW-15135
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Will Jones


This is an umbrella issue for supporting the [Apache Iceberg table format|]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15134) [GLib] Add GArrow Month, Day, Nano IntervalDataType

2021-12-16 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-15134:
-

 Summary: [GLib] Add GArrow Month, Day, Nano IntervalDataType
 Key: ARROW-15134
 URL: https://issues.apache.org/jira/browse/ARROW-15134
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: GLib
Reporter: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15133) [CI] Removing util_checkout.sh and util_cleanup.sh scripts

2021-12-16 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-15133:
---

 Summary: [CI] Removing util_checkout.sh and util_cleanup.sh scripts
 Key: ARROW-15133
 URL: https://issues.apache.org/jira/browse/ARROW-15133
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs
 Fix For: 7.0.0


- ci/scripts/util_checkout.sh was used to checkout submodules because 
actions/checkout@v2 has removed support for that, but they have restored it 
since.
- ci/scripts/util_cleanup.sh was used to free up disk space on github actions 
runners, because at that time it was limited to 7GB, from a recent run it looks 
like the linux runners now have 32GB free space so we can try to disable the 
cleanup step sparing almost a minute of build time



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15132) Remove wrong use of typename in greedy_dual_size_cache.h

2021-12-16 Thread Projjal Chanda (Jira)
Projjal Chanda created ARROW-15132:
--

 Summary: Remove wrong use of typename in greedy_dual_size_cache.h
 Key: ARROW-15132
 URL: https://issues.apache.org/jira/browse/ARROW-15132
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15131) [Python] Coerce value_set argument to array in "is_in" kernel

2021-12-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15131:
-

 Summary: [Python] Coerce value_set argument to array in "is_in" 
kernel
 Key: ARROW-15131
 URL: https://issues.apache.org/jira/browse/ARROW-15131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Small example I ran into:

{code:python}
>>> arr = pa.array(['a', 'b', 'c', 'd'])
>>> pc.is_in(arr, ['a', 'c'])
...
TypeError: "['a', 'c']" is not a valid value set
{code}

That's not a super friendly error message (it was not directly clear what is 
not "valid" about this). Passing {{pa.array(['a', 'c']) explicitly works, but I 
expected that the kernel would try this automatically (as we also convert the 
first array argument to an array).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15130) [Doc] Annotation Glossary

2021-12-16 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-15130:
---

 Summary: [Doc] Annotation Glossary
 Key: ARROW-15130
 URL: https://issues.apache.org/jira/browse/ARROW-15130
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Alenka Frim


Add Annotation Glossary for Apache Arrow project.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15129) [Doc][C++] Building and Memory allocators

2021-12-16 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-15129:
---

 Summary: [Doc][C++] Building and Memory allocators
 Key: ARROW-15129
 URL: https://issues.apache.org/jira/browse/ARROW-15129
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Alenka Frim


Add a section about memory allocation options in the building process of the 
C++ library. The content could fit under the section [Build Dependency 
Management|https://arrow.apache.org/docs/developers/cpp/building.html#build-dependency-management|[http://example.com].|http://example.com]/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15128) [C++] segfault when writing CSV from RecordBatchReader

2021-12-16 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15128:


 Summary: [C++] segfault when writing CSV from RecordBatchReader
 Key: ARROW-15128
 URL: https://issues.apache.org/jira/browse/ARROW-15128
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


I'm currently trying to implement functionality in R so that we can open a 
dataset and then write to a CSV file, but I'm getting a segfault when I run my 
tests:

 
{code:r}
tbl <- tibble::tibble(
  dbl = c(1:8, NA, 10) + .1,
  lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
  false = logical(10),
  chr = letters[c(1:5, NA, 7:10)]
)

make_temp_dir <- function() {
  path <- tempfile()
  dir.create(path)
  normalizePath(path, winslash = "/")
}

data_dir <- make_temp_dir()
write_dataset(tbl, data_dir, partitioning = "lgl")
data_in <- open_dataset(data_dir)

csv_file <- tempfile()
tbl_out <- write_csv_arrow(data_in, csv_file)
{code}
 
{code:java}
Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x7fffee51fdd7 in __gnu_cxx::__exchange_and_add (__mem=0xe9, __val=-1)
at /usr/include/c++/9/ext/atomicity.h:49
49{ return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)