[jira] [Created] (ARROW-16277) [Python] No builds for macOS arm64.

2022-04-21 Thread A. Coady (Jira)
A. Coady created ARROW-16277:


 Summary: [Python] No builds for macOS arm64.
 Key: ARROW-16277
 URL: https://issues.apache.org/jira/browse/ARROW-16277
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 8.0.0
 Environment: macOS
Reporter: A. Coady


Nightly builds no longer include a build for macOS for arm64. The last one to 
do so was 8.0.0.dev312.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16276) [R] Release News

2022-04-21 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16276:
--

 Summary: [R] Release News
 Key: ARROW-16276
 URL: https://issues.apache.org/jira/browse/ARROW-16276
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jonathan Keane
Assignee: Will Jones


I typically use a command like:

{code}
git log fcab481 --grep=".*\[R\].*" --format="%s"
{code}

Which will find all the commits with {{[R]}}, since commit fcab481. I found 
commit fcab481 by going to the 7.0.0 release branch and then finding the last 
commit that is in the master branch as well as the 7.0.0 release. 





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16275) [C++] Add support for pushdown projection of nested references

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16275:
---

 Summary: [C++] Add support for pushdown projection of nested 
references
 Key: ARROW-16275
 URL: https://issues.apache.org/jira/browse/ARROW-16275
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Now that we support nested field references we should support pushdown 
predicates based on nested field references.  For example:

{noformat}
dataset.to_table(filter=ds.field('values', 'one') > 200)
{noformat}

{{file_parquet.cc}} tests which row groups to include when scanning a parquet 
fragment using parquet statistics.  At the moment it skips any non-leaf 
columns.  That will need to change.

Second, even if we were able to detect and produce a guarantee based on nested 
references, it's not clear the simplification logic would be able to detect 
this and appropriately simplify.  So there may be changes needed there too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16274) [C++] Substrait consumer should be feature-aware

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16274:
---

 Summary: [C++] Substrait consumer should be feature-aware
 Key: ARROW-16274
 URL: https://issues.apache.org/jira/browse/ARROW-16274
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


The Substrait consumer should be aware of which features the consumer was built 
with and gracefully reject plans that request unsupported features.

For example, Substrait plans could specify the parquet format and Arrow can be 
built without parquet support.  In that case the consumer should still compile 
but reject all parquet plans.

Today we simply force all features that Substrait could possibly require to be 
turned on.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16273) [C++] Valgrind error in arrow-compute-scalar-test

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16273:
---

 Summary: [C++] Valgrind error in arrow-compute-scalar-test
 Key: ARROW-16273
 URL: https://issues.apache.org/jira/browse/ARROW-16273
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Weston Pace


Currently valgrind is failing earlier on the tpch-node-test and 
hash-join-node-test.  Once we fix those tests it seems the next error is this:

{noformat}
[ RUN  ] TestStringKernels/0.Strptime
==9928== Conditional jump or move depends on uninitialised value(s)
==9928==at 0x411AEA2: arrow::TestInitialized(arrow::ArrayData const&) 
(gtest_util.cc:682)
==9928==by 0xAE1C79: arrow::compute::(anonymous 
namespace)::ValidateOutput(arrow::ArrayData const&) (test_util.cc:287)
==9928==by 0xAE23FC: arrow::compute::ValidateOutput(arrow::Datum const&) 
(test_util.cc:320)
==9928==by 0xAE4946: 
arrow::compute::CheckScalarNonRecursive(std::__cxx11::basic_string, std::allocator > const&, 
std::vector > const&, arrow::Datum 
const&, arrow::compute::FunctionOptions const*) (test_util.cc:80)
==9928==by 0xAE63A4: 
arrow::compute::CheckScalar(std::__cxx11::basic_string, std::allocator >, std::vector > const&, arrow::Datum, 
arrow::compute::FunctionOptions const*) (test_util.cc:108)
==9928==by 0xAE7E28: 
arrow::compute::CheckScalarUnary(std::__cxx11::basic_string, std::allocator >, arrow::Datum, arrow::Datum, 
arrow::compute::FunctionOptions const*) (test_util.cc:254)
==9928==by 0xAE80D3: 
arrow::compute::CheckScalarUnary(std::__cxx11::basic_string, std::allocator >, 
std::shared_ptr, std::__cxx11::basic_string, std::allocator >, 
std::shared_ptr, std::__cxx11::basic_string, std::allocator >, arrow::compute::FunctionOptions 
const*) (test_util.cc:260)
==9928==by 0x9F783F: 
arrow::compute::BaseTestStringKernels::CheckUnary(std::__cxx11::basic_string, std::allocator >, 
std::__cxx11::basic_string, std::allocator 
>, std::shared_ptr, std::__cxx11::basic_string, std::allocator >, arrow::compute::FunctionOptions 
const*) (scalar_string_test.cc:56)
==9928==by 0xA2A62D: 
arrow::compute::TestStringKernels_Strptime_Test::TestBody() 
(scalar_string_test.cc:1855)
==9928==by 0x64974DC: void 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==9928==by 0x648E90C: void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==9928==by 0x6469CDC: testing::Test::Run() (gtest.cc:2682)
==9928==by 0x646A6FE: testing::TestInfo::Run() (gtest.cc:2861)
==9928==by 0x646B0BD: testing::TestSuite::Run() (gtest.cc:3015)
==9928==by 0x647B1DB: testing::internal::UnitTestImpl::RunAllTests() 
(gtest.cc:5855)
==9928==by 0x6498497: bool 
testing::internal::HandleSehExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607)
==9928==by 0x648FAF9: bool 
testing::internal::HandleExceptionsInMethodIfSupported(testing::internal::UnitTestImpl*, bool 
(testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643)
==9928==by 0x64796A8: testing::UnitTest::Run() (gtest.cc:5438)
==9928==by 0x4204918: RUN_ALL_TESTS() (gtest.h:2490)
==9928==by 0x420495B: main (gtest_main.cc:52)
==9928== 
{
   
   Memcheck:Cond
   fun:_ZN5arrow15TestInitializedERKNS_9ArrayDataE
   fun:_ZN5arrow7compute12_GLOBAL__N_114ValidateOutputERKNS_9ArrayDataE
   fun:_ZN5arrow7compute14ValidateOutputERKNS_5DatumE
   
fun:_ZN5arrow7compute23CheckScalarNonRecursiveERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaISA_EERKSA_PKNS0_15FunctionOptionsE
   
fun:_ZN5arrow7compute11CheckScalarENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS_5DatumESaIS8_EES8_PKNS0_15FunctionOptionsE
   
fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_5DatumES7_PKNS0_15FunctionOptionsE
   
fun:_ZN5arrow7compute16CheckScalarUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_8DataTypeEES6_S9_S6_PKNS0_15FunctionOptionsE
   
fun:_ZN5arrow7compute21BaseTestStringKernelsINS_10StringTypeEE10CheckUnaryENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St10shared_ptrINS_8DataTypeEES9_PKNS0_15FunctionOptionsE
   
fun:_ZN5arrow7compute31TestStringKernels_Strptime_TestINS_10StringTypeEE8TestBodyEv
   
fun:_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
   
fun:_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc
   fun:_ZN7testing4Test3RunEv
   fun:_ZN7testing8TestInfo3RunEv
   fun:_ZN7testing9TestSuite3RunEv
   fun:_ZN7testing8internal12UnitTestImpl11RunAllTestsEv
   
fun:_ZN7testing8internal38HandleSehExceptionsInMethodIfSupportedINS0_12UnitTestImplEbEET0_PT_MS4_FS3_vEPKc
   
fun:_ZN

[jira] [Created] (ARROW-16272) Poor read performance of S3FileSystem.open_input_file when used with `pd.read_csv`

2022-04-21 Thread Sahil Gupta (Jira)
Sahil Gupta created ARROW-16272:
---

 Summary: Poor read performance of S3FileSystem.open_input_file 
when used with `pd.read_csv`
 Key: ARROW-16272
 URL: https://issues.apache.org/jira/browse/ARROW-16272
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 7.0.0, 5.0.0, 4.0.1
 Environment: MacOS 12.1
MacBook Pro
Intel x86
Reporter: Sahil Gupta


`pyarrow.fs.S3FileSystem.open_input_file` and 
`pyarrow.fs.S3FileSystem.open_input_stream` performs very poorly when used with 
Pandas' `read_csv`.

 

```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = S3FileSystem(
        anonymous=True,
        region="us-east-2",
        endpoint_override=None,
        proxy_options=None,
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    # fhandler = fs.open_input_stream(
    #     
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    # )
    fhandler = fs.open_input_file(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```

Output:
```shell
Running...
Time to create fs:  0.0003612041473388672
Time to create fhandler:  0.22461509704589844
read time: 105.76488208770752
total time: 105.99135684967041
```
This is with `pandas==1.4.2`.

Getting similar performance with `fs.open_input_stream` as well (commented out 
in the code).
```shell
Running...
Time to create fs:  0.0002570152282714844
Time to create fhandler:  0.18540692329406738
read time: 186.8419930934906
total time: 187.03169012069702
```

When running it with just pandas (which uses `s3fs` under the hood), it's much 
faster:
```python
import pandas as pd
import time


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    year_2016_df = pd.read_csv(
        
"s3://bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
read time: 1.1012001037597656
total time: 1.101264238357544
```

Surprisingly, when we use `fsspec`'s `ArrowFSWrapper`, it's matches s3fs 
performance:
```python
import pandas as pd
import time
from pyarrow.fs import S3FileSystem
from fsspec.implementations.arrow import ArrowFSWrapper


def load_parking_tickets():
    print("Running...")
    t0 = time.time()
    fs = ArrowFSWrapper(
        S3FileSystem(
            anonymous=True,
            region="us-east-2",
            endpoint_override=None,
            proxy_options=None,
        )
    )

    print("Time to create fs: ", time.time() - t0)
    t0 = time.time()
    fhandler = fs._open(
        
"bodo-example-data/nyc-parking-tickets/Parking_Violations_Issued_-_Fiscal_Year_2016.csv",
    )
    print("Time to create fhandler: ", time.time() - t0)
    t0 = time.time()
    year_2016_df = pd.read_csv(
        fhandler,
        nrows=100,
    )
    print("read time:", time.time() - t0)
    return year_2016_df


t0 = time.time()
load_parking_tickets()
print("total time:", time.time() - t0)
```
Output:
```shell
Running...
Time to create fs:  0.0002467632293701172
Time to create fhandler:  0.1858382225036621
read time: 0.13701486587524414
total time: 0.3232450485229492
```

 

Packages:

```

pyarrow=7.0.0

pandas : 1.4.2

numpy : 1.20.3

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16271) [C++] Implement full chunked array support for replace_with_mask

2022-04-21 Thread David Li (Jira)
David Li created ARROW-16271:


 Summary: [C++] Implement full chunked array support for 
replace_with_mask
 Key: ARROW-16271
 URL: https://issues.apache.org/jira/browse/ARROW-16271
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


ARROW-15928 enables this function to accept chunked arrays for the input array, 
but not for the mask or replacements array. More work is needed to implement 
those cases (which currently just return an error).

We should also consider how to make this work at least somewhat reusable for 
similar kernels (e.g. replace_with_indices)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16270) [C++][Python][FileSystem] Make directory paths returned uniform

2022-04-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16270:
---

 Summary: [C++][Python][FileSystem] Make directory paths returned 
uniform
 Key: ARROW-16270
 URL: https://issues.apache.org/jira/browse/ARROW-16270
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Micah Kornfield


Depending on if paths are selected with recursion or without code the result of 
the returned directories changes to include a slash or not include a slash (see 
code linked below).  It would be nice to provide consistent output here.  It 
isn't clear i the breaking change is worthwhile here.

 

 [1] 
https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/python/pyarrow/tests/test_fs.py#L688

  [2] 
https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/cpp/src/arrow/filesystem/test_util.cc#L767



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16269) [R][Python] Roundtrip ChunkedArray with ExtensionType drops type

2022-04-21 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16269:


 Summary: [R][Python] Roundtrip ChunkedArray with ExtensionType 
drops type
 Key: ARROW-16269
 URL: https://issues.apache.org/jira/browse/ARROW-16269
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python, R
Reporter: Dewey Dunnington


After ARROW-15168 we will use ExtensionType in more cases to handle R vector 
types that we don't natively implement a conversion for; however, roundtripping 
a Table through results in a Table with a slightly inconsistent state where the 
type of the ChunkedArray doesn't line up with the type in the schema:

{code:R}
# remotes::install_github("apache/arrow/r")
library(arrow, warn.conflicts = FALSE)
pa <- reticulate::import("pyarrow", convert = FALSE)

table <- arrow_table(
  ext_col = chunked_array(vctrs_extension_array(1:10))
)
table$ext_col$type
#> VctrsExtensionType
#> integer(0)
table$schema$ext_col$type
#> VctrsExtensionType
#> integer(0)

table_py <- pa$Table$from_arrays(table$columns, schema = table$schema)
table_py$column("ext_col")$type
#> int32
table_py$schema$field("ext_col")$type
#> int32

cols <- reticulate::py_to_r(table_py$columns)
names(cols) <- reticulate::py_to_r(table_py$column_names)
table2 <- Table$create(!!! cols, schema = table$schema)
table2$ext_col$type
#> Int32
#> int32
table2$schema$ext_col$type
#> VctrsExtensionType
#> integer(0)
{code}

The workaround in ARROW-15168 is to go through RecordBatchReader, which is 
probably fine but in some cases might result in ChunkedArray columns getting 
re-chunked to intersection of all the chunks. This doesn't copy any data, but 
isn't ideal (we should be able to roundtrip column-wise and avoid any 
re-chunking).

{code:R}
# remotes::install_github("apache/arrow/r#12817")
library(arrow, warn.conflicts = FALSE)

table <- arrow_table(
  c1 = chunked_array(1:2, 3:4, 5:6), 
  c2 = chunked_array(1:6)
)

table$c1
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2
#>   ],
#>   [
#> 3,
#> 4
#>   ],
#>   [
#> 5,
#> 6
#>   ]
#> ]
table$c2
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2,
#> 3,
#> 4,
#> 5,
#> 6
#>   ]
#> ]

rbr <- as_record_batch_reader(table)
table2 <- rbr$read_table()

table2$c1
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2
#>   ],
#>   [
#> 3,
#> 4
#>   ],
#>   [
#> 5,
#> 6
#>   ]
#> ]
table2$c2
#> ChunkedArray
#> [
#>   [
#> 1,
#> 2
#>   ],
#>   [
#> 3,
#> 4
#>   ],
#>   [
#> 5,
#> 6
#>   ]
#> ]
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16268) [R] Reorganized deprecated functions

2022-04-21 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16268:


 Summary: [R] Reorganized deprecated functions
 Key: ARROW-16268
 URL: https://issues.apache.org/jira/browse/ARROW-16268
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


In R/deprecated.R we have a few deprecated functions that have been deprecated 
for some time. In ARROW-15168 we deprecated {{type()}} in favour of 
{{infer_type()}}, and there are a few deprecated methods in dataset.R as well. 
We should consider removing some of the older deprecations and for the newer 
ones, we should probably add a comment about when they were deprecated so that 
we can better schedule their removal (or adopt tidyverse lifecycle terminology: 
https://lifecycle.r-lib.org/articles/stages.html ).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16267) [Java] Support Java 18

2022-04-21 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16267:
-

 Summary: [Java] Support Java 18
 Key: ARROW-16267
 URL: https://issues.apache.org/jira/browse/ARROW-16267
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Alessandro Molina
Assignee: David Dali Susanibar Arce






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16266) [R] Add StructArray$create()

2022-04-21 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16266:


 Summary: [R] Add StructArray$create()
 Key: ARROW-16266
 URL: https://issues.apache.org/jira/browse/ARROW-16266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


In ARROW-13371 we implemented the {{make_struct}} compute function bound to 
{{data.frame()}} / {{tibble()}} in dplyr evaluation; however, we didn't 
actually implement {{StructArray$create()}}. In ARROW-15168, it turns out that 
we need to do this to support {{StructArray}} creation from data.frames whose 
columns aren't all convertable using the internal C++ conversion. The hack used 
in that PR is below (but we should clearly implement the C++ function instead 
of using the hack):

{code:R}
library(arrow, warn.conflicts = FALSE)

struct_array <- function(...) {
  batch <- record_batch(...)
  array_ptr <- arrow:::allocate_arrow_array()
  schema_ptr <- arrow:::allocate_arrow_schema()
  batch$export_to_c(array_ptr, schema_ptr)
  Array$import_from_c(array_ptr, schema_ptr)
}

struct_array(a = 1, b = "two")
#> StructArray
#> >
#> -- is_valid: all not null
#> -- child 0 type: double
#>   [
#> 1
#>   ]
#> -- child 1 type: string
#>   [
#> "two"
#>   ]
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16265) [C++][Docs]

2022-04-21 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16265:
--

 Summary: [C++][Docs]
 Key: ARROW-16265
 URL: https://issues.apache.org/jira/browse/ARROW-16265
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Jacob Wujciak-Jens
 Fix For: 9.0.0


Add a note 
[here|https://arrow.apache.org/docs/developers/cpp/windows.html#windows-dependency-resolution-issues]
 that {{ZSTD_MSVC_STATIC_LIB_SUFFIX}} will be automatically set to {{_static 
}}if MCVS is used (see [PR #7388|https://github.com/apache/arrow/pull/7388]) 
and the option needs to be passed as{{ -DZSTD_MSVC_STATIC_LIB_SUFFIX=. }}Maybe 
a Log message in case of set suffix would also make sense.

I ran into this issue due to the current VCPKG version deviating from the 
zstd_static.lib naming scheme in a recent commit,



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16264) [C++][CI] Valgrind timeout in arrow-compute-hash-join-node-test

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16264:
---

 Summary: [C++][CI] Valgrind timeout in 
arrow-compute-hash-join-node-test
 Key: ARROW-16264
 URL: https://issues.apache.org/jira/browse/ARROW-16264
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


This is starting to show up once we fixed the valgrind errors in the tpch node 
test.

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23628&view=results



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16263) [Doc] Document backpressure for the C++ streaming exec plan

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16263:
---

 Summary: [Doc] Document backpressure for the C++ streaming exec 
plan
 Key: ARROW-16263
 URL: https://issues.apache.org/jira/browse/ARROW-16263
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


This is described somewhat in https://github.com/apache/arrow/pull/12228

We should update our user guide for datasets to help users understand how much 
RAM they should expect to use and what parameters (if any) are available for 
tuning.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16262) [CI] Kartothek nightly integration build is failing because of Parquet statistics date change

2022-04-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16262:
-

 Summary: [CI] Kartothek nightly integration build is failing 
because of Parquet statistics date change
 Key: ARROW-16262
 URL: https://issues.apache.org/jira/browse/ARROW-16262
 Project: Apache Arrow
  Issue Type: Test
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche


Caused by ARROW-7350, see discussion at 
https://github.com/apache/arrow/pull/12902#issuecomment-1102750381

Upstream issue at https://github.com/JDASoftwareGroup/kartothek/issues/515

On the short term, we should also fix our nightly builds (either temporarily 
disabling them altogether, or ideally on skipping those failing tests)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16261) [CI][C++] HDFS Test failures

2022-04-21 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-16261:
--

 Summary: [CI][C++] HDFS Test failures
 Key: ARROW-16261
 URL: https://issues.apache.org/jira/browse/ARROW-16261
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou
 Fix For: 8.0.0


The HDFS DeleteDirContents tests are failing, possibly following ARROW-16159.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16260) [C++] Add backpressure to aggregate node

2022-04-21 Thread Weston Pace (Jira)
Weston Pace created ARROW-16260:
---

 Summary: [C++] Add backpressure to aggregate node
 Key: ARROW-16260
 URL: https://issues.apache.org/jira/browse/ARROW-16260
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


There are two possible cases I can think of where we might need backpressure 
handling in aggregate node, though both are not a concern until we have 
spillover

* Once we have spillover we may want to pause the input while we are busy 
spilling to disk
* Once we have spillover we may want to pause reading from the spillover cache 
if the sink is busy



--
This message was sent by Atlassian Jira
(v8.20.7#820007)