[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408992#comment-17408992
 ] 

David Li commented on ARROW-13803:
--

There is an off-by-one error in 
[BitUtil::SetBitmap|https://github.com/apache/arrow/blob/8c70a5f5178c5b74cc181dc8bdd4b03ba14f36d9/cpp/src/arrow/util/bit_util.cc#L112-L115].
 In this case, offset started as 0 and length started as 65536. At this point 
in the function, offset is now 65536 and length is now 0. data is a pointer to 
an 8192-byte buffer. Hence it indexes {{data[8192]}} which is past the end of 
the buffer. We then crash because the memory at this region is not mapped on 
this platform. (I'm surprised valgrind/ASan/etc. don't catch the access on x64.)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408929#comment-17408929
 ] 

David Li commented on ARROW-13803:
--

It still doesn't replicate on Linux/x64 or MacOS/x64, unfortunately, so it does 
seem ARM-specific.

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408900#comment-17408900
 ] 

David Li commented on ARROW-13803:
--

Ok! I can reproduce it, turns out a release build was very important (should've 
thought of that earlier…)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408881#comment-17408881
 ] 

David Li commented on ARROW-13803:
--

Thanks, I'll give it a try with bundled dependencies.

I'm testing using the entire dataset already as well.

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408877#comment-17408877
 ] 

Neal Richardson commented on ARROW-13803:
-

Which years did you test? It's possible there's a data issue in some file 
that's not being handled correctly, I know there are quirks.

I am testing with the Ursa bucket data.

No conda here, dependency source AUTO and I haven't installed much on the 
system so it's basically bundling everything except lz4 and zlib AFAICT. My 
cmake invocation is 

{code}
cmake \
  -GNinja \
  -DARROW_COMPUTE=ON \
  -DARROW_CSV=ON \
  -DARROW_DATASET=ON \
  -DARROW_FILESYSTEM=ON \
  -DARROW_JEMALLOC=ON \
  -DARROW_JSON=ON \
  -DARROW_PARQUET=ON \
  -DCMAKE_BUILD_TYPE=release \
  -DARROW_INSTALL_NAME_RPATH=OFF \
  -DARROW_S3=ON \
  -DARROW_MIMALLOC=OFF \
  -DARROW_WITH_BROTLI=ON \
  -DARROW_WITH_BZ2=ON \
  -DARROW_WITH_LZ4=ON \
  -DARROW_WITH_SNAPPY=ON \
  -DARROW_WITH_ZLIB=ON \
  -DARROW_WITH_ZSTD=ON \
  -DARROW_EXTRA_ERROR_CONTEXT=ON \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DARROW_BUILD_TESTS=OFF \
  -DARROW_WITH_UTF8PROC=ON \
  ..
{code}

No special compilation flags; cmake reports

{code}
-- CMAKE_C_FLAGS:  -Qunused-arguments -O3 -DNDEBUG  -Wall 
-Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ -march=armv8-a 
-- CMAKE_CXX_FLAGS:   -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG  
-Wall -Wno-unknown-warning-option -Wno-pass-failed -stdlib=libc++ 
-march=armv8-a 
{code}

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-02 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408844#comment-17408844
 ] 

David Li commented on ARROW-13803:
--

Trying again on an M1 Mac, I still don't get the crash. Just to check a few 
things, then:
 * Is the source of the NYC Taxi dataset you're using also the Parquet files in 
the Ursa bucket?
 * What flags are you using to build Arrow and the R library?
 * Are you using Conda or Homebrew or some other source for dependencies? 
(Though I couldn't get Conda to work on the M1 Mac.)

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-09-01 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408350#comment-17408350
 ] 

David Li commented on ARROW-13803:
--

I tried again on an x86_64 Mac and didn't get the error either, though I tested 
only a couple years of the NYC Taxi dataset.

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-08-31 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407357#comment-17407357
 ] 

David Li commented on ARROW-13803:
--

Hmm, I built the branch on a Linux/x64 machine and ran the first query using a 
copy of the NYC Taxi dataset from the Ursa Labs bucket, and it did not crash:
{noformat}
> ds <- open_dataset("/home/lidavidm/Documents/taxi", partitioning = c("year", 
> "month"))
> ds %>% filter(total_amount > 0, passenger_count > 0) %>% summarise(n=n()) %>% 
> collect()
# A tibble: 1 × 1
   n
   
1 1541561340 {noformat}
so this might be something OS-specific or otherwise not as easily reproducible 
:/

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13803) [C++] Segfault on filtering taxi dataset

2021-08-31 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407321#comment-17407321
 ] 

Neal Richardson commented on ARROW-13803:
-

Some more investigation: the {{kleene_and}} does seem to be involved, and it 
seems it crashes if you {{&}} together two filter expressions where at least 
one of them is a float type. I made several queries with &ed filters on integer 
and string types and they did not crash.

{code}
> ds %>% filter(passenger_count > 0 & passenger_count < 4) %>% summarize(n = 
> n()) %>% collect()
# A tibble: 1 × 1
   n
   
1 1373176060

> ds %>% filter(payment_type == "CAS" & payment_type != "CRD") %>% 
> summarize(n=n()) %>% collect()
# A tibble: 1 × 1
 n
 
1 26876825

> ds %>% filter(total_amount > 0 & total_amount < 4) %>% summarize(n = n()) %>% 
> collect()

 *** caught bus error ***
address 0x139448000, cause 'invalid alignment'

> ds %>% filter(trip_distance > 0 & trip_distance < 4) %>% summarize(n=n()) %>% 
> collect()

 *** caught segfault ***
address 0x120e3c000, cause 'invalid permissions'

{code}

> [C++] Segfault on filtering taxi dataset
> 
>
> Key: ARROW-13803
> URL: https://issues.apache.org/jira/browse/ARROW-13803
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: macOS 11.2.1, MacBook Pro (13-inch, M1, 2020)
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Found this while testing ARROW-13740. Using the nyc-taxi dataset:
> {code}
> ds %>%
>   filter(total_amount > 0, passenger_count > 0) %>%
>   summarise(n = n()) %>%
>   collect()
> {code}
> {code}
>  *** caught segfault ***
> address 0x161784000, cause 'invalid permissions'
> Traceback:
>  1: .Call(`_arrow_ExecPlan_run`, plan, final_node, sort_options)
> ...
> {code}
> lldb shows 
> {code}
> * thread #11, stop reason = EXC_BAD_ACCESS (code=1, address=0x1631a8000)
> frame #0: 0x00013a79d9cc 
> libarrow.600.dylib`arrow::BitUtil::SetBitmap(unsigned char*, long long, long 
> long) + 296
> libarrow.600.dylib`arrow::BitUtil::SetBitmap:
> ->  0x13a79d9cc <+296>: ldrb   w10, [x8]
> 0x13a79d9d0 <+300>: cmpw9, #0x8  ; =0x8 
> 0x13a79d9d4 <+304>: cset   w11, lo
> 0x13a79d9d8 <+308>: andw9, w9, #0x7
> Target 0: (R) stopped.
> (lldb) 
> {code}
> Interestingly, I can evaluate those filter expressions just fine, and it only 
> seems to crash if both are provided. And I can count over the data with both:
> {code}
> ds %>% 
>   group_by(total_amount > 0, passenger_count > 0) %>% 
>   summarize(n=n()) %>% 
>   collect()
> # A tibble: 4 × 3
>   `total_amount > 0` `passenger_count > 0`  n
>   
> 1 FALSE  FALSE805
> 2 FALSE  TRUE  368680
> 3 TRUE   FALSE5810556
> 4 TRUE   TRUE  1541561340
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)