[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256237#comment-17256237
 ] 

Weston Pace commented on ARROW-11067:
-

I'll look into it a bit more tomorrow but at a quick glance it seems to behave 
correctly in python...
{code:java}
>>> import pyarrow.csv
>>> table = pyarrow.csv.read_csv('/home/pace/Downloads/demo_data.csv')
>>> strs = table.column('json_string').to_pylist()
>>> [len(s) for s in strs]
[38660, 37627, 34127, 45107, 127507, 59150, 34426, 41492, 39564, 106966, 880, 
7882, 216, 734, 407, 3, 383, 217, 887, 357]

{code}

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256218#comment-17256218
 ] 

John Sheffield edited comment on ARROW-11067 at 12/30/20, 1:35 AM:
---

(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2), string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 


was (Author: jms):
(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2], string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}

[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256218#comment-17256218
 ] 

John Sheffield commented on ARROW-11067:


(Sorry for the fragmented report here, but figured out a way to really isolate 
the issue.)

 

The string read failures are deterministic and predictable, and the content of 
the strings doesn't seem to matter – only length. There's a switch between 
success/failure at every integer multiple of *N * (32 * 1024) characters*.
 * For N in [0,1), string length between 0 and 32767 characters, all reads 
succeed.
 * For N in [1,2], string length 32768 and 65535, all of the reads fail.
 * The same pattern repeats until we hit LongString limits: if floor(nchar/(32 
* 1024) is 0 or even, the read succeeds. If floor(nchar/(32 * 1024) is odd, it 
fails.

Code:

 

 
{code:java}
library(tidyverse)
library(arrow)

generate_string <- function(n){
  paste0(sample(c(LETTERS, letters), size = n, replace = TRUE), collapse = "")
}

sample_breaks <- (1:60L * 16L * 1024L)
sample_lengths <- sample_breaks - 1
set.seed(1234)

test_strings <- purrr::map_chr(sample_lengths, generate_string)

readr::write_csv(data.frame(str = test_strings, strlen = sample_lengths),
 "arrow_sample_data.csv")

arrow::read_csv_arrow("arrow_sample_data.csv") %>%
  dplyr::mutate(failed_case = ifelse(is.na(str), "failed", "succeeded")) %>%
  dplyr::select(-str) %>%
  ggplot(data = ., aes(x = (strlen / (32 * 1024)), y = failed_case)) +
  geom_point(aes(color = ifelse(floor(strlen / (32 * 1024)) %% 2 == 0, "even", 
"odd")), size = 3) +
  scale_x_continuous(breaks = seq(0, 30)) +
  labs(x = "string length / (32 * 1024) : integer multiple of 32kb",
   y = "string read success/failure",
   color = "even/odd multiple of 32kb")
{code}
 

!arrow_explanation.png!

 

 

 

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_explanation.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_explanation.png, arrow_failure_cases.csv, 
> arrow_failure_cases.csv, arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2020-12-29 Thread Bradley Miro (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256212#comment-17256212
 ] 

Bradley Miro commented on ARROW-9019:
-

Hello! I'm on the GCP Dataproc team and was wondering if there's been any 
progress or workarounds for this? I am attempting to support a Horovod + 
Dataproc integration but this keeps popping up as a blocker to finishing the 
integration. Any help would be appreciated :) 

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256208#comment-17256208
 ] 

Neal Richardson commented on ARROW-11067:
-

That's really helpful, thanks for sharing. [~apitrou] or [~westonpace], can you 
take a look at this?

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrowbug1.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256207#comment-17256207
 ] 

John Sheffield commented on ARROW-11067:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_failure_cases.csv

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrow_failure_cases.csv, 
> arrowbug1.png, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Comment: was deleted

(was: I pulled a few strings over a much larger dataset and came to something 
useful. There is an extremely definite 'striping' of success/failure patterns 
beginning at nchar of 32,767 (where failures start); then the failures stop and 
all cases succeed between 65,685 and 98,832 chars; and then we switch back to 
failures. The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!)

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256206#comment-17256206
 ] 

John Sheffield edited comment on ARROW-11067 at 12/29/20, 11:38 PM:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 100% 
succeeded.)

[^arrow_failure_cases.csv]

 

!arrowbug1.png!


was (Author: jms):
I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 
100%[^arrow_failure_cases.csv] succeeded.)

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256206#comment-17256206
 ] 

John Sheffield commented on ARROW-11067:


I pulled a few strings over a much larger dataset and came to something useful. 
There is an extremely definite 'striping' of success/failure patterns beginning 
at nchar of 32,767 (where failures start); then the failures stop and all cases 
succeed between 65,685 and 98,832 chars; and then we switch back to failures. 
The graph below captures it all.   

(Unfortunately, can't share the full dataset this came from for confidentiality 
reasons, but I'm betting that I can recreate the effect on something simulated. 
I also attached the distribution of character counts by success/failure – this 
is the CSV behind the plot, dropping cases below 30k characters which 
100%[^arrow_failure_cases.csv] succeeded.)

!arrowbug1.png!

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrow_failure_cases.csv

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrow_failure_cases.csv, arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sheffield updated ARROW-11067:
---
Attachment: arrowbug1.png

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: arrowbug1.png, demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale

2020-12-29 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256202#comment-17256202
 ] 

Antoine Pitrou commented on ARROW-7288:
---

[~jonkeane] Is this something you could try to diagnose?

> [C++][R] read_parquet() freezes on Windows with Japanese locale
> ---
>
> Key: ARROW-7288
> URL: https://issues.apache.org/jira/browse/ARROW-7288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
> Environment: R 3.6.1 on Windows 10
>Reporter: Hiroaki Yutani
>Priority: Critical
>  Labels: parquet
> Fix For: 4.0.0
>
>
> The following example on read_parquet()'s doc freezes (seems to wait for the 
> result forever) on my Windows.
> df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
> The CRAN checks are all fine, which means the example is successfully 
> executed on the CRAN Windows. So, I have no idea why it doesn't work on my 
> local.
> [https://cran.r-project.org/web/checks/check_results_arrow.html]
> Here's my session info in case it helps:
> {code:java}
> > sessioninfo::session_info()
> - Session info 
> -
>  setting  value
>  version  R version 3.6.1 (2019-07-05)
>  os   Windows 10 x64
>  system   x86_64, mingw32
>  ui   RStudio
>  language en
>  collate  Japanese_Japan.932
>  ctypeJapanese_Japan.932
>  tz   Asia/Tokyo
>  date 2019-12-01
> - Packages 
> -
>  package * version  date   lib source
>  arrow   * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1)
>  assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0)
>  bit   1.1-14   2018-05-29 [1] CRAN (R 3.6.0)
>  bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0)
>  cli   1.1.02019-03-19 [1] CRAN (R 3.6.0)
>  crayon1.3.42017-09-16 [1] CRAN (R 3.6.0)
>  fs1.3.12019-05-06 [1] CRAN (R 3.6.0)
>  glue  1.3.12019-03-12 [1] CRAN (R 3.6.0)
>  magrittr  1.5  2014-11-22 [1] CRAN (R 3.6.0)
>  purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1)
>  R62.4.12019-11-12 [1] CRAN (R 3.6.1)
>  Rcpp  1.0.32019-11-08 [1] CRAN (R 3.6.1)
>  reprex0.3.02019-05-16 [1] CRAN (R 3.6.0)
>  rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1)
>  rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0)
>  sessioninfo   1.1.12018-11-05 [1] CRAN (R 3.6.0)
>  tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0)
>  withr 2.1.22018-03-15 [1] CRAN (R 3.6.0)
> [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6
> [2] C:/Program Files/R/R-3.6.1/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8470) [Python][R] Expose incremental write API for Feather files

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8470:
---
Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python][R] Expose incremental write API for Feather files
> --
>
> Key: ARROW-8470
> URL: https://issues.apache.org/jira/browse/ARROW-8470
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 4.0.0
>
>
> This is already available for writing IPC files, so this would mostly be an 
> interface to that with the addition of logic to handle conversions from 
> Python or R data frames and splitting the inputs based on the configured 
> Feather chunksize



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9187) [R] Add bindings for arithmetic kernels

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9187:
--

Assignee: Neal Richardson  (was: Jonathan Keane)

> [R] Add bindings for arithmetic kernels
> ---
>
> Key: ARROW-9187
> URL: https://issues.apache.org/jira/browse/ARROW-9187
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9187) [R] Add bindings for arithmetic kernels

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9187:
--

Assignee: Jonathan Keane  (was: Neal Richardson)

> [R] Add bindings for arithmetic kernels
> ---
>
> Key: ARROW-9187
> URL: https://issues.apache.org/jira/browse/ARROW-9187
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9856) [R] Add bindings for string compute functions

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-9856:
--

Assignee: Jonathan Keane

> [R] Add bindings for string compute functions
> -
>
> Key: ARROW-9856
> URL: https://issues.apache.org/jira/browse/ARROW-9856
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
> Fix For: 3.0.0
>
>
> See https://arrow.apache.org/docs/cpp/compute.html#string-predicates and 
> below. Since R's base string functions, as well as stringr/stringi, aren't 
> generics that we can define methods for, this will probably make most sense 
> within the context of a dplyr expression where we have more control over the 
> evaluation.
> This will require enabling utf8proc in the builds; there's already an 
> rtools-package for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7288) [C++][R] read_parquet() freezes on Windows with Japanese locale

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7288:
---
Fix Version/s: (was: 3.0.0)
   4.0.0

> [C++][R] read_parquet() freezes on Windows with Japanese locale
> ---
>
> Key: ARROW-7288
> URL: https://issues.apache.org/jira/browse/ARROW-7288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
> Environment: R 3.6.1 on Windows 10
>Reporter: Hiroaki Yutani
>Priority: Critical
>  Labels: parquet
> Fix For: 4.0.0
>
>
> The following example on read_parquet()'s doc freezes (seems to wait for the 
> result forever) on my Windows.
> df <- read_parquet(system.file("v0.7.1.parquet", package="arrow"))
> The CRAN checks are all fine, which means the example is successfully 
> executed on the CRAN Windows. So, I have no idea why it doesn't work on my 
> local.
> [https://cran.r-project.org/web/checks/check_results_arrow.html]
> Here's my session info in case it helps:
> {code:java}
> > sessioninfo::session_info()
> - Session info 
> -
>  setting  value
>  version  R version 3.6.1 (2019-07-05)
>  os   Windows 10 x64
>  system   x86_64, mingw32
>  ui   RStudio
>  language en
>  collate  Japanese_Japan.932
>  ctypeJapanese_Japan.932
>  tz   Asia/Tokyo
>  date 2019-12-01
> - Packages 
> -
>  package * version  date   lib source
>  arrow   * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.1)
>  assertthat0.2.12019-03-21 [1] CRAN (R 3.6.0)
>  bit   1.1-14   2018-05-29 [1] CRAN (R 3.6.0)
>  bit64 0.9-72017-05-08 [1] CRAN (R 3.6.0)
>  cli   1.1.02019-03-19 [1] CRAN (R 3.6.0)
>  crayon1.3.42017-09-16 [1] CRAN (R 3.6.0)
>  fs1.3.12019-05-06 [1] CRAN (R 3.6.0)
>  glue  1.3.12019-03-12 [1] CRAN (R 3.6.0)
>  magrittr  1.5  2014-11-22 [1] CRAN (R 3.6.0)
>  purrr 0.3.32019-10-18 [1] CRAN (R 3.6.1)
>  R62.4.12019-11-12 [1] CRAN (R 3.6.1)
>  Rcpp  1.0.32019-11-08 [1] CRAN (R 3.6.1)
>  reprex0.3.02019-05-16 [1] CRAN (R 3.6.0)
>  rlang 0.4.22019-11-23 [1] CRAN (R 3.6.1)
>  rstudioapi0.10 2019-03-19 [1] CRAN (R 3.6.0)
>  sessioninfo   1.1.12018-11-05 [1] CRAN (R 3.6.0)
>  tidyselect0.2.52018-10-11 [1] CRAN (R 3.6.0)
>  withr 2.1.22018-03-15 [1] CRAN (R 3.6.0)
> [1] C:/Users/hiroaki-yutani/Documents/R/win-library/3.6
> [2] C:/Program Files/R/R-3.6.1/library
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6582) [R] Arrow to R fails with embedded nuls in strings

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6582:
--

Assignee: Neal Richardson

> [R] Arrow to R fails with embedded nuls in strings
> --
>
> Key: ARROW-6582
> URL: https://issues.apache.org/jira/browse/ARROW-6582
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.14.1
> Environment: Windows 10
> R 3.4.4
>Reporter: John Cassil
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Apologies if this issue isn't categorized or documented appropriately.  
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, 
> I have recently decided to try to use arrow::read_parquet() on a few parquet 
> files that were on my local machine rather than in hadoop.  I was not able to 
> proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE 
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using 
> data.table::fread(), but readr::read_delim() seems to handle them gracefully 
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even 
> recreate a parquet file with embedded nuls using arrow if it won't let me 
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9147) [C++][Dataset] Support null -> other type promotion in Dataset scanning

2020-12-29 Thread Gabriel Bassett (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256193#comment-17256193
 ] 

Gabriel Bassett commented on ARROW-9147:


I received the following error with arrow 2.0.0 (R):

```

Error in dataset___Scanner__ToTable(self): Type error: fields had matching 
names but differing types. From: : bool To: : null

```

Should this be fixed? Is it possible the order in which the null is encountered 
is handled differently?

> [C++][Dataset] Support null -> other type promotion in Dataset scanning
> ---
>
> Key: ARROW-9147
> URL: https://issues.apache.org/jira/browse/ARROW-9147
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With regarding schema evolution / normalization, we support inserting nulls 
> for a missing column or changing nullability, or normalizing column order, 
> but we do not yet seem to support promotion of null type to any other type.
> Small python example:
> {code}
> In [11]: df = pd.DataFrame({"col": np.array([None, None, None, None], 
> dtype='object')})
> ...: df.to_parquet("test_filter_schema.parquet", engine="pyarrow")
> ...:
> ...: import pyarrow.dataset as ds
> ...: dataset = ds.dataset("test_filter_schema.parquet", format="parquet", 
> schema=pa.schema([("col", pa.int64())]))
> ...: dataset.to_table()
> ...
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Dataset.to_table()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.Scanner.to_table()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: fields had matching names but differing types. From: col: 
> null To: col: int64
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11063.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9041
[https://github.com/apache/arrow/pull/9041]

> [Rust] Validate null counts when building arrays
> 
>
> Key: ARROW-11063
> URL: https://issues.apache.org/jira/browse/ARROW-11063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ArrayDataBuilder allows the user to specify a null count, alternatively 
> calculating it if it is not set.
> The problem is that the user-specified null count is never validated against 
> the actual count of the buffer.
> I suggest removing the ability to specify a null-count, and instead always 
> calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb updated ARROW-11063:

Component/s: Rust

> [Rust] Validate null counts when building arrays
> 
>
> Key: ARROW-11063
> URL: https://issues.apache.org/jira/browse/ARROW-11063
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ArrayDataBuilder allows the user to specify a null count, alternatively 
> calculating it if it is not set.
> The problem is that the user-specified null count is never validated against 
> the actual count of the buffer.
> I suggest removing the ability to specify a null-count, and instead always 
> calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb reassigned ARROW-11063:
---

Assignee: Neville Dipale

> [Rust] Validate null counts when building arrays
> 
>
> Key: ARROW-11063
> URL: https://issues.apache.org/jira/browse/ARROW-11063
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ArrayDataBuilder allows the user to specify a null count, alternatively 
> calculating it if it is not set.
> The problem is that the user-specified null count is never validated against 
> the actual count of the buffer.
> I suggest removing the ability to specify a null-count, and instead always 
> calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11065:
-
Description: 
My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was told 
I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 {color:#24292e}tar 
ball and tried to install its "cpp" component according to the instruction.  
However, I got the following error after {{cd release}} to run {{cmake ..}}: 
{color}

 

{noformat}
Login=root: Line=602 > cmake ..
-- Building using CMake version: 3.16.0
-- Arrow version: 2.0.0 (full: '2.0.0')
-- Arrow SO version: 200 (full: 200.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
-- infer not found
-- Found cpplint executable at 
/software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
-- System processor: powerpc
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
  SSE4.2 required but compiler doesn't support it.
Call Stack (most recent call first):
  CMakeLists.txt:437 (include)

-- Configuring incomplete, errors occurred!
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
{noformat}

Attached are 2 CMake output/error files.  Sutou Kouhei suggested me to submit 
an issue here.  Can someone please help me to fix the issue?  What do I have to 
do with required SSE4.2?

Thanks.

 

  was:
My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was told 
I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 {color:#24292e}tar 
ball and tried to install its "cpp" component according to the instruction.  
However, I got the following error after {{cd release}} to run {{cmake ..}}: 
{color}

 

{color:#24292e}{{Login=root: Line=602 > cmake ..
-- Building using CMake version: 3.16.0
-- Arrow version: 2.0.0 (full: '2.0.0')
-- Arrow SO version: 200 (full: 200.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
-- infer not found
-- Found cpplint executable at 
/software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
-- System processor: powerpc
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
  SSE4.2 required but compiler doesn't support it.
Call Stack (most recent call first):
  CMakeLists.txt:437 (include)

-- Configuring incomplete, errors occurred!
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}

{color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei suggested 
me to submit an issue here.  Can someone please help me to fix the issue?  What 
do I have to do with required SSE4.2?{color}

{color:#24292e}Thanks.{color}

 


> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {noformat}
> Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".
> {noformat}
> Attached are 2 CMake output/error files.  Sutou Kouhei suggested 

[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11065:
-
Labels:   (was: build)

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {color:#24292e}{{Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}
> {color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei 
> suggested me to submit an issue here.  Can someone please help me to fix the 
> issue?  What do I have to do with required SSE4.2?{color}
> {color:#24292e}Thanks.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11065:
-
Issue Type: New Feature  (was: Bug)

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {color:#24292e}{{Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}
> {color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei 
> suggested me to submit an issue here.  Can someone please help me to fix the 
> issue?  What do I have to do with required SSE4.2?{color}
> {color:#24292e}Thanks.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11065:
-
Fix Version/s: (was: 2.0.0)

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
>  Labels: build
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {color:#24292e}{{Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}
> {color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei 
> suggested me to submit an issue here.  Can someone please help me to fix the 
> issue?  What do I have to do with required SSE4.2?{color}
> {color:#24292e}Thanks.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11065:
-
Flags:   (was: Important)

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
>  Labels: build
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {color:#24292e}{{Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}
> {color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei 
> suggested me to submit an issue here.  Can someone please help me to fix the 
> issue?  What do I have to do with required SSE4.2?{color}
> {color:#24292e}Thanks.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11071) [R][CI] Use processx to set up minio and flight servers in tests

2020-12-29 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11071:
---

 Summary: [R][CI] Use processx to set up minio and flight servers 
in tests
 Key: ARROW-11071
 URL: https://issues.apache.org/jira/browse/ARROW-11071
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 3.0.0


Rather than rely on them being set up outside of the tests. processx is already 
a transitive test dependency (testthat uses it) so there's no reason for us not 
to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove reassigned ARROW-11030:
--

Assignee: (was: Andy Grove)

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11030) [Rust] [DataFusion] HashJoinExec slow with many batches

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11030:
---
Summary: [Rust] [DataFusion] HashJoinExec slow with many batches  (was: 
[Rust] [DataFusion] MutableArrayData slow with many batches)

> [Rust] [DataFusion] HashJoinExec slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256172#comment-17256172
 ] 

Andy Grove commented on ARROW-11030:


Thanks [~jorgecarleitao] and [~Dandandan]  for the information. It looks like I 
may have misunderstood the issue a bit. I am going to unassign this for now and 
change the title back to being specific to hash join.

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11068) [Rust] [DataFusion] Wrap more operators in CoalesceBatchExec

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11068:
---
Description: 
Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
this to wrap HashJoinExec and HashAggregateExec as well since they can both 
produce small batches.

Rather than hard-code a list of operators that need to be wrapped, we should 
find a more generic mechanism so that plans can declare if their input and/or 
output batches should be coalesced (similar to how we handle partitioning) and 
this would allow custom operators outside of DataFusion to benefit from this 
optimization.

  was:
Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
this to wrap join output as well.

Rather than hard-code a list of operators that need to be wrapped, we should 
find a more generic mechanism so that plans can declare if their input and/or 
output batches should be coalesced (similar to how we handle partitioning) and 
this would allow custom operators outside of DataFusion to benefit from this 
optimization.


> [Rust] [DataFusion] Wrap more operators in CoalesceBatchExec
> 
>
> Key: ARROW-11068
> URL: https://issues.apache.org/jira/browse/ARROW-11068
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
> this to wrap HashJoinExec and HashAggregateExec as well since they can both 
> produce small batches.
> Rather than hard-code a list of operators that need to be wrapped, we should 
> find a more generic mechanism so that plans can declare if their input and/or 
> output batches should be coalesced (similar to how we handle partitioning) 
> and this would allow custom operators outside of DataFusion to benefit from 
> this optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11068) [Rust] [DataFusion] Wrap more operators in CoalesceBatchExec

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11068:
---
Summary: [Rust] [DataFusion] Wrap more operators in CoalesceBatchExec  
(was: [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec)

> [Rust] [DataFusion] Wrap more operators in CoalesceBatchExec
> 
>
> Key: ARROW-11068
> URL: https://issues.apache.org/jira/browse/ARROW-11068
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
> this to wrap join output as well.
> Rather than hard-code a list of operators that need to be wrapped, we should 
> find a more generic mechanism so that plans can declare if their input and/or 
> output batches should be coalesced (similar to how we handle partitioning) 
> and this would allow custom operators outside of DataFusion to benefit from 
> this optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11070) [C++] [R] Implement exponentiation compute kernel

2020-12-29 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-11070:
--

 Summary: [C++] [R] Implement exponentiation compute kernel
 Key: ARROW-11070
 URL: https://issues.apache.org/jira/browse/ARROW-11070
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, R
Reporter: Jonathan Keane


We have addition, subtraction, multiplication, and division. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256149#comment-17256149
 ] 

Neal Richardson commented on ARROW-11067:
-

Thanks for the detailed summary. Since as you say, the values are missing in 
the Arrow Table, it sounds like something in the C++ library. We'll take a look.

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11069) Parquet writer incorrect data being written when data type is dictionary

2020-12-29 Thread Palash Goel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palash Goel updated ARROW-11069:

 Attachment: image-2020-12-30-01-19-20-491.png
 image-2020-12-30-01-19-42-739.png
Description: 
When writing a dict column using pyarrow. 

 
This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again. I don't see any obvious 
pattern in the shuffled rows.
 !image-2020-12-30-01-19-42-739.png! Original records
 !image-2020-12-30-01-19-20-491.png! Written records

  was:
When writing a dict column using pyarrow. 

 

This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again.


> Parquet writer incorrect data being written when data type is dictionary
> 
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
>  !image-2020-12-30-01-19-42-739.png! Original records
>  !image-2020-12-30-01-19-20-491.png! Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11069) Parquet writer incorrect data being written when data type is dictionary

2020-12-29 Thread Palash Goel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palash Goel updated ARROW-11069:

Attachment: image-2020-12-30-01-20-45-183.png

> Parquet writer incorrect data being written when data type is dictionary
> 
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
> This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
>  !image-2020-12-30-01-19-42-739.png! Original records
>  !image-2020-12-30-01-19-20-491.png! Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11069) Parquet writer incorrect data being written when data type is dictionary

2020-12-29 Thread Palash Goel (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Palash Goel updated ARROW-11069:

Description: 
When writing a dict column using pyarrow. 

 
 This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again. I don't see any obvious 
pattern in the shuffled rows.

!image-2020-12-30-01-20-45-183.png!
 Original records
 !image-2020-12-30-01-19-20-491.png!

Written records

  was:
When writing a dict column using pyarrow. 

 
This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again. I don't see any obvious 
pattern in the shuffled rows.
 !image-2020-12-30-01-19-42-739.png! Original records
 !image-2020-12-30-01-19-20-491.png! Written records


> Parquet writer incorrect data being written when data type is dictionary
> 
>
> Key: ARROW-11069
> URL: https://issues.apache.org/jira/browse/ARROW-11069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: pandas v1.0.4
>Reporter: Palash Goel
>Priority: Major
> Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, 
> image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, 
> original.parquet
>
>
> When writing a dict column using pyarrow. 
>  
>  This incorrect results start appearing after index 1024. first_write.parquet 
> was created after reading and then writing it again. I don't see any obvious 
> pattern in the shuffled rows.
> !image-2020-12-30-01-20-45-183.png!
>  Original records
>  !image-2020-12-30-01-19-20-491.png!
> Written records



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11069) Parquet writer incorrect data being written when data type is dictionary

2020-12-29 Thread Palash Goel (Jira)
Palash Goel created ARROW-11069:
---

 Summary: Parquet writer incorrect data being written when data 
type is dictionary
 Key: ARROW-11069
 URL: https://issues.apache.org/jira/browse/ARROW-11069
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: pandas v1.0.4
Reporter: Palash Goel
 Attachments: first_write.parquet, original.parquet

When writing a dict column using pyarrow. 

 

This incorrect results start appearing after index 1024. first_write.parquet 
was created after reading and then writing it again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256141#comment-17256141
 ] 

Jorge Leitão commented on ARROW-11030:
--

[~andygrove], the MutableArrayData is basically:
 # bound to a bunch of existing Arrays (of the same type), so that it can "copy 
slices of slots" from any of those arrays
 # It uses the arrays' DataType to extend the buffers and child_data according 
to the spec

The parameter "capacity" (in number of slots) is used to reserve all buffers 
that we can reserve upfront (e.g. primitive types and offsets). In the case of 
a concat, we can compute that parameter exactly, since it is the sum of all 
arrays lens: 
https://github.com/apache/arrow/blob/f4ccceb2536bb6ed5bf584843c55c6628ad23494/rust/arrow/src/compute/kernels/concat.rs#L57

The `new` function allocates closures (one per array without child data) that 
are optimized to extend buffers accordingly.

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256138#comment-17256138
 ] 

Daniël Heres commented on ARROW-11030:
--

But the same is applicable for mutablearraydata: for the left side of the join 
it generates n batches, which are iterated n times in mutablearraydata::new().

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256132#comment-17256132
 ] 

Jorge Leitão commented on ARROW-11058:
--

Thank you so much for your explanation, [~andygrove] . I agree with that.

Maybe this is too obvious and I am just not knowledgeable here: what is the 
problem with P = B? I.e. what do we gain from having both a batch size and 
number of parts, instead of having just one batch per part?

I am asking this because it seems to me that we have a fragmentation problem: 
we start with a bunch of contiguous blocks of memory, and as we operate on 
them, we fragement / filter them in smaller and smaller parts, that, at some 
point, make them slow to operate individually (and we defragment via coalesces 
to bring them back together). Just like in a OS.

With (P,B), we need to deal with fragmentation both at the partition level and 
batch level: we need to worry about having a partition that is balanced (in 
number of rows per part), and also have each part balanced (in number of rows 
per batch on each part).

Wouldn't be simpler if P=B, where we only need to worry about fragmentation of 
parts (and coalesce parts)? I suspect that that would be too simple, i.e. I am 
missing the benefit of the extra degree of freedom (P,B) vs (P=B).

 

 

> [Rust] [DataFusion] Implement "coalesce batches" operator
> -
>
> Key: ARROW-11058
> URL: https://issues.apache.org/jira/browse/ARROW-11058
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches 
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with 
> one of these so that small batches can be recombined into larger batches to 
> improve the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256115#comment-17256115
 ] 

Daniël Heres commented on ARROW-11030:
--

It's not directly related to the mutablearraydata::new(), but still makes the 
hash join having a part with exponential time.

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256109#comment-17256109
 ] 

Daniël Heres commented on ARROW-11030:
--

One comment I put in a PR, which is I think part of the problem:

I think part of a further speed up could be moving the building of the left / 
build-side Vec<&ArrayData> arrays so that it is only created once instead of 
for each right batch in build_batch_from_indices. Currently when making the 
batch size smaller, the build-side Vec is built more times, but also contains 
more (smaller) batches itself, which could explain (part of the) big / 
exponential slowdown on smaller batches.

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11030) [Rust] [DataFusion] MutableArrayData slow with many batches

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11030:
---
Summary: [Rust] [DataFusion] MutableArrayData slow with many batches  (was: 
[Rust] [DataFusion] Poor join performance with smaller batches)

> [Rust] [DataFusion] MutableArrayData slow with many batches
> ---
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11030) [Rust] [DataFusion] Poor join performance with smaller batches

2020-12-29 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256105#comment-17256105
 ] 

Andy Grove commented on ARROW-11030:


I have a theory on what might be happening here but I am struggling to really 
understand this.

It looks like we create a buffer and for each input array, we extend this 
buffer. Each time we extend it, the buffer is larger so the cost of extending 
it again gets higher each time?

Is there a way we can compute upfront how much to extend it by and do one 
extend operation?

> [Rust] [DataFusion] Poor join performance with smaller batches
> --
>
> Key: ARROW-11030
> URL: https://issues.apache.org/jira/browse/ARROW-11030
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 3.0.0
>
>
> Performance of joins slows down dramatically with smaller batches.
> The issue is related to slow performance of MutableDataArray::new() when 
> passed a high number of batches. This happens when passing in all of the 
> batches from the build side of the join and this happens once per build-side 
> join key for each probe-side batch.
> It seems to get exponentially slower as the number of arrays increases even 
> though the number of rows is the same.
> I modified hash_join.rs to have this debug code:
> {code:java}
> let start = Instant::now();
> let row_count: usize = arrays.iter().map(|arr| arr.len()).sum();
> let num_arrays = arrays.len();
> let mut mutable = MutableArrayData::new(arrays, true, capacity);
> if num_arrays > 0 {
> debug!("MutableArrayData::new() with {} arrays containing {} rows took {} 
> ms", num_arrays, row_count, start.elapsed().as_millis());
> } {code}
> Batch size 131072:
> {code:java}
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms
> MutableArrayData::new() with 4584 arrays containing 3115341 rows took 1 ms 
> {code}
> Batch size 16384:
> {code:java}
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 19 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 16 ms
> MutableArrayData::new() with 36624 arrays containing 3115341 rows took 17 ms 
> {code}
> Batch size 4096:
> {code:java}
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 89 ms
> MutableArrayData::new() with 146496 arrays containing 3115341 rows took 88 ms 
> {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11068) [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec

2020-12-29 Thread Andy Grove (Jira)
Andy Grove created ARROW-11068:
--

 Summary: [Rust] [DataFusion] Wrap HashJoinExec in CoalesceBatchExec
 Key: ARROW-11068
 URL: https://issues.apache.org/jira/browse/ARROW-11068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.0


Once [https://github.com/apache/arrow/pull/9043] is merged, we should extend 
this to wrap join output as well.

Rather than hard-code a list of operators that need to be wrapped, we should 
find a more generic mechanism so that plans can declare if their input and/or 
output batches should be coalesced (similar to how we handle partitioning) and 
this would allow custom operators outside of DataFusion to benefit from this 
optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator

2020-12-29 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256096#comment-17256096
 ] 

Andy Grove commented on ARROW-11058:


[~jorgecarleitao]

I think that the PR [https://github.com/apache/arrow/pull/9043] will help 
explain this, but Apache Spark actually does do something very similar. Spark 
has partitions which are the unit of parallelism (either on threads or 
executors) and each partition is an iterator[T].

Spark supports row-based (Iterator[Row]) and column-based operators 
(Iterator[ColumnarBatch]) operators out-the-box although most of the built-in 
operators are row-based. Spark will insert transitions as required to convert 
between row and column-based operators.

Because filters can produce empty batches or batches with a single or small 
number of rows, we lose some efficiency both with SIMD and also just due to 
per-batch overheads in particular kernels (as we have seen with 
MutableArrayData). 

Small batches can also be inefficient when writing out to Parquet because we 
lose the benefits of compression to some degree, so this is another use case 
where we would want to coalesce them.

Coalescing batches is especially important for GPU if we ever add support for 
that because the cost of an operation on GPU is the same (once data is loaded) 
regardless of how many items it is operating on, so it is beneficial to operate 
on as much data in parallel as possible.

> [Rust] [DataFusion] Implement "coalesce batches" operator
> -
>
> Key: ARROW-11058
> URL: https://issues.apache.org/jira/browse/ARROW-11058
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches 
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with 
> one of these so that small batches can be recombined into larger batches to 
> improve the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11058) [Rust] [DataFusion] Implement "coalesce batches" operator

2020-12-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11058:
---
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Implement "coalesce batches" operator
> -
>
> Key: ARROW-11058
> URL: https://issues.apache.org/jira/browse/ARROW-11058
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we have a FilterExec in the plan, it can produce lots of small batches 
> and we therefore lose efficiency of vectorized operations.
> We should implement a new CoalesceBatchExec and wrap every FilterExec with 
> one of these so that small batches can be recombined into larger batches to 
> improve the efficiency of upstream operators.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10578) [C++] Comparison kernels crashing for string array with null string scalar

2020-12-29 Thread Kirill Lykov (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256068#comment-17256068
 ] 

Kirill Lykov commented on ARROW-10578:
--

Problem is still reproducible. It happens only for type==string

I don't see cpp tests for this use case: 
[https://github.com/apache/arrow/blob/52d615dc2cd64fbdbc10f2aeeb3b43ad5e879f3b/cpp/src/arrow/compute/kernels/scalar_compare_test.cc#L537]

Let me know if I look into the wrong place.
I will try to add unit test for this particular case.

I also think it makes sense to add test on pyarrow. Something similar to 
[https://github.com/apache/arrow/blob/64f9b3fbe9ef4c718449a735435b53ab992ca852/python/pyarrow/tests/test_compute.py#L769]


The problem is that the scalar is invalid (`datum->is_valid == false`): see 
[https://github.com/apache/arrow/blob/ca685a0c08bb41f43a80e5605e4cc8f9efb77cca/cpp/src/arrow/compute/kernels/codegen_internal.h#L713
 
|https://github.com/apache/arrow/blob/ca685a0c08bb41f43a80e5605e4cc8f9efb77cca/cpp/src/arrow/compute/kernels/codegen_internal.h#L713]But
 we deference val at codegen_internal.h:275 and trying to create string_view 
from data_ which has address 0x10.

To fix the bug, I guess some additional checks should be added to 
https://github.com/apache/arrow/blame/ca685a0c08bb41f43a80e5605e4cc8f9efb77cca/cpp/src/arrow/compute/kernels/codegen_internal.h#L273
Something like if scalar is invalid, return default string_view.




 

> [C++] Comparison kernels crashing for string array with null string scalar
> --
>
> Key: ARROW-10578
> URL: https://issues.apache.org/jira/browse/ARROW-10578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Comparing a string array with a string scalar works:
> {code}
> In [1]: import pyarrow.compute as pc
> In [2]: pc.equal(pa.array(["a", None, "b"]), pa.scalar("a", type="string")
> Out[2]: 
> 
> [
>   true,
>   null,
>   false
> ]
> {code}
> but if the scalar is a null (from the proper string type), it crashes:
> {code}
> In [4]: pc.equal(pa.array(["a", None, "b"]), pa.scalar(None, type="string"))
> Segmentation fault (core dumped)
> {code}
> (and not even debug messages ..)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11067:

Fix Version/s: 3.0.0

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10995) [Rust] [DataFusion] Improve parallelism when reading Parquet files

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10995.

Resolution: Fixed

Issue resolved by pull request 9029
[https://github.com/apache/arrow/pull/9029]

> [Rust] [DataFusion] Improve parallelism when reading Parquet files
> --
>
> Key: ARROW-10995
> URL: https://issues.apache.org/jira/browse/ARROW-10995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently the unit of parallelism is the number of parquet files being read.
> For example, if we run a query against a Parquet table that consists of 8 
> partitions then we will attempt to run 8 async tasks in parallel and if there 
> is a single Parquet file then we will only try and run 1 async task so this 
> does not scale well. Also, if there are hundreds or thousands of Parquet 
> files then we will try and process them all concurrently which also doesn't 
> scale well.
> These are the options for improving this situation:
>  
>  # Use Parquet row groups as the unit of partitioning and divide the number 
> of row groups by the desired level of concurrency (defaulting to number of 
> cores)
>  # Keep file as the unit of partitions and add a RepartitionExec into the 
> plan if there are fewer partitions (files) than cores and in the case where 
> there are more files than cores, split the files up into lists so that each 
> partition is a list of files rather than a single file. Each partition task 
> will process one file at a time.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11067) [R] read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11067:

Summary: [R] read_csv_arrow silently fails to read some strings and returns 
nulls  (was: read_csv_arrow silently fails to read some strings and returns 
nulls)

> [R] read_csv_arrow silently fails to read some strings and returns nulls
> 
>
> Key: ARROW-11067
> URL: https://issues.apache.org/jira/browse/ARROW-11067
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: John Sheffield
>Priority: Major
> Attachments: demo_data.csv
>
>
> A sample file is attached, showing 10 rows each of strings with consistent 
> failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
> strings are in the column `json_string` – if relevant, they are geojsons with 
> min nchar of 33,229 and max nchar of 202,515.
> When I read this sample file with other R CSV readers (readr and data.table 
> shown), the files are imported correctly and there are no NAs in the 
> json_string column.
> When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
> end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
> this might not be limited to the R interface, but I can't help debug much 
> further upstream.
>  
>  
> {code:java}
> aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
> aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
> bbb <- data.table::fread("demo_data.csv")
> ccc <- readr::read_csv("demo_data.csv")
> mean(is.na(aaa1$json_string)) # 0.5
> mean(is.na(aaa2$column(1))) # Scalar 0.5
> mean(is.na(bbb$json_string)) # 0
> mean(is.na(ccc$json_string)) # 0{code}
>  
>  
>  * arrow 2.0 (latest CRAN)
>  * readr 1.4.0
>  * data.table 1.13.2
>  * R version 4.0.1 (2020-06-06)
>  * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11067) read_csv_arrow silently fails to read some strings and returns nulls

2020-12-29 Thread John Sheffield (Jira)
John Sheffield created ARROW-11067:
--

 Summary: read_csv_arrow silently fails to read some strings and 
returns nulls
 Key: ARROW-11067
 URL: https://issues.apache.org/jira/browse/ARROW-11067
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: John Sheffield
 Attachments: demo_data.csv

A sample file is attached, showing 10 rows each of strings with consistent 
failures (false_na = TRUE) and consistent successes (false_na = FALSE). The 
strings are in the column `json_string` – if relevant, they are geojsons with 
min nchar of 33,229 and max nchar of 202,515.

When I read this sample file with other R CSV readers (readr and data.table 
shown), the files are imported correctly and there are no NAs in the 
json_string column.

When I read with arrow::read_csv_arrow, 50% of the sample json_string column 
end up as NAs. as_data_frame TRUE or FALSE does not change the behavior, so 
this might not be limited to the R interface, but I can't help debug much 
further upstream.

 

 
{code:java}
aaa1 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = TRUE)
aaa2 <- arrow::read_csv_arrow("demo_data.csv", as_data_frame = FALSE)
bbb <- data.table::fread("demo_data.csv")
ccc <- readr::read_csv("demo_data.csv")
mean(is.na(aaa1$json_string)) # 0.5
mean(is.na(aaa2$column(1))) # Scalar 0.5
mean(is.na(bbb$json_string)) # 0
mean(is.na(ccc$json_string)) # 0{code}
 

 
 * arrow 2.0 (latest CRAN)
 * readr 1.4.0
 * data.table 1.13.2
 * R version 4.0.1 (2020-06-06)
 * MacOS Catalina 10.15.7 / x86_64-apple-darwin17.0

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11062) [Java] When writing to flight stream, Spark's mapPartitions is not working

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11062:

Summary: [Java] When writing to flight stream, Spark's mapPartitions is not 
working  (was: When writing to flight stream, Spark's mapPartitions is not 
working)

> [Java] When writing to flight stream, Spark's mapPartitions is not working
> --
>
> Key: ARROW-11062
> URL: https://issues.apache.org/jira/browse/ARROW-11062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Ravi Shankar
>Priority: Major
> Fix For: 2.0.0
>
>
> Hi,
> I have the following method:
>  
> val outRDD = myRdd.mapPartitions { it =>
> val  l = Location.forGrpcInsecure("10.0.0.113", 12233);
> val allocator = it.allocator.newChildAllocator("SparkFlightConnector", 0, 
> Long.MaxValue)
>        val client = FlightClient.builder(allocator, l).build();
>        val desc = FlightDescriptor.path("wonderful")
>        val stream = client.startPut(desc,it.root, new AsyncPutListener)
>        it.foreach { root =>
>         // doPut on the populated VectorSchemaRoot
>         stream.putNext()
>       }
>       stream.completed()
>       // Need to call this, or exceptions from the server get swallowed
>       stream.getResult
>  
>  //   println(it.next().contentToTSVString())
>       client.close()
>  
>       Iterator.empty
>       }.count
>  
> Following is the error:
>  
> Caused by: java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;CLjava/lang/Object;)V
>  at io.grpc.Metadata$Key.validateName(Metadata.java:742)
>  at io.grpc.Metadata$Key.(Metadata.java:750)
>  at io.grpc.Metadata$Key.(Metadata.java:668)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:959)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:954)
>  at io.grpc.Metadata$Key.of(Metadata.java:705)
>  at io.grpc.Metadata$Key.of(Metadata.java:701)
>  at io.grpc.internal.GrpcUtil.(GrpcUtil.java:80)
>  
> When I googles, its sayong some guava related jar issue. I did maven 
> dependency tree and did not find anything wrong. Please help.
>  
> Best,
> Ravion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11065) [C++] Installation failed on AIX7.2

2020-12-29 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11065:

Summary: [C++] Installation failed on AIX7.2  (was: Installation of Apache 
Arrow C++ failed on AIX7.2)

> [C++] Installation failed on AIX7.2
> ---
>
> Key: ARROW-11065
> URL: https://issues.apache.org/jira/browse/ARROW-11065
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 2.0.0
> Environment: AIX7.2
>Reporter: Xiaobo Zhang
>Priority: Major
>  Labels: build
> Fix For: 2.0.0
>
> Attachments: CMakeError.log, CMakeOutput.log
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was 
> told I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 
> {color:#24292e}tar ball and tried to install its "cpp" component according to 
> the instruction.  However, I got the following error after {{cd release}} to 
> run {{cmake ..}}: {color}
>  
> {color:#24292e}{{Login=root: Line=602 > cmake ..
> -- Building using CMake version: 3.16.0
> -- Arrow version: 2.0.0 (full: '2.0.0')
> -- Arrow SO version: 200 (full: 200.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found cpplint executable at 
> /software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
> -- System processor: powerpc
> -- Arrow build warning level: PRODUCTION
> CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
>   SSE4.2 required but compiler doesn't support it.
> Call Stack (most recent call first):
>   CMakeLists.txt:437 (include)
> -- Configuring incomplete, errors occurred!
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
> See also 
> "/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}
> {color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei 
> suggested me to submit an issue here.  Can someone please help me to fix the 
> issue?  What do I have to do with required SSE4.2?{color}
> {color:#24292e}Thanks.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11064) [Rust][DataFusion] Speed up hash join on smaller batches

2020-12-29 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-11064.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 9042
[https://github.com/apache/arrow/pull/9042]

> [Rust][DataFusion] Speed up hash join on smaller batches
> 
>
> Key: ARROW-11064
> URL: https://issues.apache.org/jira/browse/ARROW-11064
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11066) [Java] Is there a bug in flight AddWritableBuffer

2020-12-29 Thread Kangping Huang (Jira)
Kangping Huang created ARROW-11066:
--

 Summary: [Java] Is there a bug in flight AddWritableBuffer
 Key: ARROW-11066
 URL: https://issues.apache.org/jira/browse/ARROW-11066
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Affects Versions: 1.0.0
Reporter: Kangping Huang


[https://github.com/apache/arrow/blob/9bab12f03ac486bb8270f031b83f0a0411766b3e/java/flight/flight-core/src/main/java/org/apache/arrow/flight/grpc/AddWritableBuffer.java#L94]

buf.readBytes(stream, buf.readableBytes());

is this line redundant
In my perf.svg, this will copy the data from buf to OutputStream, which can not 
realize zero-copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11065) Installation of Apache Arrow C++ failed on AIX7.2

2020-12-29 Thread Xiaobo Zhang (Jira)
Xiaobo Zhang created ARROW-11065:


 Summary: Installation of Apache Arrow C++ failed on AIX7.2
 Key: ARROW-11065
 URL: https://issues.apache.org/jira/browse/ARROW-11065
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 2.0.0
 Environment: AIX7.2
Reporter: Xiaobo Zhang
 Fix For: 2.0.0
 Attachments: CMakeError.log, CMakeOutput.log

My installation of pyarrow on AIX7.2 failed due to missing ARROW and I was told 
I have to install ARROW C++ first.  I downloaded ARROW 2.0.0 {color:#24292e}tar 
ball and tried to install its "cpp" component according to the instruction.  
However, I got the following error after {{cd release}} to run {{cmake ..}}: 
{color}

 

{color:#24292e}{{Login=root: Line=602 > cmake ..
-- Building using CMake version: 3.16.0
-- Arrow version: 2.0.0 (full: '2.0.0')
-- Arrow SO version: 200 (full: 200.0.0)
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
-- infer not found
-- Found cpplint executable at 
/software/thirdparty/apache-arrow-2.0.0/cpp/build-support/cpplint.py
-- System processor: powerpc
-- Arrow build warning level: PRODUCTION
CMake Error at cmake_modules/SetupCxxFlags.cmake:365 (message):
  SSE4.2 required but compiler doesn't support it.
Call Stack (most recent call first):
  CMakeLists.txt:437 (include)

-- Configuring incomplete, errors occurred!
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeOutput.log".
See also 
"/software/thirdparty/apache-arrow-2.0.0/cpp/release/CMakeFiles/CMakeError.log".}}{color}

{color:#24292e}Attached are 2 CMake output/error files.  Sutou Kouhei suggested 
me to submit an issue here.  Can someone please help me to fix the issue?  What 
do I have to do with required SSE4.2?{color}

{color:#24292e}Thanks.{color}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11062) When writing to flight stream, Spark's mapPartitions is not working

2020-12-29 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256009#comment-17256009
 ] 

David Li commented on ARROW-11062:
--

Hi Ravi, can you provide the output of {{mvn dependency:tree}} for your project 
to start with?

> When writing to flight stream, Spark's mapPartitions is not working
> ---
>
> Key: ARROW-11062
> URL: https://issues.apache.org/jira/browse/ARROW-11062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Ravi Shankar
>Priority: Major
> Fix For: 2.0.0
>
>
> Hi,
> I have the following method:
>  
> val outRDD = myRdd.mapPartitions { it =>
> val  l = Location.forGrpcInsecure("10.0.0.113", 12233);
> val allocator = it.allocator.newChildAllocator("SparkFlightConnector", 0, 
> Long.MaxValue)
>        val client = FlightClient.builder(allocator, l).build();
>        val desc = FlightDescriptor.path("wonderful")
>        val stream = client.startPut(desc,it.root, new AsyncPutListener)
>        it.foreach { root =>
>         // doPut on the populated VectorSchemaRoot
>         stream.putNext()
>       }
>       stream.completed()
>       // Need to call this, or exceptions from the server get swallowed
>       stream.getResult
>  
>  //   println(it.next().contentToTSVString())
>       client.close()
>  
>       Iterator.empty
>       }.count
>  
> Following is the error:
>  
> Caused by: java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;CLjava/lang/Object;)V
>  at io.grpc.Metadata$Key.validateName(Metadata.java:742)
>  at io.grpc.Metadata$Key.(Metadata.java:750)
>  at io.grpc.Metadata$Key.(Metadata.java:668)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:959)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:954)
>  at io.grpc.Metadata$Key.of(Metadata.java:705)
>  at io.grpc.Metadata$Key.of(Metadata.java:701)
>  at io.grpc.internal.GrpcUtil.(GrpcUtil.java:80)
>  
> When I googles, its sayong some guava related jar issue. I did maven 
> dependency tree and did not find anything wrong. Please help.
>  
> Best,
> Ravion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11062) When writing to flight stream, Spark's mapPartitions is not working

2020-12-29 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256009#comment-17256009
 ] 

David Li edited comment on ARROW-11062 at 12/29/20, 2:10 PM:
-

Hi Ravi, can you provide the output of {{mvn dependency:tree}} for your project 
to start with? And your pom.xml as well.


was (Author: lidavidm):
Hi Ravi, can you provide the output of {{mvn dependency:tree}} for your project 
to start with?

> When writing to flight stream, Spark's mapPartitions is not working
> ---
>
> Key: ARROW-11062
> URL: https://issues.apache.org/jira/browse/ARROW-11062
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0
>Reporter: Ravi Shankar
>Priority: Major
> Fix For: 2.0.0
>
>
> Hi,
> I have the following method:
>  
> val outRDD = myRdd.mapPartitions { it =>
> val  l = Location.forGrpcInsecure("10.0.0.113", 12233);
> val allocator = it.allocator.newChildAllocator("SparkFlightConnector", 0, 
> Long.MaxValue)
>        val client = FlightClient.builder(allocator, l).build();
>        val desc = FlightDescriptor.path("wonderful")
>        val stream = client.startPut(desc,it.root, new AsyncPutListener)
>        it.foreach { root =>
>         // doPut on the populated VectorSchemaRoot
>         stream.putNext()
>       }
>       stream.completed()
>       // Need to call this, or exceptions from the server get swallowed
>       stream.getResult
>  
>  //   println(it.next().contentToTSVString())
>       client.close()
>  
>       Iterator.empty
>       }.count
>  
> Following is the error:
>  
> Caused by: java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;CLjava/lang/Object;)V
>  at io.grpc.Metadata$Key.validateName(Metadata.java:742)
>  at io.grpc.Metadata$Key.(Metadata.java:750)
>  at io.grpc.Metadata$Key.(Metadata.java:668)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:959)
>  at io.grpc.Metadata$AsciiKey.(Metadata.java:954)
>  at io.grpc.Metadata$Key.of(Metadata.java:705)
>  at io.grpc.Metadata$Key.of(Metadata.java:701)
>  at io.grpc.internal.GrpcUtil.(GrpcUtil.java:80)
>  
> When I googles, its sayong some guava related jar issue. I did maven 
> dependency tree and did not find anything wrong. Please help.
>  
> Best,
> Ravion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11064) [Rust][DataFusion] Speed up hash join on smaller batches

2020-12-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11064:
---
Labels: pull-request-available  (was: )

> [Rust][DataFusion] Speed up hash join on smaller batches
> 
>
> Key: ARROW-11064
> URL: https://issues.apache.org/jira/browse/ARROW-11064
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11064) [Rust][DataFusion] Speed up hash join on smaller batches

2020-12-29 Thread Jira
Daniël Heres created ARROW-11064:


 Summary: [Rust][DataFusion] Speed up hash join on smaller batches
 Key: ARROW-11064
 URL: https://issues.apache.org/jira/browse/ARROW-11064
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Daniël Heres
Assignee: Daniël Heres






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10957) Expanding pyarrow buffer size more than 2GB for pandas_udf functions

2020-12-29 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255968#comment-17255968
 ] 

Liya Fan commented on ARROW-10957:
--

Sure. Please find the test case here: 
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestRoundTrip.java#L254

> Expanding pyarrow buffer size more than 2GB for pandas_udf functions
> 
>
> Key: ARROW-10957
> URL: https://issues.apache.org/jira/browse/ARROW-10957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java, Python
>Affects Versions: 2.0.0
> Environment: Spark: 2.4.4
> Python:
> Dcycler (0.10.0)
> glmnet-py (0.1.0b2)
> joblib (1.0.0)
> kiwisolver (1.3.1)
> lightgbm (3.1.1) EPRECATION
> matplotlib (3.0.3)
> numpy (1.19.4)
> pandas (1.1.5)
> pip (9.0.3: The default format will switch to columns in the future. You can)
> pyarrow 2.0.0
> pyparsing (2.4.7) use --format=(legacy|columns) (or define a 
> format=(python-dateutil (2.8.1)
> pytz (202legacy|columns) in yo0.4)
> scikit-learn (0.23.2)
> scipy (1.5.4)
> setuptools (51.0.0) ur pip.conf under the [list] section) to disable this 
> warnsix (1.15.0)
> sklearn (0.0)
> threadpoolctl (2.1.0)
> venv-paing. ck (0.2.0)
> wheel (0.36.2)
>Reporter: Dmitry Kravchuk
>Priority: Major
>  Labels: features, patch, performance
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> There is 2GB limit for data that can be passed to any pandas_udf function and 
> the aim of this issue is to expand this limit. It's very small buffer size if 
> we use pyspark and our goal is fitting machine learning models.
> Steps to reproduce - just use following spark-submit for executing following 
> after python function.
> {code:java}
> %sh
> cd /home/zeppelin/code && \
> export PYSPARK_DRIVER_PYTHON=/home/zeppelin/envs/env3/bin/python && \
> export PYSPARK_PYTHON=./env3/bin/python && \
> export ARROW_PRE_0_15_IPC_FORMAT=1 && \
> spark-submit \
> --master yarn \
> --deploy-mode client \
> --num-executors 5 \
> --executor-cores 5 \
> --driver-memory 8G \
> --executor-memory 8G \
> --conf spark.executor.memoryOverhead=4G \
> --conf spark.driver.memoryOverhead=4G \
> --archives /home/zeppelin/env3.tar.gz#env3 \
> --jars "/opt/deltalake/delta-core_2.11-0.5.0.jar" \
> --py-files jobs.zip,"/opt/deltalake/delta-core_2.11-0.5.0.jar" main.py \
> --job temp
> {code}
>  
> {code:java|title=Bar.Python|borderStyle=solid}
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> def analyze(spark):
> pdf1 = pd.DataFrame(
> [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
> columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
> [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
> columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> import os
> os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
> return df
> df4 = df3 \
> .withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \
> .withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \
> .withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \
> .withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \
> .withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \
> .withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \
> .withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \
> .withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \
> .withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \
> .withColumn('df2_c6', F.col('df2_c6').cast(T.StringType()))
> print(df4.printSchema())
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> {code}
> If you need more details please let me know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9934) [Rust] Shape and stride check in tensor

2020-12-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9934:
-

Assignee: Fernando Herrera

> [Rust] Shape and stride check in tensor
> ---
>
> Key: ARROW-9934
> URL: https://issues.apache.org/jira/browse/ARROW-9934
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Fernando Herrera
>Assignee: Fernando Herrera
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When creating a tensor there is no check for the supplied shape and stride. 
> There should be a check before creating the tensor object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9908) [Rust] Support temporal data types in JSON reader

2020-12-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9908:
-

Assignee: Christoph Schulze

> [Rust] Support temporal data types in JSON reader
> -
>
> Key: ARROW-9908
> URL: https://issues.apache.org/jira/browse/ARROW-9908
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Christoph Schulze
>Assignee: Christoph Schulze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently the JSON reader does not support any temporal data types. Columns 
> with *numerical* data should be interpretable as temporal type when defined 
> accordingly in the schema. Currently this would throw an error with a 
> misleading message ("struct types are not yet supported").
> related issue:
> https://issues.apache.org/jira/browse/ARROW-4803 focuses on parsing temporal 
> data based on strings inputs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10095) [Rust] [Parquet] Update for IPC changes

2020-12-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-10095:
--

Assignee: Carol Nichols

> [Rust] [Parquet] Update for IPC changes
> ---
>
> Key: ARROW-10095
> URL: https://issues.apache.org/jira/browse/ARROW-10095
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 1.0.1
>Reporter: Neville Dipale
>Assignee: Carol Nichols
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The IPC changes made to comply with MetadataVersion 4 broke the rust-parquet 
> writer branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11061) [Rust] Validate array properties against schema

2020-12-29 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255926#comment-17255926
 ] 

Neville Dipale commented on ARROW-11061:


[~andygrove] [~alamb] [~jorgecarleitao] I don't know if you've encountered the 
issues above, but they're making my work veery difficult on the parquet writer. 
There are equivalent checks in the C++ implementation, but I haven't looked at 
them in detail yet.

> [Rust] Validate array properties against schema
> ---
>
> Key: ARROW-11061
> URL: https://issues.apache.org/jira/browse/ARROW-11061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>
> We have a problem when it comes to nested arrays, where one could create a 
> > where the array fields can't be null, but 
> the list can have null slots.
> This creates a lot of work when working with such nested arrays, because we 
> have to create work-arounds to account for this, and take unnecessarily 
> slower paths.
> I propose that we prevent this problem at the source, by:
>  * checking that a batch can't be created with arrays that have incompatible 
> null contracts
>  * preventing list and struct children from being non-null if any descendant 
> of such children are null (might be less of an issue for structs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11063:
---
Labels: pull-request-available  (was: )

> [Rust] Validate null counts when building arrays
> 
>
> Key: ARROW-11063
> URL: https://issues.apache.org/jira/browse/ARROW-11063
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ArrayDataBuilder allows the user to specify a null count, alternatively 
> calculating it if it is not set.
> The problem is that the user-specified null count is never validated against 
> the actual count of the buffer.
> I suggest removing the ability to specify a null-count, and instead always 
> calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11063:
--

 Summary: [Rust] Validate null counts when building arrays
 Key: ARROW-11063
 URL: https://issues.apache.org/jira/browse/ARROW-11063
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


ArrayDataBuilder allows the user to specify a null count, alternatively 
calculating it if it is not set.

The problem is that the user-specified null count is never validated against 
the actual count of the buffer.

I suggest removing the ability to specify a null-count, and instead always 
calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11061) [Rust] Validate array properties against schema

2020-12-29 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11061:
---
Component/s: Rust

> [Rust] Validate array properties against schema
> ---
>
> Key: ARROW-11061
> URL: https://issues.apache.org/jira/browse/ARROW-11061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>
> We have a problem when it comes to nested arrays, where one could create a 
> > where the array fields can't be null, but 
> the list can have null slots.
> This creates a lot of work when working with such nested arrays, because we 
> have to create work-arounds to account for this, and take unnecessarily 
> slower paths.
> I propose that we prevent this problem at the source, by:
>  * checking that a batch can't be created with arrays that have incompatible 
> null contracts
>  * preventing list and struct children from being non-null if any descendant 
> of such children are null (might be less of an issue for structs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11062) When writing to flight stream, Spark's mapPartitions is not working

2020-12-29 Thread Ravi Shankar (Jira)
Ravi Shankar created ARROW-11062:


 Summary: When writing to flight stream, Spark's mapPartitions is 
not working
 Key: ARROW-11062
 URL: https://issues.apache.org/jira/browse/ARROW-11062
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 2.0.0
Reporter: Ravi Shankar
 Fix For: 2.0.0


Hi,

I have the following method:

 

val outRDD = myRdd.mapPartitions { it =>

val  l = Location.forGrpcInsecure("10.0.0.113", 12233);

val allocator = it.allocator.newChildAllocator("SparkFlightConnector", 0, 
Long.MaxValue)

       val client = FlightClient.builder(allocator, l).build();

       val desc = FlightDescriptor.path("wonderful")

       val stream = client.startPut(desc,it.root, new AsyncPutListener)

       it.foreach { root =>

        // doPut on the populated VectorSchemaRoot

        stream.putNext()

      }

      stream.completed()

      // Need to call this, or exceptions from the server get swallowed

      stream.getResult

 

 //   println(it.next().contentToTSVString())

      client.close()

 

      Iterator.empty

      }.count

 

Following is the error:

 

Caused by: java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;CLjava/lang/Object;)V

 at io.grpc.Metadata$Key.validateName(Metadata.java:742)

 at io.grpc.Metadata$Key.(Metadata.java:750)

 at io.grpc.Metadata$Key.(Metadata.java:668)

 at io.grpc.Metadata$AsciiKey.(Metadata.java:959)

 at io.grpc.Metadata$AsciiKey.(Metadata.java:954)

 at io.grpc.Metadata$Key.of(Metadata.java:705)

 at io.grpc.Metadata$Key.of(Metadata.java:701)

 at io.grpc.internal.GrpcUtil.(GrpcUtil.java:80)

 

When I googles, its sayong some guava related jar issue. I did maven dependency 
tree and did not find anything wrong. Please help.

 

Best,

Ravion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10925) [Rust] Validate temporal data that has restrictions

2020-12-29 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255893#comment-17255893
 ] 

Neville Dipale commented on ARROW-10925:


[~andygrove] [~jorgecarleitao] [~alamb] I don't know if you've come across 
issues like what's on this JIRA, but it's been inconvenient for me while 
working on the parquet writer.

> [Rust] Validate temporal data that has restrictions
> ---
>
> Key: ARROW-10925
> URL: https://issues.apache.org/jira/browse/ARROW-10925
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Neville Dipale
>Priority: Major
>
> Some temporal data types have restrictions (e.g. date64 should be a multiple 
> of 8640). We should validate them when creating the arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11061) [Rust] Validate array properties against schema

2020-12-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11061:
--

 Summary: [Rust] Validate array properties against schema
 Key: ARROW-11061
 URL: https://issues.apache.org/jira/browse/ARROW-11061
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


We have a problem when it comes to nested arrays, where one could create a 
> where the array fields can't be null, but the 
list can have null slots.

This creates a lot of work when working with such nested arrays, because we 
have to create work-arounds to account for this, and take unnecessarily slower 
paths.

I propose that we prevent this problem at the source, by:
 * checking that a batch can't be created with arrays that have incompatible 
null contracts
 * preventing list and struct children from being non-null if any descendant of 
such children are null (might be less of an issue for structs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10957) Expanding pyarrow buffer size more than 2GB for pandas_udf functions

2020-12-29 Thread Dmitry Kravchuk (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255871#comment-17255871
 ] 

Dmitry Kravchuk commented on ARROW-10957:
-

[~fan_li_ya] Where should I search Java code example of 
TestRoundTrip#testMetadata? Can you share link?

> Expanding pyarrow buffer size more than 2GB for pandas_udf functions
> 
>
> Key: ARROW-10957
> URL: https://issues.apache.org/jira/browse/ARROW-10957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Java, Python
>Affects Versions: 2.0.0
> Environment: Spark: 2.4.4
> Python:
> Dcycler (0.10.0)
> glmnet-py (0.1.0b2)
> joblib (1.0.0)
> kiwisolver (1.3.1)
> lightgbm (3.1.1) EPRECATION
> matplotlib (3.0.3)
> numpy (1.19.4)
> pandas (1.1.5)
> pip (9.0.3: The default format will switch to columns in the future. You can)
> pyarrow 2.0.0
> pyparsing (2.4.7) use --format=(legacy|columns) (or define a 
> format=(python-dateutil (2.8.1)
> pytz (202legacy|columns) in yo0.4)
> scikit-learn (0.23.2)
> scipy (1.5.4)
> setuptools (51.0.0) ur pip.conf under the [list] section) to disable this 
> warnsix (1.15.0)
> sklearn (0.0)
> threadpoolctl (2.1.0)
> venv-paing. ck (0.2.0)
> wheel (0.36.2)
>Reporter: Dmitry Kravchuk
>Priority: Major
>  Labels: features, patch, performance
> Fix For: 2.0.1
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> There is 2GB limit for data that can be passed to any pandas_udf function and 
> the aim of this issue is to expand this limit. It's very small buffer size if 
> we use pyspark and our goal is fitting machine learning models.
> Steps to reproduce - just use following spark-submit for executing following 
> after python function.
> {code:java}
> %sh
> cd /home/zeppelin/code && \
> export PYSPARK_DRIVER_PYTHON=/home/zeppelin/envs/env3/bin/python && \
> export PYSPARK_PYTHON=./env3/bin/python && \
> export ARROW_PRE_0_15_IPC_FORMAT=1 && \
> spark-submit \
> --master yarn \
> --deploy-mode client \
> --num-executors 5 \
> --executor-cores 5 \
> --driver-memory 8G \
> --executor-memory 8G \
> --conf spark.executor.memoryOverhead=4G \
> --conf spark.driver.memoryOverhead=4G \
> --archives /home/zeppelin/env3.tar.gz#env3 \
> --jars "/opt/deltalake/delta-core_2.11-0.5.0.jar" \
> --py-files jobs.zip,"/opt/deltalake/delta-core_2.11-0.5.0.jar" main.py \
> --job temp
> {code}
>  
> {code:java|title=Bar.Python|borderStyle=solid}
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> def analyze(spark):
> pdf1 = pd.DataFrame(
> [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
> columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
> [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
> columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> import os
> os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
> return df
> df4 = df3 \
> .withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \
> .withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \
> .withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \
> .withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \
> .withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \
> .withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \
> .withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \
> .withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \
> .withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \
> .withColumn('df2_c6', F.col('df2_c6').cast(T.StringType()))
> print(df4.printSchema())
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> {code}
> If you need more details please let me know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)