date:20211211

[jira] [Updated] (ARROW-15072) [R] Error: This build of the arrow package does not support Datasets

2021-12-11 Thread hu geme (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hu geme updated ARROW-15072:

Summary: [R] Error: This build of the arrow package does not support 
Datasets  (was: Error: This build of the arrow package does not support 
Datasets)

> [R] Error: This build of the arrow package does not support Datasets
> 
>
> Key: ARROW-15072
> URL: https://issues.apache.org/jira/browse/ARROW-15072
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, R
>Affects Versions: 6.0.1
> Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker 
> rocker/r-base:4.1.2
>Reporter: hu geme
>Priority: Minor
> Fix For: 6.0.1
>
>
> Hello,
> I would like to report a possible issue (or I did not grasp the documentation 
> and I apologize in advance)
> Im trying to use R with arrow on docker:
>  
> {code:java}
> FROM rocker/r-base:4.1.2
> RUN apt update && \
> apt install -y -V ca-certificates lsb-release wget && \wget 
> "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 
> 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename 
> --short).deb" && \apt-get -y --no-install-recommends install \  
> ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
> RUN apt update && \
> apt install -y -V -f \libarrow-dev \libarrow-dataset-dev 
> \libarrow-glib-dev \libarrow-flight-dev \
> libparquet-dev \libparquet-glib-dev
> RUN install2.r --error \
>arrow {code}
> Thats the output of sessionInfo from the container running R
>  
> {code:java}
> sessionInfo()
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
> LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale:
>  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
>  [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
>  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
>  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
>  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
> [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8attached base 
> packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_6.0.1 DBI_1.1.1  loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.1   bit_4.0.4          compiler_4.1.2     magrittr_2.0.1  
>   
>  [5] assertthat_0.2.1   R6_2.5.1           tools_4.1.2        glue_1.5.1      
>   
>  [9] bit64_4.0.5        vctrs_0.3.8        RJDBC_0.2-8        rlang_0.4.12    
>   
> [13] rJava_1.0-5        AWR.Athena_2.0.7-0 purrr_0.3.4      {code}
> And as far as I understand,  all requierements are fulfilled to use datasets
> R version 4.1.2
> Platform: x86_64-pc-linux-gnu (64-bit)
> arrow_6.0.1
>  
> {code:java}
> > .Machine$sizeof.pointer < 8
> [1] FALSE
> > getRversion() < "4.0.0"
> [1] FALSE
> > tolower(Sys.info()[["sysname"]]) == "windows"
> [1] FALSE
> >  {code}
> Nevertheless I get 
> Error: This build of the arrow package does not support Datasets
> in return when
> {code:java}
> arrow::open_dataset(sources = path) {code}
> Appreciate any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15038) [Packaging][CentOS] Drop support for CentOS 8

2021-12-11 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15038.
--
Fix Version/s: 7.0.0
   Resolution: Fixed

Issue resolved by pull request 11909
[https://github.com/apache/arrow/pull/11909]

> [Packaging][CentOS] Drop support for CentOS 8
> -
>
> Key: ARROW-15038
> URL: https://issues.apache.org/jira/browse/ARROW-15038
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Because it'll reach EOL at 2021-12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15074) [C++] Support multiple frames in LZ4?

2021-12-11 Thread Jira

Jorge Leitão created ARROW-15074:


 Summary: [C++] Support multiple frames in LZ4?
 Key: ARROW-15074
 URL: https://issues.apache.org/jira/browse/ARROW-15074
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge Leitão
 Attachments: b.arrow

When reading an arrow file with buffers LZ4-compressed with multiple frames, we 
get

{code:java}
OSError: Lz4 compressed input contains more than one frame
{code}

Attached is an example of such a file, which can be opened with

{code:java}
import pyarrow.ipc
with pa.ipc.open_file("b.arrow") as reader:
print(reader.get_batch(0))
{code}

that fails with the error above.

The LZ4 frame supports multiple frames and we do not refer that only one frame 
should be on a buffer as part of the spec.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457700#comment-17457700
 ] 

Micah Kornfield edited comment on ARROW-15073 at 12/11/21, 6:07 PM:


This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.  
[https://github.com/apache/parquet-format/pull/168/files] has a description.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 


was (Author: emkornfield):
This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457700#comment-17457700
 ] 

Micah Kornfield commented on ARROW-15073:
-

This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15073:
-
Component/s: C++
 Parquet

> [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by 
> (py)spark
> -
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15073:
-
Summary: [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are 
unreadable by (py)spark  (was: [C++][Python] LZ4- and zstd- compressed parquet 
files are unreadable by (py)spark)

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15073:
-
Description: 
The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
)

pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018

pyarrow reads the file correctly.

  was:
The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)

pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018

pyarrow reads the file correctly.


> [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by 
> (py)spark
> -
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15073:
-
Description: 
The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)

pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018

pyarrow reads the file correctly.

  was:
The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)

pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with a failure in the Thrift protocol:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018


> [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by 
> (py)spark
> -
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457693#comment-17457693
 ] 

Jorge Leitão commented on ARROW-15073:
--

cc [~emkornfield] [~apitrou]

> [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by 
> (py)spark
> -
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-15073:
-
Summary: [C++][Python] LZ4- and zstd- compressed parquet files are 
unreadable by (py)spark  (was: [C++][Python] LZ4-compressed parquet files are 
unreadable by (py)spark)

> [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by 
> (py)spark
> -
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with a failure in the Thrift protocol:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15073) [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Jira

Jorge Leitão created ARROW-15073:


 Summary: [C++][Python] LZ4-compressed parquet files are unreadable 
by (py)spark
 Key: ARROW-15073
 URL: https://issues.apache.org/jira/browse/ARROW-15073
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge Leitão


The following snipped shows the issue

{code:java}
import pyarrow as pa  # pyarrow==6.0.1
import pyarrow.parquet
import pyspark.sql  # pyspark==3.1.2


path = "bla.parquet"

t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]),
)

pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()

result = spark.read.parquet(path).select("int64").collect()
{code}

This fails with a failure in the Thrift protocol:

{code:java}
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, 
null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 
00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, 
count:1)])
{code}

Found while debugging the root cause of 
https://github.com/pola-rs/polars/issues/2018



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15056) [C++] Speed up GcsFileSystem tests

2021-12-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15056:
---
Labels: pull-request-available  (was: )

> [C++] Speed up GcsFileSystem tests
> --
>
> Key: ARROW-15056
> URL: https://issues.apache.org/jira/browse/ARROW-15056
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The tests spend too much time starting and shutting down the testbench.  A 
> single testbench should be enough for all the tests.  May require some 
> helpers to use a different bucket in each test or something similar.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets

2021-12-11 Thread hu geme (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hu geme updated ARROW-15072:

Description: 
Hello,

I would like to report a possible issue (or I did not grasp the documentation 
and I apologize in advance)

Im trying to use R with arrow on docker:

 
{code:java}
FROM rocker/r-base:4.1.2

RUN apt update && \
apt install -y -V ca-certificates lsb-release wget && \wget 
"https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 
'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename 
--short).deb" && \apt-get -y --no-install-recommends install \  
./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb

RUN apt update && \
apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \  
  libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \  
  libparquet-glib-dev

RUN install2.r --error \
   arrow {code}
Thats the output of sessionInfo from the container running R

 
{code:java}
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: 
/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8attached base 
packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     other 
attached packages:
[1] arrow_6.0.1 DBI_1.1.1  loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1   bit_4.0.4          compiler_4.1.2     magrittr_2.0.1    
 [5] assertthat_0.2.1   R6_2.5.1           tools_4.1.2        glue_1.5.1        
 [9] bit64_4.0.5        vctrs_0.3.8        RJDBC_0.2-8        rlang_0.4.12      
[13] rJava_1.0-5        AWR.Athena_2.0.7-0 purrr_0.3.4      {code}
And as far as I understand,  all requierements are fulfilled to use datasets

R version 4.1.2

Platform: x86_64-pc-linux-gnu (64-bit)

arrow_6.0.1

 
{code:java}
> .Machine$sizeof.pointer < 8
[1] FALSE
> getRversion() < "4.0.0"
[1] FALSE
> tolower(Sys.info()[["sysname"]]) == "windows"
[1] FALSE
>  {code}
Nevertheless I get 

Error: This build of the arrow package does not support Datasets

in return when
{code:java}
arrow::open_dataset(sources = path) {code}
Appreciate any help!

  was:
Hello,

I would like to report a possible issue. 

Im trying to use R with arrow on docker:

 
{code:java}
FROM rocker/r-base:4.1.2

RUN apt update && \
apt install -y -V ca-certificates lsb-release wget && \wget 
"https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 
'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename 
--short).deb" && \apt-get -y --no-install-recommends install \  
./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb

RUN apt update && \
apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \  
  libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \  
  libparquet-glib-dev

RUN install2.r --error \
   arrow {code}
Thats the output of sessionInfo from the container running R

 
{code:java}
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: 
/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8attached base 
packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     other 
attached packages:
[1] arrow_6.0.1 DBI_1.1.1  loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1   bit_4.0.4          compiler_4.1.2     magrittr_2.0.1    
 [5] assertthat_0.2.1   R6_2.5.1           tools_4.1.2        glue_1.5.1        
 [9] bit64_4.0.5        vctrs_0.3.8        RJDBC_0.2-8        rlang_0.4.12      
[13] rJava_1.0-5        AWR.Athena_2.0.7-0 purrr_0.3.4      {code}
And as far as I understand,  all requierements are fulfilled to use datasets

R version 4.1.2

Platform: x86_64-pc-linux-gnu (64-bit)

arrow_6

[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets

2021-12-11 Thread hu geme (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hu geme updated ARROW-15072:

Priority: Minor  (was: Trivial)

> Error: This build of the arrow package does not support Datasets
> 
>
> Key: ARROW-15072
> URL: https://issues.apache.org/jira/browse/ARROW-15072
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, R
>Affects Versions: 6.0.1
> Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker 
> rocker/r-base:4.1.2
>Reporter: hu geme
>Priority: Minor
> Fix For: 6.0.1
>
>
> Hello,
> I would like to report a possible issue. 
> Im trying to use R with arrow on docker:
>  
> {code:java}
> FROM rocker/r-base:4.1.2
> RUN apt update && \
> apt install -y -V ca-certificates lsb-release wget && \wget 
> "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 
> 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename 
> --short).deb" && \apt-get -y --no-install-recommends install \  
> ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
> RUN apt update && \
> apt install -y -V -f \libarrow-dev \libarrow-dataset-dev 
> \libarrow-glib-dev \libarrow-flight-dev \
> libparquet-dev \libparquet-glib-dev
> RUN install2.r --error \
>arrow {code}
> Thats the output of sessionInfo from the container running R
>  
> {code:java}
> sessionInfo()
> R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
> LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale:
>  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
>  [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
>  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
>  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
>  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
> [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8attached base 
> packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_6.0.1 DBI_1.1.1  loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.1   bit_4.0.4          compiler_4.1.2     magrittr_2.0.1  
>   
>  [5] assertthat_0.2.1   R6_2.5.1           tools_4.1.2        glue_1.5.1      
>   
>  [9] bit64_4.0.5        vctrs_0.3.8        RJDBC_0.2-8        rlang_0.4.12    
>   
> [13] rJava_1.0-5        AWR.Athena_2.0.7-0 purrr_0.3.4      {code}
> And as far as I understand,  all requierements are fulfilled to use datasets
> R version 4.1.2
> Platform: x86_64-pc-linux-gnu (64-bit)
> arrow_6.0.1
>  
> {code:java}
> > .Machine$sizeof.pointer < 8
> [1] FALSE
> > getRversion() < "4.0.0"
> [1] FALSE
> > tolower(Sys.info()[["sysname"]]) == "windows"
> [1] FALSE
> >  {code}
> Nevertheless I get 
> Error: This build of the arrow package does not support Datasets
> in return when
> {code:java}
> arrow::open_dataset(sources = path) {code}
> Appreciate any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15072) Error: This build of the arrow package does not support Datasets

2021-12-11 Thread hu geme (Jira)

hu geme created ARROW-15072:
---

 Summary: Error: This build of the arrow package does not support 
Datasets
 Key: ARROW-15072
 URL: https://issues.apache.org/jira/browse/ARROW-15072
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet, R
Affects Versions: 6.0.1
 Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker 
rocker/r-base:4.1.2
Reporter: hu geme
 Fix For: 6.0.1


Hello,

I would like to report a possible issue. 

Im trying to use R with arrow on docker:

 
{code:java}
FROM rocker/r-base:4.1.2

RUN apt update && \
apt install -y -V ca-certificates lsb-release wget && \wget 
"https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 
'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename 
--short).deb" && \apt-get -y --no-install-recommends install \  
./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb

RUN apt update && \
apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \  
  libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \  
  libparquet-glib-dev

RUN install2.r --error \
   arrow {code}
Thats the output of sessionInfo from the container running R

 
{code:java}
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: 
/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8attached base 
packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     other 
attached packages:
[1] arrow_6.0.1 DBI_1.1.1  loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1   bit_4.0.4          compiler_4.1.2     magrittr_2.0.1    
 [5] assertthat_0.2.1   R6_2.5.1           tools_4.1.2        glue_1.5.1        
 [9] bit64_4.0.5        vctrs_0.3.8        RJDBC_0.2-8        rlang_0.4.12      
[13] rJava_1.0-5        AWR.Athena_2.0.7-0 purrr_0.3.4      {code}
And as far as I understand,  all requierements are fulfilled to use datasets

R version 4.1.2

Platform: x86_64-pc-linux-gnu (64-bit)

arrow_6.0.1

 
{code:java}
> .Machine$sizeof.pointer < 8
[1] FALSE
> getRversion() < "4.0.0"
[1] FALSE
> tolower(Sys.info()[["sysname"]]) == "windows"
[1] FALSE
>  {code}
Nevertheless I get 

Error: This build of the arrow package does not support Datasets

in return when
{code:java}
arrow::open_dataset(sources = path) {code}
Appreciate any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15072) [R] Error: This build of the arrow package does not support Datasets

[jira] [Resolved] (ARROW-15038) [Packaging][CentOS] Drop support for CentOS 8

[jira] [Created] (ARROW-15074) [C++] Support multiple frames in LZ4?

[jira] [Comment Edited] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Commented] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

[jira] [Created] (ARROW-15073) [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark

[jira] [Updated] (ARROW-15056) [C++] Speed up GcsFileSystem tests

[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets

[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets

[jira] [Created] (ARROW-15072) Error: This build of the arrow package does not support Datasets

16 matches

Site Navigation

Mail list logo

Footer information