[jira] [Updated] (ARROW-15072) [R] Error: This build of the arrow package does not support Datasets
[ https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hu geme updated ARROW-15072: Summary: [R] Error: This build of the arrow package does not support Datasets (was: Error: This build of the arrow package does not support Datasets) > [R] Error: This build of the arrow package does not support Datasets > > > Key: ARROW-15072 > URL: https://issues.apache.org/jira/browse/ARROW-15072 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, R >Affects Versions: 6.0.1 > Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker > rocker/r-base:4.1.2 >Reporter: hu geme >Priority: Minor > Fix For: 6.0.1 > > > Hello, > I would like to report a possible issue (or I did not grasp the documentation > and I apologize in advance) > Im trying to use R with arrow on docker: > > {code:java} > FROM rocker/r-base:4.1.2 > RUN apt update && \ > apt install -y -V ca-certificates lsb-release wget && \wget > "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr > 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename > --short).deb" && \apt-get -y --no-install-recommends install \ > ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb > RUN apt update && \ > apt install -y -V -f \libarrow-dev \libarrow-dataset-dev > \libarrow-glib-dev \libarrow-flight-dev \ > libparquet-dev \libparquet-glib-dev > RUN install2.r --error \ >arrow {code} > Thats the output of sessionInfo from the container running R > > {code:java} > sessionInfo() > R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 > LAPACK: > /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 > [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8attached base > packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_6.0.1 DBI_1.1.1 loaded via a namespace (and not attached): > [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.2 magrittr_2.0.1 > > [5] assertthat_0.2.1 R6_2.5.1 tools_4.1.2 glue_1.5.1 > > [9] bit64_4.0.5 vctrs_0.3.8 RJDBC_0.2-8 rlang_0.4.12 > > [13] rJava_1.0-5 AWR.Athena_2.0.7-0 purrr_0.3.4 {code} > And as far as I understand, all requierements are fulfilled to use datasets > R version 4.1.2 > Platform: x86_64-pc-linux-gnu (64-bit) > arrow_6.0.1 > > {code:java} > > .Machine$sizeof.pointer < 8 > [1] FALSE > > getRversion() < "4.0.0" > [1] FALSE > > tolower(Sys.info()[["sysname"]]) == "windows" > [1] FALSE > > {code} > Nevertheless I get > Error: This build of the arrow package does not support Datasets > in return when > {code:java} > arrow::open_dataset(sources = path) {code} > Appreciate any help! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15038) [Packaging][CentOS] Drop support for CentOS 8
[ https://issues.apache.org/jira/browse/ARROW-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-15038. -- Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 11909 [https://github.com/apache/arrow/pull/11909] > [Packaging][CentOS] Drop support for CentOS 8 > - > > Key: ARROW-15038 > URL: https://issues.apache.org/jira/browse/ARROW-15038 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Because it'll reach EOL at 2021-12. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15074) [C++] Support multiple frames in LZ4?
Jorge Leitão created ARROW-15074: Summary: [C++] Support multiple frames in LZ4? Key: ARROW-15074 URL: https://issues.apache.org/jira/browse/ARROW-15074 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Leitão Attachments: b.arrow When reading an arrow file with buffers LZ4-compressed with multiple frames, we get {code:java} OSError: Lz4 compressed input contains more than one frame {code} Attached is an example of such a file, which can be opened with {code:java} import pyarrow.ipc with pa.ipc.open_file("b.arrow") as reader: print(reader.get_batch(0)) {code} that fails with the error above. The LZ4 frame supports multiple frames and we do not refer that only one frame should be on a buffer as part of the spec. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457700#comment-17457700 ] Micah Kornfield edited comment on ARROW-15073 at 12/11/21, 6:07 PM: This is expected LZ4 has always had compatibility issues which is why lz4_raw was introduced in https://issues.apache.org/jira/browse/PARQUET-1998 I'm trying to find the Jira for it but failing for the Java implementation of LZ4_RAW in java but to my knowledge it hasn't been done yet. [https://github.com/apache/parquet-format/pull/168/files] has a description. ZSTD is news to me, but haven't been tracking it carefully, I thought it was tested at some point but we might need to have a conversation on that one as well. was (Author: emkornfield): This is expected LZ4 has always had compatibility issues which is why lz4_raw was introduced in https://issues.apache.org/jira/browse/PARQUET-1998 I'm trying to find the Jira for it but failing for the Java implementation of LZ4_RAW in java but to my knowledge it hasn't been done yet. ZSTD is news to me, but haven't been tracking it carefully, I thought it was tested at some point but we might need to have a conversation on that one as well. > [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable > by (py)spark > -- > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457700#comment-17457700 ] Micah Kornfield commented on ARROW-15073: - This is expected LZ4 has always had compatibility issues which is why lz4_raw was introduced in https://issues.apache.org/jira/browse/PARQUET-1998 I'm trying to find the Jira for it but failing for the Java implementation of LZ4_RAW in java but to my knowledge it hasn't been done yet. ZSTD is news to me, but haven't been tracking it carefully, I thought it was tested at some point but we might need to have a conversation on that one as well. > [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable > by (py)spark > -- > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15073: - Component/s: C++ Parquet > [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by > (py)spark > - > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15073: - Summary: [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark (was: [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark) > [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable > by (py)spark > -- > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Parquet >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15073: - Description: The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 pyarrow reads the file correctly. was: The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 pyarrow reads the file correctly. > [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by > (py)spark > - > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15073: - Description: The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 pyarrow reads the file correctly. was: The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with a failure in the Thrift protocol: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 > [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by > (py)spark > - > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457693#comment-17457693 ] Jorge Leitão commented on ARROW-15073: -- cc [~emkornfield] [~apitrou] > [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by > (py)spark > - > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 > pyarrow reads the file correctly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15073) [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark
[ https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Leitão updated ARROW-15073: - Summary: [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark (was: [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark) > [C++][Python] LZ4- and zstd- compressed parquet files are unreadable by > (py)spark > - > > Key: ARROW-15073 > URL: https://issues.apache.org/jira/browse/ARROW-15073 > Project: Apache Arrow > Issue Type: Bug >Reporter: Jorge Leitão >Priority: Major > > The following snipped shows the issue > {code:java} > import pyarrow as pa # pyarrow==6.0.1 > import pyarrow.parquet > import pyspark.sql # pyspark==3.1.2 > path = "bla.parquet" > t = pa.table( > [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], > schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), > ) > pyarrow.parquet.write_table( > t, > path, > use_dictionary=False, > compression="LZ4", > ) > spark = pyspark.sql.SparkSession.builder.getOrCreate() > result = spark.read.parquet(path).select("int64").collect() > {code} > This fails with a failure in the Thrift protocol: > {code:java} > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, > encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, > total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, > statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 > 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 > 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, > encoding:PLAIN, count:1)]) > {code} > Found while debugging the root cause of > https://github.com/pola-rs/polars/issues/2018 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15073) [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark
Jorge Leitão created ARROW-15073: Summary: [C++][Python] LZ4-compressed parquet files are unreadable by (py)spark Key: ARROW-15073 URL: https://issues.apache.org/jira/browse/ARROW-15073 Project: Apache Arrow Issue Type: Bug Reporter: Jorge Leitão The following snipped shows the issue {code:java} import pyarrow as pa # pyarrow==6.0.1 import pyarrow.parquet import pyspark.sql # pyspark==3.1.2 path = "bla.parquet" t = pa.table( [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])], schema=pa.schema([pa.field("int64", pa.int64(), nullable=False)]), ) pyarrow.parquet.write_table( t, path, use_dictionary=False, compression="LZ4", ) spark = pyspark.sql.SparkSession.builder.getOrCreate() result = spark.read.parquet(path).select("int64").collect() {code} This fails with a failure in the Thrift protocol: {code:java} Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]) {code} Found while debugging the root cause of https://github.com/pola-rs/polars/issues/2018 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15056) [C++] Speed up GcsFileSystem tests
[ https://issues.apache.org/jira/browse/ARROW-15056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15056: --- Labels: pull-request-available (was: ) > [C++] Speed up GcsFileSystem tests > -- > > Key: ARROW-15056 > URL: https://issues.apache.org/jira/browse/ARROW-15056 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Carlos O'Ryan >Assignee: Carlos O'Ryan >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The tests spend too much time starting and shutting down the testbench. A > single testbench should be enough for all the tests. May require some > helpers to use a different bucket in each test or something similar. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets
[ https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hu geme updated ARROW-15072: Description: Hello, I would like to report a possible issue (or I did not grasp the documentation and I apologize in advance) Im trying to use R with arrow on docker: {code:java} FROM rocker/r-base:4.1.2 RUN apt update && \ apt install -y -V ca-certificates lsb-release wget && \wget "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb" && \apt-get -y --no-install-recommends install \ ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb RUN apt update && \ apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \ libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \ libparquet-glib-dev RUN install2.r --error \ arrow {code} Thats the output of sessionInfo from the container running R {code:java} sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_6.0.1 DBI_1.1.1 loaded via a namespace (and not attached): [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.2 magrittr_2.0.1 [5] assertthat_0.2.1 R6_2.5.1 tools_4.1.2 glue_1.5.1 [9] bit64_4.0.5 vctrs_0.3.8 RJDBC_0.2-8 rlang_0.4.12 [13] rJava_1.0-5 AWR.Athena_2.0.7-0 purrr_0.3.4 {code} And as far as I understand, all requierements are fulfilled to use datasets R version 4.1.2 Platform: x86_64-pc-linux-gnu (64-bit) arrow_6.0.1 {code:java} > .Machine$sizeof.pointer < 8 [1] FALSE > getRversion() < "4.0.0" [1] FALSE > tolower(Sys.info()[["sysname"]]) == "windows" [1] FALSE > {code} Nevertheless I get Error: This build of the arrow package does not support Datasets in return when {code:java} arrow::open_dataset(sources = path) {code} Appreciate any help! was: Hello, I would like to report a possible issue. Im trying to use R with arrow on docker: {code:java} FROM rocker/r-base:4.1.2 RUN apt update && \ apt install -y -V ca-certificates lsb-release wget && \wget "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb" && \apt-get -y --no-install-recommends install \ ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb RUN apt update && \ apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \ libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \ libparquet-glib-dev RUN install2.r --error \ arrow {code} Thats the output of sessionInfo from the container running R {code:java} sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_6.0.1 DBI_1.1.1 loaded via a namespace (and not attached): [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.2 magrittr_2.0.1 [5] assertthat_0.2.1 R6_2.5.1 tools_4.1.2 glue_1.5.1 [9] bit64_4.0.5 vctrs_0.3.8 RJDBC_0.2-8 rlang_0.4.12 [13] rJava_1.0-5 AWR.Athena_2.0.7-0 purrr_0.3.4 {code} And as far as I understand, all requierements are fulfilled to use datasets R version 4.1.2 Platform: x86_64-pc-linux-gnu (64-bit) arrow_6
[jira] [Updated] (ARROW-15072) Error: This build of the arrow package does not support Datasets
[ https://issues.apache.org/jira/browse/ARROW-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hu geme updated ARROW-15072: Priority: Minor (was: Trivial) > Error: This build of the arrow package does not support Datasets > > > Key: ARROW-15072 > URL: https://issues.apache.org/jira/browse/ARROW-15072 > Project: Apache Arrow > Issue Type: Bug > Components: Parquet, R >Affects Versions: 6.0.1 > Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker > rocker/r-base:4.1.2 >Reporter: hu geme >Priority: Minor > Fix For: 6.0.1 > > > Hello, > I would like to report a possible issue. > Im trying to use R with arrow on docker: > > {code:java} > FROM rocker/r-base:4.1.2 > RUN apt update && \ > apt install -y -V ca-certificates lsb-release wget && \wget > "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr > 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename > --short).deb" && \apt-get -y --no-install-recommends install \ > ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb > RUN apt update && \ > apt install -y -V -f \libarrow-dev \libarrow-dataset-dev > \libarrow-glib-dev \libarrow-flight-dev \ > libparquet-dev \libparquet-glib-dev > RUN install2.r --error \ >arrow {code} > Thats the output of sessionInfo from the container running R > > {code:java} > sessionInfo() > R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default > BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 > LAPACK: > /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 > [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8attached base > packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] arrow_6.0.1 DBI_1.1.1 loaded via a namespace (and not attached): > [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.2 magrittr_2.0.1 > > [5] assertthat_0.2.1 R6_2.5.1 tools_4.1.2 glue_1.5.1 > > [9] bit64_4.0.5 vctrs_0.3.8 RJDBC_0.2-8 rlang_0.4.12 > > [13] rJava_1.0-5 AWR.Athena_2.0.7-0 purrr_0.3.4 {code} > And as far as I understand, all requierements are fulfilled to use datasets > R version 4.1.2 > Platform: x86_64-pc-linux-gnu (64-bit) > arrow_6.0.1 > > {code:java} > > .Machine$sizeof.pointer < 8 > [1] FALSE > > getRversion() < "4.0.0" > [1] FALSE > > tolower(Sys.info()[["sysname"]]) == "windows" > [1] FALSE > > {code} > Nevertheless I get > Error: This build of the arrow package does not support Datasets > in return when > {code:java} > arrow::open_dataset(sources = path) {code} > Appreciate any help! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15072) Error: This build of the arrow package does not support Datasets
hu geme created ARROW-15072: --- Summary: Error: This build of the arrow package does not support Datasets Key: ARROW-15072 URL: https://issues.apache.org/jira/browse/ARROW-15072 Project: Apache Arrow Issue Type: Bug Components: Parquet, R Affects Versions: 6.0.1 Environment: x86_64-pc-linux-gnu (64-bit) via rocker/docker rocker/r-base:4.1.2 Reporter: hu geme Fix For: 6.0.1 Hello, I would like to report a possible issue. Im trying to use R with arrow on docker: {code:java} FROM rocker/r-base:4.1.2 RUN apt update && \ apt install -y -V ca-certificates lsb-release wget && \wget "https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb" && \apt-get -y --no-install-recommends install \ ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb RUN apt update && \ apt install -y -V -f \libarrow-dev \libarrow-dataset-dev \ libarrow-glib-dev \libarrow-flight-dev \libparquet-dev \ libparquet-glib-dev RUN install2.r --error \ arrow {code} Thats the output of sessionInfo from the container running R {code:java} sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye)Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.18.solocale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_6.0.1 DBI_1.1.1 loaded via a namespace (and not attached): [1] tidyselect_1.1.1 bit_4.0.4 compiler_4.1.2 magrittr_2.0.1 [5] assertthat_0.2.1 R6_2.5.1 tools_4.1.2 glue_1.5.1 [9] bit64_4.0.5 vctrs_0.3.8 RJDBC_0.2-8 rlang_0.4.12 [13] rJava_1.0-5 AWR.Athena_2.0.7-0 purrr_0.3.4 {code} And as far as I understand, all requierements are fulfilled to use datasets R version 4.1.2 Platform: x86_64-pc-linux-gnu (64-bit) arrow_6.0.1 {code:java} > .Machine$sizeof.pointer < 8 [1] FALSE > getRversion() < "4.0.0" [1] FALSE > tolower(Sys.info()[["sysname"]]) == "windows" [1] FALSE > {code} Nevertheless I get Error: This build of the arrow package does not support Datasets in return when {code:java} arrow::open_dataset(sources = path) {code} Appreciate any help! -- This message was sent by Atlassian Jira (v8.20.1#820001)