[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808235#comment-17808235 ] Никита Соколов commented on SPARK-46247: No, there was no trailing dot at the end of the filenames, it is from an exception. The file is invalid because of a -5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet suffix. BucketingUtils fails to extract the bucket id when it is there. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L34C31-L34C31] Is this enough? If not, then I will come back with the whole stacktrace a bit later. Should I use the s3a-prefix in the path option or some configurations? > Invalid bucket file error when reading from bucketed table created with > PathOutputCommitProtocol > > > Key: SPARK-46247 > URL: https://issues.apache.org/jira/browse/SPARK-46247 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > > I am trying to create an external partioned bucketed table using this code: > {code:java} > spark.read.parquet("s3://faucct/input") > .repartition(128, col("product_id")) > .write.partitionBy("features_date").bucketBy(128, "product_id") > .option("path", "s3://faucct/tmp/output") > .option("compression", "uncompressed") > .saveAsTable("tmp.output"){code} > At first it took more time than expected because it had to rename a lot of > files in the end, which requires copying in S3. But I have used the > configuration from the documentation – > [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]: > {code:java} > spark.hadoop.fs.s3a.committer.name directory > spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code} > It is properly partitioned: every partition_date has exactly 128 files named > like > [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet]. > Then I am trying to join this table with another one, for example like this: > {code:java} > spark.table("tmp.output").repartition(128, $"product_id") > .join(spark.table("tmp.output").repartition(128, $"product_id"), > Seq("product_id")).count(){code} > Because of the configuration I get the following errors: > {code:java} > org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: > s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731) > at > org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808227#comment-17808227 ] Steve Loughran commented on SPARK-46247: why is the file invalid? any more stack trace? # try using s3a:// as the prefix all the way through # is there really a "." at the end of the filenames. The directory committer was netflix's design for incremental update of an existing table, where a partition could be deleted before new data was committed. unless you want to do this, use the magic or (second best) staging committer > Invalid bucket file error when reading from bucketed table created with > PathOutputCommitProtocol > > > Key: SPARK-46247 > URL: https://issues.apache.org/jira/browse/SPARK-46247 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > > I am trying to create an external partioned bucketed table using this code: > {code:java} > spark.read.parquet("s3://faucct/input") > .repartition(128, col("product_id")) > .write.partitionBy("features_date").bucketBy(128, "product_id") > .option("path", "s3://faucct/tmp/output") > .option("compression", "uncompressed") > .saveAsTable("tmp.output"){code} > At first it took more time than expected because it had to rename a lot of > files in the end, which requires copying in S3. But I have used the > configuration from the documentation – > [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]: > {code:java} > spark.hadoop.fs.s3a.committer.name directory > spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code} > It is properly partitioned: every partition_date has exactly 128 files named > like > [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet]. > Then I am trying to join this table with another one, for example like this: > {code:java} > spark.table("tmp.output").repartition(128, $"product_id") > .join(spark.table("tmp.output").repartition(128, $"product_id"), > Seq("product_id")).count(){code} > Because of the configuration I get the following errors: > {code:java} > org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: > s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731) > at > org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol
[ https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793207#comment-17793207 ] Никита Соколов commented on SPARK-46247: Can be bypassed by the following configuration during writes: {code:java} fs.s3a.committer.staging.unique-filenames: false {code} > Invalid bucket file error when reading from bucketed table created with > PathOutputCommitProtocol > > > Key: SPARK-46247 > URL: https://issues.apache.org/jira/browse/SPARK-46247 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Никита Соколов >Priority: Major > > I am trying to create an external partioned bucketed table using this code: > {code:java} > spark.read.parquet("s3://faucct/input") > .repartition(128, col("product_id")) > .write.partitionBy("features_date").bucketBy(128, "product_id") > .option("path", "s3://faucct/tmp/output") > .option("compression", "uncompressed") > .saveAsTable("tmp.output"){code} > At first it took more time than expected because it had to rename a lot of > files in the end, which requires copying in S3. But I have used the > configuration from the documentation – > [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]: > {code:java} > spark.hadoop.fs.s3a.committer.name directory > spark.sql.sources.commitProtocolClass > org.apache.spark.internal.io.cloud.PathOutputCommitProtocol > spark.sql.parquet.output.committer.class > org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code} > It is properly partitioned: every partition_date has exactly 128 files named > like > [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet]. > Then I am trying to join this table with another one, for example like this: > {code:java} > spark.table("tmp.output").repartition(128, $"product_id") > .join(spark.table("tmp.output").repartition(128, $"product_id"), > Seq("product_id")).count(){code} > Because of the configuration I get the following errors: > {code:java} > org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: > s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731) > at > org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org