[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol

2024-01-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808235#comment-17808235
 ] 

Никита Соколов commented on SPARK-46247:


No, there was no trailing dot at the end of the filenames, it is from an 
exception. The file is invalid because of a 
-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet suffix. BucketingUtils fails 
to extract the bucket id when it is there.
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BucketingUtils.scala#L34C31-L34C31]

Is this enough? If not, then I will come back with the whole stacktrace a bit 
later.

Should I use the s3a-prefix in the path option or some configurations?

 

> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> 
>
> Key: SPARK-46247
> URL: https://issues.apache.org/jira/browse/SPARK-46247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol

2024-01-18 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808227#comment-17808227
 ] 

Steve Loughran commented on SPARK-46247:


why is the file invalid? any more stack trace?

# try using s3a:// as the prefix all the way through
# is there really a "." at the end of the filenames.

The directory committer was netflix's design for incremental update of an 
existing table, where a partition could be deleted before new data was 
committed.

unless you want to do this, use the magic or (second best) staging committer


> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> 
>
> Key: SPARK-46247
> URL: https://issues.apache.org/jira/browse/SPARK-46247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46247) Invalid bucket file error when reading from bucketed table created with PathOutputCommitProtocol

2023-12-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793207#comment-17793207
 ] 

Никита Соколов commented on SPARK-46247:


Can be bypassed by the following configuration during writes:
{code:java}
fs.s3a.committer.staging.unique-filenames: false {code}

> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> 
>
> Key: SPARK-46247
> URL: https://issues.apache.org/jira/browse/SPARK-46247
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Никита Соколов
>Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-0-43293810-d0e9-4eee-9be8-e9e50a3e10fd_0-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org