[ 
https://issues.apache.org/jira/browse/SPARK-46247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808227#comment-17808227
 ] 

Steve Loughran commented on SPARK-46247:
----------------------------------------

why is the file invalid? any more stack trace?

# try using s3a:// as the prefix all the way through
# is there really a "." at the end of the filenames.

The directory committer was netflix's design for incremental update of an 
existing table, where a partition could be deleted before new data was 
committed.

unless you want to do this, use the magic or (second best) staging committer


> Invalid bucket file error when reading from bucketed table created with 
> PathOutputCommitProtocol
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-46247
>                 URL: https://issues.apache.org/jira/browse/SPARK-46247
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.0
>            Reporter: Никита Соколов
>            Priority: Major
>
> I am trying to create an external partioned bucketed table using this code:
> {code:java}
> spark.read.parquet("s3://faucct/input")
>   .repartition(128, col("product_id"))
>   .write.partitionBy("features_date").bucketBy(128, "product_id")
>   .option("path", "s3://faucct/tmp/output")
>   .option("compression", "uncompressed")
>   .saveAsTable("tmp.output"){code}
> At first it took more time than expected because it had to rename a lot of 
> files in the end, which requires copying in S3. But I have used the 
> configuration from the documentation – 
> [https://spark.apache.org/docs/3.0.0-preview/cloud-integration.html#committing-work-into-cloud-storage-safely-and-fast]:
> {code:java}
> spark.hadoop.fs.s3a.committer.name directory
> spark.sql.sources.commitProtocolClass 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
> spark.sql.parquet.output.committer.class 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter {code}
> It is properly partitioned: every partition_date has exactly 128 files named 
> like 
> [part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet|https://s3.console.aws.amazon.com/s3/object/joom-analytics-recom?region=eu-central-1&prefix=recom/dataset/best/best-to-cart-rt/user-product-v4/to_cart-faucct/fnw/ipw/msv2/2023-09-15/14d/tmp_3/features_date%3D2023-09-01/part-00117-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00117-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet].
> Then I am trying to join this table with another one, for example like this:
> {code:java}
> spark.table("tmp.output").repartition(128, $"product_id")
>   .join(spark.table("tmp.output").repartition(128, $"product_id"), 
> Seq("product_id")).count(){code}
> Because of the configuration I get the following errors:
> {code:java}
> org.apache.spark.SparkException: [INVALID_BUCKET_FILE] Invalid bucket file: 
> s3://faucct/tmp/output/features_date=2023-09-01/part-00000-43293810-d0e9-4eee-9be8-e9e50a3e10fd_00000-5eb66a54-2fbb-4775-8f3b-3040b2966a71.c000.parquet.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.invalidBucketFile(QueryExecutionErrors.scala:2731)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.$anonfun$createBucketedReadRDD$5(DataSourceScanExec.scala:636)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to