Hi,

I'm running spark3 on Kubernetes and using S3A staging committer (directory 
committer) to write data to s3 bucket. The same set up works fine with spark 
2.4.5 but with spark3 the final data (writing in parquet format) is not visible 
in s3 bucket and when read operation is performed on that parquet data it fails 
as it is an empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging 
data i have set up a shared PVC for all executors and drivers(i.e., 
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with 
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket= 
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==============================
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
  "metrics" : {
    "stream_write_block_uploads" : 0,
    "files_created" : 5,
    "S3guard_metadatastore_put_path_latencyNumOps" : 0,
    "stream_write_block_uploads_aborted" : 0,
    "committer_commits_reverted" : 0,
    "op_open" : 2,
    "stream_closed" : 12,
    "committer_magic_files_created" : 0,
    "object_copy_requests" : 0,
    "s3guard_metadatastore_initialization" : 0,
    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
    "stream_write_block_uploads_committed" : 0,
    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
    "committer_bytes_committed" : 0,
    "op_create" : 5,
    "stream_read_fully_operations" : 0,
    "committer_commits_completed" : 0,
    "object_put_requests_active" : 0,
    "s3guard_metadatastore_retry" : 0,
    "stream_write_block_uploads_active" : 0,
    "stream_opened" : 12,
    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
    "op_create_non_recursive" : 0,
    "object_continue_list_requests" : 0,
    "committer_jobs_completed" : 5,
    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
    "stream_close_operations" : 12,
    "stream_read_operations" : 378,
    "object_delete_requests" : 4,
    "fake_directories_deleted" : 8,
    "stream_aborted" : 0,
    "op_rename" : 0,
    "object_multipart_aborted" : 0,
    "committer_commits_created" : 0,
    "op_get_file_status" : 26,
    "s3guard_metadatastore_put_path_request" : 9,
    "committer_commits_failed" : 0,
    "stream_bytes_read_in_close" : 0,
    "op_glob_status" : 1,
    "stream_read_exceptions" : 0,
    "op_exists" : 5,
    "stream_read_version_mismatches" : 0,
    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
    "stream_write_block_uploads_pending" : 4,
    "directories_created" : 0,
    "S3guard_metadatastore_throttle_rateNumEvents" : 0,
    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
    "stream_bytes_backwards_on_seek" : 0,
    "stream_bytes_read" : 2997558,
    "stream_write_total_data" : 16282,
    "committer_jobs_failed" : 0,
    "stream_read_operations_incomplete" : 29,
    "files_copied_bytes" : 0,
    "op_delete" : 8,
    "object_put_bytes_pending" : 0,
    "stream_write_block_uploads_data_pending" : 0,
    "op_list_located_status" : 0,
    "object_list_requests" : 19,
    "stream_forward_seek_operations" : 0,
    "committer_tasks_completed" : 0,
    "committer_commits_aborted" : 0,
    "object_metadata_requests" : 45,
    "object_put_requests_completed" : 4,
    "stream_seek_operations" : 0,
    "op_list_status" : 0,
    "store_io_throttled" : 0,
    "stream_write_failures" : 0,
    "op_get_file_checksum" : 0,
    "files_copied" : 0,
    "ignored_errors" : 8,
    "committer_bytes_uploaded" : 0,
    "committer_tasks_failed" : 0,
    "stream_bytes_skipped_on_seek" : 0,
   "op_list_files" : 0,
    "files_deleted" : 0,
    "stream_bytes_discarded_in_abort" : 0,
    "op_mkdirs" : 1,
    "op_copy_from_local_file" : 0,
    "op_is_directory" : 1,
    "s3guard_metadatastore_throttled" : 0,
    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
    "stream_write_total_time" : 0,
    "stream_backward_seek_operations" : 0,
    "object_put_requests" : 4,
    "object_put_bytes" : 16282,
    "directories_deleted" : 0,
    "op_is_file" : 2,
    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
    "fs.s3a.metadatastore.impl" : 
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
    "fs.s3a.committer.magic.enabled" : "false",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ ]
}

===============================
With same s3 bucket if i run spark job with spark 2.4.5 then it writes data to 
s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar to 
above one but "filenames" key in that json contain list of part files 
(parquet's data files) but with spark3 it is empty list as shown above.
There is no exception or error during write operation, but read fails to get 
the schema as the parquet file is empty.

Not sure what is causing the issue, I have attached the spark configuration 
which are used to submit the job as attachment(spark-default.conf).

I'm using Ceph as underlying storage for s3 buckets and if I use rados command 
to check data i can see parquet data with file name containing multipart upload 
in some path like below (but not in final output s3 path)

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep 
"part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"

Thanks and Regards,
Abhishek

Attachment: spark-default.conf
Description: spark-default.conf

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to