Shruti Singhania created SPARK-52336: ----------------------------------------
Summary: Set a default for fs.gs.application.name.suffix when GCS is used and not user-defined Key: SPARK-52336 URL: https://issues.apache.org/jira/browse/SPARK-52336 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.5.6, 4.0.1 Reporter: Shruti Singhania *1. Current Behavior:* Apache Spark does not currently set a default value for the GCS Hadoop connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want to leverage this GCS connector feature for better traceability of Spark applications in GCS logs and metrics must set it explicitly either in Hadoop configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}}, or programmatically in their Spark application. *2. Problem / Motivation:* The {{fs.gs.application.name.suffix}} property is very useful for identifying which application is performing GCS operations, especially in environments where multiple Spark applications (or other Hadoop applications) interact with GCS concurrently. Without a default set by Spark when GCS is used: * Many users might be unaware of this beneficial GCS connector feature. * GCS logs and metrics is getting harder to correlate with specific Spark applications, increasing debugging time and operational overhead. * It introduces an extra configuration step for users who would benefit from this tagging. Setting a sensible default when GCS is detected would improve the experience for Spark users on GCS, providing better traceability with no extra configuration effort for the common case. *3. Proposed Change:* We propose that Spark should automatically set a default value for {{fs.gs.application.name.suffix}} if: # The application is interacting with Google Cloud Storage (i.e., paths with the {{gs://}} scheme are used). # The user has *not* already provided a value for {{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark configuration. User-defined values should always take precedence. *4. Implementation Considerations* {*}(Open for Discussion){*}{*}:{*} * *Detection of GCS Usage:* Spark would need to detect when a {{FileSystem}} for the {{gs://}} scheme is being initialized or used. This might be done in {{HadoopFSUtils}} or a similar place where the Hadoop {{Configuration}} object is prepared for file system interactions. * *Precedence:* The logic must ensure that this default is only applied if {{fs.gs.application.name.suffix}} (and potentially {{fs.gs.application.name}} if the suffix is intended to be appended to it by the connector) is not already present in the Hadoop {{Configuration}} being used.{*}{*} *5. Benefits:* * *Improved Traceability:* Easier to identify Spark application interactions in GCS request logs and metrics provided by the GCS connector. * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to specific Spark jobs. * *Better User Experience:* Provides a useful GCS integration feature by default, reducing boilerplate configuration for users. * *Consistency:* Encourages a good practice for applications interacting with GCS. *Impact:* This change is expected to be low-impact and beneficial. It adds a configuration property that the GCS connector already understands. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org