Shruti Singhania created SPARK-52336:
----------------------------------------

             Summary: Set a default for fs.gs.application.name.suffix when GCS 
is used and not user-defined
                 Key: SPARK-52336
                 URL: https://issues.apache.org/jira/browse/SPARK-52336
             Project: Spark
          Issue Type: Task
          Components: Spark Core
    Affects Versions: 3.5.6, 4.0.1
            Reporter: Shruti Singhania


*1. Current Behavior:*

Apache Spark does not currently set a default value for the GCS Hadoop 
connector configuration {{{}fs.gs.application.name.suffix{}}}. Users who want 
to leverage this GCS connector feature for better traceability of Spark 
applications in GCS logs and metrics must set it explicitly either in Hadoop 
configuration files ({{{}core-site.xml{}}}), via {{{}spark-submit --conf{}}}, 
or programmatically in their Spark application.

*2. Problem / Motivation:*

The {{fs.gs.application.name.suffix}} property is very useful for identifying 
which application is performing GCS operations, especially in environments 
where multiple Spark applications (or other Hadoop applications) interact with 
GCS concurrently.

Without a default set by Spark when GCS is used:
 * Many users might be unaware of this beneficial GCS connector feature.
 * GCS logs and metrics is getting harder to correlate with specific Spark 
applications, increasing debugging time and operational overhead.
 * It introduces an extra configuration step for users who would benefit from 
this tagging.

Setting a sensible default when GCS is detected would improve the experience 
for Spark users on GCS, providing better traceability with no extra 
configuration effort for the common case.

*3. Proposed Change:*

We propose that Spark should automatically set a default value for 
{{fs.gs.application.name.suffix}} if:
 # The application is interacting with Google Cloud Storage (i.e., paths with 
the {{gs://}} scheme are used).
 # The user has *not* already provided a value for 
{{fs.gs.application.name.suffix}} in their Hadoop configuration or Spark 
configuration. User-defined values should always take precedence.

*4. Implementation Considerations* {*}(Open for Discussion){*}{*}:{*}
 * *Detection of GCS Usage:* Spark would need to detect when a {{FileSystem}} 
for the {{gs://}} scheme is being initialized or used. This might be done in 
{{HadoopFSUtils}} or a similar place where the Hadoop {{Configuration}} object 
is prepared for file system interactions.
 * *Precedence:* The logic must ensure that this default is only applied if 
{{fs.gs.application.name.suffix}} (and potentially {{fs.gs.application.name}} 
if the suffix is intended to be appended to it by the connector) is not already 
present in the Hadoop {{Configuration}} being used.{*}{*}

*5. Benefits:*
 * *Improved Traceability:* Easier to identify Spark application interactions 
in GCS request logs and metrics provided by the GCS connector.
 * *Enhanced Debugging:* Simplifies pinpointing GCS operations related to 
specific Spark jobs.
 * *Better User Experience:* Provides a useful GCS integration feature by 
default, reducing boilerplate configuration for users.
 * *Consistency:* Encourages a good practice for applications interacting with 
GCS.

 

*Impact:* This change is expected to be low-impact and beneficial. It adds a 
configuration property that the GCS connector already understands.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to