Adrian Ionescu created SPARK-21669:
--------------------------------------

             Summary: Internal API for collecting metrics/stats during 
FileFormatWriter jobs
                 Key: SPARK-21669
                 URL: https://issues.apache.org/jira/browse/SPARK-21669
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Adrian Ionescu


It would be useful to have some infrastructure in place for collecting custom 
metrics or statistics on data on the fly, as it is being written to disk.

This was inspired by the work in SPARK-20703, which added simple metrics 
collection for data write operations, such as {{numFiles}}, {{numPartitions}}, 
{{numRows}}. Those metrics are first collected on the executors and then sent 
to the driver, which aggregates and posts them as updates to the {{SQLMetrics}} 
subsystem.

The above can be generalized and turned into a pluggable interface, which in 
the future could be used for other purposes: e.g. automatic maintenance of 
cost-based optimizer (CBO) statistics during "INSERT INTO <table> SELECT ..." 
operations, such that users won't need to explicitly call "ANALYZE TABLE 
<table> COMPUTE STATISTICS" afterwards anymore, thus avoiding an extra 
full-table scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to