[ 
https://issues.apache.org/jira/browse/SPARK-25187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589060#comment-16589060
 ] 

Ryan Blue edited comment on SPARK-25187 at 8/22/18 3:58 PM:
------------------------------------------------------------

The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together when using 
{{spark.read.format("fmt").option(...).load()}}. The only way to set both types 
of options is to pass them to the {{DataFrameReader}}. That makes it appear 
that there is only one set of options for a source. But, consider sources that 
are stored in the the session catalog. Those sources are stored with 
source/table configuration, the {{OPTIONS}} passed in when creating the table. 
When reading these tables, we can also pass options to the {{DataFrameReader}}, 
which need to be passed when creating a scan of those sources.


was (Author: rdblue):
The need for {{newScanConfigBuilder}} to take key-value options doesn't require 
a change to the life-cycle of {{ReadSupport}} instances. There are options that 
are related to scan configuration, and not to source configuration. If data 
sources are free to reuse {{ReadSupport}} instances, then scan options must be 
passed to configure the scan.

HBase provides a good example of the difference. HBase table options would 
include where the data lives, like the HBase host to connect to. HBase scan 
options would include the MVCC timestamp to request for a scan. A HBase 
ReadSupport can be reused, which means that the MVCC timestamp used should be 
one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of 
options are mixed together when using 
{{spark.read.format("fmt").option(...).load()}}. The only way to set these 
options is to pass them to the {{DataFrameReader}}. That makes it appear that 
there is only one set of options for a source. But, consider sources that are 
stored in the the session catalog. Those sources are stored with source/table 
configuration, the {{OPTIONS}} passed in when creating the table. When reading 
these tables, we can also pass options to the {{DataFrameReader}}, which need 
to be passed when creating a scan of those sources.

> Revisit the life cycle of ReadSupport instances.
> ------------------------------------------------
>
>                 Key: SPARK-25187
>                 URL: https://issues.apache.org/jira/browse/SPARK-25187
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Priority: Major
>
> Currently the life cycle is bound to the batch/stream query. This fits 
> streaming very well but may not be perfect for batch source. We can also 
> consider to let {{ReadSupport.newScanConfigBuilder}} take 
> {{DataSourceOptions}} as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to