[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562058#comment-16562058 ]
Ryan Blue commented on SPARK-24882: ----------------------------------- [~cloud_fan], when you say that "ReadSupport is created via reflection and should be very light-weight", that raises a red flag for me. As I've mentioned in the discussions on table catalog support, I think that we should instantiate the catalog through reflection, not the data source. Starting with the data source is backward and only works now because we have just one global catalog. In my catalog support PR, the catalog is what gets instantiated via reflection. In that model, the initialization you need would be in the Table instance's life-cycle and the catalog would return tables that implement ReadSupport. So, I think we should consider the impact of multiple catalog support. I think that if we go with the catalog proposal, it would make this change cleaner because we would be able to remove the extra DataSourceReader level. This would also end up working like a lot of other Table and Scan interfaces. Iceberg, for example, has a Table that you call newScan on to get a configurable TableScan. That's very similar to how this API would end up: Table implements ReadSupport, ReadSupport exposes newScanConfig. Getting a Reader that doesn't really read data just doesn't make sense. For the builder discussion: Does the current proposal work for the {{Filter(Limit(Filter(Scan)))}} case? I don't think it does because implementations expect predicates to be pushed just once. I think that these cases should probably be handled by a more generic push-down that passes parts of the plan tree. If that's the case, then the builder is simpler and immutable. I'd rather go with the builder for now to set the expectation that ScanConfig is immutable. We *could* add an intermediate class, like {{ConfigurableScanConfig}}, that exposes the {{SupportsXYZ}} pushdown interfaces. Then, we could add a method to {{ConfigurableScanConfig}} to get a final, immutable {{ScanConfig}}. But, that's really just a builder with a more difficult to implement interface. In the event that we do find pushdown operations that are incompatible with a builder – and not incompatible with both a builder and the current proposal like your example – then we can always add a new way to build or configure a {{ScanConfig}} later. The important thing is that the ScanConfig is immutable to provide the guarantees that you want in this API: that the ScanConfig won't change between calls to {{estimateStatistics}} and {{planInputPartitions}}. > separate responsibilities of the data source v2 read API > -------------------------------------------------------- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.0 > Reporter: Wenchen Fan > Assignee: Wenchen Fan > Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org