Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21948#discussion_r207294283 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -50,4 +50,15 @@ * this ID will always be 0. */ DataWriter<T> createDataWriter(int partitionId, long taskId, long epochId); + + /** + * When true, Spark will reuse the same data object instance when sending data to the data writer, + * for better performance. Data writers should carefully handle the data objects if it's reused, + * e.g. do not buffer the data objects in a list. By default it returns false for safety, data + * sources can override it if their data writers immediately write the data object to somewhere + * else like a memory buffer or disk. + */ + default boolean reuseDataObject() { --- End diff -- I don't think this should be added in this commit. This is to move to `InternalRow` and should not alter the API. I'm fine documenting this, but writers are responsible for defensive copies if necessary. This default is going to cause sources to be slower and I don't think it is necessary for implementations that aren't tests buffering data in memory.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org