[GitHub] spark pull request #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol....

steveloughran Thu, 15 Mar 2018 08:41:44 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20824#discussion_r174829859
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
    @@ -145,15 +146,23 @@ object FileCommitProtocol {
           jobId: String,
           outputPath: String,
           dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol = {
    +
    +    logDebug(s"Creating committer $className; job $jobId; 
output=$outputPath;" +
    +      s" dynamic=$dynamicPartitionOverwrite")
         val clazz = 
Utils.classForName(className).asInstanceOf[Class[FileCommitProtocol]]
         // First try the constructor with arguments (jobId: String, 
outputPath: String,
         // dynamicPartitionOverwrite: Boolean).
         // If that doesn't exist, try the one with (jobId: string, outputPath: 
String).
         try {
           val ctor = clazz.getDeclaredConstructor(classOf[String], 
classOf[String], classOf[Boolean])
    +      logDebug("Using (String, String, Boolean) constructor")
           ctor.newInstance(jobId, outputPath, 
dynamicPartitionOverwrite.asInstanceOf[java.lang.Boolean])
         } catch {
           case _: NoSuchMethodException =>
    +        logDebug("Falling back to (String, String) constructor")
    +        require(!dynamicPartitionOverwrite,
    +          "Dynamic Partition Overwrite is enabled but" +
    +            s" the committer ${className} does not have the appropriate 
constructor")
    --- End diff --
    
    Problem is that the dynamic partition logic in 
`InsertIntoHadoopFsRelationCommand` assumes that rename() is a fast reliable 
operation you can do with any implementation of the FileCommitProtocol,  sets 
itself up for it when enabled, then instantiates the inner committer, and 
carries on with the dynamic partitioning, irrespective of whether or not. 
rename() doesn't always work like that, breaking the rest of the algorithm.
    
    If the committer doesn't have that 3-arg constructor, you can't be 
confident that you can do that. To silently log and continue is to run the risk 
that the underlying committers commit algorithm isn't compatible with the 
algorithm.
    
    A fail-fast ensures that when the outcome is going to be unknown, you aren' 
t left trying to work out what's happened.
    
    Regarding your example, yes, it's in trouble. Problem is: how to 
differentiate that from subclasses which don't know anything at all about the 
new feature. you can't even look for an interface on the newly created object 
if the base class implements it; you are left with some dynamic probe of the 
instance.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol....

Reply via email to