[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

rdblue Tue, 10 Jul 2018 08:57:47 -0700

Github user rdblue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21305#discussion_r201399506
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
    @@ -240,21 +238,27 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
     
         val cls = DataSource.lookupDataSource(source, 
df.sparkSession.sessionState.conf)
         if (classOf[DataSourceV2].isAssignableFrom(cls)) {
    -      val ds = cls.newInstance()
    -      ds match {
    +      val source = cls.newInstance().asInstanceOf[DataSourceV2]
    +      source match {
             case ws: WriteSupport =>
    -          val options = new DataSourceOptions((extraOptions ++
    -            DataSourceV2Utils.extractSessionConfigs(
    -              ds = ds.asInstanceOf[DataSourceV2],
    -              conf = df.sparkSession.sessionState.conf)).asJava)
    -          // Using a timestamp and a random UUID to distinguish different 
writing jobs. This is good
    -          // enough as there won't be tons of writing jobs created at the 
same second.
    -          val jobId = new SimpleDateFormat("yyyyMMddHHmmss", Locale.US)
    -            .format(new Date()) + "-" + UUID.randomUUID()
    -          val writer = ws.createWriter(jobId, df.logicalPlan.schema, mode, 
options)
    -          if (writer.isPresent) {
    +          val options = extraOptions ++
    +              DataSourceV2Utils.extractSessionConfigs(source, 
df.sparkSession.sessionState.conf)
    +
    +          val relation = DataSourceV2Relation.create(source, options.toMap)
    +          if (mode == SaveMode.Append) {
                 runCommand(df.sparkSession, "save") {
    -              WriteToDataSourceV2(writer.get(), df.logicalPlan)
    +              AppendData.byName(relation, df.logicalPlan)
    +            }
    +
    +          } else {
    +            val writer = ws.createWriter(
    +              UUID.randomUUID.toString, 
df.logicalPlan.output.toStructType, mode,
    --- End diff --
    
    I see no good reason to over-complicate the unique string passed in. Here's 
a quote from wikipedia on the chance of a conflict (from [this SO 
answer](https://stackoverflow.com/questions/24876188/how-big-is-the-chance-to-get-a-java-uuid-randomuuid-collision)):
    
    ```
    Only after generating 1 billion UUIDs every second for the next 100 years, 
the probability of creating just one duplicate would be about 50%. Or, to put 
it another way, the probability of one duplicate would be about 50% if every 
person on earth owned 600 million UUIDs.
    ```
    
    Adding timestamp to a UUID to avoid collisions is unnecessary.
    
    For the other use, why would a user go to the temp directory of some node's 
file system -- which may not even be used by a given source -- instead of going 
to the logs? What if the user wants any other piece of information besides the 
starting timestamp (that's in some format that has to be converted)?
    
    In short, I don't agree with the argument that it is helpful to pass the 
old format. This is just a carry-over from making fake Hadoop job IDs (why it 
was called `jobId` and started with `job_`). It's debatable whether the write 
UUID itself is even useful given that there is no requirement to use it 
anywhere.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

Reply via email to