[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

HyukjinKwon Mon, 10 Oct 2016 23:24:16 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15072#discussion_r82730295
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -53,7 +53,15 @@ import org.apache.spark.util.Utils
     
     private[sql] object Dataset {
       def apply[T: Encoder](sparkSession: SparkSession, logicalPlan: 
LogicalPlan): Dataset[T] = {
    -    new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
    +    val encoder = implicitly[Encoder[T]]
    +    if (encoder.clsTag.runtimeClass == classOf[Row]) {
    +      // We should use the encoder generated from the executed plan rather 
than the existing
    +      // encoder for DataFrame because the types of columns can be varied 
due to widening types.
    +      // See SPARK-17123. This is a bit hacky. Maybe we should find a 
better way to do this.
    +      ofRows(sparkSession, logicalPlan).asInstanceOf[Dataset[T]]
    +    } else {
    +      new Dataset(sparkSession, logicalPlan, encoder)
    +    }
    --- End diff --
    
    Ah, here is the codes I ran
    
    ```scala
    val dates = Seq(
      (new Date(0), BigDecimal.valueOf(1), new Timestamp(2)),
      (new Date(3), BigDecimal.valueOf(4), new Timestamp(5))
    ).toDF("date", "timestamp", "decimal")
    
    val widenTypedRows = Seq(
      (new Timestamp(2), 10.5D, "string")
    ).toDF("date", "timestamp", "decimal")
    
    dates.except(widenTypedRows).collect()
    ```
    
    and error message.
    
    ```java
    23:10:05.331 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 30, Column 107: No applicable constructor/method found for actual 
parameters "long"; candidates are: "public static java.sql.Date 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
    /* 001 */ public java.lang.Object generate(Object[] references) {
    /* 002 */   return new SpecificSafeProjection(references);
    /* 003 */ }
    /* 004 */
    /* 005 */ class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
    /* 006 */
    /* 007 */   private Object[] references;
    /* 008 */   private InternalRow mutableRow;
    /* 009 */   private Object[] values;
    /* 010 */   private org.apache.spark.sql.types.StructType schema;
    /* 011 */
    /* 012 */   public SpecificSafeProjection(Object[] references) {
    /* 013 */     this.references = references;
    /* 014 */     mutableRow = (InternalRow) references[references.length - 1];
    /* 015 */
    /* 016 */     this.schema = (org.apache.spark.sql.types.StructType) 
references[0];
    /* 017 */
    /* 018 */   }
    /* 019 */
    /* 020 */
    /* 021 */
    /* 022 */   public java.lang.Object apply(java.lang.Object _i) {
    /* 023 */     InternalRow i = (InternalRow) _i;
    /* 024 */
    /* 025 */     values = new Object[3];
    /* 026 */
    /* 027 */     boolean isNull2 = i.isNullAt(0);
    /* 028 */     long value2 = isNull2 ? -1L : (i.getLong(0));
    /* 029 */     boolean isNull1 = isNull2;
    /* 030 */     final java.sql.Date value1 = isNull1 ? null : 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
    /* 031 */     isNull1 = value1 == null;
    /* 032 */     if (isNull1) {
    /* 033 */       values[0] = null;
    /* 034 */     } else {
    /* 035 */       values[0] = value1;
    /* 036 */     }
    /* 037 */
    /* 038 */     boolean isNull4 = i.isNullAt(1);
    /* 039 */     double value4 = isNull4 ? -1.0 : (i.getDouble(1));
    /* 040 */
    /* 041 */     boolean isNull3 = isNull4;
    /* 042 */     java.math.BigDecimal value3 = null;
    /* 043 */     if (!isNull3) {
    /* 044 */
    /* 045 */       Object funcResult = null;
    /* 046 */       funcResult = value4.toJavaBigDecimal();
    /* 047 */       if (funcResult == null) {
    /* 048 */         isNull3 = true;
    /* 049 */       } else {
    /* 050 */         value3 = (java.math.BigDecimal) funcResult;
    /* 051 */       }
    /* 052 */
    /* 053 */     }
    /* 054 */     isNull3 = value3 == null;
    /* 055 */     if (isNull3) {
    /* 056 */       values[1] = null;
    /* 057 */     } else {
    /* 058 */       values[1] = value3;
    /* 059 */     }
    /* 060 */
    /* 061 */     boolean isNull6 = i.isNullAt(2);
    /* 062 */     UTF8String value6 = isNull6 ? null : (i.getUTF8String(2));
    /* 063 */     boolean isNull5 = isNull6;
    /* 064 */     final java.sql.Timestamp value5 = isNull5 ? null : 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaTimestamp(value6);
    /* 065 */     isNull5 = value5 == null;
    /* 066 */     if (isNull5) {
    /* 067 */       values[2] = null;
    /* 068 */     } else {
    /* 069 */       values[2] = value5;
    /* 070 */     }
    /* 071 */
    /* 072 */     final org.apache.spark.sql.Row value = new 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, schema);
    /* 073 */     if (false) {
    /* 074 */       mutableRow.setNullAt(0);
    /* 075 */     } else {
    /* 076 */
    /* 077 */       mutableRow.update(0, value);
    /* 078 */     }
    /* 079 */
    /* 080 */     return mutableRow;
    /* 081 */   }
    /* 082 */ }
    ```
    
    ```
    /* 028 */     long value2 = isNull2 ? -1L : (i.getLong(0));
    /* 029 */     boolean isNull1 = isNull2;
    /* 030 */     final java.sql.Date value1 = isNull1 ? null : 
org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
    ```
    
    Here, it seems `toJavaDate` takes `Int` but it seems `long` is given from 
`Timestamp`.
    It (apparently) seems it needs widen schema to compare each. I will look 
into this deeper but do you have any idea on this maybe?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15072: [SPARK-17123][SQL] Use type-widened encoder for D...

Reply via email to