[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

icexelloss Wed, 04 Oct 2017 14:25:23 -0700

Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18732#discussion_r142796899
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
    @@ -519,3 +519,18 @@ case class CoGroup(
         outputObjAttr: Attribute,
         left: LogicalPlan,
         right: LogicalPlan) extends BinaryNode with ObjectProducer
    +
    +case class FlatMapGroupsInPandas(
    +    groupingAttributes: Seq[Attribute],
    +    functionExpr: Expression,
    +    output: Seq[Attribute],
    +    child: LogicalPlan) extends UnaryNode {
    +  /**
    +   * This is needed because output attributes is considered `reference` 
when
    +   * passed through the constructor.
    +   *
    +   * Without this, catalyst will complain that output attributes are 
missing
    +   * from the input.
    +   */
    +  override val producedAttributes = AttributeSet(output)
    --- End diff --
    
    This is one of the trick bit.
    
    It's because of this code:
    
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L135
    
    Because of `productIterator` will return all member variables, including 
`output`, `references` of the tree node will include all output attributes, and 
it will complain about missing input:
    
    ```
    def missingInput: AttributeSet = references -- inputSet -- 
producedAttributes
    ```
    
    I think my solution here isn't great but I don't know the best way of deal 
with this. If someone with deeper catalyst knowledge can suggest, I am happy to 
give rid of this bit..



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

Reply via email to