Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 @maropu I have run the following benchmark: ``` test("AttributeSet -- benchmark") { val attrSetA = AttributeSet((1 to 100).map { i => AttributeReference(s"c$i", IntegerType)() }) val attrSetB = attrSetA.take(80).toSeq val attrSetC = (1 to 100).map { i => AttributeReference(s"c2_$i", IntegerType)() } val attrSetD = (attrSetA.take(50) ++ attrSetC.take(50)).toSeq val attrSetE = attrSetC.take(50) ++ attrSetA.take(50) val n_iter = 1000000 val t0 = System.nanoTime() (1 to n_iter) foreach { _ => val r1 = attrSetA -- attrSetB val r2 = attrSetA -- attrSetC val r3 = attrSetA -- attrSetD val r4 = attrSetA -- attrSetE } val t1 = System.nanoTime() (1 to n_iter) foreach { _ => val r1 = attrSetA subsetOf AttributeSet(attrSetB) val r2 = attrSetA subsetOf AttributeSet(attrSetC) val r3 = attrSetA subsetOf AttributeSet(attrSetD) val r4 = attrSetA subsetOf AttributeSet(attrSetE) } val t2 = System.nanoTime() val totalTime1 = t1 - t0 val totalTime2 = t2 - t1 println(s"Average time for --: ${totalTime1 / n_iter} us") println(s"Average time for subsetOf: ${totalTime2 / n_iter} us") } ``` And the output is: ``` Average time for --: 25065 us Average time for subsetOf: 108638 us ``` So for the case you mentioned, using `subsetOf` would instead introduce a performance regression. I have also run all the tests in StarJoinCostBasedReorderSuite for 1000 times and the perf regression was confirmed: ``` Running StarJoinCostBasedReorderSuite's tests 1000 times takes w/o change: 68877186927 us Running StarJoinCostBasedReorderSuite's tests 1000 times takes with change: 70689955856 us ``` The point is that there we have a `Seq[Attribute]` instead of an `AttributeSet` as parameter. Hope this is clear, let me know otherwise. Thanks.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org