date:20180718

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21739
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93255/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21739
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93258/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #93255 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93255/testReport)**
 for PR 21451 at commit 
[`335e26d`](https://github.com/apache/spark/commit/335e26d168dc99e7317175da8732ff691ff512f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class UploadBlockStream extends BlockTransferMessage `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21739
  
**[Test build #93258 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93258/testReport)**
 for PR 21739 at commit 
[`c262e87`](https://github.com/apache/spark/commit/c262e87afe8736febcb546827f0af22da14a02d9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21774
  
Since Spark doesn't have a persistent UDF API like Hive UDF, I think this 
is the best we can do now. In the future we should migrate this to UDF API so 
that we can register it with a name and use it in SQL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203607394
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala
 ---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.avro
+
+import org.apache.avro.Schema
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.{AvroDataToCatalyst, CatalystDataToAvro, 
RandomDataGenerator}
+import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
+import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, 
GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
GenericArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+class AvroCatalystDataConversionSuite extends SparkFunSuite with 
ExpressionEvalHelper {
+
+  private def roundTripTest(data: Literal): Unit = {
+val avroType = SchemaConverters.toAvroType(data.dataType, 
data.nullable)
+checkResult(data, avroType, data.eval())
+  }
+
+  private def checkResult(data: Literal, avroType: Schema, expected: Any): 
Unit = {
+checkEvaluation(
+  AvroDataToCatalyst(CatalystDataToAvro(data), new 
SerializableSchema(avroType)),
+  prepareExpectedResult(expected))
+  }
+
+  private def assertFail(data: Literal, avroType: Schema): Unit = {
+intercept[java.io.EOFException] {
+  AvroDataToCatalyst(CatalystDataToAvro(data), new 
SerializableSchema(avroType)).eval()
+}
+  }
+
+  private val testingTypes = Seq(
+BooleanType,
+ByteType,
+ShortType,
+IntegerType,
+LongType,
+FloatType,
+DoubleType,
+DecimalType(8, 0),   // 32 bits decimal without fraction
+DecimalType(8, 4),   // 32 bits decimal
+DecimalType(16, 0),  // 64 bits decimal without fraction
+DecimalType(16, 11), // 64 bits decimal
+DecimalType(38, 0),
+DecimalType(38, 38),
+StringType,
+BinaryType)
+
+  protected def prepareExpectedResult(expected: Any): Any = expected match 
{
+// Spark decimal is converted to avro string=
+case d: Decimal => UTF8String.fromString(d.toString)
+// Spark byte and short both map to avro int
+case b: Byte => b.toInt
+case s: Short => s.toInt
+case row: GenericInternalRow => 
InternalRow.fromSeq(row.values.map(prepareExpectedResult))
+case array: GenericArrayData => new 
GenericArrayData(array.array.map(prepareExpectedResult))
+case map: MapData =>
+  val keys = new GenericArrayData(
+
map.keyArray().asInstanceOf[GenericArrayData].array.map(prepareExpectedResult))
+  val values = new GenericArrayData(
+
map.valueArray().asInstanceOf[GenericArrayData].array.map(prepareExpectedResult))
+  new ArrayBasedMapData(keys, values)
+case other => other
+  }
+
+  testingTypes.foreach { dt =>
+val seed = scala.util.Random.nextLong()
+test(s"single $dt with seed $seed") {
+  val rand = new scala.util.Random(seed)
+  val data = RandomDataGenerator.forType(dt, rand = rand).get.apply()
+  val converter = CatalystTypeConverters.createToCatalystConverter(dt)
+  val input = Literal.create(converter(data), dt)
+  roundTripTest(input)
+}
+  }
+
+  for (_ <- 1 to 5) {
+val seed = scala.util.Random.nextLong()
+val rand = new scala.util.Random(seed)
+val schema = RandomDataGenerator.randomSchema(rand, 5, testingTypes)
+test(s"flat schema ${schema.catalogString} with seed $seed") {
+  val data = RandomDataGenerator.randomRow(rand, schema)
+  val converter = 
CatalystTypeConverters.createToCatalystConverter(schema)
+  val input = Literal.create(converter(data), schema)
+

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203606947
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala ---
@@ -36,4 +40,27 @@ package object avro {
 @scala.annotation.varargs
 def avro(sources: String*): DataFrame = 
reader.format("avro").load(sources: _*)
   }
+
--- End diff --

because avro data source is an external package like kafka data source. 
It's not available in `org.apache.spark.sql.functions`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203606818
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala 
---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+
+import org.apache.avro.generic.GenericDatumWriter
+import org.apache.avro.io.{BinaryEncoder, EncoderFactory}
+
+import org.apache.spark.sql.avro.{AvroSerializer, SchemaConverters}
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types.{BinaryType, DataType}
+
+case class CatalystDataToAvro(child: Expression) extends UnaryExpression 
with CodegenFallback {
+
+  override lazy val dataType: DataType = BinaryType
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203606783
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with CodegenFallback with ExpectsInputTypes {
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
+
+  override lazy val dataType: DataType =
--- End diff --

the `dataType` is needed in executor side to build `AvroDeserializer`, it's 
better to serialize it instead of recomputing it at executor side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203606677
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
+case class AvroDataToCatalyst(child: Expression, avroType: 
SerializableSchema)
+  extends UnaryExpression with CodegenFallback with ExpectsInputTypes {
--- End diff --

good point. Since the implementation is short, I think it should be easy to 
codegen it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21774#discussion_r203606284
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.avro.generic.GenericDatumReader
+import org.apache.avro.io.{BinaryDecoder, DecoderFactory}
+
+import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, 
SerializableSchema}
+import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, 
Expression, UnaryExpression}
+import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
+import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType}
+
--- End diff --

This is not a function expression like the ones in SQL core, so 
`ExpressionDescription` can't apply here.  I think we can leave it for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93253/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20838
  
**[Test build #93253 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93253/testReport)**
 for PR 20838 at commit 
[`2c4f15c`](https://github.com/apache/spark/commit/2c4f15c13efa8b181c8c53bd9a90f4f578a40169).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21700
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21700
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93256/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21700
  
**[Test build #93256 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93256/testReport)**
 for PR 21700 at commit 
[`cf78a2a`](https://github.com/apache/spark/commit/cf78a2a25791a683c0ee36b08bdc79edd54f212a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21782: [SPARK-24816][SQL] SQL interface support repartit...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21782#discussion_r203604973
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -394,6 +394,41 @@ class FilterPushdownBenchmark extends SparkFunSuite 
with BenchmarkBeforeAndAfter
   }
 }
   }
+
+  ignore("Pushdown benchmark for RANGE PARTITION BY/DISTRIBUTE BY") {
--- End diff --

how is this related to pushdown?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21469
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93257/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21469
  
**[Test build #93257 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93257/testReport)**
 for PR 21469 at commit 
[`32d0418`](https://github.com/apache/spark/commit/32d041878b7dcd20794250853063dab4bac09118).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21813
  
**[Test build #93260 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93260/testReport)**
 for PR 21813 at commit 
[`b5ada3f`](https://github.com/apache/spark/commit/b5ada3feb7d243859714c04ec4fb8c225c1781e0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified//
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...

2018-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21813: [SPARK 24424] Support ANSI-SQL compliant syntax f...

2018-07-18 Thread dilipbiswal

GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/21813

[SPARK 24424] Support ANSI-SQL compliant syntax for GROUPING SET

## What changes were proposed in this pull request?

Enhances the parser and analyzer to support ANSI compliant syntax for 
GROUPING SET. As part of this change we derive the grouping expressions from 
user supplied groupings in the grouping sets clause.

```SQL
SELECT c1, c2, max(c3) 
FROM t1
GROUP BY GROUPING SETS ((c1), (c1, c2))
```


## How was this patch tested?
Added tests in SQLQueryTestSuite and ResolveGroupingAnalyticsSuite.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark spark-24424

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21813.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21813


commit b5ada3feb7d243859714c04ec4fb8c225c1781e0
Author: Dilip Biswal 
Date:   2018-07-19T05:12:33Z

[SPARK-24424] Support ANSI-SQL compliant syntax for GROUPING SET




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21732: [SPARK-24762][SQL] Aggregator should be able to use Opti...

2018-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21732
  
ping @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-18 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21698
  
> > checkpoint can not guarantee that you shall always get the same output 
...
> 
> IIRC we can checkpoint to HDFS? Then it becomes reliable.

Sure, thanks for clarify on that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode

2018-07-18 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21758
  
@mridulm Sorry I missed that message, now I've updated the comment, we can 
continue the discussion on that thread.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21698
  
> checkpoint can not guarantee that you shall always get the same output ...

IIRC we can checkpoint to HDFS? Then it becomes reliable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...

2018-07-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20856
  
@HyukjinKwon good analysis!

Currently Spark is a little messy about what shall be serialized and sent 
to executors. Sometimes we just send an entire query tree but only read a few 
properties of it.

It seems to me it would be better to always do codegen at driver side, to 
avoid complex expression/plan operations at executor side.(not sure if it's 
possible, cc @viirya @rednaxelafx @kiszk).

For this particular problem, I think we can just change these `val`s to 
`lazy val` or `def` in `FileSourceScanExec`, with a unit test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-07-18 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21698

> I see some discussion about making shuffles deterministic, but it proved
to be very difficult. Is there a prior discussion on this you can point me to?
Is it that even if you used fetch-to-disk and had the shuffle-fetch side read
the map-output in a set order, you'd still have random variations in spills?

Related discussion can be found https://github.com/apache/spark/pull/20414
. Also, let me list some of the scenarios that might generate non-deterministic
row ordering below:
**Random Shuffle Blocks Fetch**
We randomize the ordering of block fetches on purpose, for avoiding all the
nodes fetching from the same node at the same time. That means, we send out
multiple concurrent pull requests, and the fetched blocks are processed in
FIFO. Therefore, the row ordering of shuffle output are non-deterministic.
**Shuffle Merge With Spills**
The shuffle using Aggregator (for instance, combineByKey) uses
ExternalAppendOnlyMap to combine the values. The ExternalAppendOnlyMap claims
that it keeps the row orders, but it actually uses the hash to compare the
elements (i.e., HashComparator). Even though the sort algorithm is stable, the
map sizes can be different when the spilling happens. The requests for
additional memory might be in different orders. The spilling could be non
deterministic and thus the resulting order can still be non-deterministic.
**Read From External Data Source**
Some external data sources might generate different row ordering of outputs
on different read request.

> since we only need to do this sort on RDDs post shuffle

IIUC this is not the case in RDD.repartition(), see
https://github.com/apache/spark/blob/94c67a76ec1fda908a671a47a2a1fa63b3ab1b06/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L453~L461
, it requires the input rows are ordered then perform a round-robin style data
transformation, so I don't see what we can do if the input data type is not
sortable.

> on a fetch-failure in repartition, fail the entire job

Currently I can't figure out a case that a customer may vote for this
behavior change, esp. FetchFailure tends to occur more often on long-running
jobs on big datasets compared to interactive queries.

> We could add logic to detect whether even an order-dependent operation
was safe to retry -- eg. repartition just after reading from hdfs or after a
checkpoint can be done as it is now. Each stage would need to know this based
on extra properties of all the RDDs it was defined from.

This is something I'm also trying to figure out, that we shall enable users
to tell Spark that an RDD will generate deterministic output, so you don't have
to worry about data correctness issue over these RDDs. Please also note that
actually checkpoint can not guarantee that you shall always get the same output
on each read operation, because you may have executorLost, and then you have to
recompute the partitions thus may fetch different data.

> Honestly I consider this bug so serious I'd consider loudly warning from
every api which suffers from this if we can't fix -- make them deprecated and
log a warning.

We shall definitely update the comments, but shall we make the apis
deprecated? I can't say I agree or disagree on this. I'm still trying to extend
the current approach to allow data correctness, and the code changes shall be
well flagged off. Maybe we can revisit the deprecated apis proposal after that?

---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 585 matches

Mail list logo