date:20180515

[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21341
  
**[Test build #90672 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90672/testReport)**
 for PR 21341 at commit 
[`0674301`](https://github.com/apache/spark/commit/06743015fbfca7060c800daedfd65bc9c52bf7b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...

2018-05-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21341
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers

2018-05-15 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20894
  
ping @gengliangwang 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf...

2018-05-15 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21341
  
cc @gatorsmile @viirya @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21341: Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that ...

2018-05-15 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/21341

Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accesâ¦

â¦sed only on the driver"

This reverts commit a4206d58e05ab9ed6f01fee57e18dee65cbc4efc.

This is from https://github.com/apache/spark/pull/21299 and to ease the 
review of it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark revert

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21341


commit 06743015fbfca7060c800daedfd65bc9c52bf7b4
Author: Wenchen Fan 
Date:   2018-05-16T06:54:08Z

Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed 
only on the driver"

This reverts commit a4206d58e05ab9ed6f01fee57e18dee65cbc4efc.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90666/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21291
  
**[Test build #90666 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90666/testReport)**
 for PR 21291 at commit 
[`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21106
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21106
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3251/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21252
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3250/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21252
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21106: [SPARK-23711][SQL] Add fallback generator for UnsafeProj...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21106
  
**[Test build #90671 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90671/testReport)**
 for PR 21106 at commit 
[`f883c2b`](https://github.com/apache/spark/commit/f883c2b8f2b80b2d73e28d78fcaa6530143e0b66).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21252: [SPARK-24193] Sort by disk when number of limit is big i...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21252
  
**[Test build #90670 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90670/testReport)**
 for PR 21252 at commit 
[`6fa3e58`](https://github.com/apache/spark/commit/6fa3e582582fafffdc469943177e47272ba4c8a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21258
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90665/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21258
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21258
  
**[Test build #90665 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90665/testReport)**
 for PR 21258 at commit 
[`afd2ebb`](https://github.com/apache/spark/commit/afd2ebbb48f45f9763e0e602262f5b558f90077a).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21086: [SPARK-24002] [SQL] Task not serializable caused by org....

2018-05-15 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21086
  
since people hit this issue, let's backport. cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21086: [SPARK-24002] [SQL] Task not serializable caused ...

2018-05-15 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21086#discussion_r188504187
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -351,12 +338,26 @@ class ParquetFileFormat
 val timestampConversion: Boolean =
   sparkSession.sessionState.conf.isParquetINT96TimestampConversion
 val capacity = sqlConf.parquetVectorizedReaderBatchSize
+val enableParquetFilterPushDown: Boolean =
+  sparkSession.sessionState.conf.parquetFilterPushDown
 // Whole stage codegen (PhysicalRDD) is able to deal with batches 
directly
 val returningBatch = supportBatch(sparkSession, resultSchema)
 
 (file: PartitionedFile) => {
   assert(file.partitionValues.numFields == partitionSchema.size)
 
+  // Try to push down filters when filter push-down is enabled.
--- End diff --

Now the code is inside the read function, which will be executed at 
executor side. Thus we don't need to serialize `ParquetFilters`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20929
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90664/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20929
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20929
  
**[Test build #90664 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90664/testReport)**
 for PR 20929 at commit 
[`53b686d`](https://github.com/apache/spark/commit/53b686dede4e5fbcb2b3e39932602ae0c9974209).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20611: [SPARK-23425][SQL]Support wildcard in HDFS path for load...

2018-05-15 Thread kevinyu98

Github user kevinyu98 commented on the issue:

https://github.com/apache/spark/pull/20611
  
@sujith71955 Sorry for the delay. I just ran your test case with my fix 
only, and it run successfully. Can you verify it? If it is true, then my fix is 
much simple, what do you think? Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21329
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21329
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3249/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21329
  
**[Test build #90669 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90669/testReport)**
 for PR 21329 at commit 
[`353606c`](https://github.com/apache/spark/commit/353606c919d1b61db22e9e9f47ab6ed06d78702e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21329: [SPARK-24277][SQL] Code clean up in SQL module: HadoopMa...

2018-05-15 Thread gengliangwang

Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/21329
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20872: [SPARK-23264][SQL] Fix scala.MatchError in literals.sql....

2018-05-15 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20872
  
@cloud-fan ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21069
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21069
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3248/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90662/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21208
  
**[Test build #90662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90662/testReport)**
 for PR 21208 at commit 
[`3bd11e2`](https://github.com/apache/spark/commit/3bd11e2e25cbc172791b9934279589d0cd459ba5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21069: [SPARK-23920][SQL]add array_remove to remove all ...

2018-05-15 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21069#discussion_r188494901
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 ---
@@ -280,4 +280,35 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
 
 checkEvaluation(Concat(Seq(aa0, aa1)), Seq(Seq("a", "b"), Seq("c"), 
Seq("d"), Seq("e", "f")))
   }
+
+  test("Array remove") {
+val a0 = Literal.create(Seq(1, 2, 3, 2, 2, 5), ArrayType(IntegerType))
+val a1 = Literal.create(Seq("b", "a", "a", "c", "b"), 
ArrayType(StringType))
+val a2 = Literal.create(Seq[String](null, "", null, ""), 
ArrayType(StringType))
+val a3 = Literal.create(Seq.empty[Integer], ArrayType(IntegerType))
+val a4 = Literal.create(null, ArrayType(StringType))
+val a5 = Literal.create(Seq(1, null, 8, 9, null), 
ArrayType(IntegerType))
+val a6 = Literal.create(Seq(true, false, false, true), 
ArrayType(BooleanType))
+
+checkEvaluation(ArrayRemove(a0, Literal(0)), Seq(1, 2, 3, 2, 2, 5))
+checkEvaluation(ArrayRemove(a0, Literal(1)), Seq(2, 3, 2, 2, 5))
+checkEvaluation(ArrayRemove(a0, Literal(2)), Seq(1, 3, 5))
+checkEvaluation(ArrayRemove(a0, Literal(3)), Seq(1, 2, 2, 2, 5))
+checkEvaluation(ArrayRemove(a0, Literal(5)), Seq(1, 2, 3, 2, 2))
--- End diff --

@ueshin Thank you very much for your comments. I am very sorry for the late 
reply. I corrected everything except this one. I have 
```checkEvaluation(ArrayRemove(a0, Literal(0)), Seq(1, 2, 3, 2, 2, 5))``` to 
check no value is removed with not contained value. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21069
  
**[Test build #90668 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90668/testReport)**
 for PR 21069 at commit 
[`7fd77d0`](https://github.com/apache/spark/commit/7fd77d01777e7b8bd8b34503cf4d4e7c77df9ecd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21069
  
**[Test build #90667 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90667/testReport)**
 for PR 21069 at commit 
[`8011aa9`](https://github.com/apache/spark/commit/8011aa91e0ef6bb13ee7b83532dc6fd236cdf792).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3247/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-15 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188491670
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

this way `final class PrefixSpan(override val uid: String) extends Params` 
seemingly breaks binary compatibility if later we change it into an estimator ?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21291
  
**[Test build #90666 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90666/testReport)**
 for PR 21291 at commit 
[`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21291
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21092
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21092
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90660/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21092
  
**[Test build #90660 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90660/testReport)**
 for PR 21092 at commit 
[`72953a3`](https://github.com/apache/spark/commit/72953a3ef42ce0aa0d4b55c0f213198b4b468907).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21258
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3246/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21258
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21325: [R][backport-2.2] backport lint fix

2018-05-15 Thread felixcheung

Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/21325


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21258: [SPARK-23933][SQL] Add map_from_arrays function

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21258
  
**[Test build #90665 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90665/testReport)**
 for PR 21258 at commit 
[`afd2ebb`](https://github.com/apache/spark/commit/afd2ebbb48f45f9763e0e602262f5b558f90077a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20929
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3245/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20929
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20929
  
**[Test build #90664 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90664/testReport)**
 for PR 20929 at commit 
[`53b686d`](https://github.com/apache/spark/commit/53b686dede4e5fbcb2b3e39932602ae0c9974209).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20929
  
ok, in this pr, I'll focus on adding a new flag to do so. just a sec for 
the update. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20929: [SPARK-23772][SQL][WIP] Provide an option to ignore colu...

2018-05-15 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20929
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21340
  
**[Test build #90663 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90663/testReport)**
 for PR 21340 at commit 
[`1c83b32`](https://github.com/apache/spark/commit/1c83b329fb59bb357bcbf4ac14179fa55a8b4aad).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21340
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90663/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21340
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21340
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21340
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3244/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90661/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21340
  
**[Test build #90663 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90663/testReport)**
 for PR 21340 at commit 
[`1c83b32`](https://github.com/apache/spark/commit/1c83b329fb59bb357bcbf4ac14179fa55a8b4aad).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21291
  
**[Test build #90661 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90661/testReport)**
 for PR 21291 at commit 
[`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21340: [SPARK-24115] Have logging pass through instrumen...

2018-05-15 Thread MrBago

GitHub user MrBago opened a pull request:

https://github.com/apache/spark/pull/21340

[SPARK-24115] Have logging pass through instrumentation class.

## What changes were proposed in this pull request?

Fixes to tuning instrumentation.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MrBago/spark tunning-instrumentation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21340.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21340


commit 1c83b329fb59bb357bcbf4ac14179fa55a8b4aad
Author: Bago Amirbekian 
Date:   2018-05-16T01:39:31Z

Have logging pass through instrumentation class.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21336: [SPARK-24286][Documentation] DataFrameReader.csv ...

2018-05-15 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21336#discussion_r188476857
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -521,7 +521,7 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
*
* You can set the following CSV-specific options to deal with CSV files:
* 
-   * `sep` (default `,`): sets a single character as a separator for 
each
+   * `sep` or  `delimiter` (default `,`): sets a single character as a 
separator for each
--- End diff --

`sep` is preferred and `delimiter` is not documented on purpose.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21338
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21338
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90659/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21338
  
**[Test build #90659 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90659/testReport)**
 for PR 21338 at commit 
[`60d058e`](https://github.com/apache/spark/commit/60d058e02be7d2daf4d7c5f0abff3530c2349c00).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21086: [SPARK-24002] [SQL] Task not serializable caused ...

2018-05-15 Thread ghoto

Github user ghoto commented on a diff in the pull request:

https://github.com/apache/spark/pull/21086#discussion_r188473831
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -351,12 +338,26 @@ class ParquetFileFormat
 val timestampConversion: Boolean =
   sparkSession.sessionState.conf.isParquetINT96TimestampConversion
 val capacity = sqlConf.parquetVectorizedReaderBatchSize
+val enableParquetFilterPushDown: Boolean =
+  sparkSession.sessionState.conf.parquetFilterPushDown
 // Whole stage codegen (PhysicalRDD) is able to deal with batches 
directly
 val returningBatch = supportBatch(sparkSession, resultSchema)
 
 (file: PartitionedFile) => {
   assert(file.partitionValues.numFields == partitionSchema.size)
 
+  // Try to push down filters when filter push-down is enabled.
--- End diff --

So this code is the same as before. How can this solve the bug described in 
the head of the Conversation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21153: [SPARK-24058][ML][PySpark] Default Params in ML s...

2018-05-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21153


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...

2018-05-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21153
  
Thanks @jkbradley @WeichenXu123 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21153: [SPARK-24058][ML][PySpark] Default Params in ML should b...

2018-05-15 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21153
  
OK thanks @viirya !
Merging with master



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function

2018-05-15 Thread pepinoflo

Github user pepinoflo commented on the issue:

https://github.com/apache/spark/pull/21208
  
Just changed my email address in those 9 last commits. Unfortunately I 
wasn't able to rewrite the first commit as the first merge could not be 
preserved even with `git rebase -i -p`. Is that ok to be merged anyway or this 
needs to be fixed somehow (maybe removing the 2 merges totally and doing a new 
merge)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21208: [SPARK-23925][SQL] Add array_repeat collection function

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21208
  
**[Test build #90662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90662/testReport)**
 for PR 21208 at commit 
[`3bd11e2`](https://github.com/apache/spark/commit/3bd11e2e25cbc172791b9934279589d0cd459ba5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20973: [SPARK-20114][ML] spark.ml parity for sequential ...

2018-05-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20973#discussion_r188464083
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/PrefixSpan.scala ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.fpm
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.mllib.fpm.{PrefixSpan => mllibPrefixSpan}
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.types.{ArrayType, LongType, StructField, 
StructType}
+
+/**
+ * :: Experimental ::
+ * A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+ * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+ * Efficiently by Prefix-Projected Pattern Growth
+ * (see http://doi.org/10.1109/ICDE.2001.914830";>here).
+ *
+ * @see https://en.wikipedia.org/wiki/Sequential_Pattern_Mining";>Sequential 
Pattern Mining
+ * (Wikipedia)
+ */
+@Since("2.4.0")
+@Experimental
+object PrefixSpan {
+
+  /**
+   * :: Experimental ::
+   * Finds the complete set of frequent sequential patterns in the input 
sequences of itemsets.
+   *
+   * @param dataset A dataset or a dataframe containing a sequence column 
which is
+   *{{{Seq[Seq[_]]}}} type
+   * @param sequenceCol the name of the sequence column in dataset, rows 
with nulls in this column
+   *are ignored
+   * @param minSupport the minimal support level of the sequential 
pattern, any pattern that
+   *   appears more than (minSupport * 
size-of-the-dataset) times will be output
+   *  (recommended value: `0.1`).
+   * @param maxPatternLength the maximal length of the sequential pattern
+   * (recommended value: `10`).
+   * @param maxLocalProjDBSize The maximum number of items (including 
delimiters used in the
+   *   internal storage format) allowed in a 
projected database before
+   *   local processing. If a projected database 
exceeds this size, another
+   *   iteration of distributed prefix growth is 
run
+   *   (recommended value: `3200`).
+   * @return A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+   * The schema of it will be:
+   *  - `sequence: Seq[Seq[T]]` (T is the item type)
+   *  - `freq: Long`
+   */
+  @Since("2.4.0")
+  def findFrequentSequentialPatterns(
+  dataset: Dataset[_],
+  sequenceCol: String,
--- End diff --

It should be easier to keep the `PrefixSpan` name and make it an 
`Estimator` later. For example:

~~~scala
final class PrefixSpan(override val uid: String) extends Params {
  // param, setters, getters
  def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame
}
~~~

Later we can add `Estimator.fit` and `PrefixSpanModel.transform`. Any issue 
with this approach?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21086: [SPARK-24002] [SQL] Task not serializable caused by org....

2018-05-15 Thread ghoto

Github user ghoto commented on the issue:

https://github.com/apache/spark/pull/21086
  
I'm hitting this issue after upgrading from 2.0.2 to 2.3.0. Please backport 
this PR to Spark 2.3.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21291
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3243/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21092
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3146/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21337: [SPARK-24234][SS] Reader for continuous processin...

2018-05-15 Thread jose-torres

Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/21337#discussion_r188456856
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.continuous.shuffle
+
+import java.util.UUID
+
+import org.apache.spark.{Partition, SparkContext, SparkEnv, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.util.NextIterator
+
+case class ContinuousShuffleReadPartition(index: Int) extends Partition {
+  // Initialized only on the executor, and only once even as we call 
compute() multiple times.
+  lazy val (receiver, endpoint) = {
+val env = SparkEnv.get.rpcEnv
+val receiver = new UnsafeRowReceiver(env)
+val endpoint = env.setupEndpoint(UUID.randomUUID().toString, receiver)
+TaskContext.get().addTaskCompletionListener { ctx =>
+  env.stop(endpoint)
+}
+(receiver, endpoint)
+  }
+}
+
+/**
+ * RDD at the bottom of each continuous processing shuffle task, reading 
from the
+ */
+class ContinuousShuffleReadRDD(sc: SparkContext, numPartitions: Int)
+extends RDD[UnsafeRow](sc, Nil) {
+
+  override protected def getPartitions: Array[Partition] = {
+(0 until numPartitions).map(ContinuousShuffleReadPartition).toArray
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[UnsafeRow] = {
+val receiver = 
split.asInstanceOf[ContinuousShuffleReadPartition].receiver
+
+new NextIterator[UnsafeRow] {
+  override def getNext(): UnsafeRow = receiver.poll() match {
+case ReceiverRow(r) => r
+case ReceiverEpochMarker() =>
--- End diff --

It should, but I think that's significant enough to justify its own PR. 
Added an explicit TODO to be safe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21337: [SPARK-24234][SS] Reader for continuous processin...

2018-05-15 Thread jose-torres

Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/21337#discussion_r188456692
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleReadRDD.scala
 ---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.continuous.shuffle
+
+import java.util.UUID
+
+import org.apache.spark.{Partition, SparkContext, SparkEnv, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import org.apache.spark.util.NextIterator
+
+case class ContinuousShuffleReadPartition(index: Int) extends Partition {
+  // Initialized only on the executor, and only once even as we call 
compute() multiple times.
+  lazy val (receiver, endpoint) = {
+val env = SparkEnv.get.rpcEnv
+val receiver = new UnsafeRowReceiver(env)
+val endpoint = env.setupEndpoint(UUID.randomUUID().toString, receiver)
+TaskContext.get().addTaskCompletionListener { ctx =>
+  env.stop(endpoint)
+}
+(receiver, endpoint)
+  }
+}
+
+/**
+ * RDD at the bottom of each continuous processing shuffle task, reading 
from the
--- End diff --

Well, ContinuousShuffleReadRDD is a bit self-documenting as a reader. Added 
that it's receiving shuffle data from upstream tasks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21338
  
I'll reply to the original e-mail on the PMC list.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21291
  
**[Test build #90661 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90661/testReport)**
 for PR 21291 at commit 
[`f93738b`](https://github.com/apache/spark/commit/f93738be3a7509d70568b3060a0cc4dd3ff23da0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21092
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3146/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...

2018-05-15 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21291
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/21338
  
Can we check this the appropriate Apache group (is it infra ?) ? It seems 
odd that the policy would require removing them when nexus requires them. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21092
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21092
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3242/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21322
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21322
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90652/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21092
  
**[Test build #90660 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90660/testReport)**
 for PR 21092 at commit 
[`72953a3`](https://github.com/apache/spark/commit/72953a3ef42ce0aa0d4b55c0f213198b4b468907).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21092: [SPARK-23984][K8S] Initial Python Bindings for PySpark o...

2018-05-15 Thread ifilonenko

Github user ifilonenko commented on the issue:

https://github.com/apache/spark/pull/21092
  
jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21322
  
**[Test build #90652 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90652/testReport)**
 for PR 21322 at commit 
[`6a08c43`](https://github.com/apache/spark/commit/6a08c434cf967b939b8065bb23d64d0715e38a2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21322
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90651/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21322
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21322: [SPARK-24225][CORE] Support closing AutoClosable objects...

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21322
  
**[Test build #90651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90651/testReport)**
 for PR 21322 at commit 
[`62d46d3`](https://github.com/apache/spark/commit/62d46d3bf49ef0393a916d3cafaae4947f374f36).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21338
  
Right. The new policy says we shouldn't use md5 files, but the nexus server 
requires them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21338: [SPARK-23601][build][follow-up] Keep md5 checksums for n...

2018-05-15 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/21338
  
If I follow this correctly, this is a partial revert only for the Nexus 
artifacts ?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21326
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition

2018-05-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21326
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90650/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition

2018-05-15 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21326
  
**[Test build #90650 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90650/testReport)**
 for PR 21326 at commit 
[`f571750`](https://github.com/apache/spark/commit/f571750b26a7da936e48ba5e40528e6a16c43744).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 427 matches

Mail list logo