[GitHub] spark pull request #16641: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz closed the pull request at: https://github.com/apache/spark/pull/16641 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request #16641: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
GitHub user someorz opened a pull request: https://github.com/apache/spark/pull/16641 Merge pull request #1 from apache/master update ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16344 **[Test build #71643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71643/testReport)** for PR 16344 at commit

[GitHub] spark pull request #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz closed the pull request at: https://github.com/apache/spark/pull/16640 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark issue #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
Github user someorz commented on the issue: https://github.com/apache/spark/pull/16640 update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request #16640: Merge pull request #1 from apache/master

2017-01-18 Thread someorz
GitHub user someorz opened a pull request: https://github.com/apache/spark/pull/16640 Merge pull request #1 from apache/master update ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16593 @windpiger Could you do me a favor to add a dedicated test case in this PR? - Create a partitinoed Hive Table - Create a partitinoed data source Table - Create a partitinoed Hive

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16344: [SPARK-18929][ML] Add Tweedie distribution in GLM

2017-01-18 Thread MLnick
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16344 jenkins add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96805975 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -87,8 +101,8 @@ case class

[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96805893 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -64,7 +77,7 @@ case class

[GitHub] spark issue #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15415 **[Test build #71642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71642/testReport)** for PR 15415 at commit

[GitHub] spark issue #12064: [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSu...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12064 **[Test build #71641 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71641/testReport)** for PR 12064 at commit

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96804046 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96803812 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71640/ Test FAILed. ---

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)** for PR 16639 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 need define a new map output statistics to do this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf I don't think it would work. map output statistics is just approximate number of output bytes. You can't use it to get correct row number. --- If your project is set up for it, you can reply

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)** for PR 16639 at commit

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71638/ Test FAILed. ---

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-18 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r96802011 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/AssociationRules.scala --- @@ -0,0 +1,113 @@ +/* + * Licensed to the Apache Software

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71640/testReport)** for PR 16639 at commit

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)** for PR 16639 at commit

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71637/ Test FAILed. ---

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread squito
Github user squito commented on the issue: https://github.com/apache/spark/pull/16639 cc @kayousterhout @markhamstra @mateiz This isn't just protecting against crazy user code -- I've seen users hit this with spark sql (because of

[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96800976 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -586,12 +594,12 @@ class SessionCatalog(

[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16552 **[Test build #71639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71639/testReport)** for PR 16552 at commit

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16639 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71636/ Test FAILed. ---

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71638/testReport)** for PR 16639 at commit

[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96800456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -586,12 +594,12 @@ class SessionCatalog(

[GitHub] spark issue #16581: [SPARK-18589] [SQL] Fix Python UDF accessing attributes ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16581 **[Test build #3541 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3541/testReport)** for PR 16581 at commit

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71637/testReport)** for PR 16639 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 Yes, you are right, we can not ensure the uniform distribution for global limit. An idea is not use a special partitioner, after the shuffle we should get the mapoutput statistics for row num of

[GitHub] spark issue #16639: [SPARK-19276][CORE] Fetch Failure handling robust to use...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16639 **[Test build #71636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71636/testReport)** for PR 16639 at commit

[GitHub] spark pull request #16639: [SPARK-19276][CORE] Fetch Failure handling robust...

2017-01-18 Thread squito
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/16639 [SPARK-19276][CORE] Fetch Failure handling robust to user error handling ## What changes were proposed in this pull request? Fault-tolerance in spark requires special handling of shuffle

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/ Test PASSed. ---

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71635/testReport)** for PR 16635 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 @cloud-fan Incorporated code review comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16605 okay. But, if this issue finished, I'm planning to take SPARK-12823 in a similar way. Do u think also it's not also worth trying struct? cc: @cloud-fan --- If your project is set up for it,

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16605 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71631/ Test PASSed. ---

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16605 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16605 **[Test build #71631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)** for PR 16605 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71630/ Test PASSed. ---

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16605 Well, it will be good if we can support `Array` in `ScalaUDF`, but it's not a big deal as users can easily do `udf { (seq: Seq[Int]) => val a = seq.toArray; // do anything you like with the array

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71630 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)** for PR 16635 at commit

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71634/ Test PASSed. ---

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71634 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)** for PR 16593 at commit

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71632/ Test FAILed. ---

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16593 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)** for PR 16593 at commit

[GitHub] spark pull request #16634: [SPARK-16968][SQL][Backport-2.0]Add additional op...

2017-01-18 Thread gatorsmile
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/16634 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16634: [SPARK-16968][SQL][Backport-2.0]Add additional options i...

2017-01-18 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16634 Thanks for the review, Merging to 2.0! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16605 Sure, @maropu . `WrappedArray` is not documented for now. Hi, @gatorsmile and @cloud-fan . Could you review this PR? --- If your project is set up for it, you can reply to this

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf No. A simple example: if there are 5 local limit which produce 1, 2, 1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the distributions for each local limit look like:

[GitHub] spark issue #16213: [SPARK-18020][Streaming][Kinesis] Checkpoint SHARD_END t...

2017-01-18 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16213 @tdas ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790387 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on the issue: https://github.com/apache/spark/pull/16605 oh, yea. I didn't find that and I think it's a good point. IMO `WrappedArray` is implicitly used inside for implicit conversions, so users do not use `WrappedArray` directly for UDFs in most

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790264 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790238 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96790019 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 refer to the maillist >One issue left is how to decide shuffle partition number. We can have a config of the maximum number of elements for each GlobalLimit task to process, then do a

[GitHub] spark pull request #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread maropu
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16605#discussion_r96789868 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala --- @@ -84,7 +86,9 @@ case class ScalaUDF( case 1 =>

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96789821 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark pull request #16635: [SPARK-19059] [SQL] Unable to retrieve data from ...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16635#discussion_r96789777 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2513,4 +2513,18 @@ class SQLQuerySuite extends QueryTest with

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)** for PR 16633 at commit

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71634/testReport)** for PR 16593 at commit

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf > it use a special partitioner to do this, the partitioner like the row_numer in sql it give each row a uniform partitionid, so in the reduce task, each task handle num of rows very

[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96788653 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -87,8 +101,8 @@ case class

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue: https://github.com/apache/spark/pull/16633 To clear, now we have these issues: 1. local limit compute all partitions, that means it launch many tasks but actually maybe very small tasks is enough. 2. global limit single partition

[GitHub] spark issue #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable work with ...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16593 **[Test build #71632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71632/testReport)** for PR 16593 at commit

[GitHub] spark pull request #16593: [SPARK-19153][SQL]DataFrameWriter.saveAsTable wor...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16593#discussion_r96788538 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1361,6 +1355,22 @@ class HiveDDLSuite }

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/ Test FAILed. ---

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16633 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16633 **[Test build #71627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)** for PR 16633 at commit

[GitHub] spark issue #16605: [SPARK-18884][SQL] Support Array[_] in ScalaUDF

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16605 **[Test build #71631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71631/testReport)** for PR 16605 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71630/testReport)** for PR 16635 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71629/ Test FAILed. ---

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)** for PR 16635 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16635 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16635 **[Test build #71629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71629/testReport)** for PR 16635 at commit

[GitHub] spark issue #16635: [SPARK-19059] [SQL] Unable to retrieve data from parquet...

2017-01-18 Thread jayadevanmurali
Github user jayadevanmurali commented on the issue: https://github.com/apache/spark/pull/16635 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16552: [SPARK-19152][SQL]DataFrameWriter.saveAsTable support hi...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16552 The overall idea is to use `InsertIntable` to implement appending to hive table, but this approach is too hacky, we should follow the way how we deal with data source table, e.g.

[GitHub] spark pull request #16633: [SPARK-19274][SQL] Make GlobalLimit without shuff...

2017-01-18 Thread viirya
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16633#discussion_r96786080 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala --- @@ -90,21 +94,74 @@ trait BaseLimitExec extends UnaryExecNode with

[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96785980 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -118,6 +118,14 @@ class SessionCatalog(

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16633 @scwf The main issue the user posted in the mailing list is, the limit is big enough or partition number is big enough to cause performance bottleneck in shuffling the data of local limit. But

[GitHub] spark issue #16638: spark-19115

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16638 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request #16638: spark-19115

2017-01-18 Thread ouyangxiaochen
GitHub user ouyangxiaochen opened a pull request: https://github.com/apache/spark/pull/16638 spark-19115 ## What changes were proposed in this pull request? sparksql supports the command : create external table if not exists gen_tbl like src_tbl location '/warehouse/gen_tbl' in

[GitHub] spark pull request #16621: [SPARK-19265][SQL] make table relation cache gene...

2017-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16621#discussion_r96785426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -118,6 +118,14 @@ class SessionCatalog( }

[GitHub] spark pull request #16627: [SPARK-19267][SS]Fix a race condition when stoppi...

2017-01-18 Thread tdas
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16627#discussion_r96779672 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala --- @@ -34,6 +35,132 @@ import

  1   2   3   4   5   6   >