[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/7929#discussion_r36274062 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala --- @@ -62,6 +64,52 @@ private[hive] class ClientWrapper( extends ClientInterface with Logging { + overrideHadoopShims() + + // !! HACK ALERT !! + // + // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, which is used by Spark EC2 + // scripts. We should remove this after upgrading Spark EC2 scripts to some more recent Hadoop + // version in the future. + // + // Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking + // version information gathered from Hadoop jar files. If the major version number is 1, + // `Hadoop20SShims` will be loaded. Otherwise, if the major version number is 2, `Hadoop23Shims` + // will be chosen. + // + // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux due to historical + // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and should be used together with --- End diff -- I'd also be okay matching against all 2.0.x, if you prefer that, and updating comment to suggest same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127894609 [Test build #39835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39835/consoleFull) for PR 7918 at commit [`9fb1eb2`](https://github.com/apache/spark/commit/9fb1eb2dd9647fe0a3614ddc8fb7cd4e5075fc16). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7761#issuecomment-127894409 [Test build #39834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39834/consoleFull) for PR 7761 at commit [`27195c2`](https://github.com/apache/spark/commit/27195c236b51d862039905522e317ebc6dc75d7d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127894096 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127894190 [Test build #39833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39833/consoleFull) for PR 7893 at commit [`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7929#discussion_r36273851 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala --- @@ -62,6 +64,52 @@ private[hive] class ClientWrapper( extends ClientInterface with Logging { + overrideHadoopShims() + + // !! HACK ALERT !! + // + // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, which is used by Spark EC2 + // scripts. We should remove this after upgrading Spark EC2 scripts to some more recent Hadoop + // version in the future. + // + // Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking + // version information gathered from Hadoop jar files. If the major version number is 1, + // `Hadoop20SShims` will be loaded. Otherwise, if the major version number is 2, `Hadoop23Shims` + // will be chosen. + // + // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux due to historical + // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and should be used together with --- End diff -- My gut is that there's much more reason to believe other 2.0.x builds work the same way. The method in question here (as far as I understand) never appeared in any 2.0.x release. Occam's razor would suggest not special casing here. I don't know that CDH4 is the only relevant 2.0.x release; certainly upstream Apache Hadoop made a number of 2.0.x releases that this change would (again as far as I understand) affect as well and would be left out. At the least, let's get the comment updated. Also, `mr1` really isn't relevant. I would not special-case cdh4, since the comments will say it's not special. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127894040 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127892586 [Test build #39830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39830/consoleFull) for PR 7924 at commit [`581e9e3`](https://github.com/apache/spark/commit/581e9e3f79e98dd4c5f52543a1eb635999bb6e60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/7410#issuecomment-127892684 Hi @tdas, would you please help to review this patch, thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user XuTingjun commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127892711 Thanks all, I have added the document . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127892342 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7918#issuecomment-127892312 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127892357 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7761#issuecomment-127892341 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127892327 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7953#issuecomment-127892353 [Test build #39829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39829/consoleFull) for PR 7953 at commit [`3cac3cc`](https://github.com/apache/spark/commit/3cac3cc68d3c6113b036526b64cfdeab57d57588). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7761#issuecomment-127892387 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9607] [SPARK-9608] fix zinc-port handli...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/7944#issuecomment-127891984 Nah - not a big enough thing to deal to create a new JIRA. Anyways this LGTM. @JoshRosen feel free to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127892005 [Test build #225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/225/consoleFull) for PR 7893 at commit [`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127891934 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127892111 [Test build #224 has started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/224/consoleFull) for PR 7924 at commit [`581e9e3`](https://github.com/apache/spark/commit/581e9e3f79e98dd4c5f52543a1eb635999bb6e60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127891878 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127891747 [Test build #1350 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1350/consoleFull) for PR 7893 at commit [`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127891887 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127891935 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127891028 [Test build #39831 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39831/consoleFull) for PR 7833 at commit [`9570bec`](https://github.com/apache/spark/commit/9570bec0d54537e51623b2b5777895c209dd706a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7953#issuecomment-127890921 LGTM pending Jenkins passing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890876 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7953#issuecomment-127890869 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890915 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127890873 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127890833 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7953#issuecomment-127890897 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7924#issuecomment-127890916 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/7893#issuecomment-127890943 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890739 [Test build #223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/223/consoleFull) for PR 7833 at commit [`9570bec`](https://github.com/apache/spark/commit/9570bec0d54537e51623b2b5777895c209dd706a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/7929#issuecomment-127890665 do feel free to get the comment thing hashed out with @srowen. My time zone is approaching bed time, so I have to sign off. Would be nice to get something of this nature in soon because of the test issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890431 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890533 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7833#issuecomment-127890489 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/7929#discussion_r36273191 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala --- @@ -62,6 +64,52 @@ private[hive] class ClientWrapper( extends ClientInterface with Logging { + overrideHadoopShims() + + // !! HACK ALERT !! + // + // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, which is used by Spark EC2 + // scripts. We should remove this after upgrading Spark EC2 scripts to some more recent Hadoop + // version in the future. + // + // Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking + // version information gathered from Hadoop jar files. If the major version number is 1, + // `Hadoop20SShims` will be loaded. Otherwise, if the major version number is 2, `Hadoop23Shims` + // will be chosen. + // + // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux due to historical + // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and should be used together with --- End diff -- Yeah I agree the comment is slightly wrong. I think CDH4 named the release with "mr1" because they took the upstream 2.0.X release but then packaged with the older (pre-yarn) version of MR. So this comment could be improved or just made shorter. In terms of covering other Hadoop 2.0.x distributions - as far as I know no one other than cloudera ever really distributed this. I am pretty hesitant to make any assumptions about what other Hadoop 2.0.x distributions might contain, because that in general was not a time of API stability for Hadoop and there generally variance around API's. So my feeling was to just cover the one case we do distribute binary builds for (the chd4 distribution). My main feeling was, we should make this work for the cdh4 version that we do provide binary builds for, but not go crazy trying to hypothesize about other one-off hadoop versions that were packaged around that time, if any exist. I do agree though the comment could be made more succinct and accurate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...
GitHub user yjshen opened a pull request: https://github.com/apache/spark/pull/7953 [SPARK-9628][SQL]Rename int to SQLDate, long to SQLTimestamp for better readability JIRA: https://issues.apache.org/jira/browse/SPARK-9628 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yjshen/spark datetime_alias Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7953 commit 3cac3cc68d3c6113b036526b64cfdeab57d57588 Author: Yijie Shen Date: 2015-08-05T06:35:04Z rename int to SQLDate, long to SQLTimestamp for better readability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127889452 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272873 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java --- @@ -191,24 +191,28 @@ public void spill() throws IOException { spillWriters.size(), spillWriters.size() > 1 ? " times" : " time"); -final UnsafeSorterSpillWriter spillWriter = - new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, writeMetrics, -inMemSorter.numRecords()); -spillWriters.add(spillWriter); -final UnsafeSorterIterator sortedRecords = inMemSorter.getSortedIterator(); -while (sortedRecords.hasNext()) { - sortedRecords.loadNext(); - final Object baseObject = sortedRecords.getBaseObject(); - final long baseOffset = sortedRecords.getBaseOffset(); - final int recordLength = sortedRecords.getRecordLength(); - spillWriter.write(baseObject, baseOffset, recordLength, sortedRecords.getKeyPrefix()); +// We only write out contents of the inMemSorter if it is not empty. +if (inMemSorter.numRecords() > 0) { + final UnsafeSorterSpillWriter spillWriter = +new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, writeMetrics, + inMemSorter.numRecords()); + spillWriters.add(spillWriter); + final UnsafeSorterIterator sortedRecords = inMemSorter.getSortedIterator(); + while (sortedRecords.hasNext()) { +sortedRecords.loadNext(); +final Object baseObject = sortedRecords.getBaseObject(); +final long baseOffset = sortedRecords.getBaseOffset(); +final int recordLength = sortedRecords.getRecordLength(); +spillWriter.write(baseObject, baseOffset, recordLength, sortedRecords.getKeyPrefix()); + } + spillWriter.close(); + final long spillSize = freeMemory(); --- End diff -- Actually, one comment: should this be outside of the `if` condition? I'm not sure what happens if you call `initializeForWriting()` in a case where you haven't already called `freeMemory()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127889358 [Test build #39824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39824/console) for PR 7952 at commit [`8d08090`](https://github.com/apache/spark/commit/8d0809014b76006208b214abc75969a112d21596). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class IsotonicRegression(override val uid: String) extends Estimator[IsotonicRegressionModel]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/7929#discussion_r36272840 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala --- @@ -62,6 +64,52 @@ private[hive] class ClientWrapper( extends ClientInterface with Logging { + overrideHadoopShims() + + // !! HACK ALERT !! + // + // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, which is used by Spark EC2 + // scripts. We should remove this after upgrading Spark EC2 scripts to some more recent Hadoop + // version in the future. + // + // Internally, Hive `ShimLoader` tries to load different versions of Hadoop shims by checking + // version information gathered from Hadoop jar files. If the major version number is 1, + // `Hadoop20SShims` will be loaded. Otherwise, if the major version number is 2, `Hadoop23Shims` + // will be chosen. + // + // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux due to historical + // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and should be used together with --- End diff -- I still think this comment doesn't make sense. My "more Hadoop 1-like" comment refers to the MapReduce part, which is not relevant here. `2.0.0-mr1-cdh4.1.1` is correctly a 2.0.x Hadoop build. The next line has a typo one way or the other. Right now, the logic is: if Hadoop version = 1.x, then use Hadoop 2.0 shims. Else use the Hadoop 2.3 shims. That's the problem. The desired logic seems to be: if Hadoop version <= 2.0.x, use Hadoop 2.0 shims. Else use the Hadoop 2.3 shims. That's much better, even if the "2.3 shims" name isn't the most accurate. Why is the logic not "Hadoop version <= 2.0.x"? why is this suggesting CDH4 is a special case -- let alone mr1? Right now this is still not going to work for other Hadoop 2.0.x distributions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/7948#issuecomment-127889224 Changes look good overall; just one minor comment RE: a typo in a variable name, plus a comment on tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/7929#issuecomment-127889147 LGTM - feel free to merge, as it is really taking a toll on our tests right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272769 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala --- @@ -231,4 +231,109 @@ class UnsafeFixedWidthAggregationMapSuite extends SparkFunSuite with Matchers { map.free() } + + testWithMemoryLeakDetection("test external sorting with an empty map") { +// Calling this make sure we have block manager and everything else setup. +TestSQLContext + +val map = new UnsafeFixedWidthAggregationMap( + emptyAggregationBuffer, + aggBufferSchema, + groupKeySchema, + taskMemoryManager, + shuffleMemoryManager, + 128, // initial capacity + PAGE_SIZE_BYTES, + false // disable perf metrics +) + +// Convert the map into a sorter +val sorter = map.destructAndCreateExternalSorter() + +// Add more keys to the sorter and make sure the results come out sorted. +val additionalKeys = randomStrings(1024) +val keyConverter = UnsafeProjection.create(groupKeySchema) +val valueConverter = UnsafeProjection.create(aggBufferSchema) + +additionalKeys.zipWithIndex.foreach { case (str, i) => + val k = InternalRow(UTF8String.fromString(str)) + val v = InternalRow(str.length) + sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v)) + + if ((i % 100) == 0) { +shuffleMemoryManager.markAsOutOfMemory() +sorter.closeCurrentPage() + } +} + +val out = new scala.collection.mutable.ArrayBuffer[String] +val iter = sorter.sortedIterator() +while (iter.next()) { + // At here, we also test if copy is correct. + val key = iter.getKey.copy() + val value = iter.getValue.copy() + assert(key.getString(0).length === value.getInt(0)) + out += key.getString(0) +} + +assert(out === (additionalKeys).sorted) + +map.free() + } + + testWithMemoryLeakDetection("test external sorting with empty records") { +// Calling this make sure we have block manager and everything else setup. +TestSQLContext + +// Memory consumption in the beginning of the task. +val initialMemoryConsumption = shuffleMemoryManager.getMemoryConsumptionForThisTask() + +val map = new UnsafeFixedWidthAggregationMap( + emptyAggregationBuffer, + StructType(Nil), + StructType(Nil), + taskMemoryManager, + shuffleMemoryManager, + 128, // initial capacity + PAGE_SIZE_BYTES, + false // disable perf metrics +) + +(1 to 10).foreach { i => + val buf = map.getAggregationBuffer(InternalRow(0)) + assert(buf != null) +} + +// Convert the map into a sorter +val sorter = map.destructAndCreateExternalSorter() + +withClue(s"destructAndCreateExternalSorter should release memory used by the map") { + // 4096 * 16 is the initial size allocated for the pointer/prefix array in the in-mem sorter. + assert(shuffleMemoryManager.getMemoryConsumptionForThisTask() === +initialMemoryConsumption + 4096 * 16) +} + +// Add more keys to the sorter and make sure the results come out sorted. +(1 to 4096).foreach { i => + sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0)) + + if ((i % 100) == 0) { +shuffleMemoryManager.markAsOutOfMemory() +sorter.closeCurrentPage() + } +} + +var count = 0 +val iter = sorter.sortedIterator() +while (iter.next()) { + // At here, we also test if copy is correct. --- End diff -- Is this necessary for this test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...
Github user nraychaudhuri commented on a diff in the pull request: https://github.com/apache/spark/pull/7796#discussion_r36272732 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -381,3 +447,20 @@ object DirectKafkaStreamSuite { } } } + +private[streaming] class ConstantEstimator(rates: Double*) extends RateEstimator { --- End diff -- I don't have enough permissions to change the PR title. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...
Github user nraychaudhuri commented on a diff in the pull request: https://github.com/apache/spark/pull/7796#discussion_r36272684 --- Diff: external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala --- @@ -381,3 +447,20 @@ object DirectKafkaStreamSuite { } } } + +private[streaming] class ConstantEstimator(rates: Double*) extends RateEstimator { --- End diff -- @tdas I tried to reuse that but that is in different project. Is the test files from streaming project shared with external projects? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127888915 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272567 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala --- @@ -231,4 +231,109 @@ class UnsafeFixedWidthAggregationMapSuite extends SparkFunSuite with Matchers { map.free() } + + testWithMemoryLeakDetection("test external sorting with an empty map") { +// Calling this make sure we have block manager and everything else setup. +TestSQLContext + +val map = new UnsafeFixedWidthAggregationMap( + emptyAggregationBuffer, + aggBufferSchema, + groupKeySchema, + taskMemoryManager, + shuffleMemoryManager, + 128, // initial capacity + PAGE_SIZE_BYTES, + false // disable perf metrics +) + +// Convert the map into a sorter +val sorter = map.destructAndCreateExternalSorter() + +// Add more keys to the sorter and make sure the results come out sorted. +val additionalKeys = randomStrings(1024) +val keyConverter = UnsafeProjection.create(groupKeySchema) +val valueConverter = UnsafeProjection.create(aggBufferSchema) + +additionalKeys.zipWithIndex.foreach { case (str, i) => + val k = InternalRow(UTF8String.fromString(str)) + val v = InternalRow(str.length) + sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v)) + + if ((i % 100) == 0) { +shuffleMemoryManager.markAsOutOfMemory() +sorter.closeCurrentPage() + } +} + +val out = new scala.collection.mutable.ArrayBuffer[String] +val iter = sorter.sortedIterator() +while (iter.next()) { + // At here, we also test if copy is correct. + val key = iter.getKey.copy() + val value = iter.getValue.copy() + assert(key.getString(0).length === value.getInt(0)) + out += key.getString(0) +} + +assert(out === (additionalKeys).sorted) + +map.free() + } + + testWithMemoryLeakDetection("test external sorting with empty records") { +// Calling this make sure we have block manager and everything else setup. +TestSQLContext + +// Memory consumption in the beginning of the task. +val initialMemoryConsumption = shuffleMemoryManager.getMemoryConsumptionForThisTask() + +val map = new UnsafeFixedWidthAggregationMap( + emptyAggregationBuffer, + StructType(Nil), + StructType(Nil), + taskMemoryManager, + shuffleMemoryManager, + 128, // initial capacity + PAGE_SIZE_BYTES, + false // disable perf metrics +) + +(1 to 10).foreach { i => + val buf = map.getAggregationBuffer(InternalRow(0)) + assert(buf != null) +} + +// Convert the map into a sorter +val sorter = map.destructAndCreateExternalSorter() + +withClue(s"destructAndCreateExternalSorter should release memory used by the map") { + // 4096 * 16 is the initial size allocated for the pointer/prefix array in the in-mem sorter. + assert(shuffleMemoryManager.getMemoryConsumptionForThisTask() === +initialMemoryConsumption + 4096 * 16) +} + +// Add more keys to the sorter and make sure the results come out sorted. +(1 to 4096).foreach { i => + sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), UnsafeRow.createFromByteArray(0, 0)) + + if ((i % 100) == 0) { +shuffleMemoryManager.markAsOutOfMemory() +sorter.closeCurrentPage() + } +} + +var count = 0 +val iter = sorter.sortedIterator() +while (iter.next()) { + // At here, we also test if copy is correct. + iter.getKey.copy() + iter.getValue.copy() + count += 1; +} + +assert(count === 4097) --- End diff -- To clarify: maybe add a comment saying that one row comes from the map, plus added directly to the KV sorter after creating it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6817#issuecomment-127888097 [Test build #39827 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39827/consoleFull) for PR 6817 at commit [`4b2dd75`](https://github.com/apache/spark/commit/4b2dd75abc3469fd6abc13e388c6fb9b2060b962). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...
Github user XuTingjun commented on the pull request: https://github.com/apache/spark/pull/6817#issuecomment-127886475 @squito, I have updated the test, thank you very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6817#issuecomment-127886729 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7774#issuecomment-127886593 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127887018 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6817#issuecomment-127886604 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7774#issuecomment-127886731 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7774#issuecomment-127887044 [Test build #39828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39828/consoleFull) for PR 7774 at commit [`5a2bc99`](https://github.com/apache/spark/commit/5a2bc9937bc26e014842b720fd2096294c9272b7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272385 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java --- @@ -82,8 +82,15 @@ public UnsafeKVExternalSorter(StructType keySchema, StructType valueSchema, pageSizeBytes); } else { // Insert the records into the in-memory sorter. + // We will use the number of elements in the map as the initialSize of the + // UnsafeInMemorySorter. Because UnsafeInMemorySorter does not accept 0 as the initialSize, + // we will use 1 as its initial size if the map is empty. + int initialSoeterSize = map.numElements(); --- End diff -- Typo in this variable name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272417 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java --- @@ -82,8 +82,15 @@ public UnsafeKVExternalSorter(StructType keySchema, StructType valueSchema, pageSizeBytes); } else { // Insert the records into the in-memory sorter. + // We will use the number of elements in the map as the initialSize of the + // UnsafeInMemorySorter. Because UnsafeInMemorySorter does not accept 0 as the initialSize, + // we will use 1 as its initial size if the map is empty. + int initialSoeterSize = map.numElements(); + if (initialSoeterSize == 0) { +initialSoeterSize = 1; + } final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter( -taskMemoryManager, recordComparator, prefixComparator, map.numElements()); +taskMemoryManager, recordComparator, prefixComparator, initialSoeterSize); --- End diff -- Could also do `Math.max(1, map.numElements)` if you want a one-liner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272364 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillMerger.java --- @@ -47,11 +47,19 @@ public int compare(UnsafeSorterIterator left, UnsafeSorterIterator right) { priorityQueue = new PriorityQueue(numSpills, comparator); } - public void addSpill(UnsafeSorterIterator spillReader) throws IOException { + /** + * Add an UnsafeSorterIterator to this merger + */ + public void addSpillIfNotEmpty(UnsafeSorterIterator spillReader) throws IOException { if (spillReader.hasNext()) { + // We only add the spillReader to the priorityQueue if it is not empty. We do this to --- End diff -- Yep, makes sense. Putting empty spill writers violates an invariant that's maintained by the `loadNext()` loop: if a spill reader is in the priority queue, then `getBaseObject()`, `getBaseOffset()`, etc. point to a row that has not been returned yet. We covered the maintenance of that invariant but didn't establish it properly when there were empty spills. This change fixes that, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7774#issuecomment-127884863 [Test build #1349 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1349/consoleFull) for PR 7774 at commit [`57d4cd2`](https://github.com/apache/spark/commit/57d4cd2edc349bf027ffca5b2e819e7479c3be62). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/7948#discussion_r36272160 --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java --- @@ -191,24 +191,28 @@ public void spill() throws IOException { spillWriters.size(), spillWriters.size() > 1 ? " times" : " time"); --- End diff -- Not sure whether we should move this log statement inside the `if` block or not. I suppose it might be useful to know when memory pressure triggered a spill even if we didn't end up writing rows, so it's probably fine to leave this where it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....
Github user davies commented on the pull request: https://github.com/apache/spark/pull/7925#issuecomment-127884616 Merged into master and 1.5 branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7925 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127884531 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127884542 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7925#issuecomment-127884345 [Test build #1344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1344/console) for PR 7925 at commit [`e19701a`](https://github.com/apache/spark/commit/e19701a59bbbc6a709cb3b3a6ff24c141ad2f425). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9486][SQL][WIP] Add data source aliasin...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7802#issuecomment-127883935 Actually I think the current API breaks binary compatibility for data sources, so we can't merge it as is. In Java (or Scala binary compatibility), RelationProvider now has an extra interface that has no default implementation. We need to find a workaround to provide this information. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9486][SQL][WIP] Add data source aliasin...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7802#issuecomment-127883375 @JDrit what's still WIP about this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127883122 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7904#issuecomment-127883162 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/7925#issuecomment-127882225 LGTM (not super familiar with decimals though) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...
Github user GraceH commented on a diff in the pull request: https://github.com/apache/spark/pull/7927#discussion_r36271550 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -592,8 +592,14 @@ private[spark] class BlockManager( val locations = Random.shuffle(master.getLocations(blockId)) for (loc <- locations) { logDebug(s"Getting remote block $blockId from $loc") - val data = blockTransferService.fetchBlockSync( -loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + val data = try { +blockTransferService.fetchBlockSync( + loc.host, loc.port, loc.executorId, blockId.toString).nioByteBuffer() + } catch { +case e: Throwable => + logWarning(s"Exception during getting remote block $blockId from $loc", e) --- End diff -- @squito So agree to do like ```askWithRetry```. If we can get one block from any remote store successfully, it successes. We should not break the working path whenever meet the first exception. So maybe, we need to catch all kinds of Exceptions (not IOException only). If some attempts failed, we need to log out the exception information but continue the fetching work. When we run to the final location and it still throws out certain exception, we need to throw out a NEW exception to tell that all attempts failed (i.e., no available location there). and meanwhile, maybe to add the last exception information into this NEW exception. But if we only focus IOException, when we meet some types of exceptions for certain locations, it still breaks the entire workflow (to fetch data from the rest locations if possible). What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7774#issuecomment-127880431 [Test build #1348 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1348/consoleFull) for PR 7774 at commit [`57d4cd2`](https://github.com/apache/spark/commit/57d4cd2edc349bf027ffca5b2e819e7479c3be62). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9581][SQL] Add unit test for JSON UDT
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7917#issuecomment-127879419 [Test build #1347 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1347/consoleFull) for PR 7917 at commit [`93e3954`](https://github.com/apache/spark/commit/93e395486a326ec360923f2fe7de762b42a36252). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127878981 [Test build #39824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39824/consoleFull) for PR 7952 at commit [`8d08090`](https://github.com/apache/spark/commit/8d0809014b76006208b214abc75969a112d21596). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127873996 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127873887 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127873523 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9217][STREAMING] Make the kinesis recei...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/7825#issuecomment-127872074 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8266][SQL]add function translate
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7709#issuecomment-127866513 [Test build #1346 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1346/consoleFull) for PR 7709 at commit [`b4c47bf`](https://github.com/apache/spark/commit/b4c47bf9e224beb9cf020fb794c6ba741b0fc2a7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7410#issuecomment-127865531 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains
Github user davies commented on the pull request: https://github.com/apache/spark/pull/7949#issuecomment-127865432 Merged into master and 1.5 branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7410#issuecomment-127865211 [Test build #39815 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39815/console) for PR 7410 at commit [`f375e16`](https://github.com/apache/spark/commit/f375e16640c1670ec907711bf63d2e70e5a19f6c). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class PythonMessageAndMetadata(` * ` class PythonMessageAndMetadataPickler extends IObjectPickler ` * `class KafkaMessageAndMetadata(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7580 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7949 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7952#issuecomment-127864368 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8266][SQL]add function translate
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7709#issuecomment-127864330 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...
Github user feynmanliang commented on a diff in the pull request: https://github.com/apache/spark/pull/7937#discussion_r36270670 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -139,200 +202,308 @@ class PrefixSpan private ( run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray)) } +} + +@Experimental +object PrefixSpan extends Logging { + /** - * Find the complete set of sequential patterns in the input sequences. This method utilizes - * the internal representation of itemsets as Array[Int] where each itemset is represented by - * a contiguous sequence of non-negative integers and delimiters represented by [[DELIMITER]]. - * @param data ordered sequences of itemsets. Items are represented by non-negative integers. - * Each itemset has one or more items and is delimited by [[DELIMITER]]. - * @return a set of sequential pattern pairs, - * the key of pair is pattern (a list of elements), - * the value of pair is the pattern's count. + * Find the complete set of frequent sequential patterns in the input sequences. + * @param data ordered sequences of itemsets. We represent a sequence internally as Array[Int], + * where each itemset is represented by a contiguous sequence of distinct and ordered + * positive integers. We use 0 as the delimiter at itemset boundaries, including the + * first and the last position. + * @return an RDD of (frequent sequential pattern, count) pairs, + * @see [[Postfix]] */ - private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = { + private[fpm] def genFreqPatterns( + data: RDD[Array[Int]], + minCount: Long, + maxPatternLength: Int, + maxLocalProjDBSize: Long): RDD[(Array[Int], Long)] = { val sc = data.sparkContext if (data.getStorageLevel == StorageLevel.NONE) { logWarning("Input data is not cached.") } -// Use List[Set[Item]] for internal computation -val sequences = data.map { seq => splitSequence(seq.toList) } - -// Convert min support to a min number of transactions for this dataset -val minCount = if (minSupport == 0) 0L else math.ceil(sequences.count() * minSupport).toLong - -// (Frequent items -> number of occurrences, all items here satisfy the `minSupport` threshold -val freqItemCounts = sequences - .flatMap(seq => seq.flatMap(nonemptySubsets(_)).distinct.map(item => (item, 1L))) - .reduceByKey(_ + _) - .filter { case (item, count) => (count >= minCount) } - .collect() - .toMap - -// Pairs of (length 1 prefix, suffix consisting of frequent items) -val itemSuffixPairs = { - val freqItemSets = freqItemCounts.keys.toSet - val freqItems = freqItemSets.flatten - sequences.flatMap { seq => -val filteredSeq = seq.map(item => freqItems.intersect(item)).filter(_.nonEmpty) -freqItemSets.flatMap { item => - val candidateSuffix = LocalPrefixSpan.getSuffix(item, filteredSeq) - candidateSuffix match { -case suffix if !suffix.isEmpty => Some((List(item), suffix)) -case _ => None +val postfixes = data.map(items => new Postfix(items)) + +// Local frequent patterns (prefixes) and their counts. +val localFreqPatterns = mutable.ArrayBuffer.empty[(Array[Int], Long)] +// Prefixes whose projected databases are small. +val smallPrefixes = mutable.Map.empty[Int, Prefix] +val emptyPrefix = Prefix.empty +// Prefixes whose projected databases are large. +var largePrefixes = mutable.Map(emptyPrefix.id -> emptyPrefix) +while (largePrefixes.nonEmpty) { + val numLocalFreqPatterns = localFreqPatterns.length + logInfo(s"number of local frequent patterns: $numLocalFreqPatterns") + if (localFreqPatterns.length > 100) { +logWarning( + s""" + | Collected $numLocalFreqPatterns local frequent patterns. You may want to consider: + | 1. increase minSupport, + | 2. decrease maxPatternLength, + | 3. increase maxLocalProjDBSize. + """.stripMargin) + } + logInfo(s"number of small prefixes: ${smallPrefixes.size}") + logInfo(s"number of large prefixes: ${largePrefixes.size}") + val largePrefixArray = largePrefixes.values.toArray + val freqPrefixes = postfixes.flatMap { postfix => --- End diff -- OK, just FYI `for` and `map/flatMap` are equivalent (http://docs.scala-lang.org/tutorials/FAQ/yield.html) --- If your project is set up for it,
[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7937 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/7594 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9360][SQL] Support BinaryType in Prefix...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/7676#discussion_r36270593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala --- @@ -76,6 +78,7 @@ case class SortPrefix(child: SortOrder) extends UnaryExpression { (DoublePrefixComparator.computePrefix(Double.NegativeInfinity), s"$DoublePrefixCmp.computePrefix((double)$input)") case StringType => (0L, s"$input.getPrefix()") + case BinaryType => (0L, s"$BinaryPrefixCmp.computePrefix((byte[])$input)") --- End diff -- I think we don't need the cast here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/7937#issuecomment-127863143 Merged into master and branch-1.5. Thanks @feynmanliang and @zhangjiajin for review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7929#issuecomment-127862858 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/7937#discussion_r36270520 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -139,200 +202,308 @@ class PrefixSpan private ( run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray)) } +} + +@Experimental +object PrefixSpan extends Logging { + /** - * Find the complete set of sequential patterns in the input sequences. This method utilizes - * the internal representation of itemsets as Array[Int] where each itemset is represented by - * a contiguous sequence of non-negative integers and delimiters represented by [[DELIMITER]]. - * @param data ordered sequences of itemsets. Items are represented by non-negative integers. - * Each itemset has one or more items and is delimited by [[DELIMITER]]. - * @return a set of sequential pattern pairs, - * the key of pair is pattern (a list of elements), - * the value of pair is the pattern's count. + * Find the complete set of frequent sequential patterns in the input sequences. + * @param data ordered sequences of itemsets. We represent a sequence internally as Array[Int], + * where each itemset is represented by a contiguous sequence of distinct and ordered + * positive integers. We use 0 as the delimiter at itemset boundaries, including the + * first and the last position. + * @return an RDD of (frequent sequential pattern, count) pairs, + * @see [[Postfix]] */ - private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = { + private[fpm] def genFreqPatterns( + data: RDD[Array[Int]], + minCount: Long, + maxPatternLength: Int, + maxLocalProjDBSize: Long): RDD[(Array[Int], Long)] = { val sc = data.sparkContext if (data.getStorageLevel == StorageLevel.NONE) { logWarning("Input data is not cached.") } -// Use List[Set[Item]] for internal computation -val sequences = data.map { seq => splitSequence(seq.toList) } - -// Convert min support to a min number of transactions for this dataset -val minCount = if (minSupport == 0) 0L else math.ceil(sequences.count() * minSupport).toLong - -// (Frequent items -> number of occurrences, all items here satisfy the `minSupport` threshold -val freqItemCounts = sequences - .flatMap(seq => seq.flatMap(nonemptySubsets(_)).distinct.map(item => (item, 1L))) - .reduceByKey(_ + _) - .filter { case (item, count) => (count >= minCount) } - .collect() - .toMap - -// Pairs of (length 1 prefix, suffix consisting of frequent items) -val itemSuffixPairs = { - val freqItemSets = freqItemCounts.keys.toSet - val freqItems = freqItemSets.flatten - sequences.flatMap { seq => -val filteredSeq = seq.map(item => freqItems.intersect(item)).filter(_.nonEmpty) -freqItemSets.flatMap { item => - val candidateSuffix = LocalPrefixSpan.getSuffix(item, filteredSeq) - candidateSuffix match { -case suffix if !suffix.isEmpty => Some((List(item), suffix)) -case _ => None +val postfixes = data.map(items => new Postfix(items)) + +// Local frequent patterns (prefixes) and their counts. +val localFreqPatterns = mutable.ArrayBuffer.empty[(Array[Int], Long)] +// Prefixes whose projected databases are small. +val smallPrefixes = mutable.Map.empty[Int, Prefix] +val emptyPrefix = Prefix.empty +// Prefixes whose projected databases are large. +var largePrefixes = mutable.Map(emptyPrefix.id -> emptyPrefix) +while (largePrefixes.nonEmpty) { + val numLocalFreqPatterns = localFreqPatterns.length + logInfo(s"number of local frequent patterns: $numLocalFreqPatterns") + if (localFreqPatterns.length > 100) { +logWarning( + s""" + | Collected $numLocalFreqPatterns local frequent patterns. You may want to consider: + | 1. increase minSupport, + | 2. decrease maxPatternLength, + | 3. increase maxLocalProjDBSize. + """.stripMargin) + } + logInfo(s"number of small prefixes: ${smallPrefixes.size}") + logInfo(s"number of large prefixes: ${largePrefixes.size}") + val largePrefixArray = largePrefixes.values.toArray + val freqPrefixes = postfixes.flatMap { postfix => --- End diff -- There are several performance issues with `for` in Scala. I don't know whether `for` syntax is better here. I'm more comfortable with `flatMap`, which
[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7929#issuecomment-127862755 [Test build #39818 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39818/console) for PR 7929 at commit [`c99b497`](https://github.com/apache/spark/commit/c99b497560d3103acb65076eac023a6bf36f96b5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org