[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/7929#discussion_r36274062
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
@@ -62,6 +64,52 @@ private[hive] class ClientWrapper(
   extends ClientInterface
   with Logging {
 
+  overrideHadoopShims()
+
+  // !! HACK ALERT !!
+  //
+  // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, 
which is used by Spark EC2
+  // scripts.  We should remove this after upgrading Spark EC2 scripts to 
some more recent Hadoop
+  // version in the future.
+  //
+  // Internally, Hive `ShimLoader` tries to load different versions of 
Hadoop shims by checking
+  // version information gathered from Hadoop jar files.  If the major 
version number is 1,
+  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version 
number is 2, `Hadoop23Shims`
+  // will be chosen.
+  //
+  // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux 
due to historical
+  // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and 
should be used together with
--- End diff --

I'd also be okay matching against all 2.0.x, if you prefer that, and 
updating comment to suggest same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127894609
  
  [Test build #39835 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39835/consoleFull)
 for   PR 7918 at commit 
[`9fb1eb2`](https://github.com/apache/spark/commit/9fb1eb2dd9647fe0a3614ddc8fb7cd4e5075fc16).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7761#issuecomment-127894409
  
  [Test build #39834 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39834/consoleFull)
 for   PR 7761 at commit 
[`27195c2`](https://github.com/apache/spark/commit/27195c236b51d862039905522e317ebc6dc75d7d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127894096
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127894190
  
  [Test build #39833 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39833/consoleFull)
 for   PR 7893 at commit 
[`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7929#discussion_r36273851
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
@@ -62,6 +64,52 @@ private[hive] class ClientWrapper(
   extends ClientInterface
   with Logging {
 
+  overrideHadoopShims()
+
+  // !! HACK ALERT !!
+  //
+  // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, 
which is used by Spark EC2
+  // scripts.  We should remove this after upgrading Spark EC2 scripts to 
some more recent Hadoop
+  // version in the future.
+  //
+  // Internally, Hive `ShimLoader` tries to load different versions of 
Hadoop shims by checking
+  // version information gathered from Hadoop jar files.  If the major 
version number is 1,
+  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version 
number is 2, `Hadoop23Shims`
+  // will be chosen.
+  //
+  // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux 
due to historical
+  // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and 
should be used together with
--- End diff --

My gut is that there's much more reason to believe other 2.0.x builds work 
the same way. The method in question here (as far as I understand) never 
appeared in any 2.0.x release. Occam's razor would suggest not special casing 
here. I don't know that CDH4 is the only relevant 2.0.x release; certainly 
upstream Apache Hadoop made a number of 2.0.x releases that this change would 
(again as far as I understand) affect as well and would be left out.

At the least, let's get the comment updated. Also, `mr1` really isn't 
relevant. I would not special-case cdh4, since the comments will say it's not 
special.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127894040
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127892586
  
  [Test build #39830 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39830/consoleFull)
 for   PR 7924 at commit 
[`581e9e3`](https://github.com/apache/spark/commit/581e9e3f79e98dd4c5f52543a1eb635999bb6e60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...

2015-08-04 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/7410#issuecomment-127892684
  
Hi @tdas, would you please help to review this patch, thanks a lot. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread XuTingjun
Github user XuTingjun commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127892711
  
Thanks all, I have added the document .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127892342
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9585] add config to enable inputFormat ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7918#issuecomment-127892312
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127892357
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7761#issuecomment-127892341
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127892327
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7953#issuecomment-127892353
  
  [Test build #39829 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39829/consoleFull)
 for   PR 7953 at commit 
[`3cac3cc`](https://github.com/apache/spark/commit/3cac3cc68d3c6113b036526b64cfdeab57d57588).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6486] [MLlib] [Python] Add BlockMatrix ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7761#issuecomment-127892387
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9607] [SPARK-9608] fix zinc-port handli...

2015-08-04 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/7944#issuecomment-127891984
  
Nah - not a big enough thing to deal to create a new JIRA. Anyways this 
LGTM. @JoshRosen feel free to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127892005
  
  [Test build #225 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/225/consoleFull)
 for   PR 7893 at commit 
[`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127891934
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127892111
  
  [Test build #224 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/224/consoleFull)
 for   PR 7924 at commit 
[`581e9e3`](https://github.com/apache/spark/commit/581e9e3f79e98dd4c5f52543a1eb635999bb6e60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127891878
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127891747
  
  [Test build #1350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1350/consoleFull)
 for   PR 7893 at commit 
[`81ff97b`](https://github.com/apache/spark/commit/81ff97bcf3c6f368046a53376a3285354000972b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127891887
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127891935
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127891028
  
  [Test build #39831 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39831/consoleFull)
 for   PR 7833 at commit 
[`9570bec`](https://github.com/apache/spark/commit/9570bec0d54537e51623b2b5777895c209dd706a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...

2015-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7953#issuecomment-127890921
  
LGTM pending Jenkins passing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890876
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7953#issuecomment-127890869
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890915
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127890873
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127890833
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7953#issuecomment-127890897
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9548][SQL] Add a destructive iterator f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7924#issuecomment-127890916
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9403][SQL] Add codegen support in In an...

2015-08-04 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/7893#issuecomment-127890943
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890739
  
  [Test build #223 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/223/consoleFull)
 for   PR 7833 at commit 
[`9570bec`](https://github.com/apache/spark/commit/9570bec0d54537e51623b2b5777895c209dd706a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/7929#issuecomment-127890665
  
do feel free to get the comment thing hashed out with @srowen. My time zone 
is approaching bed time, so I have to sign off. Would be nice to get something 
of this nature in soon because of the test issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890431
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890533
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7833#issuecomment-127890489
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/7929#discussion_r36273191
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
@@ -62,6 +64,52 @@ private[hive] class ClientWrapper(
   extends ClientInterface
   with Logging {
 
+  overrideHadoopShims()
+
+  // !! HACK ALERT !!
+  //
+  // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, 
which is used by Spark EC2
+  // scripts.  We should remove this after upgrading Spark EC2 scripts to 
some more recent Hadoop
+  // version in the future.
+  //
+  // Internally, Hive `ShimLoader` tries to load different versions of 
Hadoop shims by checking
+  // version information gathered from Hadoop jar files.  If the major 
version number is 1,
+  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version 
number is 2, `Hadoop23Shims`
+  // will be chosen.
+  //
+  // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux 
due to historical
+  // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and 
should be used together with
--- End diff --

Yeah I agree the comment is slightly wrong. I think CDH4 named the release 
with "mr1" because they took the upstream 2.0.X release but then packaged with 
the older (pre-yarn) version of MR. So this comment could be improved or just 
made shorter.

In terms of covering other Hadoop 2.0.x distributions - as far as I know no 
one other than cloudera ever really distributed this. I am pretty hesitant to 
make any assumptions about what other Hadoop 2.0.x distributions might contain, 
because that in general was not a time of API stability for Hadoop and there 
generally variance around API's. So my feeling was to just cover the one case 
we do distribute binary builds for (the chd4 distribution).

My main feeling was, we should make this work for the cdh4 version that we 
do provide binary builds for, but not go crazy trying to hypothesize about 
other one-off hadoop versions that were packaged around that time, if any exist.

I do agree though the comment could be made more succinct and accurate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9628][SQL]Rename int to SQLDate, long t...

2015-08-04 Thread yjshen
GitHub user yjshen opened a pull request:

https://github.com/apache/spark/pull/7953

[SPARK-9628][SQL]Rename int to SQLDate, long to SQLTimestamp for better 
readability

JIRA: https://issues.apache.org/jira/browse/SPARK-9628

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yjshen/spark datetime_alias

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/7953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #7953


commit 3cac3cc68d3c6113b036526b64cfdeab57d57588
Author: Yijie Shen 
Date:   2015-08-05T06:35:04Z

rename int to SQLDate, long to SQLTimestamp for better readability




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127889452
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272873
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
 ---
@@ -191,24 +191,28 @@ public void spill() throws IOException {
   spillWriters.size(),
   spillWriters.size() > 1 ? " times" : " time");
 
-final UnsafeSorterSpillWriter spillWriter =
-  new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, 
writeMetrics,
-inMemSorter.numRecords());
-spillWriters.add(spillWriter);
-final UnsafeSorterIterator sortedRecords = 
inMemSorter.getSortedIterator();
-while (sortedRecords.hasNext()) {
-  sortedRecords.loadNext();
-  final Object baseObject = sortedRecords.getBaseObject();
-  final long baseOffset = sortedRecords.getBaseOffset();
-  final int recordLength = sortedRecords.getRecordLength();
-  spillWriter.write(baseObject, baseOffset, recordLength, 
sortedRecords.getKeyPrefix());
+// We only write out contents of the inMemSorter if it is not empty.
+if (inMemSorter.numRecords() > 0) {
+  final UnsafeSorterSpillWriter spillWriter =
+new UnsafeSorterSpillWriter(blockManager, fileBufferSizeBytes, 
writeMetrics,
+  inMemSorter.numRecords());
+  spillWriters.add(spillWriter);
+  final UnsafeSorterIterator sortedRecords = 
inMemSorter.getSortedIterator();
+  while (sortedRecords.hasNext()) {
+sortedRecords.loadNext();
+final Object baseObject = sortedRecords.getBaseObject();
+final long baseOffset = sortedRecords.getBaseOffset();
+final int recordLength = sortedRecords.getRecordLength();
+spillWriter.write(baseObject, baseOffset, recordLength, 
sortedRecords.getKeyPrefix());
+  }
+  spillWriter.close();
+  final long spillSize = freeMemory();
--- End diff --

Actually, one comment: should this be outside of the `if` condition? I'm 
not sure what happens if you call `initializeForWriting()` in a case where you 
haven't already called `freeMemory()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127889358
  
  [Test build #39824 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39824/console)
 for   PR 7952 at commit 
[`8d08090`](https://github.com/apache/spark/commit/8d0809014b76006208b214abc75969a112d21596).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class IsotonicRegression(override val uid: String) extends 
Estimator[IsotonicRegressionModel]`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7929#discussion_r36272840
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala ---
@@ -62,6 +64,52 @@ private[hive] class ClientWrapper(
   extends ClientInterface
   with Logging {
 
+  overrideHadoopShims()
+
+  // !! HACK ALERT !!
+  //
+  // This method is a surgical fix for Hadoop version 2.0.0-mr1-cdh4.1.1, 
which is used by Spark EC2
+  // scripts.  We should remove this after upgrading Spark EC2 scripts to 
some more recent Hadoop
+  // version in the future.
+  //
+  // Internally, Hive `ShimLoader` tries to load different versions of 
Hadoop shims by checking
+  // version information gathered from Hadoop jar files.  If the major 
version number is 1,
+  // `Hadoop20SShims` will be loaded.  Otherwise, if the major version 
number is 2, `Hadoop23Shims`
+  // will be chosen.
+  //
+  // However, part of APIs in Hadoop 2.0.x and 2.1.x versions were in flux 
due to historical
+  // reasons. So 2.0.0-mr1-cdh4.1.1 is actually more Hadoop-1-like and 
should be used together with
--- End diff --

I still think this comment doesn't make sense. My "more Hadoop 1-like" 
comment refers to the MapReduce part, which is not relevant here. 
`2.0.0-mr1-cdh4.1.1` is correctly a 2.0.x Hadoop build. The next line has a 
typo one way or the other.

Right now, the logic is: if Hadoop version = 1.x, then use Hadoop 2.0 
shims. Else use the Hadoop 2.3 shims. That's the problem.

The desired logic seems to be: if Hadoop version <= 2.0.x, use Hadoop 2.0 
shims. Else use the Hadoop 2.3 shims. That's much better, even if the "2.3 
shims" name isn't the most accurate.

Why is the logic not "Hadoop version <= 2.0.x"? why is this suggesting CDH4 
is a special case -- let alone mr1? Right now this is still not going to work 
for other Hadoop 2.0.x distributions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/7948#issuecomment-127889224
  
Changes look good overall; just one minor comment RE: a typo in a variable 
name, plus a comment on tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/7929#issuecomment-127889147
  
LGTM - feel free to merge, as it is really taking a toll on our tests right 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272769
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala
 ---
@@ -231,4 +231,109 @@ class UnsafeFixedWidthAggregationMapSuite extends 
SparkFunSuite with Matchers {
 
 map.free()
   }
+
+  testWithMemoryLeakDetection("test external sorting with an empty map") {
+// Calling this make sure we have block manager and everything else 
setup.
+TestSQLContext
+
+val map = new UnsafeFixedWidthAggregationMap(
+  emptyAggregationBuffer,
+  aggBufferSchema,
+  groupKeySchema,
+  taskMemoryManager,
+  shuffleMemoryManager,
+  128, // initial capacity
+  PAGE_SIZE_BYTES,
+  false // disable perf metrics
+)
+
+// Convert the map into a sorter
+val sorter = map.destructAndCreateExternalSorter()
+
+// Add more keys to the sorter and make sure the results come out 
sorted.
+val additionalKeys = randomStrings(1024)
+val keyConverter = UnsafeProjection.create(groupKeySchema)
+val valueConverter = UnsafeProjection.create(aggBufferSchema)
+
+additionalKeys.zipWithIndex.foreach { case (str, i) =>
+  val k = InternalRow(UTF8String.fromString(str))
+  val v = InternalRow(str.length)
+  sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v))
+
+  if ((i % 100) == 0) {
+shuffleMemoryManager.markAsOutOfMemory()
+sorter.closeCurrentPage()
+  }
+}
+
+val out = new scala.collection.mutable.ArrayBuffer[String]
+val iter = sorter.sortedIterator()
+while (iter.next()) {
+  // At here, we also test if copy is correct.
+  val key = iter.getKey.copy()
+  val value = iter.getValue.copy()
+  assert(key.getString(0).length === value.getInt(0))
+  out += key.getString(0)
+}
+
+assert(out === (additionalKeys).sorted)
+
+map.free()
+  }
+
+  testWithMemoryLeakDetection("test external sorting with empty records") {
+// Calling this make sure we have block manager and everything else 
setup.
+TestSQLContext
+
+// Memory consumption in the beginning of the task.
+val initialMemoryConsumption = 
shuffleMemoryManager.getMemoryConsumptionForThisTask()
+
+val map = new UnsafeFixedWidthAggregationMap(
+  emptyAggregationBuffer,
+  StructType(Nil),
+  StructType(Nil),
+  taskMemoryManager,
+  shuffleMemoryManager,
+  128, // initial capacity
+  PAGE_SIZE_BYTES,
+  false // disable perf metrics
+)
+
+(1 to 10).foreach { i =>
+  val buf = map.getAggregationBuffer(InternalRow(0))
+  assert(buf != null)
+}
+
+// Convert the map into a sorter
+val sorter = map.destructAndCreateExternalSorter()
+
+withClue(s"destructAndCreateExternalSorter should release memory used 
by the map") {
+  // 4096 * 16 is the initial size allocated for the pointer/prefix 
array in the in-mem sorter.
+  assert(shuffleMemoryManager.getMemoryConsumptionForThisTask() ===
+initialMemoryConsumption + 4096 * 16)
+}
+
+// Add more keys to the sorter and make sure the results come out 
sorted.
+(1 to 4096).foreach { i =>
+  sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), 
UnsafeRow.createFromByteArray(0, 0))
+
+  if ((i % 100) == 0) {
+shuffleMemoryManager.markAsOutOfMemory()
+sorter.closeCurrentPage()
+  }
+}
+
+var count = 0
+val iter = sorter.sortedIterator()
+while (iter.next()) {
+  // At here, we also test if copy is correct.
--- End diff --

Is this necessary for this test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...

2015-08-04 Thread nraychaudhuri
Github user nraychaudhuri commented on a diff in the pull request:

https://github.com/apache/spark/pull/7796#discussion_r36272732
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
 ---
@@ -381,3 +447,20 @@ object DirectKafkaStreamSuite {
 }
   }
 }
+
+private[streaming] class ConstantEstimator(rates: Double*) extends 
RateEstimator {
--- End diff --

I don't have enough permissions to change the PR title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8978][Streaming] Implements the DirectK...

2015-08-04 Thread nraychaudhuri
Github user nraychaudhuri commented on a diff in the pull request:

https://github.com/apache/spark/pull/7796#discussion_r36272684
  
--- Diff: 
external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
 ---
@@ -381,3 +447,20 @@ object DirectKafkaStreamSuite {
 }
   }
 }
+
+private[streaming] class ConstantEstimator(rates: Double*) extends 
RateEstimator {
--- End diff --

@tdas I tried to reuse that but that is in different project. Is the test 
files from streaming project shared with external projects?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127888915
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272567
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMapSuite.scala
 ---
@@ -231,4 +231,109 @@ class UnsafeFixedWidthAggregationMapSuite extends 
SparkFunSuite with Matchers {
 
 map.free()
   }
+
+  testWithMemoryLeakDetection("test external sorting with an empty map") {
+// Calling this make sure we have block manager and everything else 
setup.
+TestSQLContext
+
+val map = new UnsafeFixedWidthAggregationMap(
+  emptyAggregationBuffer,
+  aggBufferSchema,
+  groupKeySchema,
+  taskMemoryManager,
+  shuffleMemoryManager,
+  128, // initial capacity
+  PAGE_SIZE_BYTES,
+  false // disable perf metrics
+)
+
+// Convert the map into a sorter
+val sorter = map.destructAndCreateExternalSorter()
+
+// Add more keys to the sorter and make sure the results come out 
sorted.
+val additionalKeys = randomStrings(1024)
+val keyConverter = UnsafeProjection.create(groupKeySchema)
+val valueConverter = UnsafeProjection.create(aggBufferSchema)
+
+additionalKeys.zipWithIndex.foreach { case (str, i) =>
+  val k = InternalRow(UTF8String.fromString(str))
+  val v = InternalRow(str.length)
+  sorter.insertKV(keyConverter.apply(k), valueConverter.apply(v))
+
+  if ((i % 100) == 0) {
+shuffleMemoryManager.markAsOutOfMemory()
+sorter.closeCurrentPage()
+  }
+}
+
+val out = new scala.collection.mutable.ArrayBuffer[String]
+val iter = sorter.sortedIterator()
+while (iter.next()) {
+  // At here, we also test if copy is correct.
+  val key = iter.getKey.copy()
+  val value = iter.getValue.copy()
+  assert(key.getString(0).length === value.getInt(0))
+  out += key.getString(0)
+}
+
+assert(out === (additionalKeys).sorted)
+
+map.free()
+  }
+
+  testWithMemoryLeakDetection("test external sorting with empty records") {
+// Calling this make sure we have block manager and everything else 
setup.
+TestSQLContext
+
+// Memory consumption in the beginning of the task.
+val initialMemoryConsumption = 
shuffleMemoryManager.getMemoryConsumptionForThisTask()
+
+val map = new UnsafeFixedWidthAggregationMap(
+  emptyAggregationBuffer,
+  StructType(Nil),
+  StructType(Nil),
+  taskMemoryManager,
+  shuffleMemoryManager,
+  128, // initial capacity
+  PAGE_SIZE_BYTES,
+  false // disable perf metrics
+)
+
+(1 to 10).foreach { i =>
+  val buf = map.getAggregationBuffer(InternalRow(0))
+  assert(buf != null)
+}
+
+// Convert the map into a sorter
+val sorter = map.destructAndCreateExternalSorter()
+
+withClue(s"destructAndCreateExternalSorter should release memory used 
by the map") {
+  // 4096 * 16 is the initial size allocated for the pointer/prefix 
array in the in-mem sorter.
+  assert(shuffleMemoryManager.getMemoryConsumptionForThisTask() ===
+initialMemoryConsumption + 4096 * 16)
+}
+
+// Add more keys to the sorter and make sure the results come out 
sorted.
+(1 to 4096).foreach { i =>
+  sorter.insertKV(UnsafeRow.createFromByteArray(0, 0), 
UnsafeRow.createFromByteArray(0, 0))
+
+  if ((i % 100) == 0) {
+shuffleMemoryManager.markAsOutOfMemory()
+sorter.closeCurrentPage()
+  }
+}
+
+var count = 0
+val iter = sorter.sortedIterator()
+while (iter.next()) {
+  // At here, we also test if copy is correct.
+  iter.getKey.copy()
+  iter.getValue.copy()
+  count += 1;
+}
+
+assert(count === 4097)
--- End diff --

To clarify: maybe add a comment saying that one row comes from the map, 
plus added directly to the KV sorter after creating it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6817#issuecomment-127888097
  
  [Test build #39827 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39827/consoleFull)
 for   PR 6817 at commit 
[`4b2dd75`](https://github.com/apache/spark/commit/4b2dd75abc3469fd6abc13e388c6fb9b2060b962).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...

2015-08-04 Thread XuTingjun
Github user XuTingjun commented on the pull request:

https://github.com/apache/spark/pull/6817#issuecomment-127886475
  
@squito, I have updated the test, thank you very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6817#issuecomment-127886729
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-127886593
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127887018
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8366] When tasks failed and append new ...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6817#issuecomment-127886604
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-127886731
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-127887044
  
  [Test build #39828 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39828/consoleFull)
 for   PR 7774 at commit 
[`5a2bc99`](https://github.com/apache/spark/commit/5a2bc9937bc26e014842b720fd2096294c9272b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272385
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java
 ---
@@ -82,8 +82,15 @@ public UnsafeKVExternalSorter(StructType keySchema, 
StructType valueSchema,
 pageSizeBytes);
 } else {
   // Insert the records into the in-memory sorter.
+  // We will use the number of elements in the map as the initialSize 
of the
+  // UnsafeInMemorySorter. Because UnsafeInMemorySorter does not 
accept 0 as the initialSize,
+  // we will use 1 as its initial size if the map is empty.
+  int initialSoeterSize = map.numElements();
--- End diff --

Typo in this variable name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272417
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java
 ---
@@ -82,8 +82,15 @@ public UnsafeKVExternalSorter(StructType keySchema, 
StructType valueSchema,
 pageSizeBytes);
 } else {
   // Insert the records into the in-memory sorter.
+  // We will use the number of elements in the map as the initialSize 
of the
+  // UnsafeInMemorySorter. Because UnsafeInMemorySorter does not 
accept 0 as the initialSize,
+  // we will use 1 as its initial size if the map is empty.
+  int initialSoeterSize = map.numElements();
+  if (initialSoeterSize == 0) {
+initialSoeterSize = 1;
+  }
   final UnsafeInMemorySorter inMemSorter = new UnsafeInMemorySorter(
-taskMemoryManager, recordComparator, prefixComparator, 
map.numElements());
+taskMemoryManager, recordComparator, prefixComparator, 
initialSoeterSize);
--- End diff --

Could also do `Math.max(1, map.numElements)` if you want a one-liner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272364
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillMerger.java
 ---
@@ -47,11 +47,19 @@ public int compare(UnsafeSorterIterator left, 
UnsafeSorterIterator right) {
 priorityQueue = new PriorityQueue(numSpills, 
comparator);
   }
 
-  public void addSpill(UnsafeSorterIterator spillReader) throws 
IOException {
+  /**
+   * Add an UnsafeSorterIterator to this merger
+   */
+  public void addSpillIfNotEmpty(UnsafeSorterIterator spillReader) throws 
IOException {
 if (spillReader.hasNext()) {
+  // We only add the spillReader to the priorityQueue if it is not 
empty. We do this to
--- End diff --

Yep, makes sense. Putting empty spill writers violates an invariant that's 
maintained by the `loadNext()` loop: if a spill reader is in the priority 
queue, then `getBaseObject()`, `getBaseOffset()`, etc. point to a row that has 
not been returned yet.  We covered the maintenance of that invariant but didn't 
establish it properly when there were empty spills. This change fixes that, 
though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-127884863
  
  [Test build #1349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1349/consoleFull)
 for   PR 7774 at commit 
[`57d4cd2`](https://github.com/apache/spark/commit/57d4cd2edc349bf027ffca5b2e819e7479c3be62).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9611] [SQL] Fixes a few corner cases wh...

2015-08-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/7948#discussion_r36272160
  
--- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java
 ---
@@ -191,24 +191,28 @@ public void spill() throws IOException {
   spillWriters.size(),
   spillWriters.size() > 1 ? " times" : " time");
--- End diff --

Not sure whether we should move this log statement inside the `if` block or 
not.  I suppose it might be useful to know when memory pressure triggered a 
spill even if we didn't end up writing rows, so it's probably fine to leave 
this where it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....

2015-08-04 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7925#issuecomment-127884616
  
Merged into master and 1.5 branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....

2015-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7925


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127884531
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127884542
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7925#issuecomment-127884345
  
  [Test build #1344 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1344/console)
 for   PR 7925 at commit 
[`e19701a`](https://github.com/apache/spark/commit/e19701a59bbbc6a709cb3b3a6ff24c141ad2f425).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9486][SQL][WIP] Add data source aliasin...

2015-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7802#issuecomment-127883935
  
Actually I think the current API breaks binary compatibility for data 
sources, so we can't merge it as is.

In Java (or Scala binary compatibility), RelationProvider now has an extra 
interface that has no default implementation. We need to find a workaround to 
provide this information.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9486][SQL][WIP] Add data source aliasin...

2015-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7802#issuecomment-127883375
  
@JDrit what's still WIP about this patch?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127883122
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7165] [WIP] [SQL] Use sort merge join f...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7904#issuecomment-127883162
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9119] [SPARK-8359] [SQL] match Decimal....

2015-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/7925#issuecomment-127882225
  
LGTM (not super familiar with decimals though)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9591][CORE]Job may fail for exception d...

2015-08-04 Thread GraceH
Github user GraceH commented on a diff in the pull request:

https://github.com/apache/spark/pull/7927#discussion_r36271550
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -592,8 +592,14 @@ private[spark] class BlockManager(
 val locations = Random.shuffle(master.getLocations(blockId))
 for (loc <- locations) {
   logDebug(s"Getting remote block $blockId from $loc")
-  val data = blockTransferService.fetchBlockSync(
-loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  val data = try {
+blockTransferService.fetchBlockSync(
+  loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
+  } catch {
+case e: Throwable =>
+  logWarning(s"Exception during getting remote block $blockId from 
$loc", e)
--- End diff --

@squito So agree to do like ```askWithRetry```.  If we can get one block 
from any remote store successfully, it successes. We should not break the 
working path whenever meet the first exception.

So maybe, we need to catch all kinds of Exceptions (not IOException only). 
If some attempts failed, we need to log out the exception information but 
continue the fetching work.  When we run to the final location and it still 
throws out certain exception, we need to throw out a NEW exception to tell that 
all attempts failed (i.e., no available location there). and meanwhile, maybe 
to add the last exception information into this NEW exception. 

But if we only focus IOException, when we meet some types of exceptions for 
certain locations, it still breaks the entire workflow (to fetch data from the 
rest locations if possible). 

What do you think?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8861][SPARK-8862][SQL] Add basic instru...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7774#issuecomment-127880431
  
  [Test build #1348 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1348/consoleFull)
 for   PR 7774 at commit 
[`57d4cd2`](https://github.com/apache/spark/commit/57d4cd2edc349bf027ffca5b2e819e7479c3be62).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9581][SQL] Add unit test for JSON UDT

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7917#issuecomment-127879419
  
  [Test build #1347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1347/consoleFull)
 for   PR 7917 at commit 
[`93e3954`](https://github.com/apache/spark/commit/93e395486a326ec360923f2fe7de762b42a36252).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127878981
  
  [Test build #39824 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39824/consoleFull)
 for   PR 7952 at commit 
[`8d08090`](https://github.com/apache/spark/commit/8d0809014b76006208b214abc75969a112d21596).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127873996
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127873887
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127873523
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9217][STREAMING] Make the kinesis recei...

2015-08-04 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/7825#issuecomment-127872074
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8266][SQL]add function translate

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7709#issuecomment-127866513
  
  [Test build #1346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1346/consoleFull)
 for   PR 7709 at commit 
[`b4c47bf`](https://github.com/apache/spark/commit/b4c47bf9e224beb9cf020fb794c6ba741b0fc2a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7410#issuecomment-127865531
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-04 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/7949#issuecomment-127865432
  
Merged into master and 1.5 branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9065][Streaming][PySpark] Add MessageHa...

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7410#issuecomment-127865211
  
  [Test build #39815 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39815/console)
 for   PR 7410 at commit 
[`f375e16`](https://github.com/apache/spark/commit/f375e16640c1670ec907711bf63d2e70e5a19f6c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class PythonMessageAndMetadata(`
  * `  class PythonMessageAndMetadataPickler extends IObjectPickler `
  * `class KafkaMessageAndMetadata(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7580


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7949


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9493] [ML] add featureIndex to handle v...

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7952#issuecomment-127864368
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8266][SQL]add function translate

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7709#issuecomment-127864330
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/7937#discussion_r36270670
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -139,200 +202,308 @@ class PrefixSpan private (
 run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray))
   }
 
+}
+
+@Experimental
+object PrefixSpan extends Logging {
+
   /**
-   * Find the complete set of sequential patterns in the input sequences. 
This method utilizes
-   * the internal representation of itemsets as Array[Int] where each 
itemset is represented by
-   * a contiguous sequence of non-negative integers and delimiters 
represented by [[DELIMITER]].
-   * @param data ordered sequences of itemsets. Items are represented by 
non-negative integers.
-   * Each itemset has one or more items and is delimited by 
[[DELIMITER]].
-   * @return a set of sequential pattern pairs,
-   * the key of pair is pattern (a list of elements),
-   * the value of pair is the pattern's count.
+   * Find the complete set of frequent sequential patterns in the input 
sequences.
+   * @param data ordered sequences of itemsets. We represent a sequence 
internally as Array[Int],
+   * where each itemset is represented by a contiguous 
sequence of distinct and ordered
+   * positive integers. We use 0 as the delimiter at itemset 
boundaries, including the
+   * first and the last position.
+   * @return an RDD of (frequent sequential pattern, count) pairs,
+   * @see [[Postfix]]
*/
-  private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
+  private[fpm] def genFreqPatterns(
+  data: RDD[Array[Int]],
+  minCount: Long,
+  maxPatternLength: Int,
+  maxLocalProjDBSize: Long): RDD[(Array[Int], Long)] = {
 val sc = data.sparkContext
 
 if (data.getStorageLevel == StorageLevel.NONE) {
   logWarning("Input data is not cached.")
 }
 
-// Use List[Set[Item]] for internal computation
-val sequences = data.map { seq => splitSequence(seq.toList) }
-
-// Convert min support to a min number of transactions for this dataset
-val minCount = if (minSupport == 0) 0L else 
math.ceil(sequences.count() * minSupport).toLong
-
-// (Frequent items -> number of occurrences, all items here satisfy 
the `minSupport` threshold
-val freqItemCounts = sequences
-  .flatMap(seq => seq.flatMap(nonemptySubsets(_)).distinct.map(item => 
(item, 1L)))
-  .reduceByKey(_ + _)
-  .filter { case (item, count) => (count >= minCount) }
-  .collect()
-  .toMap
-
-// Pairs of (length 1 prefix, suffix consisting of frequent items)
-val itemSuffixPairs = {
-  val freqItemSets = freqItemCounts.keys.toSet
-  val freqItems = freqItemSets.flatten
-  sequences.flatMap { seq =>
-val filteredSeq = seq.map(item => 
freqItems.intersect(item)).filter(_.nonEmpty)
-freqItemSets.flatMap { item =>
-  val candidateSuffix = LocalPrefixSpan.getSuffix(item, 
filteredSeq)
-  candidateSuffix match {
-case suffix if !suffix.isEmpty => Some((List(item), suffix))
-case _ => None
+val postfixes = data.map(items => new Postfix(items))
+
+// Local frequent patterns (prefixes) and their counts.
+val localFreqPatterns = mutable.ArrayBuffer.empty[(Array[Int], Long)]
+// Prefixes whose projected databases are small.
+val smallPrefixes = mutable.Map.empty[Int, Prefix]
+val emptyPrefix = Prefix.empty
+// Prefixes whose projected databases are large.
+var largePrefixes = mutable.Map(emptyPrefix.id -> emptyPrefix)
+while (largePrefixes.nonEmpty) {
+  val numLocalFreqPatterns = localFreqPatterns.length
+  logInfo(s"number of local frequent patterns: $numLocalFreqPatterns")
+  if (localFreqPatterns.length > 100) {
+logWarning(
+  s"""
+ | Collected $numLocalFreqPatterns local frequent patterns. 
You may want to consider:
+ |   1. increase minSupport,
+ |   2. decrease maxPatternLength,
+ |   3. increase maxLocalProjDBSize.
+   """.stripMargin)
+  }
+  logInfo(s"number of small prefixes: ${smallPrefixes.size}")
+  logInfo(s"number of large prefixes: ${largePrefixes.size}")
+  val largePrefixArray = largePrefixes.values.toArray
+  val freqPrefixes = postfixes.flatMap { postfix =>
--- End diff --

OK, just FYI `for` and `map/flatMap` are equivalent 
(http://docs.scala-lang.org/tutorials/FAQ/yield.html)


---
If your project is set up for it,

[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7937


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8999][MLlib]Support non-temporal sequen...

2015-08-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/7594


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9360][SQL] Support BinaryType in Prefix...

2015-08-04 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/7676#discussion_r36270593
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SortOrder.scala
 ---
@@ -76,6 +78,7 @@ case class SortPrefix(child: SortOrder) extends 
UnaryExpression {
 (DoublePrefixComparator.computePrefix(Double.NegativeInfinity),
   s"$DoublePrefixCmp.computePrefix((double)$input)")
   case StringType => (0L, s"$input.getPrefix()")
+  case BinaryType => (0L, 
s"$BinaryPrefixCmp.computePrefix((byte[])$input)")
--- End diff --

I think we don't need the cast here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/7937#issuecomment-127863143
  
Merged into master and branch-1.5. Thanks @feynmanliang and @zhangjiajin 
for review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/7929#issuecomment-127862858
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9540] [MLLIB] optimize PrefixSpan imple...

2015-08-04 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/7937#discussion_r36270520
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -139,200 +202,308 @@ class PrefixSpan private (
 run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray))
   }
 
+}
+
+@Experimental
+object PrefixSpan extends Logging {
+
   /**
-   * Find the complete set of sequential patterns in the input sequences. 
This method utilizes
-   * the internal representation of itemsets as Array[Int] where each 
itemset is represented by
-   * a contiguous sequence of non-negative integers and delimiters 
represented by [[DELIMITER]].
-   * @param data ordered sequences of itemsets. Items are represented by 
non-negative integers.
-   * Each itemset has one or more items and is delimited by 
[[DELIMITER]].
-   * @return a set of sequential pattern pairs,
-   * the key of pair is pattern (a list of elements),
-   * the value of pair is the pattern's count.
+   * Find the complete set of frequent sequential patterns in the input 
sequences.
+   * @param data ordered sequences of itemsets. We represent a sequence 
internally as Array[Int],
+   * where each itemset is represented by a contiguous 
sequence of distinct and ordered
+   * positive integers. We use 0 as the delimiter at itemset 
boundaries, including the
+   * first and the last position.
+   * @return an RDD of (frequent sequential pattern, count) pairs,
+   * @see [[Postfix]]
*/
-  private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
+  private[fpm] def genFreqPatterns(
+  data: RDD[Array[Int]],
+  minCount: Long,
+  maxPatternLength: Int,
+  maxLocalProjDBSize: Long): RDD[(Array[Int], Long)] = {
 val sc = data.sparkContext
 
 if (data.getStorageLevel == StorageLevel.NONE) {
   logWarning("Input data is not cached.")
 }
 
-// Use List[Set[Item]] for internal computation
-val sequences = data.map { seq => splitSequence(seq.toList) }
-
-// Convert min support to a min number of transactions for this dataset
-val minCount = if (minSupport == 0) 0L else 
math.ceil(sequences.count() * minSupport).toLong
-
-// (Frequent items -> number of occurrences, all items here satisfy 
the `minSupport` threshold
-val freqItemCounts = sequences
-  .flatMap(seq => seq.flatMap(nonemptySubsets(_)).distinct.map(item => 
(item, 1L)))
-  .reduceByKey(_ + _)
-  .filter { case (item, count) => (count >= minCount) }
-  .collect()
-  .toMap
-
-// Pairs of (length 1 prefix, suffix consisting of frequent items)
-val itemSuffixPairs = {
-  val freqItemSets = freqItemCounts.keys.toSet
-  val freqItems = freqItemSets.flatten
-  sequences.flatMap { seq =>
-val filteredSeq = seq.map(item => 
freqItems.intersect(item)).filter(_.nonEmpty)
-freqItemSets.flatMap { item =>
-  val candidateSuffix = LocalPrefixSpan.getSuffix(item, 
filteredSeq)
-  candidateSuffix match {
-case suffix if !suffix.isEmpty => Some((List(item), suffix))
-case _ => None
+val postfixes = data.map(items => new Postfix(items))
+
+// Local frequent patterns (prefixes) and their counts.
+val localFreqPatterns = mutable.ArrayBuffer.empty[(Array[Int], Long)]
+// Prefixes whose projected databases are small.
+val smallPrefixes = mutable.Map.empty[Int, Prefix]
+val emptyPrefix = Prefix.empty
+// Prefixes whose projected databases are large.
+var largePrefixes = mutable.Map(emptyPrefix.id -> emptyPrefix)
+while (largePrefixes.nonEmpty) {
+  val numLocalFreqPatterns = localFreqPatterns.length
+  logInfo(s"number of local frequent patterns: $numLocalFreqPatterns")
+  if (localFreqPatterns.length > 100) {
+logWarning(
+  s"""
+ | Collected $numLocalFreqPatterns local frequent patterns. 
You may want to consider:
+ |   1. increase minSupport,
+ |   2. decrease maxPatternLength,
+ |   3. increase maxLocalProjDBSize.
+   """.stripMargin)
+  }
+  logInfo(s"number of small prefixes: ${smallPrefixes.size}")
+  logInfo(s"number of large prefixes: ${largePrefixes.size}")
+  val largePrefixArray = largePrefixes.values.toArray
+  val freqPrefixes = postfixes.flatMap { postfix =>
--- End diff --

There are several performance issues with `for` in Scala. I don't know 
whether `for` syntax is better here. I'm more comfortable with `flatMap`, which 

[GitHub] spark pull request: [SPARK-9593] [SQL] Fixes Hadoop shims loading

2015-08-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/7929#issuecomment-127862755
  
  [Test build #39818 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/39818/console)
 for   PR 7929 at commit 
[`c99b497`](https://github.com/apache/spark/commit/c99b497560d3103acb65076eac023a6bf36f96b5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >