[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052493
  
@pwendell @scwf What I mean is that com.esotericsoftware is again shaded in 
hive as org.apache.hive.com.esotericsoftware. I think that's the reason why the 
original hive package work against spark. But the spark-project:hive-exec does 
not include the shaded org.apache.hive.com.esotericsoftware, and need to relink 
which cause the version confliction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052515
  
Okay I think the issue is pretty tough. Unfortunately hive is directly 
using the shaded objenesis classes. However, Spark needs Kryo 2.21 which 
depends on the original objenesis classes.

Here is the hive code that uses it:


https://github.com/apache/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L186

So we can't just remove kryo that hive uses. This is pretty ugly. One 
solution might be to update chill in Spark so that Spark is using the same Kryo 
version as Hive.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4102] Remove unused ShuffleReader.stop(...

2014-10-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2966


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052281
  
@pwendell, right in hive 0.13.1 it use the shaded 
```com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy``` 
in kryo 2.22.
So if we exclude it, we will get classnotfound error, because in kryo 
2.21(spark chill depends on) in spark does not have this class(the class in 
kryo 2.21 is org.objenesis.strategy.InstantiatorStrategy)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052124
  
@pwendell com.esotericsoftware is already shaded in hive. Will it work if 
we keep it in hive-exec.jar? Please advice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3968 Use parquet-mr filter2 api in spark...

2014-10-29 Thread saucam
Github user saucam commented on the pull request:

https://github.com/apache/spark/pull/2841#issuecomment-61052070
  
yes. In task side metadata strategy, the tasks are spawned first, and each 
task will then read the metadata and drop the row groups. So if I am using 
yarn,  and data is huge (metadata is large) , the memory will be consumed on 
the yarn side , but in case of client side metadata strategy, whole of the 
metadata will be read before the tasks are spawned, on a single node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052076
  
Another thing to notice is that Kryo 2.21 is a really weird release. [Kryo 
2.21 
POM](https://repo1.maven.org/maven2/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.pom)
 suggests that Objenesis classes are relocated to package 
`com.esotericsoftware.shaded.org.objenesis`, but classes within the Maven 
artifact jar file still reside in package `org.objenesis`. Also, Kryo GitHub 
repo doesn't provide 2.21 release download and the version number in the POM of 
[kryo-2.21 
tag](https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/pom.xml#L13) is 
actually `2.21-SNAPSHOT`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052007
  
@scwf the hive classes only link against kryo... they don't link against 
objenesis directly. As long as kryo did not make a binary-incompatible change 
between 2.21 and 2.22, it should be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61052028
  

```com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy``` 
is in kryo 2.22


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051983
  
actually the most recent failures, it is using kryo 2.21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051898
  
@pwendell,spark depends on kryo 2.21 which not  shaded objenesis while 
hive 0.13 depends on kryo 2.22 and it shaded objenesis. So excluding will not 
fix the problem because in hive can not find the shaded class


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051870
  
Based on the most recent failures, it seems like somehow the test classpath 
is still using kryo 2.22.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051691
  
Just to make it more intuitive, made a dependency graph to illustrate the 
issue:

![dependency-hell](http://tinyurl.com/q5opqe2)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051681
  
@scwf I checked dev/run-tests, it does invoke python/run-tests. Didn't you 
also run it locally and succeed, or I miss anything?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61051413
  
The problem here is that Hive 0.13 upgrades the Kryo version from 2.21 to 
2.22. Spark previously depends on Kryo 2.22 via chill. In Kryo 2.22 they made a 
build change where they started inlining the objenesis dependency via shading. 
This patch somehow causes Spark to compile against Kryo 2.21 and run against 
Kryo 2.22, which is the root cause of the errors.

My suggestion was to just exclude Kryo from Hive, hoping that it would 
result in us just keeping Kryo 2.21 and that Hive could deal with it. We might 
need to exclude it in other places than hive-exec. That could be the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4094][CORE] checkpoint should stil...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-61051336
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22522/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4094][CORE] checkpoint should stil...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-61051328
  
  [Test build #22522 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22522/consoleFull)
 for   PR 2956 at commit 
[`a942bfa`](https://github.com/apache/spark/commit/a942bfa41be317cb68fe69ea1becd3059619a909).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH i...

2014-10-29 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/2711#issuecomment-61050455
  
Very good, thanks, @andrewor14 @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1720][SPARK-1719] Add the value of LD_L...

2014-10-29 Thread witgo
Github user witgo closed the pull request at:

https://github.com/apache/spark/pull/1031


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61050366
  
@scwf Hmm, you mean the dev/run-test does not run pyspark? I locally run 
dev/run-test today and months ago, and didn't met pyspark error. How can I 
invoke pyspark test locally?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH i...

2014-10-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2711


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH i...

2014-10-29 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2711#issuecomment-61049248
  
Alright thanks, I'm merging this into master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61048834
  
@zhzhan,  original hive failed pyspark, see #3004


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3003#issuecomment-61048818
  
  [Test build #22523 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22523/consoleFull)
 for   PR 3003 at commit 
[`47b144f`](https://github.com/apache/spark/commit/47b144f66badf9484966d3f2c74ccdb594350751).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

2014-10-29 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3003#issuecomment-61048612
  
@mateiz  I had re-implemented it, now it checks the size result before 
sending from executor and fetching in driver, please review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2931#issuecomment-61048438
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22521/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2931#issuecomment-61048431
  
  [Test build #22521 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22521/consoleFull)
 for   PR 2931 at commit 
[`ed5fbf0`](https://github.com/apache/spark/commit/ed5fbf0765136da963f6a8447f1ff69191825392).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class WriteAheadLogBackedBlockRDDPartition(`
  * `class WriteAheadLogBackedBlockRDD[T: ClassTag](`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

2014-10-29 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2952#discussion_r19588522
  
--- Diff: examples/src/main/python/mllib/word2vec.py ---
@@ -0,0 +1,48 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
+# The file was unziped and split into multiple lines using
+# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
+# This was done so that the example can be run in local mode
--- End diff --

It's better to including download and unzip.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

2014-10-29 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2952#discussion_r19588505
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -162,6 +162,40 @@ for((synonym, cosineSimilarity) <- synonyms) {
 }
 {% endhighlight %}
 
+
+{% highlight python %}
+# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
+# The file was unziped and split into multiple lines using
+# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
+# This was done so that the example can be run in local mode
+
+import sys
+
+from pyspark import SparkContext
+from pyspark.mllib.feature import Word2Vec
+
+USAGE = ("bin/spark-submit --driver-memory 4g "
--- End diff --

it should look like the scala one, I think the following should be enough:
```
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec

sc = SparkContext(appName='Word2Vec')
inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))

word2vec = Word2Vec()
model = word2vec.fit(inp)

synonyms = model.findSynonyms('china', 40)
for word, cosine_distance in synonyms:
print "{}: {}".format(word, cosine_distance)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4094][CORE] checkpoint should stil...

2014-10-29 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-61047597
  
Since checkpoint will be done recursively on RDD parents, we need to avoid 
one RDD traversed for multiple times. E.g. for following lineage:
A -- B -- C -- D -- E
 `-- F --' 
When we call E.count(), we shall avoid traverse A and B for twice, since 
there are two path to these two nodes: EDCBA and EDFBA. Or else the traverse 
time will increase exponentially.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4094][CORE] checkpoint should stil...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2956#issuecomment-61047499
  
  [Test build #22522 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22522/consoleFull)
 for   PR 2956 at commit 
[`a942bfa`](https://github.com/apache/spark/commit/a942bfa41be317cb68fe69ea1becd3059619a909).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61046800
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22514/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61046798
  
  [Test build #22514 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22514/consoleFull)
 for   PR 2983 at commit 
[`69dba42`](https://github.com/apache/spark/commit/69dba425dd28877212e359887d8c6c86f527e4b8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DecimalType(DataType):`
  * `case class UnscaledValue(child: Expression) extends UnaryExpression `
  * `case class MakeDecimal(child: Expression, precision: Int, scale: Int) 
extends UnaryExpression `
  * `case class MutableLiteral(var value: Any, dataType: DataType, 
nullable: Boolean = true)`
  * `case class PrecisionInfo(precision: Int, scale: Int)`
  * `case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends 
FractionalType `
  * `final class Decimal extends Ordered[Decimal] with Serializable `
  * `  trait DecimalIsConflicted extends Numeric[Decimal] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61046731
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22520/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61046728
  
  [Test build #22520 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22520/consoleFull)
 for   PR 2940 at commit 
[`f192f47`](https://github.com/apache/spark/commit/f192f47d0e916e2b4b425581a4a76b7aaf782328).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread tianyi
Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61046528
  
Hi, @marmbrus @liancheng 
I had rebased this PR after 
[#2762](https://github.com/apache/spark/pull/2762). 
Any more comments on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4149][SQL] ISO 8601 support for json da...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3012#issuecomment-61046541
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22518/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4149][SQL] ISO 8601 support for json da...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3012#issuecomment-61046537
  
  [Test build #22518 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22518/consoleFull)
 for   PR 3012 at commit 
[`c62b7e2`](https://github.com/apache/spark/commit/c62b7e2b924ab2a9d9c21580be1e077a24b8eb5d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61046333
  
  [Test build #22516 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22516/consoleFull)
 for   PR 2940 at commit 
[`df5f320`](https://github.com/apache/spark/commit/df5f3204afb1f3b6566df3dbed5f45b371c1ae67).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61046335
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22516/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61046263
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22519/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61046262
  
  [Test build #22519 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22519/consoleFull)
 for   PR 2542 at commit 
[`b708fc7`](https://github.com/apache/spark/commit/b708fc7636143562b950fda5fda778e1cd447ae1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61045963
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22517/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61045960
  
  [Test build #22517 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22517/consoleFull)
 for   PR 2542 at commit 
[`b708fc7`](https://github.com/apache/spark/commit/b708fc7636143562b950fda5fda778e1cd447ae1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61045831
  
I'm combing shading and dependency among Hive, Spark, Chill, Kryo and 
Objenesis. Will give a summary later. I don't think we can fix all the problem 
unless relationships among these key components are crystal clear...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61045639
  
I am not expert on this. Looks like esotericsoftware is already shaded in 
hive. Is it helpful if org.spark-prjoect:hive-exec include the shaded jar? 
Since the original hive jar works and no conflicts.  
  

  com.esotericsoftware
  
org.apache.hive.com.esotericsoftware



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4148][PySpark] fix seed distribution an...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3010#issuecomment-61045548
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22513/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4148][PySpark] fix seed distribution an...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3010#issuecomment-61045545
  
  [Test build #22513 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22513/consoleFull)
 for   PR 3010 at commit 
[`c1bacd9`](https://github.com/apache/spark/commit/c1bacd9f46fe5559d4affa74dd986c79cced1611).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4150][PySpark] return self in rdd.setNa...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3011#issuecomment-61045471
  
  [Test build #22515 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22515/consoleFull)
 for   PR 3011 at commit 
[`4ac3bbd`](https://github.com/apache/spark/commit/4ac3bbdba145d5f5bd3a40906c4ca08daee4d9a8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4150][PySpark] return self in rdd.setNa...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3011#issuecomment-61045477
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22515/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2994#issuecomment-61045263
  
  [Test build #22512 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22512/consoleFull)
 for   PR 2994 at commit 
[`0a45f1a`](https://github.com/apache/spark/commit/0a45f1ab5ba5f9440a78e47e48b48f0321d440c1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) extends 
Serializable with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2994#issuecomment-61045266
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22512/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/2931#issuecomment-61044847
  
Apart from the readability, does one have a performance benefit over the 
other?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2931#issuecomment-61044841
  
  [Test build #22521 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22521/consoleFull)
 for   PR 2931 at commit 
[`ed5fbf0`](https://github.com/apache/spark/commit/ed5fbf0765136da963f6a8447f1ff69191825392).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/2931#issuecomment-61044672
  
@rxin I updated. Only part i am not in agreement is the preferred location 
logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2931#discussion_r19587466
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDDSuite.scala
 ---
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.rdd
+
+import java.io.File
+
+import scala.util.Random
+
+import com.google.common.io.Files
+import org.apache.hadoop.conf.Configuration
+import org.scalatest.{BeforeAndAfterAll, FunSuite}
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.storage.{BlockId, BlockManager, StorageLevel, 
StreamBlockId}
+import org.apache.spark.streaming.util.{WriteAheadLogFileSegment, 
WriteAheadLogWriter}
+
+class WriteAheadLogBackedBlockRDDSuite extends FunSuite with 
BeforeAndAfterAll {
+  val conf = new SparkConf()
+.setMaster("local[2]")
+.setAppName(this.getClass.getSimpleName)
+  val hadoopConf = new Configuration()
+
+  var sparkContext: SparkContext = null
+  var blockManager: BlockManager = null
+  var dir: File = null
+
+  override def beforeAll(): Unit = {
+sparkContext = new SparkContext(conf)
+blockManager = sparkContext.env.blockManager
+dir = Files.createTempDir()
+  }
+
+  override def afterAll(): Unit = {
+// Copied from LocalSparkContext, simpler than to introduced test 
dependencies to core tests.
+sparkContext.stop()
+dir.delete()
+System.clearProperty("spark.driver.port")
+  }
+
+  test("Read data available in block manager and write ahead log") {
+testRDD(5, 5)
+  }
+
+  test("Read data available only in block manager, not in write ahead 
log") {
+testRDD(5, 0)
+  }
+
+  test("Read data available only in write ahead log, not in block 
manager") {
+testRDD(0, 5)
+  }
+
+  test("Read data available only in write ahead log, and test storing in 
block manager") {
+testRDD(0, 5, testStoreInBM = true)
+  }
+
+  test("Read data with partially available in block manager, and rest in 
write ahead log") {
+testRDD(3, 2)
+  }
+
+  /**
+   * Test the WriteAheadLogBackedRDD, by writing some partitions of the 
data to block manager
+   * and the rest to a write ahead log, and then reading reading it all 
back using the RDD.
+   * It can also test if the partitions that were read from the log were 
again stored in
+   * block manager.
+   * @param numPartitionssInBM Number of partitions to write to the Block 
Manager
+   * @param numPartitionsInWAL Number of partitions to write to the Write 
Ahead Log
+   * @param testStoreInBM Test whether blocks read from log are stored 
back into block manager
+   */
+  private def testRDD(
+  numPartitionssInBM: Int,
+  numPartitionsInWAL: Int,
+  testStoreInBM: Boolean = false
+) {
+val numBlocks = numPartitionssInBM + numPartitionsInWAL
+val data = Seq.tabulate(numBlocks) { _ => Seq.fill(10) { 
scala.util.Random.nextString(50) } }
--- End diff --

Nice! Right!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2931#discussion_r19587457
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.Configuration
+
+import org.apache.spark._
+import org.apache.spark.rdd.BlockRDD
+import org.apache.spark.storage.{BlockId, StorageLevel}
+import org.apache.spark.streaming.util.{HdfsUtils, 
WriteAheadLogFileSegment, WriteAheadLogRandomReader}
+
+/**
+ * Partition class for 
[[org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD]].
+ * It contains information about the id of the blocks having this 
partition's data and
+ * the segment of the write ahead log that backs the partition.
+ * @param index index of the partition
+ * @param blockId id of the block having the partition data
+ * @param segment segment of the write ahead log having the partition data
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDDPartition(
+val index: Int,
+val blockId: BlockId,
+val segment: WriteAheadLogFileSegment
+  ) extends Partition
+
+
+/**
+ * This class represents a special case of the BlockRDD where the data 
blocks in
+ * the block manager are also backed by segments in write ahead logs. For 
reading
+ * the data, this RDD first looks up the blocks by their ids in the block 
manager.
+ * If it does not find them, it looks up the corresponding file segment.
+ *
+ * @param sc SparkContext
+ * @param hadoopConfig Hadoop configuration
+ * @param blockIds Ids of the blocks that contains this RDD's data
+ * @param segments Segments in write ahead logs that contain this RDD's 
data
+ * @param storeInBlockManager Whether to store in the block manager after 
reading from the segment
+ * @param storageLevel storage level to store when storing in block manager
+ * (applicable when storeInBlockManager = true)
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDD[T: ClassTag](
+@transient sc: SparkContext,
+@transient hadoopConfig: Configuration,
+@transient override val blockIds: Array[BlockId],
--- End diff --

For that matter, the `val` in the following lines were not needed either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2931#discussion_r19587361
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.Configuration
+
+import org.apache.spark._
+import org.apache.spark.rdd.BlockRDD
+import org.apache.spark.storage.{BlockId, StorageLevel}
+import org.apache.spark.streaming.util.{HdfsUtils, 
WriteAheadLogFileSegment, WriteAheadLogRandomReader}
+
+/**
+ * Partition class for 
[[org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD]].
+ * It contains information about the id of the blocks having this 
partition's data and
+ * the segment of the write ahead log that backs the partition.
+ * @param index index of the partition
+ * @param blockId id of the block having the partition data
+ * @param segment segment of the write ahead log having the partition data
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDDPartition(
+val index: Int,
+val blockId: BlockId,
+val segment: WriteAheadLogFileSegment
+  ) extends Partition
+
+
+/**
+ * This class represents a special case of the BlockRDD where the data 
blocks in
+ * the block manager are also backed by segments in write ahead logs. For 
reading
+ * the data, this RDD first looks up the blocks by their ids in the block 
manager.
+ * If it does not find them, it looks up the corresponding file segment.
+ *
+ * @param sc SparkContext
+ * @param hadoopConfig Hadoop configuration
+ * @param blockIds Ids of the blocks that contains this RDD's data
+ * @param segments Segments in write ahead logs that contain this RDD's 
data
+ * @param storeInBlockManager Whether to store in the block manager after 
reading from the segment
+ * @param storageLevel storage level to store when storing in block manager
+ * (applicable when storeInBlockManager = true)
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDD[T: ClassTag](
+@transient sc: SparkContext,
+@transient hadoopConfig: Configuration,
+@transient override val blockIds: Array[BlockId],
+@transient val segments: Array[WriteAheadLogFileSegment],
+val storeInBlockManager: Boolean,
+val storageLevel: StorageLevel
+  ) extends BlockRDD[T](sc, blockIds) {
+
+  require(
+blockIds.length == segments.length,
+s"Number of block ids (${blockIds.length}) must be " +
+  s"the same as number of segments (${segments.length}})!")
+
+  // Hadoop configuration is not serializable, so broadcast it as a 
serializable.
+  private val broadcastedHadoopConf = new 
SerializableWritable(hadoopConfig)
+
+  override def getPartitions: Array[Partition] = {
+assertValid()
+Array.tabulate(blockIds.size) { i =>
+  new WriteAheadLogBackedBlockRDDPartition(i, blockIds(i), 
segments(i)) }
+  }
+
+  /**
+   * Gets the partition data by getting the corresponding block from the 
block manager.
+   * If the block does not exist, then the data is read from the 
corresponding segment
+   * in write ahead log files.
+   */
+  override def compute(split: Partition, context: TaskContext): 
Iterator[T] = {
+assertValid()
+val hadoopConf = broadcastedHadoopConf.value
+val blockManager = SparkEnv.get.blockManager
+val partition = 
split.asInstanceOf[WriteAheadLogBackedBlockRDDPartition]
+val blockId = partition.blockId
+blockManager.get(blockId) match {
+  case Some(block) => // Data is in Block Manager
+val iterator = block.data.asInstanceOf[Iterator[T]]
+logDebug(s"Read partition data of $this from block manager, block 
$blockId")
+iterator
+  case None => // Data

[GitHub] spark pull request: [EC2] Factor out Mesos spark-ec2 branch

2014-10-29 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3008#issuecomment-61044027
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2931#discussion_r19587321
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
 ---
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.Configuration
+
+import org.apache.spark._
+import org.apache.spark.rdd.BlockRDD
+import org.apache.spark.storage.{BlockId, StorageLevel}
+import org.apache.spark.streaming.util.{HdfsUtils, 
WriteAheadLogFileSegment, WriteAheadLogRandomReader}
+
+/**
+ * Partition class for 
[[org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD]].
+ * It contains information about the id of the blocks having this 
partition's data and
+ * the segment of the write ahead log that backs the partition.
+ * @param index index of the partition
+ * @param blockId id of the block having the partition data
+ * @param segment segment of the write ahead log having the partition data
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDDPartition(
+val index: Int,
+val blockId: BlockId,
+val segment: WriteAheadLogFileSegment
+  ) extends Partition
+
+
+/**
+ * This class represents a special case of the BlockRDD where the data 
blocks in
+ * the block manager are also backed by segments in write ahead logs. For 
reading
+ * the data, this RDD first looks up the blocks by their ids in the block 
manager.
+ * If it does not find them, it looks up the corresponding file segment.
+ *
+ * @param sc SparkContext
+ * @param hadoopConfig Hadoop configuration
+ * @param blockIds Ids of the blocks that contains this RDD's data
+ * @param segments Segments in write ahead logs that contain this RDD's 
data
+ * @param storeInBlockManager Whether to store in the block manager after 
reading from the segment
+ * @param storageLevel storage level to store when storing in block manager
+ * (applicable when storeInBlockManager = true)
+ */
+private[streaming]
+class WriteAheadLogBackedBlockRDD[T: ClassTag](
+@transient sc: SparkContext,
+@transient hadoopConfig: Configuration,
+@transient override val blockIds: Array[BlockId],
+@transient val segments: Array[WriteAheadLogFileSegment],
+val storeInBlockManager: Boolean,
+val storageLevel: StorageLevel
+  ) extends BlockRDD[T](sc, blockIds) {
+
+  require(
+blockIds.length == segments.length,
+s"Number of block ids (${blockIds.length}) must be " +
+  s"the same as number of segments (${segments.length}})!")
+
+  // Hadoop configuration is not serializable, so broadcast it as a 
serializable.
+  private val broadcastedHadoopConf = new 
SerializableWritable(hadoopConfig)
+
+  override def getPartitions: Array[Partition] = {
+assertValid()
+Array.tabulate(blockIds.size) { i =>
+  new WriteAheadLogBackedBlockRDDPartition(i, blockIds(i), 
segments(i)) }
+  }
+
+  /**
+   * Gets the partition data by getting the corresponding block from the 
block manager.
+   * If the block does not exist, then the data is read from the 
corresponding segment
+   * in write ahead log files.
+   */
+  override def compute(split: Partition, context: TaskContext): 
Iterator[T] = {
+assertValid()
+val hadoopConf = broadcastedHadoopConf.value
+val blockManager = SparkEnv.get.blockManager
+val partition = 
split.asInstanceOf[WriteAheadLogBackedBlockRDDPartition]
+val blockId = partition.blockId
+blockManager.get(blockId) match {
+  case Some(block) => // Data is in Block Manager
+val iterator = block.data.asInstanceOf[Iterator[T]]
+logDebug(s"Read partition data of $this from block manager, block 
$blockId")
+iterator
+  case None => //

[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61043967
  
  [Test build #22520 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22520/consoleFull)
 for   PR 2940 at commit 
[`f192f47`](https://github.com/apache/spark/commit/f192f47d0e916e2b4b425581a4a76b7aaf782328).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4027][Streaming] HDFSBasedBlockRDD to r...

2014-10-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2931#discussion_r19587299
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/rdd/HDFSBackedBlockRDD.scala
 ---
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.rdd
+
+import scala.reflect.ClassTag
+
+import org.apache.hadoop.conf.Configuration
+
+import org.apache.spark.rdd.BlockRDD
+import org.apache.spark.storage.{BlockId, StorageLevel}
+import org.apache.spark.streaming.util.{WriteAheadLogFileSegment, 
HdfsUtils, WriteAheadLogRandomReader}
+import org.apache.spark._
+
+private[streaming]
+class HDFSBackedBlockRDDPartition(
+val blockId: BlockId,
+val index: Int,
+val segment: WriteAheadLogFileSegment
+  ) extends Partition
+
+private[streaming]
+class HDFSBackedBlockRDD[T: ClassTag](
+@transient sc: SparkContext,
+@transient hadoopConfiguration: Configuration,
+@transient override val blockIds: Array[BlockId],
+@transient val segments: Array[WriteAheadLogFileSegment],
+val storeInBlockManager: Boolean,
+val storageLevel: StorageLevel
+  ) extends BlockRDD[T](sc, blockIds) {
+
+  require(blockIds.length == segments.length,
+"Number of block ids must be the same as number of segments!")
+
+  // Hadoop Configuration is not serializable, so broadcast it as a 
serializable.
+  val broadcastedHadoopConf = sc.broadcast(new 
SerializableWritable(hadoopConfiguration))
+
+  override def getPartitions: Array[Partition] = {
+assertValid()
+(0 until blockIds.size).map { i =>
+  new HDFSBackedBlockRDDPartition(blockIds(i), i, segments(i))
+}.toArray
+  }
+
+  override def compute(split: Partition, context: TaskContext): 
Iterator[T] = {
+assertValid()
+val hadoopConf = broadcastedHadoopConf.value.value
+val blockManager = SparkEnv.get.blockManager
+val partition = split.asInstanceOf[HDFSBackedBlockRDDPartition]
+val blockId = partition.blockId
+blockManager.get(blockId) match {
+  // Data is in Block Manager, grab it from there.
+  case Some(block) =>
+block.data.asInstanceOf[Iterator[T]]
+  // Data not found in Block Manager, grab it from HDFS
+  case None =>
+logInfo("Reading partition data from write ahead log " + 
partition.segment.path)
+val reader = new WriteAheadLogRandomReader(partition.segment.path, 
hadoopConf)
+val dataRead = reader.read(partition.segment)
+reader.close()
+// Currently, we support storing the data to BM only in serialized 
form and not in
+// deserialized form
+if (storeInBlockManager) {
+  blockManager.putBytes(blockId, dataRead, storageLevel)
+}
+dataRead.rewind()
+blockManager.dataDeserialize(blockId, 
dataRead).asInstanceOf[Iterator[T]]
+}
+  }
+
+  override def getPreferredLocations(split: Partition): Seq[String] = {
+val partition = split.asInstanceOf[HDFSBackedBlockRDDPartition]
+val locations = getBlockIdLocations()
+locations.getOrElse(partition.blockId,
--- End diff --

Isnt it something that Josh suggested more intuitive? All the alternatives 
are clearly in one line. And it does not have redundant code as `case Some(loc) 
=> loc`  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61043719
  
  [Test build #22519 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22519/consoleFull)
 for   PR 2542 at commit 
[`b708fc7`](https://github.com/apache/spark/commit/b708fc7636143562b950fda5fda778e1cd447ae1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4149][SQL] ISO 8601 support for json da...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3012#issuecomment-61043716
  
  [Test build #22518 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22518/consoleFull)
 for   PR 3012 at commit 
[`c62b7e2`](https://github.com/apache/spark/commit/c62b7e2b924ab2a9d9c21580be1e077a24b8eb5d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61043663
  
Two potential workaround for this:
1 change kryo version in hive to fix the conflict
2 to shade chill 
other idea?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61043455
  
  [Test build #22516 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22516/consoleFull)
 for   PR 2940 at commit 
[`df5f320`](https://github.com/apache/spark/commit/df5f3204afb1f3b6566df3dbed5f45b371c1ae67).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61043457
  
  [Test build #22517 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22517/consoleFull)
 for   PR 2542 at commit 
[`b708fc7`](https://github.com/apache/spark/commit/b708fc7636143562b950fda5fda778e1cd447ae1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4028][Streaming] ReceivedBlockHandler i...

2014-10-29 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/2940#issuecomment-61043515
  
For reference to others, I spoke @pwendell and @JoshRosen offline and 
decided that a slightly modified version of suggestion 3 (in my earlier 
comment) is the best middle ground that addresses all the concerns. What I have 
done is add a trait `ReceivedBlockStoreResult`. 
`ReceivedBlockHandler.storeBlock` returns a `ReceivedBlockStoreResult` object, 
the contents of that object is not of any concern to `ReceiverSupervisorImpl` 
and simply passed on. Implementations of `ReceivedBlockHandler` all return 
`ReceivedBlockStoreResult`, so no generic typing. This keeps the complexity 
low, while keeping `ReceiverSupervisorImpl` code generic, and addressing 
Patrick's concern of `Option[Any]` being non-intuitive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4149][SQL] ISO 8601 support for json da...

2014-10-29 Thread adrian-wang
GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/3012

[SPARK-4149][SQL] ISO 8601 support for json date time strings

This implement the feature @davies mentioned in 
https://github.com/apache/spark/pull/2901#discussion-diff-19313312

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark iso8601

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3012.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3012


commit c62b7e2b924ab2a9d9c21580be1e077a24b8eb5d
Author: Daoyuan Wang 
Date:   2014-10-30T04:06:09Z

json data timestamp ISO8601 support




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4137] [EC2] Don't change working dir on...

2014-10-29 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/2988#discussion_r19587131
  
--- Diff: ec2/spark_ec2.py ---
@@ -718,12 +726,16 @@ def get_num_disks(instance_type):
 return 1
 
 
-# Deploy the configuration file templates in a given local directory to
-# a cluster, filling in any template parameters with information about the
-# cluster (e.g. lists of masters and slaves). Files are only deployed to
-# the first master instance in the cluster, and we expect the setup
-# script to be run on that instance to copy them to other nodes.
 def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules):
+"""
+Deploy the configuration file templates in a given local directory to
--- End diff --

Yeah, I thought I'd make this the first change toward having all the 
function descriptions be in docstrings, but for consistency's sake you're 
right--it should be a comment on top.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61043371
  
Seems we can not upgrade kryo in spark, since the latest chill depend on 
2.21. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3688][SQL]LogicalPlan can't resolve col...

2014-10-29 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2542#issuecomment-61043355
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: hive 0.13 test issue

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3004#issuecomment-61043157
  
  [Test build #22510 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22510/consoleFull)
 for   PR 3004 at commit 
[`a433434`](https://github.com/apache/spark/commit/a433434910b0d69b32f82e91bd47ded564d490b1).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4137] [EC2] Don't change working dir on...

2014-10-29 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/2988#issuecomment-61043164
  
Functionality LGTM. I left a minor style question for @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: hive 0.13 test issue

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3004#issuecomment-61043160
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22510/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH i...

2014-10-29 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/2711#issuecomment-61043113
  
@andrewor14 
I've tested in Linux (yarn, mesos) and Mac OS X(standalone).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4137] [EC2] Don't change working dir on...

2014-10-29 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/2988#discussion_r19586967
  
--- Diff: ec2/spark_ec2.py ---
@@ -718,12 +726,16 @@ def get_num_disks(instance_type):
 return 1
 
 
-# Deploy the configuration file templates in a given local directory to
-# a cluster, filling in any template parameters with information about the
-# cluster (e.g. lists of masters and slaves). Files are only deployed to
-# the first master instance in the cluster, and we expect the setup
-# script to be run on that instance to copy them to other nodes.
 def deploy_files(conn, root_dir, opts, master_nodes, slave_nodes, modules):
+"""
+Deploy the configuration file templates in a given local directory to
--- End diff --

Should we change this style given that other functions in this file have 
comments on top ? Any thoughts @JoshRosen ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4150][PySpark] return self in rdd.setNa...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3011#issuecomment-61042785
  
  [Test build #22515 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22515/consoleFull)
 for   PR 3011 at commit 
[`4ac3bbd`](https://github.com/apache/spark/commit/4ac3bbdba145d5f5bd3a40906c4ca08daee4d9a8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4150][PySpark] return self in rdd.setNa...

2014-10-29 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/3011

[SPARK-4150][PySpark] return self in rdd.setName

Then we can do `rdd.setName('abc').cache().count()`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark rdd-setname

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3011.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3011


commit 4ac3bbdba145d5f5bd3a40906c4ca08daee4d9a8
Author: Xiangrui Meng 
Date:   2014-10-30T03:51:40Z

return self in rdd.setName




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61042108
  
  [Test build #22514 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22514/consoleFull)
 for   PR 2983 at commit 
[`69dba42`](https://github.com/apache/spark/commit/69dba425dd28877212e359887d8c6c86f527e4b8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61042000
  
I am testing with just upgrade kryo in spark and do not exclude hive's kryo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4148][PySpark] fix seed distribution an...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3010#issuecomment-61041820
  
  [Test build #22513 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22513/consoleFull)
 for   PR 3010 at commit 
[`c1bacd9`](https://github.com/apache/spark/commit/c1bacd9f46fe5559d4affa74dd986c79cced1611).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4148][PySpark] fix seed distribution an...

2014-10-29 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/3010

[SPARK-4148][PySpark] fix seed distribution and add some tests for 
rdd.sample

The current way of seed distribution makes the sequences sampled from 
partition i and i+1 offset by 1.

~~~
In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704
~~~

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-4148

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3010.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3010


commit c1bacd9f46fe5559d4affa74dd986c79cced1611
Author: Xiangrui Meng 
Date:   2014-10-30T03:22:13Z

fix seed distribution and add some tests for rdd.sample




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2994#issuecomment-61041536
  
  [Test build #22512 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22512/consoleFull)
 for   PR 2994 at commit 
[`0a45f1a`](https://github.com/apache/spark/commit/0a45f1ab5ba5f9440a78e47e48b48f0321d440c1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4137] [EC2] Don't change working dir on...

2014-10-29 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/2988#issuecomment-61041277
  
@shivaram I took your suggestion and tested to make sure `spark-ec2` still 
creates a functioning EC2 cluster.

This is ready for another review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [EC2] Factor out Mesos spark-ec2 branch

2014-10-29 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/3008#issuecomment-61041154
  
cc @shivaram @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...

2014-10-29 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/2994#issuecomment-61040608
  
I talked to people working on Kafka, and they assure me it is thread-safe. 
Also see this: 

https://github.com/apache/flume/blob/trunk/flume-ng-channels/flume-kafka-channel/src/main/java/org/apache/flume/channel/kafka/KafkaChannel.java

There is a single producer that is written to by various threads. See the 
corresponding test where it is written from multiple threads. I have run it in 
loops several times on travis, never seen a threading issue.

By creating a Producer per partition, this issue is avoided anyway. For 
now, we can keep it simple by creating a producer per partition - if we see 
this is a problem, we can revert to the ProducerCache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61040483
  
Just exclude kryo is not enough, should we reshade hive 0.13.1 jar? 
@pwendell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61040412
  
yeah, in hive-0.13.1 
https://github.com/apache/hive/blob/release-0.13.1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L186
 using 
```com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy```, 
while in com.twitter:chill_2.10:0.3.6 using 
```org.objenesis.strategy.InstantiatorStrategy```

```
class EmptyScalaKryoInstantiator extends KryoInstantiator {
  override def newKryo = {
val k = new KryoBase
k.setRegistrationRequired(false)
k.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy)
k
  }
}
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Delete jetty 6.1.26 from spark package

2014-10-29 Thread KaiXinXiaoLei
Github user KaiXinXiaoLei commented on the pull request:

https://github.com/apache/spark/pull/2989#issuecomment-61040188
  
Using the maven-dependency-plugin to build spark, I get the dependency tree 
for spark. I find the jetty 6 is introduced by hdfs ,yarn, flume and hbase. 
From the info of dependency tree, here I just give the information about Jetty 
6.
The Jetty 6 is brought by hdfs when building spark-core:
[INFO] +- org.apache.hadoop:hadoop-client:jar:2.4.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-hdfs:jar:2.4.1:compile
[INFO] |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile

The Jetty 6 is brought by yarn when building spark-yarn:
[INFO] +- org.apache.hadoop:hadoop-yarn-server-web-proxy:jar:2.4.1:compile
[INFO] |  +- org.apache.hadoop:hadoop-yarn-server-common:jar:2.4.1:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  \- org.mortbay.jetty:jetty:jar:6.1.26:compile
[INFO] +- org.apache.hadoop:hadoop-yarn-client:jar:2.4.1:compile
[INFO] |  +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile

The Jetty 6 is brought by flume when building spark-streaming-flume:
[INFO] +- 
org.apache.spark:spark-streaming-flume-sink_2.10:jar:1.2.0-SNAPSHOT:compile
[INFO] |  \- org.apache.flume:flume-ng-core:jar:1.4.0:compile
[INFO] | +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO] | +- org.mortbay.jetty:jetty:jar:6.1.26:compile

The Jetty 6 is brought by hbase when building spark-examples:
[INFO] +- org.apache.hbase:hbase:jar:0.94.6:compile
[INFO] |  +- org.mortbay.jetty:jetty:jar:6.1.26:compile
[INFO] |  +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61039848
  
In assembly jar the class is here
```/org/objenesis/strategy/InstantiatorStrategy.class```
it seems the class name here wrong, should be 
```org.objenesis.strategy.InstantiatorStrategy``` but not 
```com.esotericsoftware.shaded.```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [EC2] Factor out Mesos spark-ec2 branch

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3008#issuecomment-61039757
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22507/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [EC2] Factor out Mesos spark-ec2 branch

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3008#issuecomment-61039753
  
  [Test build #22507 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22507/consoleFull)
 for   PR 3008 at commit 
[`10a6089`](https://github.com/apache/spark/commit/10a6089422fa81cb496363d13e428e33e58008a4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61039529
  
Still failed and get this error
```
..
Caused by: java.lang.ClassNotFoundException: 
com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
```
i am checking whether this class in assembly jar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61039238
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22511/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61039237
  
  [Test build #22511 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22511/consoleFull)
 for   PR 2983 at commit 
[`3360a0e`](https://github.com/apache/spark/commit/3360a0ecb8e383a6d1ae9f023fe343af8418db90).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DecimalType(DataType):`
  * `case class UnscaledValue(child: Expression) extends UnaryExpression `
  * `case class MakeDecimal(child: Expression, precision: Int, scale: Int) 
extends UnaryExpression `
  * `case class MutableLiteral(var value: Any, dataType: DataType, 
nullable: Boolean = true)`
  * `case class PrecisionInfo(precision: Int, scale: Int)`
  * `case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends 
FractionalType `
  * `final class Decimal extends Ordered[Decimal] with Serializable `
  * `  trait DecimalIsConflicted extends Numeric[Decimal] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3930] [SPARK-3933] Support fixed-precis...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2983#issuecomment-61039162
  
  [Test build #22511 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22511/consoleFull)
 for   PR 2983 at commit 
[`3360a0e`](https://github.com/apache/spark/commit/3360a0ecb8e383a6d1ae9f023fe343af8418db90).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

2014-10-29 Thread anantasty
Github user anantasty commented on a diff in the pull request:

https://github.com/apache/spark/pull/2952#discussion_r19585598
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -162,6 +162,40 @@ for((synonym, cosineSimilarity) <- synonyms) {
 }
 {% endhighlight %}
 
+
+{% highlight python %}
+# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
+# The file was unziped and split into multiple lines using
+# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
+# This was done so that the example can be run in local mode
+
+import sys
+
+from pyspark import SparkContext
+from pyspark.mllib.feature import Word2Vec
+
+USAGE = ("bin/spark-submit --driver-memory 4g "
--- End diff --

@davies  simplify the docs should i just remove the Usage line and the 
creation of the context?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61038986
  
  [Test build #22508 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22508/consoleFull)
 for   PR 2685 at commit 
[`18fb1ff`](https://github.com/apache/spark/commit/18fb1fff1c2a097604b573fffba92b9a7a3f3e8f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-61038997
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22508/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >