date:20150617

[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6788#issuecomment-112889519
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6839#discussion_r32655781
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala ---
@@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, 
private var _name: String)
   def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
+
+  /** Return all the nodes which are cached. */
+  def getCachedNodes: Seq[RDDOperationNode] = {
+val cachedNodes = _childNodes.filter(_.cached)
+_childClusters.foreach(cluster = cachedNodes ++= 
cluster._childNodes.filter(_.cached))
--- End diff --

style:
```
_childClusters.foreach { cluster =
  cachedNodes ++= cluster._childNodes.filter(_.cached)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6839#issuecomment-112896107
  
  [Test build #35049 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35049/consoleFull)
 for   PR 6839 at commit 
[`f98728b`](https://github.com/apache/spark/commit/f98728bdbef0d3388f36928dccd573fa15bc6536).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...

2015-06-17 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/6825#discussion_r32656460
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -109,7 +109,7 @@ trait ExpressionEvalHelper {
 }
 
 val actual = plan(inputRow)
-val expectedRow = new 
GenericRow(Array[Any](CatalystTypeConverters.convertToCatalyst(expected)))
+val expectedRow = 
InternalRow.fromSeq(Array(CatalystTypeConverters.convertToCatalyst(expected)))
--- End diff --

Sound reasonable, it's anonying to have so many `UTF8String.fromString` in 
test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8161] Set externalBlockStoreInitialized...

2015-06-17 Thread aarondav

Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/6702#issuecomment-112904974
  
LGMT


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...

2015-06-17 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/5694#discussion_r32660292
  
--- Diff: dev/run-tests.py ---
@@ -0,0 +1,536 @@
+#!/usr/bin/env python2
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import re
+import sys
+import shutil
+import subprocess
+from collections import namedtuple
+
+SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), 
..)
+USER_HOME = os.environ.get(HOME)
+
+
+def get_error_codes(err_code_file):
+Function to retrieve all block numbers from the `run-tests-codes.sh`
+file to maintain backwards compatibility with the `run-tests-jenkins`
+script
+
+with open(err_code_file, 'r') as f:
+err_codes = [e.split()[1].strip().split('=')
+ for e in f if e.startswith(readonly)]
+return dict(err_codes)
+
+
+ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, 
dev/run-tests-codes.sh))
+
+
+def exit_from_command_with_retcode(cmd, retcode):
+print [error] running, cmd, ; received return code, retcode
--- End diff --

Minor nit / annoyance here: this ends up printing things like

```
[error] running ['/Users/joshrosen/Documents/Spark/dev/../build/mvn', 
'-Pyarn', '-Phadoop-2.3', '-Dhadoop.version=2.3.0', '-Pkinesis-asl', '-Phive', 
'-Phive-thriftserver', 'clean', 'package', '-DskipTests'] ; received return 
code 1
```

which makes it hard to copy and paste the command to run it manually in the 
shell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6758#issuecomment-112914737
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6833#issuecomment-112914725
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5694


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8077][SQL] Optimization for TreeNodes w...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6673


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6839#discussion_r32656672
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala ---
@@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, 
private var _name: String)
   def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
+
+  /** Return all the nodes which are cached. */
+  def getCachedNodes: Seq[RDDOperationNode] = {
+val cachedNodes = _childNodes.filter(_.cached)
+_childClusters.foreach(cluster = cachedNodes ++= 
cluster._childNodes.filter(_.cached))
--- End diff --

also, another way to rewrite this would be:
```
_childNodes.filter(_.cached) ++ _childClusters.flatMap(_.getCachedNodes)
```
I think it's both more concise and easier to read


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to P...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6354


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Add ability to set additional tags

2015-06-17 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/6857#issuecomment-112908003
  
Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6833#issuecomment-112914495
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8077][SQL] Optimization for TreeNodes w...

2015-06-17 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/6673#issuecomment-112928014
  
Thanks!  Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...

2015-06-17 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/6346#discussion_r32670018
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -525,6 +526,41 @@ def fit(self, data):
 return Word2VecModel(jmodel)
 
 
+class ElementwiseProduct(VectorTransformer):
+
+.. note:: Experimental
+
+Scales each column of the vector, with the supplied weight vector.
+i.e the elementwise product.
+
+ weight = Vectors.dense([1.0, 2.0, 3.0])
+ eprod = ElementwiseProduct(weight)
+ a = Vectors.dense([2.0, 1.0, 3.0])
+ eprod.transform(a)
+DenseVector([2.0, 2.0, 9.0])
+ b = Vectors.dense([9.0, 3.0, 4.0])
+ rdd = sc.parallelize([a, b])
+ eprod.transform(rdd).collect()
+[DenseVector([2.0, 2.0, 9.0]), DenseVector([9.0, 6.0, 12.0])]
+
+def __init__(self, vector):
+if not isinstance(vector, Vector):
--- End diff --

It will be good to support list and np.array


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...

2015-06-17 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/6825#discussion_r32656297
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -55,7 +55,7 @@ trait ExpressionEvalHelper {
 val actual = try evaluate(expression, inputRow) catch {
   case e: Exception = fail(sException evaluating $expression, e)
 }
-if (actual != expected) {
+if (actual !== expected) {
--- End diff --

Good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7961][SQL]Refactor SQLConf to display b...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6747#issuecomment-112911556
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8395] [DOCS] start-slave.sh docs incorr...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6855#issuecomment-112913218
  
LGTM, merging into master 1.4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Add ability to set additional tags

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6857#issuecomment-112913080
  
Hi @armisael, you need to file a JIRA here: 
https://issues.apache.org/jira/browse/SPARK. Once you have done that change the 
title of this PR to link against that JIRA? e.g.

```
[SPARK-] [EC2] Add ability to set additional tags
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: spark ssc.textFileStream returns empty

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6837#issuecomment-112913685
  
@sduchh this is opened against the wrong branch. Please submit the change 
to the master branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6839#issuecomment-112927192
  
  [Test build #35049 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35049/console)
 for   PR 6839 at commit 
[`f98728b`](https://github.com/apache/spark/commit/f98728bdbef0d3388f36928dccd573fa15bc6536).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6839#issuecomment-112927231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6788#issuecomment-112889548
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3561] Initial commit to provide pluggab...

2015-06-17 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2849#issuecomment-112894251
  
That's fine, but in the name of trying to clean up stale PRs, would you 
mind closing this PR? it's not mergeable and seems corrupted anyway. You can 
reopen another PR if you really want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...

2015-06-17 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/5694#discussion_r32658498
  
--- Diff: dev/run-tests.py ---
@@ -0,0 +1,536 @@
+#!/usr/bin/env python2
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import re
+import sys
+import shutil
+import subprocess
+from collections import namedtuple
+
+SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), 
..)
+USER_HOME = os.environ.get(HOME)
+
+
+def get_error_codes(err_code_file):
+Function to retrieve all block numbers from the `run-tests-codes.sh`
+file to maintain backwards compatibility with the `run-tests-jenkins`
+script
+
+with open(err_code_file, 'r') as f:
+err_codes = [e.split()[1].strip().split('=')
+ for e in f if e.startswith(readonly)]
+return dict(err_codes)
+
+
+ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, 
dev/run-tests-codes.sh))
+
+
+def exit_from_command_with_retcode(cmd, retcode):
+print [error] running, cmd, ; received return code, retcode
+sys.exit(int(os.environ.get(CURRENT_BLOCK, 255)))
+
+
+def rm_r(path):
+Given an arbitrary path properly remove it with the correct python
+construct if it exists
+- from: http://stackoverflow.com/a/9559881;
+
+if os.path.isdir(path):
+shutil.rmtree(path)
+elif os.path.exists(path):
+os.remove(path)
+
+
+def run_cmd(cmd):
+Given a command as a list of arguments will attempt to execute the
+command from the determined SPARK_HOME directory and, on failure, print
+an error message
+
+if not isinstance(cmd, list):
+cmd = cmd.split()
+try:
+subprocess.check_call(cmd)
+except subprocess.CalledProcessError as e:
+exit_from_command_with_retcode(e.cmd, e.returncode)
+
+
+def is_exe(path):
+Check if a given path is an executable file
+- from: http://stackoverflow.com/a/377028;
+
+return os.path.isfile(path) and os.access(path, os.X_OK)
+
+
+def which(program):
+Find and return the given program by its absolute path or 'None'
+- from: http://stackoverflow.com/a/377028;
+
+fpath, fname = os.path.split(program)
+
+if fpath:
+if is_exe(program):
+return program
+else:
+for path in os.environ.get(PATH).split(os.pathsep):
+path = path.strip('')
+exe_file = os.path.join(path, program)
+if is_exe(exe_file):
+return exe_file
+return None
+
+
+def determine_java_executable():
+Will return the path of the java executable that will be used by 
Spark's
+tests or `None`
+
+# Any changes in the way that Spark's build detects java must be 
reflected
+# here. Currently the build looks for $JAVA_HOME/bin/java then falls 
back to
+# the `java` executable on the path
+
+java_home = os.environ.get(JAVA_HOME)
+
+# check if there is an executable at $JAVA_HOME/bin/java
+java_exe = which(os.path.join(java_home, bin, java)) if java_home 
else None
+# if the java_exe wasn't set, check for a `java` version on the $PATH
+return java_exe if java_exe else which(java)
+
+
+JavaVersion = namedtuple('JavaVersion', ['major', 'minor', 'patch', 
'update'])
+
+
+def determine_java_version(java_exe):
+Given a valid java executable will return its version in named 
tuple format
+with accessors '.major', '.minor', '.patch', '.update'
+
+raw_output = subprocess.check_output([java_exe, -version],
+ stderr=subprocess.STDOUT)
+raw_version_str = raw_output.split('\n')[0]  # eg 'java version 
1.8.0_25'
+version_str = raw_version_str.split()[-1].strip('')  # eg '1.8.0_25'
+

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6852#issuecomment-112908986
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/6852#issuecomment-112908908
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6833#issuecomment-112914749
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...

2015-06-17 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/6831#issuecomment-112914683
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6758#issuecomment-112914866
  
  [Test build #35055 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35055/consoleFull)
 for   PR 6758 at commit 
[`6690a08`](https://github.com/apache/spark/commit/6690a080f10aa37a3a00d21f008e1570c812d4e2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6831#issuecomment-112914732
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6758#issuecomment-112914763
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6831#issuecomment-112914747
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6758#issuecomment-112929808
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8218][SQL] Add binary log math function

2015-06-17 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/6725#discussion_r32653337
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -404,6 +405,21 @@ def when(condition, value):
 
 
 @since(1.4)
--- End diff --

1.5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8056][SQL] Design an easier way to cons...

2015-06-17 Thread ilganeli

Github user ilganeli commented on a diff in the pull request:

https://github.com/apache/spark/pull/6686#discussion_r32656127
  
--- Diff: python/pyspark/sql/types.py ---
@@ -368,8 +367,49 @@ def __init__(self, fields):
  struct1 == struct2
 False
 
-assert all(isinstance(f, DataType) for f in fields), fields 
should be a list of DataType
-self.fields = fields
+if not fields:
+self.fields = []
+else:
+self.fields = fields
+assert all(isinstance(f, StructField) for f in fields),\
+fields should be a list of StructField
+
+def add(self, name_or_struct_field, data_type=NullType(), 
nullable=True, metadata=None):
--- End diff --

Davies - totally agree. This was changed specifically to consolidate to a 
single method as suggested by Reynold. I initially had separate add methods - 
one which accepted a StructField and one which accepted the 4 parameters, the 
first two of which were defined.

What would you suggest? My preference is to break this out into two methods 
for clarity and to avoid the problem you mention.



Thank you,
Ilya Ganelin



-Original Message-
From: Davies Liu [notificati...@github.commailto:notificati...@github.com]
Sent: Wednesday, June 17, 2015 01:18 PM Eastern Standard Time
To: apache/spark
Cc: Ganelin, Ilya
Subject: Re: [spark] [SPARK-8056][SQL] Design an easier way to construct 
schema for both Scala and Python (#6686)


In 
python/pyspark/sql/types.pyhttps://github.com/apache/spark/pull/6686#discussion_r32650869:


 @@ -368,8 +367,49 @@ def __init__(self, fields):
   struct1 == struct2
  False
  
 -assert all(isinstance(f, DataType) for f in fields), fields 
should be a list of DataType
 -self.fields = fields
 +if not fields:
 +self.fields = []
 +else:
 +self.fields = fields
 +assert all(isinstance(f, StructField) for f in fields),\
 +fields should be a list of StructField
 +
 +def add(self, name_or_struct_field, data_type=NullType(), 
nullable=True, metadata=None):


What's the use cases that we should have StructType without specifying the 
dataType of each column?

In createDataFrame, if a schema of StructType is provided, it will not try 
to infer the data types, so it does not work with StructType with NoneType in 
it.

â
Reply to this email directly or view it on 
GitHubhttps://github.com/apache/spark/pull/6686/files#r32650869.


The information contained in this e-mail is confidential and/or proprietary 
to Capital One and/or its affiliates and may only be used solely in performance 
of work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8320] [Streaming] Add example in stream...

2015-06-17 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6862#discussion_r32658252
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -1937,6 +1937,16 @@ JavaPairDStreamString, String unifiedStream = 
streamingContext.union(kafkaStre
 unifiedStream.print();
 {% endhighlight %}
 /div
+div data-lang=python markdown=1
+{% highlight python %}
+numStreams = 5
+kafkaStreams = []
+for x in range (0, numStreams):
+ kafkaStreams = x.map{ KafkaUtils.createStream(â¦)}
--- End diff --

Hm, I don't think this can be correct? x is an integer; you can't map it. 
Python collections don't map like that anyway. Are you just trying to `append` 
the new stream to `kafkaStreams`?

Nit: use `...` instead of the ellipsis character. 

Is the method really `show()` in Pyspark instead of `print()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread BenFradet

Github user BenFradet commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112905346
  
@zsxwing Can you have a look, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6828#issuecomment-112910402
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...

2015-06-17 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/6828#issuecomment-112910156
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112913681
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112913712
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7862][SQL]Fix the deadlock in script tr...

2015-06-17 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/6404#issuecomment-112913378
  
After this patch, the pull request builder logs show 5 lines of stdout 
output, which makes them hard to read:

```
[info] - test script transform for stdout (3 seconds, 806 milliseconds)
17:26:41.401 WARN org.apache.spark.scheduler.TaskSetManager: Stage 1316 
contains a task of very large size (2246 KB). The maximum recommended task size 
is 100 KB.
1   1   1
2   2   2
3   3   3
4   4   4
5   5   5
6   6   6
7   7   7
8   8   8
9   9   9
10  10  10
11  11  11
12  12  12
13  13  13
14  14  14
15  15  15
16  16  16
17  17  17
18  18  18
19  19  19
20  20  20
21  21  21
22  22  22
23  23  23
24  24  24
25  25  25
[...]
```

At least I _think_ that this is the patch that caused this issue.  If 
that's the case, could someone open up a followup PR to fix this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7515] [DOC] Update documentation for Py...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6842#issuecomment-112913530
  
LGTM merging into master 1.4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...

2015-06-17 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/5659#issuecomment-112920385
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5659#issuecomment-112921278
  
  [Test build #35056 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35056/consoleFull)
 for   PR 5659 at commit 
[`cfa79f8`](https://github.com/apache/spark/commit/cfa79f8f05153f58690b7cc5d84ff18770f2f3de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8056][SQL] Design an easier way to cons...

2015-06-17 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/6686#discussion_r32656949
  
--- Diff: python/pyspark/sql/types.py ---
@@ -368,8 +367,49 @@ def __init__(self, fields):
  struct1 == struct2
 False
 
-assert all(isinstance(f, DataType) for f in fields), fields 
should be a list of DataType
-self.fields = fields
+if not fields:
+self.fields = []
+else:
+self.fields = fields
+assert all(isinstance(f, StructField) for f in fields),\
+fields should be a list of StructField
+
+def add(self, name_or_struct_field, data_type=NullType(), 
nullable=True, metadata=None):
--- End diff --

I think you could use `None` as default value of `dataType`, and raise an 
exception if `name_or_struct_field ` is string and `dataType` is None


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6839#issuecomment-112900896
  
Approach looks fine to me. Once you address the comments I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8391][Core] More efficient usage of mem...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6859#issuecomment-112907518
  
@viirya I don't see how this patch uses less memory than before. The point 
of string builder is to avoid the expensive copy of the strings during 
concatenation. The changes here removes this optimization and actually uses 
more memory to copy the strings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5673] [MLlib] Implement Streaming wrapp...

2015-06-17 Thread MechCoder

Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/4456#issuecomment-112907685
  
I'm not an expert but it seems that the PR topic is out of place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...

2015-06-17 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/5694#discussion_r32661175
  
--- Diff: dev/run-tests.py ---
@@ -0,0 +1,536 @@
+#!/usr/bin/env python2
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import os
+import re
+import sys
+import shutil
+import subprocess
+from collections import namedtuple
+
+SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), 
..)
+USER_HOME = os.environ.get(HOME)
+
+
+def get_error_codes(err_code_file):
+Function to retrieve all block numbers from the `run-tests-codes.sh`
+file to maintain backwards compatibility with the `run-tests-jenkins`
+script
+
+with open(err_code_file, 'r') as f:
+err_codes = [e.split()[1].strip().split('=')
+ for e in f if e.startswith(readonly)]
+return dict(err_codes)
+
+
+ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, 
dev/run-tests-codes.sh))
+
+
+def exit_from_command_with_retcode(cmd, retcode):
+print [error] running, cmd, ; received return code, retcode
+sys.exit(int(os.environ.get(CURRENT_BLOCK, 255)))
+
+
+def rm_r(path):
+Given an arbitrary path properly remove it with the correct python
+construct if it exists
+- from: http://stackoverflow.com/a/9559881;
+
+if os.path.isdir(path):
+shutil.rmtree(path)
+elif os.path.exists(path):
+os.remove(path)
+
+
+def run_cmd(cmd):
+Given a command as a list of arguments will attempt to execute the
+command from the determined SPARK_HOME directory and, on failure, print
+an error message
+
+if not isinstance(cmd, list):
+cmd = cmd.split()
+try:
+subprocess.check_call(cmd)
+except subprocess.CalledProcessError as e:
+exit_from_command_with_retcode(e.cmd, e.returncode)
+
+
+def is_exe(path):
+Check if a given path is an executable file
+- from: http://stackoverflow.com/a/377028;
+
+return os.path.isfile(path) and os.access(path, os.X_OK)
+
+
+def which(program):
+Find and return the given program by its absolute path or 'None'
+- from: http://stackoverflow.com/a/377028;
+
+fpath, fname = os.path.split(program)
+
+if fpath:
+if is_exe(program):
+return program
+else:
+for path in os.environ.get(PATH).split(os.pathsep):
+path = path.strip('')
+exe_file = os.path.join(path, program)
+if is_exe(exe_file):
+return exe_file
+return None
+
+
+def determine_java_executable():
+Will return the path of the java executable that will be used by 
Spark's
+tests or `None`
+
+# Any changes in the way that Spark's build detects java must be 
reflected
+# here. Currently the build looks for $JAVA_HOME/bin/java then falls 
back to
+# the `java` executable on the path
+
+java_home = os.environ.get(JAVA_HOME)
+
+# check if there is an executable at $JAVA_HOME/bin/java
+java_exe = which(os.path.join(java_home, bin, java)) if java_home 
else None
+# if the java_exe wasn't set, check for a `java` version on the $PATH
+return java_exe if java_exe else which(java)
+
+
+JavaVersion = namedtuple('JavaVersion', ['major', 'minor', 'patch', 
'update'])
+
+
+def determine_java_version(java_exe):
+Given a valid java executable will return its version in named 
tuple format
+with accessors '.major', '.minor', '.patch', '.update'
+
+raw_output = subprocess.check_output([java_exe, -version],
+ stderr=subprocess.STDOUT)
+raw_version_str = raw_output.split('\n')[0]  # eg 'java version 
1.8.0_25'
+version_str = raw_version_str.split()[-1].strip('')  # eg '1.8.0_25'
+

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread BenFradet

Github user BenFradet commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112914027
  
Hi @andrewor14,

As I said on the JIRA, I'm not able to post screenshots at this time, 
sorry, but I might have the time this weekend.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6827#discussion_r32662931
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -282,8 +282,12 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 val newAttempts = logs.flatMap { fileStatus =
   try {
 val res = replay(fileStatus, bus)
-logInfo(sApplication log ${res.logPath} loaded successfully.)
-Some(res)
+res match {
+  case Some(r) = logDebug(sApplication log ${r.logPath} loaded 
successfully.)
--- End diff --

oh never mind, just saw @vanzin's comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6827#discussion_r32663412
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -431,7 +435,9 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
* Replays the events in the specified log file and returns information 
about the associated
* application.
*/
-  private def replay(eventLog: FileStatus, bus: ReplayListenerBus): 
FsApplicationAttemptInfo = {
+  private def replay(
+  eventLog: FileStatus,
+  bus: ReplayListenerBus): Option[FsApplicationAttemptInfo] = {
--- End diff --

need to update the javadoc to explain when we return `Some(...)` vs `None`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7289] handle project - limit - sort e...

2015-06-17 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/6780#discussion_r32668112
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala ---
@@ -160,17 +166,22 @@ case class TakeOrdered(limit: Int, sortOrder: 
Seq[SortOrder], child: SparkPlan)
 
   private val ord: RowOrdering = new RowOrdering(sortOrder, child.output)
 
-  private def collectData(): Array[InternalRow] =
-child.execute().map(_.copy()).takeOrdered(limit)(ord)
+  @transient private val projection = projectList.map(newProjection(_, 
child.output))
--- End diff --

How did you see errors here?  I removed `@transient` and `sbt sql/test` 
still works for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...

2015-06-17 Thread MechCoder

Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/6346#issuecomment-112931617
  
ping @davies Would you be able to have a look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6839#discussion_r32656904
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala ---
@@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, 
private var _name: String)
   def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
+
+  /** Return all the nodes which are cached. */
+  def getCachedNodes: Seq[RDDOperationNode] = {
+val cachedNodes = _childNodes.filter(_.cached)
+_childClusters.foreach(cluster = cachedNodes ++= 
cluster._childNodes.filter(_.cached))
--- End diff --

I see, is it because we clone fewer nodes? AFAIK `++` on ArrayBuffer 
actually clones the entire thing first


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8392] Improve the efficiency

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6839#discussion_r32656910
  
--- Diff: 
core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala ---
@@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, 
private var _name: String)
   def getAllNodes: Seq[RDDOperationNode] = {
 _childNodes ++ _childClusters.flatMap(_.childNodes)
   }
+
+  /** Return all the nodes which are cached. */
+  def getCachedNodes: Seq[RDDOperationNode] = {
+val cachedNodes = _childNodes.filter(_.cached)
+_childClusters.foreach(cluster = cachedNodes ++= 
cluster._childNodes.filter(_.cached))
+cachedNodes
--- End diff --

another way to rewrite this would be:
```
_childNodes.filter(_.cached) ++ _childClusters.flatMap(_.getCachedNodes)
```
I think it's both more concise and easier to read


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6852#issuecomment-112909451
  
  [Test build #35050 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35050/consoleFull)
 for   PR 6852 at commit 
[`d464211`](https://github.com/apache/spark/commit/d464211afe6805117cd4d56957a6f773bc8ae3a8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6852#issuecomment-112909043
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7961][SQL]Refactor SQLConf to display b...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6747#issuecomment-112911506
  
  [Test build #35047 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35047/console)
 for   PR 6747 at commit 
[`7d09bad`](https://github.com/apache/spark/commit/7d09bad23a7ac25c90735079b01984fa307a6f73).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SetCommand(kv: Option[(String, Option[String])]) extends 
RunnableCommand with Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112914311
  
  [Test build #35052 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35052/consoleFull)
 for   PR 6845 at commit 
[`edd0936`](https://github.com/apache/spark/commit/edd093623f01e69cbef00b24b67809afea5ce49d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8010][SQL]Promote types to StringType a...

2015-06-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/6551#discussion_r32668471
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -39,6 +39,16 @@ class SQLQuerySuite extends QueryTest with 
BeforeAndAfterAll with SQLTestUtils {
   val sqlContext = TestSQLContext
   import sqlContext.implicits._
 
+  test(SPARK-8010: promote numeric to string) {
+val df = Seq((1, 1)).toDF(key, value)
+df.registerTempTable(src)
+val queryCaseWhen = sql(select case when true then 1.0 else '1' end 
from src )
+val queryCoalesce = sql(select coalesce(null, 1, '1') from src )
--- End diff --

Seems Hive will use StringType. I am fine with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5673] [MLlib] Implement Streaming wrapp...

2015-06-17 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/4456#discussion_r32658912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLassoWithSGD.scala
 ---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+
+/**
+ * :: Experimental ::
+ * Train or predict a linear regression model on streaming data. Training 
uses
+ * Stochastic Gradient Descent to update the model based on each new batch 
of
+ * incoming data from a DStream (see `LinearRegressionWithSGD` for model 
equation)
+ *
+ * Each batch of data is assumed to be an RDD of LabeledPoints.
+ * The number of data points per batch can vary, but the number
+ * of features must be constant. An initial weight
+ * vector must be provided.
+ *
+ * Use a builder pattern to construct a streaming linear regression
+ * analysis in an application, like:
+ *
+ *  val model = new StreamingLassoWithSGD()
+ *.setStepSize(0.5)
+ *.setNumIterations(10)
+ *.setInitialWeights(Vectors.dense(...))
+ *.trainOn(DStream)
+ *
+ */
+@Experimental
+class StreamingLassoWithSGD private[mllib](
--- End diff --

Since Lasso, Ridge and LinearRegression have almost similar methods, I 
think it might be better to have an abstract class with all three deriving from 
it and a protected `algorithm` method, to avoid code duplication WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6828#issuecomment-112910383
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-17 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32659702
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo Usage: $(basename $0) from-version to-version 12
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo Wrong number of arguments 12
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ]  return 0; done
+  echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 
12
+  exit 1
+}
+
+check_scala_version $FROM_VERSION
+check_scala_version $TO_VERSION
+
+test_sed() {
+  [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port 
install gsed
+if test_sed sed; then
+  SED=sed
+elif test_sed gsed; then
+  SED=gsed
+else
+  echo Could not find GNU sed. Tried \sed\ and \gsed\ 12
+  exit 1
+fi
+
+BASEDIR=$(dirname $0)/..
+find $BASEDIR -name 'pom.xml' | grep -v target \
+  | xargs -I {} $SED -i -e 
's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {}
+
+# Update source of scaladocs
+$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' 
$BASEDIR/docs/_plugins/copy_api_dirs.rb
--- End diff --

I don't think that will work, and I 80% remember why -- the published POM 
doesn't have any notion of activated profiles, and so the artifacts will 
default to expressing a dependency on 2.10. That's why it had to be hard-coded 
and changed on update to 2.11.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112913463
  
Hi @BenFradet, could you post a screenshot of what this looks like before 
and after your change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...

2015-06-17 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/6825#discussion_r32661616
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -109,7 +109,7 @@ trait ExpressionEvalHelper {
 }
 
 val actual = plan(inputRow)
-val expectedRow = new 
GenericRow(Array[Any](CatalystTypeConverters.convertToCatalyst(expected)))
+val expectedRow = 
InternalRow.fromSeq(Array(CatalystTypeConverters.convertToCatalyst(expected)))
 if (actual.hashCode() != expectedRow.hashCode()) {
--- End diff --

If different types of rows have different hash code implementations they we 
cannot use them as keys in a hash table.   This is to check that they all share 
an implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/6845#issuecomment-112913386
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...

2015-06-17 Thread andrewor14

Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6827#discussion_r32662634
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -282,8 +282,12 @@ private[history] class FsHistoryProvider(conf: 
SparkConf, clock: Clock)
 val newAttempts = logs.flatMap { fileStatus =
   try {
 val res = replay(fileStatus, bus)
-logInfo(sApplication log ${res.logPath} loaded successfully.)
-Some(res)
+res match {
+  case Some(r) = logDebug(sApplication log ${r.logPath} loaded 
successfully.)
--- End diff --

should this be info?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6831#issuecomment-112915414
  
  [Test build #35054 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35054/consoleFull)
 for   PR 6831 at commit 
[`1fec395`](https://github.com/apache/spark/commit/1fec3958926cef498a1d8ca6ce5746afec5423c3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6852#issuecomment-112939057
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6346#issuecomment-112945746
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6856#discussion_r32673593
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -106,12 +109,12 @@ def predict(self, x):
 best_distance = distance
 return best
 
-def computeCost(self, rdd):
+def computeCost(self, data):
--- End diff --

I'm afraid it's too late to make changes like this to the API.  The release 
is already done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6856#discussion_r32673610
  
--- Diff: python/pyspark/mllib/tree.py ---
@@ -90,9 +92,11 @@ def predict(self, x):
 else:
 return self.call(predict, _convert_to_vector(x))
 
+@property
--- End diff --

I agree.  Can you please change the doc test back to what it was to make 
sure we don't break APIs?  If we can't support these as properties, that is OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6856#discussion_r32673601
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -123,20 +132,6 @@ class StandardScalerModel(JavaVectorTransformer):
 
 Represents a StandardScaler model that can transform vectors.
 
-def transform(self, vector):
-
-Applies standardization transformation on a vector.
-
-Note: In Python, transform cannot currently be used within
-  an RDD transformation or action.
-  Call transform directly on the RDD instead.
-
-:param vector: Vector or RDD of Vector to be standardized.
-:return: Standardized vector. If the variance of a column is
--- End diff --

This part is specific to this transformer.  Can you please add it somewhere 
in the doc for StandardScalerModel?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8320] [Streaming] Add example in stream...

2015-06-17 Thread nssalian

Github user nssalian commented on the pull request:

https://github.com/apache/spark/pull/6862#issuecomment-112947266
  
@srowen , I changed the Kafka append, the loop structure and the print 
method call.
Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6796#issuecomment-112956321
  
  [Test build #35062 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35062/consoleFull)
 for   PR 6796 at commit 
[`8f6445c`](https://github.com/apache/spark/commit/8f6445c25fa62cd9fa3cb28b7441dd19d8692c6b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...

2015-06-17 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/6853#issuecomment-112958798
  
If I understand correctly, you are just delaying the failure until 
`checkAnalysis`.  A few followup questions:  does that mean you don't check 
analysis in your code path?  Does your custom logical plan produce new 
attribute references?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...

2015-06-17 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/6853#issuecomment-112958839
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6853#issuecomment-112959854
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6853#issuecomment-112959823
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5659


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6852


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7712] [SQL] Move Window Functions from ...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6278#issuecomment-112963381
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8376][Docs]Add common lang3 to the Spar...

2015-06-17 Thread tdas

Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/6829#issuecomment-112969135
  
I think the assembly is a good idea. Though for that we will have to 
1. publish the assembly JAR instead of package JAR. It will be cumbersome 
to add additional flume-sink-assembly directory for this. Maybe we can make the 
existing flume-sink project generate and publish the assembly instead of the 
package.
2. Update instructions. I am not sure this will much impact existing 
deployments, because they are supposed to download and run the version of sink 
that is necessary for the version of Spark they are running.

@harishreedharan What do you think about this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8397][SQL] Allow custom configuration f...

2015-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6844


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6788#issuecomment-112971247
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8406] [SQL] Adding UUID to output file ...

2015-06-17 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/6864

[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental 
overwriting



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-8406

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6864


commit 0eb6baca71445b9a1c6a26ed72c13981c1e45ea6
Author: Cheng Lian l...@databricks.com
Date:   2015-06-17T22:40:20Z

Adding UUID to output file name to avoid accidental overwriting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8333] [SQL] Spark failed to delete temp...

2015-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6858#issuecomment-112977799
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8301][SQL] Improve UTF8String substring...

2015-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6804#discussion_r32690229
  
--- Diff: 
unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -125,30 +129,37 @@ public UTF8String substring(final int start, final 
int until) {
   }
 
   public boolean contains(final UTF8String substring) {
+if (substring == null) return false;
 final byte[] b = substring.getBytes();
 if (b.length == 0) {
   return true;
 }
 
 for (int i = 0; i = bytes.length - b.length; i++) {
-  // TODO: Avoid copying.
-  if (bytes[i] == b[0]  Arrays.equals(Arrays.copyOfRange(bytes, i, i 
+ b.length), b)) {
+  if (bytes[i] == b[0]  startsWith(b, i)) {
 return true;
   }
 }
 return false;
   }
 
+  private boolean startsWith(final byte[] prefix, int offset) {
--- End diff --

how about remaining it offsetInBytes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...

2015-06-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6788#issuecomment-112971221
  
  [Test build #35060 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35060/console)
 for   PR 6788 at commit 
[`2875bf4`](https://github.com/apache/spark/commit/2875bf49a4fea8027d2c8224d6d4b2bed09893d5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684159
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
+ *
+ * We assume some minimal preprocessing of this dataset has been done to 
unzip the dataset and
+ * load the data into HDFS as follows:
+ *  wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
+ *  tar -xvzf 20news-bydate.tar.gz
+ *  hadoop fs -mkdir ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir}
+ *
+ * This example uses Hive to schematize the data as tables, in order to 
map the folder
+ * structure ${20news.root.dir}/{20news-bydate-train, 
20news-bydate-train}/{newsgroup}/
+ * to partition columns type, newsgroup resulting in a dataset with three 
columns:
+ *  type, newsgroup, text
+ *
+ * In order to run this example, Spark needs to be build with hive, and at 
runtime there
+ * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the 
following
+ * configuration:
+ * configuration
+ *   property
+ * namehive.metastore.uris/name
+ * !-- Ensure that the following statement points to the Hive Metastore 
URI in your cluster --
+ * valuethrift://${thriftserver.host}:${thriftserver.port}/value
+ *   descriptionURI for client to contact metastore server/description
+ *   /property
+ * /configuration
+ *
+ * Run with
+ * {{{
+ * bin/spark-submit --class 
org.apache.spark.examples.ml.ComplexPipelineExample
+ *   --driver-memory 4g [examples JAR path] ${20news.root.dir}
+ * }}}
+ */
+object ComplexPipelineExample {
+
+  def main(args: Array[String]): Unit = {
+val conf = new SparkConf().setAppName(ComplexPipelineExample)
+val sc = new SparkContext(conf)
+val sqlContext = new HiveContext(sc)
+val path = args(0)
+
+sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text 
String)
+  PARTITIONED BY (type String, newsgroup String)
+  STORED AS TEXTFILE location '$path')
+
+val newsgroups = Array(alt.atheism, comp.graphics,
+  comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
+  comp.sys.mac.hardware, comp.windows.x, misc.forsale,
+  rec.autos, rec.motorcycles, rec.sport.baseball,
+  rec.sport.hockey, sci.crypt, sci.electronics,
+  sci.med, sci.space, soc.religion.christian,
+  talk.politics.guns, talk.politics.mideast,
+  talk.politics.misc, talk.religion.misc)
+
+for (t - Array(20news-bydate-train, 20news-bydate-train)) {
+  for (newsgroup - newsgroups) {
+sqlContext.sql(
+  sALTER TABLE 20NEWS ADD IF NOT EXISTS
+ | PARTITION(type='$t', newsgroup='$newsgroup') LOCATION 
'$path/$t/$newsgroup/'
+  .stripMargin)
+  }
+}
+
+// shuffle the data
+val partitions = 100
+val data = sqlContext.sql(SELECT * FROM 20NEWS)
+  .coalesce(partitions)  // by default we have over 19k partitions
+  .repartition(partitions)
+  .cache()
+
+import

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684165
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
+ *
+ * We assume some minimal preprocessing of this dataset has been done to 
unzip the dataset and
+ * load the data into HDFS as follows:
+ *  wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
+ *  tar -xvzf 20news-bydate.tar.gz
+ *  hadoop fs -mkdir ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir}
+ *
+ * This example uses Hive to schematize the data as tables, in order to 
map the folder
+ * structure ${20news.root.dir}/{20news-bydate-train, 
20news-bydate-train}/{newsgroup}/
+ * to partition columns type, newsgroup resulting in a dataset with three 
columns:
+ *  type, newsgroup, text
+ *
+ * In order to run this example, Spark needs to be build with hive, and at 
runtime there
+ * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the 
following
+ * configuration:
+ * configuration
+ *   property
+ * namehive.metastore.uris/name
+ * !-- Ensure that the following statement points to the Hive Metastore 
URI in your cluster --
+ * valuethrift://${thriftserver.host}:${thriftserver.port}/value
+ *   descriptionURI for client to contact metastore server/description
+ *   /property
+ * /configuration
+ *
+ * Run with
+ * {{{
+ * bin/spark-submit --class 
org.apache.spark.examples.ml.ComplexPipelineExample
+ *   --driver-memory 4g [examples JAR path] ${20news.root.dir}
+ * }}}
+ */
+object ComplexPipelineExample {
+
+  def main(args: Array[String]): Unit = {
+val conf = new SparkConf().setAppName(ComplexPipelineExample)
+val sc = new SparkContext(conf)
+val sqlContext = new HiveContext(sc)
+val path = args(0)
+
+sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text 
String)
+  PARTITIONED BY (type String, newsgroup String)
+  STORED AS TEXTFILE location '$path')
+
+val newsgroups = Array(alt.atheism, comp.graphics,
+  comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
+  comp.sys.mac.hardware, comp.windows.x, misc.forsale,
+  rec.autos, rec.motorcycles, rec.sport.baseball,
+  rec.sport.hockey, sci.crypt, sci.electronics,
+  sci.med, sci.space, soc.religion.christian,
+  talk.politics.guns, talk.politics.mideast,
+  talk.politics.misc, talk.religion.misc)
+
+for (t - Array(20news-bydate-train, 20news-bydate-train)) {
+  for (newsgroup - newsgroups) {
+sqlContext.sql(
+  sALTER TABLE 20NEWS ADD IF NOT EXISTS
+ | PARTITION(type='$t', newsgroup='$newsgroup') LOCATION 
'$path/$t/$newsgroup/'
+  .stripMargin)
+  }
+}
+
+// shuffle the data
+val partitions = 100
+val data = sqlContext.sql(SELECT * FROM 20NEWS)
+  .coalesce(partitions)  // by default we have over 19k partitions
+  .repartition(partitions)
+  .cache()
+
+import

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684148
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
+ *
+ * We assume some minimal preprocessing of this dataset has been done to 
unzip the dataset and
+ * load the data into HDFS as follows:
+ *  wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
+ *  tar -xvzf 20news-bydate.tar.gz
+ *  hadoop fs -mkdir ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir}
+ *
+ * This example uses Hive to schematize the data as tables, in order to 
map the folder
+ * structure ${20news.root.dir}/{20news-bydate-train, 
20news-bydate-train}/{newsgroup}/
--- End diff --

one should be test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684157
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
+ *
+ * We assume some minimal preprocessing of this dataset has been done to 
unzip the dataset and
+ * load the data into HDFS as follows:
+ *  wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
+ *  tar -xvzf 20news-bydate.tar.gz
+ *  hadoop fs -mkdir ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir}
+ *
+ * This example uses Hive to schematize the data as tables, in order to 
map the folder
+ * structure ${20news.root.dir}/{20news-bydate-train, 
20news-bydate-train}/{newsgroup}/
+ * to partition columns type, newsgroup resulting in a dataset with three 
columns:
+ *  type, newsgroup, text
+ *
+ * In order to run this example, Spark needs to be build with hive, and at 
runtime there
+ * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the 
following
+ * configuration:
+ * configuration
+ *   property
+ * namehive.metastore.uris/name
+ * !-- Ensure that the following statement points to the Hive Metastore 
URI in your cluster --
+ * valuethrift://${thriftserver.host}:${thriftserver.port}/value
+ *   descriptionURI for client to contact metastore server/description
+ *   /property
+ * /configuration
+ *
+ * Run with
+ * {{{
+ * bin/spark-submit --class 
org.apache.spark.examples.ml.ComplexPipelineExample
+ *   --driver-memory 4g [examples JAR path] ${20news.root.dir}
+ * }}}
+ */
+object ComplexPipelineExample {
+
+  def main(args: Array[String]): Unit = {
+val conf = new SparkConf().setAppName(ComplexPipelineExample)
+val sc = new SparkContext(conf)
+val sqlContext = new HiveContext(sc)
+val path = args(0)
+
+sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text 
String)
+  PARTITIONED BY (type String, newsgroup String)
+  STORED AS TEXTFILE location '$path')
+
+val newsgroups = Array(alt.atheism, comp.graphics,
+  comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware,
+  comp.sys.mac.hardware, comp.windows.x, misc.forsale,
+  rec.autos, rec.motorcycles, rec.sport.baseball,
+  rec.sport.hockey, sci.crypt, sci.electronics,
+  sci.med, sci.space, soc.religion.christian,
+  talk.politics.guns, talk.politics.mideast,
+  talk.politics.misc, talk.religion.misc)
+
+for (t - Array(20news-bydate-train, 20news-bydate-train)) {
--- End diff --

I assume one of these should be test

A comment on what this is doing would help


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail:

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684152
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
+ *
+ * We assume some minimal preprocessing of this dataset has been done to 
unzip the dataset and
+ * load the data into HDFS as follows:
+ *  wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
+ *  tar -xvzf 20news-bydate.tar.gz
+ *  hadoop fs -mkdir ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir}
+ *  hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir}
+ *
+ * This example uses Hive to schematize the data as tables, in order to 
map the folder
+ * structure ${20news.root.dir}/{20news-bydate-train, 
20news-bydate-train}/{newsgroup}/
+ * to partition columns type, newsgroup resulting in a dataset with three 
columns:
+ *  type, newsgroup, text
+ *
+ * In order to run this example, Spark needs to be build with hive, and at 
runtime there
+ * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the 
following
+ * configuration:
+ * configuration
+ *   property
+ * namehive.metastore.uris/name
+ * !-- Ensure that the following statement points to the Hive Metastore 
URI in your cluster --
+ * valuethrift://${thriftserver.host}:${thriftserver.port}/value
+ *   descriptionURI for client to contact metastore server/description
+ *   /property
+ * /configuration
+ *
+ * Run with
+ * {{{
+ * bin/spark-submit --class 
org.apache.spark.examples.ml.ComplexPipelineExample
+ *   --driver-memory 4g [examples JAR path] ${20news.root.dir}
+ * }}}
+ */
+object ComplexPipelineExample {
+
+  def main(args: Array[String]): Unit = {
+val conf = new SparkConf().setAppName(ComplexPipelineExample)
+val sc = new SparkContext(conf)
+val sqlContext = new HiveContext(sc)
+val path = args(0)
+
+sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text 
String)
--- End diff --

A comment about what this is doing would help people not used to *QL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6654#discussion_r32684146
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala
 ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
+import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer}
+import org.apache.spark.mllib.evaluation.MulticlassMetrics
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.hive.HiveContext
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example of an end to end machine learning pipeline that classifies 
text
+ * into one of twenty possible news categories. The dataset is the 
20newsgroups
+ * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz)
--- End diff --

Do you know how stable this URL is?  Is the UCI dataset ok to use too, if 
that's more stable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...

2015-06-17 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6654#issuecomment-112974372
  
@harsha2010  I added a few comments, but for this example, my intention was 
to have a complex chain of feature encoders (maybe using 6+ since the 
SimpleTextClassificationPipeline already includes 3 stages), with the focus on 
demonstrating how feature transformers can feed into each other, maybe in a 
non-linear graph.

This code example actually seems more useful to me as an example of how to 
work with HDFS and Hive.  I haven't looked at the SQL examples much; do they 
cover similar info?  I'm wondering where the best place for these topics is.  
(But covering them in examples seems valuable to me.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 714 matches

Mail list logo