[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6788#issuecomment-112889519 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6839#discussion_r32655781 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, private var _name: String) def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } + + /** Return all the nodes which are cached. */ + def getCachedNodes: Seq[RDDOperationNode] = { +val cachedNodes = _childNodes.filter(_.cached) +_childClusters.foreach(cluster = cachedNodes ++= cluster._childNodes.filter(_.cached)) --- End diff -- style: ``` _childClusters.foreach { cluster = cachedNodes ++= cluster._childNodes.filter(_.cached) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6839#issuecomment-112896107 [Test build #35049 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35049/consoleFull) for PR 6839 at commit [`f98728b`](https://github.com/apache/spark/commit/f98728bdbef0d3388f36928dccd573fa15bc6536). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/6825#discussion_r32656460 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -109,7 +109,7 @@ trait ExpressionEvalHelper { } val actual = plan(inputRow) -val expectedRow = new GenericRow(Array[Any](CatalystTypeConverters.convertToCatalyst(expected))) +val expectedRow = InternalRow.fromSeq(Array(CatalystTypeConverters.convertToCatalyst(expected))) --- End diff -- Sound reasonable, it's anonying to have so many `UTF8String.fromString` in test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8161] Set externalBlockStoreInitialized...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/6702#issuecomment-112904974 LGMT --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/5694#discussion_r32660292 --- Diff: dev/run-tests.py --- @@ -0,0 +1,536 @@ +#!/usr/bin/env python2 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import re +import sys +import shutil +import subprocess +from collections import namedtuple + +SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), ..) +USER_HOME = os.environ.get(HOME) + + +def get_error_codes(err_code_file): +Function to retrieve all block numbers from the `run-tests-codes.sh` +file to maintain backwards compatibility with the `run-tests-jenkins` +script + +with open(err_code_file, 'r') as f: +err_codes = [e.split()[1].strip().split('=') + for e in f if e.startswith(readonly)] +return dict(err_codes) + + +ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, dev/run-tests-codes.sh)) + + +def exit_from_command_with_retcode(cmd, retcode): +print [error] running, cmd, ; received return code, retcode --- End diff -- Minor nit / annoyance here: this ends up printing things like ``` [error] running ['/Users/joshrosen/Documents/Spark/dev/../build/mvn', '-Pyarn', '-Phadoop-2.3', '-Dhadoop.version=2.3.0', '-Pkinesis-asl', '-Phive', '-Phive-thriftserver', 'clean', 'package', '-DskipTests'] ; received return code 1 ``` which makes it hard to copy and paste the command to run it manually in the shell. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6758#issuecomment-112914737 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6833#issuecomment-112914725 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5694 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8077][SQL] Optimization for TreeNodes w...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6673 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6839#discussion_r32656672 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, private var _name: String) def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } + + /** Return all the nodes which are cached. */ + def getCachedNodes: Seq[RDDOperationNode] = { +val cachedNodes = _childNodes.filter(_.cached) +_childClusters.foreach(cluster = cachedNodes ++= cluster._childNodes.filter(_.cached)) --- End diff -- also, another way to rewrite this would be: ``` _childNodes.filter(_.cached) ++ _childClusters.flatMap(_.getCachedNodes) ``` I think it's both more concise and easier to read --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to P...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6354 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add ability to set additional tags
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/6857#issuecomment-112908003 Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6833#issuecomment-112914495 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8077][SQL] Optimization for TreeNodes w...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/6673#issuecomment-112928014 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/6346#discussion_r32670018 --- Diff: python/pyspark/mllib/feature.py --- @@ -525,6 +526,41 @@ def fit(self, data): return Word2VecModel(jmodel) +class ElementwiseProduct(VectorTransformer): + +.. note:: Experimental + +Scales each column of the vector, with the supplied weight vector. +i.e the elementwise product. + + weight = Vectors.dense([1.0, 2.0, 3.0]) + eprod = ElementwiseProduct(weight) + a = Vectors.dense([2.0, 1.0, 3.0]) + eprod.transform(a) +DenseVector([2.0, 2.0, 9.0]) + b = Vectors.dense([9.0, 3.0, 4.0]) + rdd = sc.parallelize([a, b]) + eprod.transform(rdd).collect() +[DenseVector([2.0, 2.0, 9.0]), DenseVector([9.0, 6.0, 12.0])] + +def __init__(self, vector): +if not isinstance(vector, Vector): --- End diff -- It will be good to support list and np.array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/6825#discussion_r32656297 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -55,7 +55,7 @@ trait ExpressionEvalHelper { val actual = try evaluate(expression, inputRow) catch { case e: Exception = fail(sException evaluating $expression, e) } -if (actual != expected) { +if (actual !== expected) { --- End diff -- Good catch! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7961][SQL]Refactor SQLConf to display b...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6747#issuecomment-112911556 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8395] [DOCS] start-slave.sh docs incorr...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6855#issuecomment-112913218 LGTM, merging into master 1.4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add ability to set additional tags
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6857#issuecomment-112913080 Hi @armisael, you need to file a JIRA here: https://issues.apache.org/jira/browse/SPARK. Once you have done that change the title of this PR to link against that JIRA? e.g. ``` [SPARK-] [EC2] Add ability to set additional tags ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark ssc.textFileStream returns empty
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6837#issuecomment-112913685 @sduchh this is opened against the wrong branch. Please submit the change to the master branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6839#issuecomment-112927192 [Test build #35049 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35049/console) for PR 6839 at commit [`f98728b`](https://github.com/apache/spark/commit/f98728bdbef0d3388f36928dccd573fa15bc6536). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6839#issuecomment-112927231 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6788#issuecomment-112889548 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3561] Initial commit to provide pluggab...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2849#issuecomment-112894251 That's fine, but in the name of trying to clean up stale PRs, would you mind closing this PR? it's not mergeable and seems corrupted anyway. You can reopen another PR if you really want to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/5694#discussion_r32658498 --- Diff: dev/run-tests.py --- @@ -0,0 +1,536 @@ +#!/usr/bin/env python2 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import re +import sys +import shutil +import subprocess +from collections import namedtuple + +SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), ..) +USER_HOME = os.environ.get(HOME) + + +def get_error_codes(err_code_file): +Function to retrieve all block numbers from the `run-tests-codes.sh` +file to maintain backwards compatibility with the `run-tests-jenkins` +script + +with open(err_code_file, 'r') as f: +err_codes = [e.split()[1].strip().split('=') + for e in f if e.startswith(readonly)] +return dict(err_codes) + + +ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, dev/run-tests-codes.sh)) + + +def exit_from_command_with_retcode(cmd, retcode): +print [error] running, cmd, ; received return code, retcode +sys.exit(int(os.environ.get(CURRENT_BLOCK, 255))) + + +def rm_r(path): +Given an arbitrary path properly remove it with the correct python +construct if it exists +- from: http://stackoverflow.com/a/9559881; + +if os.path.isdir(path): +shutil.rmtree(path) +elif os.path.exists(path): +os.remove(path) + + +def run_cmd(cmd): +Given a command as a list of arguments will attempt to execute the +command from the determined SPARK_HOME directory and, on failure, print +an error message + +if not isinstance(cmd, list): +cmd = cmd.split() +try: +subprocess.check_call(cmd) +except subprocess.CalledProcessError as e: +exit_from_command_with_retcode(e.cmd, e.returncode) + + +def is_exe(path): +Check if a given path is an executable file +- from: http://stackoverflow.com/a/377028; + +return os.path.isfile(path) and os.access(path, os.X_OK) + + +def which(program): +Find and return the given program by its absolute path or 'None' +- from: http://stackoverflow.com/a/377028; + +fpath, fname = os.path.split(program) + +if fpath: +if is_exe(program): +return program +else: +for path in os.environ.get(PATH).split(os.pathsep): +path = path.strip('') +exe_file = os.path.join(path, program) +if is_exe(exe_file): +return exe_file +return None + + +def determine_java_executable(): +Will return the path of the java executable that will be used by Spark's +tests or `None` + +# Any changes in the way that Spark's build detects java must be reflected +# here. Currently the build looks for $JAVA_HOME/bin/java then falls back to +# the `java` executable on the path + +java_home = os.environ.get(JAVA_HOME) + +# check if there is an executable at $JAVA_HOME/bin/java +java_exe = which(os.path.join(java_home, bin, java)) if java_home else None +# if the java_exe wasn't set, check for a `java` version on the $PATH +return java_exe if java_exe else which(java) + + +JavaVersion = namedtuple('JavaVersion', ['major', 'minor', 'patch', 'update']) + + +def determine_java_version(java_exe): +Given a valid java executable will return its version in named tuple format +with accessors '.major', '.minor', '.patch', '.update' + +raw_output = subprocess.check_output([java_exe, -version], + stderr=subprocess.STDOUT) +raw_version_str = raw_output.split('\n')[0] # eg 'java version 1.8.0_25' +version_str = raw_version_str.split()[-1].strip('') # eg '1.8.0_25' +
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6852#issuecomment-112908986 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/6852#issuecomment-112908908 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8379][SQL]avoid speculative tasks write...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6833#issuecomment-112914749 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/6831#issuecomment-112914683 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6758#issuecomment-112914866 [Test build #35055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35055/consoleFull) for PR 6758 at commit [`6690a08`](https://github.com/apache/spark/commit/6690a080f10aa37a3a00d21f008e1570c812d4e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6831#issuecomment-112914732 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6758#issuecomment-112914763 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6831#issuecomment-112914747 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8306] [SQL] AddJar command needs to set...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6758#issuecomment-112929808 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8218][SQL] Add binary log math function
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/6725#discussion_r32653337 --- Diff: python/pyspark/sql/functions.py --- @@ -404,6 +405,21 @@ def when(condition, value): @since(1.4) --- End diff -- 1.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8056][SQL] Design an easier way to cons...
Github user ilganeli commented on a diff in the pull request: https://github.com/apache/spark/pull/6686#discussion_r32656127 --- Diff: python/pyspark/sql/types.py --- @@ -368,8 +367,49 @@ def __init__(self, fields): struct1 == struct2 False -assert all(isinstance(f, DataType) for f in fields), fields should be a list of DataType -self.fields = fields +if not fields: +self.fields = [] +else: +self.fields = fields +assert all(isinstance(f, StructField) for f in fields),\ +fields should be a list of StructField + +def add(self, name_or_struct_field, data_type=NullType(), nullable=True, metadata=None): --- End diff -- Davies - totally agree. This was changed specifically to consolidate to a single method as suggested by Reynold. I initially had separate add methods - one which accepted a StructField and one which accepted the 4 parameters, the first two of which were defined. What would you suggest? My preference is to break this out into two methods for clarity and to avoid the problem you mention. Thank you, Ilya Ganelin -Original Message- From: Davies Liu [notificati...@github.commailto:notificati...@github.com] Sent: Wednesday, June 17, 2015 01:18 PM Eastern Standard Time To: apache/spark Cc: Ganelin, Ilya Subject: Re: [spark] [SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python (#6686) In python/pyspark/sql/types.pyhttps://github.com/apache/spark/pull/6686#discussion_r32650869: @@ -368,8 +367,49 @@ def __init__(self, fields): struct1 == struct2 False -assert all(isinstance(f, DataType) for f in fields), fields should be a list of DataType -self.fields = fields +if not fields: +self.fields = [] +else: +self.fields = fields +assert all(isinstance(f, StructField) for f in fields),\ +fields should be a list of StructField + +def add(self, name_or_struct_field, data_type=NullType(), nullable=True, metadata=None): What's the use cases that we should have StructType without specifying the dataType of each column? In createDataFrame, if a schema of StructType is provided, it will not try to infer the data types, so it does not work with StructType with NoneType in it. — Reply to this email directly or view it on GitHubhttps://github.com/apache/spark/pull/6686/files#r32650869. The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8320] [Streaming] Add example in stream...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/6862#discussion_r32658252 --- Diff: docs/streaming-programming-guide.md --- @@ -1937,6 +1937,16 @@ JavaPairDStreamString, String unifiedStream = streamingContext.union(kafkaStre unifiedStream.print(); {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +numStreams = 5 +kafkaStreams = [] +for x in range (0, numStreams): + kafkaStreams = x.map{ KafkaUtils.createStream(…)} --- End diff -- Hm, I don't think this can be correct? x is an integer; you can't map it. Python collections don't map like that anyway. Are you just trying to `append` the new stream to `kafkaStreams`? Nit: use `...` instead of the ellipsis character. Is the method really `show()` in Pyspark instead of `print()`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112905346 @zsxwing Can you have a look, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6828#issuecomment-112910402 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/6828#issuecomment-112910156 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112913681 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112913712 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7862][SQL]Fix the deadlock in script tr...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/6404#issuecomment-112913378 After this patch, the pull request builder logs show 5 lines of stdout output, which makes them hard to read: ``` [info] - test script transform for stdout (3 seconds, 806 milliseconds) 17:26:41.401 WARN org.apache.spark.scheduler.TaskSetManager: Stage 1316 contains a task of very large size (2246 KB). The maximum recommended task size is 100 KB. 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25 [...] ``` At least I _think_ that this is the patch that caused this issue. If that's the case, could someone open up a followup PR to fix this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7515] [DOC] Update documentation for Py...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6842#issuecomment-112913530 LGTM merging into master 1.4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/5659#issuecomment-112920385 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5659#issuecomment-112921278 [Test build #35056 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35056/consoleFull) for PR 5659 at commit [`cfa79f8`](https://github.com/apache/spark/commit/cfa79f8f05153f58690b7cc5d84ff18770f2f3de). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8056][SQL] Design an easier way to cons...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/6686#discussion_r32656949 --- Diff: python/pyspark/sql/types.py --- @@ -368,8 +367,49 @@ def __init__(self, fields): struct1 == struct2 False -assert all(isinstance(f, DataType) for f in fields), fields should be a list of DataType -self.fields = fields +if not fields: +self.fields = [] +else: +self.fields = fields +assert all(isinstance(f, StructField) for f in fields),\ +fields should be a list of StructField + +def add(self, name_or_struct_field, data_type=NullType(), nullable=True, metadata=None): --- End diff -- I think you could use `None` as default value of `dataType`, and raise an exception if `name_or_struct_field ` is string and `dataType` is None --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6839#issuecomment-112900896 Approach looks fine to me. Once you address the comments I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8391][Core] More efficient usage of mem...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6859#issuecomment-112907518 @viirya I don't see how this patch uses less memory than before. The point of string builder is to avoid the expensive copy of the strings during concatenation. The changes here removes this optimization and actually uses more memory to copy the strings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5673] [MLlib] Implement Streaming wrapp...
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/4456#issuecomment-112907685 I'm not an expert but it seems that the PR topic is out of place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7017][Build][Project Infra]: Refactor d...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/5694#discussion_r32661175 --- Diff: dev/run-tests.py --- @@ -0,0 +1,536 @@ +#!/usr/bin/env python2 + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import os +import re +import sys +import shutil +import subprocess +from collections import namedtuple + +SPARK_HOME = os.path.join(os.path.dirname(os.path.realpath(__file__)), ..) +USER_HOME = os.environ.get(HOME) + + +def get_error_codes(err_code_file): +Function to retrieve all block numbers from the `run-tests-codes.sh` +file to maintain backwards compatibility with the `run-tests-jenkins` +script + +with open(err_code_file, 'r') as f: +err_codes = [e.split()[1].strip().split('=') + for e in f if e.startswith(readonly)] +return dict(err_codes) + + +ERROR_CODES = get_error_codes(os.path.join(SPARK_HOME, dev/run-tests-codes.sh)) + + +def exit_from_command_with_retcode(cmd, retcode): +print [error] running, cmd, ; received return code, retcode +sys.exit(int(os.environ.get(CURRENT_BLOCK, 255))) + + +def rm_r(path): +Given an arbitrary path properly remove it with the correct python +construct if it exists +- from: http://stackoverflow.com/a/9559881; + +if os.path.isdir(path): +shutil.rmtree(path) +elif os.path.exists(path): +os.remove(path) + + +def run_cmd(cmd): +Given a command as a list of arguments will attempt to execute the +command from the determined SPARK_HOME directory and, on failure, print +an error message + +if not isinstance(cmd, list): +cmd = cmd.split() +try: +subprocess.check_call(cmd) +except subprocess.CalledProcessError as e: +exit_from_command_with_retcode(e.cmd, e.returncode) + + +def is_exe(path): +Check if a given path is an executable file +- from: http://stackoverflow.com/a/377028; + +return os.path.isfile(path) and os.access(path, os.X_OK) + + +def which(program): +Find and return the given program by its absolute path or 'None' +- from: http://stackoverflow.com/a/377028; + +fpath, fname = os.path.split(program) + +if fpath: +if is_exe(program): +return program +else: +for path in os.environ.get(PATH).split(os.pathsep): +path = path.strip('') +exe_file = os.path.join(path, program) +if is_exe(exe_file): +return exe_file +return None + + +def determine_java_executable(): +Will return the path of the java executable that will be used by Spark's +tests or `None` + +# Any changes in the way that Spark's build detects java must be reflected +# here. Currently the build looks for $JAVA_HOME/bin/java then falls back to +# the `java` executable on the path + +java_home = os.environ.get(JAVA_HOME) + +# check if there is an executable at $JAVA_HOME/bin/java +java_exe = which(os.path.join(java_home, bin, java)) if java_home else None +# if the java_exe wasn't set, check for a `java` version on the $PATH +return java_exe if java_exe else which(java) + + +JavaVersion = namedtuple('JavaVersion', ['major', 'minor', 'patch', 'update']) + + +def determine_java_version(java_exe): +Given a valid java executable will return its version in named tuple format +with accessors '.major', '.minor', '.patch', '.update' + +raw_output = subprocess.check_output([java_exe, -version], + stderr=subprocess.STDOUT) +raw_version_str = raw_output.split('\n')[0] # eg 'java version 1.8.0_25' +version_str = raw_version_str.split()[-1].strip('') # eg '1.8.0_25' +
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112914027 Hi @andrewor14, As I said on the JIRA, I'm not able to post screenshots at this time, sorry, but I might have the time this weekend. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6827#discussion_r32662931 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -282,8 +282,12 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) val newAttempts = logs.flatMap { fileStatus = try { val res = replay(fileStatus, bus) -logInfo(sApplication log ${res.logPath} loaded successfully.) -Some(res) +res match { + case Some(r) = logDebug(sApplication log ${r.logPath} loaded successfully.) --- End diff -- oh never mind, just saw @vanzin's comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6827#discussion_r32663412 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -431,7 +435,9 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) * Replays the events in the specified log file and returns information about the associated * application. */ - private def replay(eventLog: FileStatus, bus: ReplayListenerBus): FsApplicationAttemptInfo = { + private def replay( + eventLog: FileStatus, + bus: ReplayListenerBus): Option[FsApplicationAttemptInfo] = { --- End diff -- need to update the javadoc to explain when we return `Some(...)` vs `None` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7289] handle project - limit - sort e...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/6780#discussion_r32668112 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala --- @@ -160,17 +166,22 @@ case class TakeOrdered(limit: Int, sortOrder: Seq[SortOrder], child: SparkPlan) private val ord: RowOrdering = new RowOrdering(sortOrder, child.output) - private def collectData(): Array[InternalRow] = -child.execute().map(_.copy()).takeOrdered(limit)(ord) + @transient private val projection = projectList.map(newProjection(_, child.output)) --- End diff -- How did you see errors here? I removed `@transient` and `sbt sql/test` still works for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...
Github user MechCoder commented on the pull request: https://github.com/apache/spark/pull/6346#issuecomment-112931617 ping @davies Would you be able to have a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6839#discussion_r32656904 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, private var _name: String) def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } + + /** Return all the nodes which are cached. */ + def getCachedNodes: Seq[RDDOperationNode] = { +val cachedNodes = _childNodes.filter(_.cached) +_childClusters.foreach(cluster = cachedNodes ++= cluster._childNodes.filter(_.cached)) --- End diff -- I see, is it because we clone fewer nodes? AFAIK `++` on ArrayBuffer actually clones the entire thing first --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8392] Improve the efficiency
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6839#discussion_r32656910 --- Diff: core/src/main/scala/org/apache/spark/ui/scope/RDDOperationGraph.scala --- @@ -70,6 +70,13 @@ private[ui] class RDDOperationCluster(val id: String, private var _name: String) def getAllNodes: Seq[RDDOperationNode] = { _childNodes ++ _childClusters.flatMap(_.childNodes) } + + /** Return all the nodes which are cached. */ + def getCachedNodes: Seq[RDDOperationNode] = { +val cachedNodes = _childNodes.filter(_.cached) +_childClusters.foreach(cluster = cachedNodes ++= cluster._childNodes.filter(_.cached)) +cachedNodes --- End diff -- another way to rewrite this would be: ``` _childNodes.filter(_.cached) ++ _childClusters.flatMap(_.getCachedNodes) ``` I think it's both more concise and easier to read --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6852#issuecomment-112909451 [Test build #35050 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35050/consoleFull) for PR 6852 at commit [`d464211`](https://github.com/apache/spark/commit/d464211afe6805117cd4d56957a6f773bc8ae3a8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6852#issuecomment-112909043 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7961][SQL]Refactor SQLConf to display b...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6747#issuecomment-112911506 [Test build #35047 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35047/console) for PR 6747 at commit [`7d09bad`](https://github.com/apache/spark/commit/7d09bad23a7ac25c90735079b01984fa307a6f73). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SetCommand(kv: Option[(String, Option[String])]) extends RunnableCommand with Logging ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112914311 [Test build #35052 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35052/consoleFull) for PR 6845 at commit [`edd0936`](https://github.com/apache/spark/commit/edd093623f01e69cbef00b24b67809afea5ce49d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8010][SQL]Promote types to StringType a...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/6551#discussion_r32668471 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -39,6 +39,16 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll with SQLTestUtils { val sqlContext = TestSQLContext import sqlContext.implicits._ + test(SPARK-8010: promote numeric to string) { +val df = Seq((1, 1)).toDF(key, value) +df.registerTempTable(src) +val queryCaseWhen = sql(select case when true then 1.0 else '1' end from src ) +val queryCoalesce = sql(select coalesce(null, 1, '1') from src ) --- End diff -- Seems Hive will use StringType. I am fine with that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5673] [MLlib] Implement Streaming wrapp...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/4456#discussion_r32658912 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLassoWithSGD.scala --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.Vector + +/** + * :: Experimental :: + * Train or predict a linear regression model on streaming data. Training uses + * Stochastic Gradient Descent to update the model based on each new batch of + * incoming data from a DStream (see `LinearRegressionWithSGD` for model equation) + * + * Each batch of data is assumed to be an RDD of LabeledPoints. + * The number of data points per batch can vary, but the number + * of features must be constant. An initial weight + * vector must be provided. + * + * Use a builder pattern to construct a streaming linear regression + * analysis in an application, like: + * + * val model = new StreamingLassoWithSGD() + *.setStepSize(0.5) + *.setNumIterations(10) + *.setInitialWeights(Vectors.dense(...)) + *.trainOn(DStream) + * + */ +@Experimental +class StreamingLassoWithSGD private[mllib]( --- End diff -- Since Lasso, Ridge and LinearRegression have almost similar methods, I think it might be better to have an abstract class with all three deriving from it and a protected `algorithm` method, to avoid code duplication WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8283][SQL] Resolve udf_struct test fail...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6828#issuecomment-112910383 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/6832#discussion_r32659702 --- Diff: dev/change-scala-version.sh --- @@ -0,0 +1,63 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +usage() { + echo Usage: $(basename $0) from-version to-version 12 + exit 1 +} + +if [ $# -ne 2 ]; then + echo Wrong number of arguments 12 + usage +fi + +FROM_VERSION=$1 +TO_VERSION=$2 + +VALID_VERSIONS=( 2.10 2.11 ) + +check_scala_version() { + for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ] return 0; done + echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 12 + exit 1 +} + +check_scala_version $FROM_VERSION +check_scala_version $TO_VERSION + +test_sed() { + [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ] +} + +# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port install gsed +if test_sed sed; then + SED=sed +elif test_sed gsed; then + SED=gsed +else + echo Could not find GNU sed. Tried \sed\ and \gsed\ 12 + exit 1 +fi + +BASEDIR=$(dirname $0)/.. +find $BASEDIR -name 'pom.xml' | grep -v target \ + | xargs -I {} $SED -i -e 's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {} + +# Update source of scaladocs +$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' $BASEDIR/docs/_plugins/copy_api_dirs.rb --- End diff -- I don't think that will work, and I 80% remember why -- the published POM doesn't have any notion of activated profiles, and so the artifacts will default to expressing a dependency on 2.10. That's why it had to be hard-coded and changed on update to 2.11. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112913463 Hi @BenFradet, could you post a screenshot of what this looks like before and after your change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8371][SQL] improve unit test for MaxOf ...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/6825#discussion_r32661616 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -109,7 +109,7 @@ trait ExpressionEvalHelper { } val actual = plan(inputRow) -val expectedRow = new GenericRow(Array[Any](CatalystTypeConverters.convertToCatalyst(expected))) +val expectedRow = InternalRow.fromSeq(Array(CatalystTypeConverters.convertToCatalyst(expected))) if (actual.hashCode() != expectedRow.hashCode()) { --- End diff -- If different types of rows have different hash code implementations they we cannot use them as keys in a hash table. This is to check that they all share an implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8399][Streaming][Web UI] Overlap betwee...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/6845#issuecomment-112913386 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8372] History server shows incorrect in...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/6827#discussion_r32662634 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -282,8 +282,12 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) val newAttempts = logs.flatMap { fileStatus = try { val res = replay(fileStatus, bus) -logInfo(sApplication log ${res.logPath} loaded successfully.) -Some(res) +res match { + case Some(r) = logDebug(sApplication log ${r.logPath} loaded successfully.) --- End diff -- should this be info? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8381][SQL]reuse typeConvert when conver...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6831#issuecomment-112915414 [Test build #35054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35054/consoleFull) for PR 6831 at commit [`1fec395`](https://github.com/apache/spark/commit/1fec3958926cef498a1d8ca6ce5746afec5423c3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6852#issuecomment-112939057 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7605] [MLlib] [PySpark] Python API for ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6346#issuecomment-112945746 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6856#discussion_r32673593 --- Diff: python/pyspark/mllib/clustering.py --- @@ -106,12 +109,12 @@ def predict(self, x): best_distance = distance return best -def computeCost(self, rdd): +def computeCost(self, data): --- End diff -- I'm afraid it's too late to make changes like this to the API. The release is already done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6856#discussion_r32673610 --- Diff: python/pyspark/mllib/tree.py --- @@ -90,9 +92,11 @@ def predict(self, x): else: return self.call(predict, _convert_to_vector(x)) +@property --- End diff -- I agree. Can you please change the doc test back to what it was to make sure we don't break APIs? If we can't support these as properties, that is OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] [SPARK-7667] MLlib Python API consiste...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6856#discussion_r32673601 --- Diff: python/pyspark/mllib/feature.py --- @@ -123,20 +132,6 @@ class StandardScalerModel(JavaVectorTransformer): Represents a StandardScaler model that can transform vectors. -def transform(self, vector): - -Applies standardization transformation on a vector. - -Note: In Python, transform cannot currently be used within - an RDD transformation or action. - Call transform directly on the RDD instead. - -:param vector: Vector or RDD of Vector to be standardized. -:return: Standardized vector. If the variance of a column is --- End diff -- This part is specific to this transformer. Can you please add it somewhere in the doc for StandardScalerModel? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8320] [Streaming] Add example in stream...
Github user nssalian commented on the pull request: https://github.com/apache/spark/pull/6862#issuecomment-112947266 @srowen , I changed the Kafka append, the loop structure and the print method call. Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4176][WIP] Support decimal types with p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6796#issuecomment-112956321 [Test build #35062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35062/consoleFull) for PR 6796 at commit [`8f6445c`](https://github.com/apache/spark/commit/8f6445c25fa62cd9fa3cb28b7441dd19d8692c6b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/6853#issuecomment-112958798 If I understand correctly, you are just delaying the failure until `checkAnalysis`. A few followup questions: does that mean you don't check analysis in your code path? Does your custom logical plan produce new attribute references? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/6853#issuecomment-112958839 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6853#issuecomment-112959854 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL][SPARK-7088] Fix analysis for 3rd party l...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6853#issuecomment-112959823 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7067][SQL] fix bug when use complex nes...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5659 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8404][Streaming][Tests] Use thread-safe...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6852 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7712] [SQL] Move Window Functions from ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6278#issuecomment-112963381 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8376][Docs]Add common lang3 to the Spar...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/6829#issuecomment-112969135 I think the assembly is a good idea. Though for that we will have to 1. publish the assembly JAR instead of package JAR. It will be cumbersome to add additional flume-sink-assembly directory for this. Maybe we can make the existing flume-sink project generate and publish the assembly instead of the package. 2. Update instructions. I am not sure this will much impact existing deployments, because they are supposed to download and run the version of sink that is necessary for the version of Spark they are running. @harishreedharan What do you think about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8397][SQL] Allow custom configuration f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/6844 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6788#issuecomment-112971247 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8406] [SQL] Adding UUID to output file ...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/6864 [SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-8406 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6864.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6864 commit 0eb6baca71445b9a1c6a26ed72c13981c1e45ea6 Author: Cheng Lian l...@databricks.com Date: 2015-06-17T22:40:20Z Adding UUID to output file name to avoid accidental overwriting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8333] [SQL] Spark failed to delete temp...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6858#issuecomment-112977799 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8301][SQL] Improve UTF8String substring...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/6804#discussion_r32690229 --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java --- @@ -125,30 +129,37 @@ public UTF8String substring(final int start, final int until) { } public boolean contains(final UTF8String substring) { +if (substring == null) return false; final byte[] b = substring.getBytes(); if (b.length == 0) { return true; } for (int i = 0; i = bytes.length - b.length; i++) { - // TODO: Avoid copying. - if (bytes[i] == b[0] Arrays.equals(Arrays.copyOfRange(bytes, i, i + b.length), b)) { + if (bytes[i] == b[0] startsWith(b, i)) { return true; } } return false; } + private boolean startsWith(final byte[] prefix, int offset) { --- End diff -- how about remaining it offsetInBytes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8095] Resolve dependencies of --package...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/6788#issuecomment-112971221 [Test build #35060 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35060/console) for PR 6788 at commit [`2875bf4`](https://github.com/apache/spark/commit/2875bf49a4fea8027d2c8224d6d4b2bed09893d5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684159 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) + * + * We assume some minimal preprocessing of this dataset has been done to unzip the dataset and + * load the data into HDFS as follows: + * wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz + * tar -xvzf 20news-bydate.tar.gz + * hadoop fs -mkdir ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir} + * + * This example uses Hive to schematize the data as tables, in order to map the folder + * structure ${20news.root.dir}/{20news-bydate-train, 20news-bydate-train}/{newsgroup}/ + * to partition columns type, newsgroup resulting in a dataset with three columns: + * type, newsgroup, text + * + * In order to run this example, Spark needs to be build with hive, and at runtime there + * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the following + * configuration: + * configuration + * property + * namehive.metastore.uris/name + * !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- + * valuethrift://${thriftserver.host}:${thriftserver.port}/value + * descriptionURI for client to contact metastore server/description + * /property + * /configuration + * + * Run with + * {{{ + * bin/spark-submit --class org.apache.spark.examples.ml.ComplexPipelineExample + * --driver-memory 4g [examples JAR path] ${20news.root.dir} + * }}} + */ +object ComplexPipelineExample { + + def main(args: Array[String]): Unit = { +val conf = new SparkConf().setAppName(ComplexPipelineExample) +val sc = new SparkContext(conf) +val sqlContext = new HiveContext(sc) +val path = args(0) + +sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text String) + PARTITIONED BY (type String, newsgroup String) + STORED AS TEXTFILE location '$path') + +val newsgroups = Array(alt.atheism, comp.graphics, + comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, + comp.sys.mac.hardware, comp.windows.x, misc.forsale, + rec.autos, rec.motorcycles, rec.sport.baseball, + rec.sport.hockey, sci.crypt, sci.electronics, + sci.med, sci.space, soc.religion.christian, + talk.politics.guns, talk.politics.mideast, + talk.politics.misc, talk.religion.misc) + +for (t - Array(20news-bydate-train, 20news-bydate-train)) { + for (newsgroup - newsgroups) { +sqlContext.sql( + sALTER TABLE 20NEWS ADD IF NOT EXISTS + | PARTITION(type='$t', newsgroup='$newsgroup') LOCATION '$path/$t/$newsgroup/' + .stripMargin) + } +} + +// shuffle the data +val partitions = 100 +val data = sqlContext.sql(SELECT * FROM 20NEWS) + .coalesce(partitions) // by default we have over 19k partitions + .repartition(partitions) + .cache() + +import
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684165 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) + * + * We assume some minimal preprocessing of this dataset has been done to unzip the dataset and + * load the data into HDFS as follows: + * wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz + * tar -xvzf 20news-bydate.tar.gz + * hadoop fs -mkdir ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir} + * + * This example uses Hive to schematize the data as tables, in order to map the folder + * structure ${20news.root.dir}/{20news-bydate-train, 20news-bydate-train}/{newsgroup}/ + * to partition columns type, newsgroup resulting in a dataset with three columns: + * type, newsgroup, text + * + * In order to run this example, Spark needs to be build with hive, and at runtime there + * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the following + * configuration: + * configuration + * property + * namehive.metastore.uris/name + * !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- + * valuethrift://${thriftserver.host}:${thriftserver.port}/value + * descriptionURI for client to contact metastore server/description + * /property + * /configuration + * + * Run with + * {{{ + * bin/spark-submit --class org.apache.spark.examples.ml.ComplexPipelineExample + * --driver-memory 4g [examples JAR path] ${20news.root.dir} + * }}} + */ +object ComplexPipelineExample { + + def main(args: Array[String]): Unit = { +val conf = new SparkConf().setAppName(ComplexPipelineExample) +val sc = new SparkContext(conf) +val sqlContext = new HiveContext(sc) +val path = args(0) + +sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text String) + PARTITIONED BY (type String, newsgroup String) + STORED AS TEXTFILE location '$path') + +val newsgroups = Array(alt.atheism, comp.graphics, + comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, + comp.sys.mac.hardware, comp.windows.x, misc.forsale, + rec.autos, rec.motorcycles, rec.sport.baseball, + rec.sport.hockey, sci.crypt, sci.electronics, + sci.med, sci.space, soc.religion.christian, + talk.politics.guns, talk.politics.mideast, + talk.politics.misc, talk.religion.misc) + +for (t - Array(20news-bydate-train, 20news-bydate-train)) { + for (newsgroup - newsgroups) { +sqlContext.sql( + sALTER TABLE 20NEWS ADD IF NOT EXISTS + | PARTITION(type='$t', newsgroup='$newsgroup') LOCATION '$path/$t/$newsgroup/' + .stripMargin) + } +} + +// shuffle the data +val partitions = 100 +val data = sqlContext.sql(SELECT * FROM 20NEWS) + .coalesce(partitions) // by default we have over 19k partitions + .repartition(partitions) + .cache() + +import
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684148 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) + * + * We assume some minimal preprocessing of this dataset has been done to unzip the dataset and + * load the data into HDFS as follows: + * wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz + * tar -xvzf 20news-bydate.tar.gz + * hadoop fs -mkdir ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir} + * + * This example uses Hive to schematize the data as tables, in order to map the folder + * structure ${20news.root.dir}/{20news-bydate-train, 20news-bydate-train}/{newsgroup}/ --- End diff -- one should be test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684157 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) + * + * We assume some minimal preprocessing of this dataset has been done to unzip the dataset and + * load the data into HDFS as follows: + * wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz + * tar -xvzf 20news-bydate.tar.gz + * hadoop fs -mkdir ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir} + * + * This example uses Hive to schematize the data as tables, in order to map the folder + * structure ${20news.root.dir}/{20news-bydate-train, 20news-bydate-train}/{newsgroup}/ + * to partition columns type, newsgroup resulting in a dataset with three columns: + * type, newsgroup, text + * + * In order to run this example, Spark needs to be build with hive, and at runtime there + * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the following + * configuration: + * configuration + * property + * namehive.metastore.uris/name + * !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- + * valuethrift://${thriftserver.host}:${thriftserver.port}/value + * descriptionURI for client to contact metastore server/description + * /property + * /configuration + * + * Run with + * {{{ + * bin/spark-submit --class org.apache.spark.examples.ml.ComplexPipelineExample + * --driver-memory 4g [examples JAR path] ${20news.root.dir} + * }}} + */ +object ComplexPipelineExample { + + def main(args: Array[String]): Unit = { +val conf = new SparkConf().setAppName(ComplexPipelineExample) +val sc = new SparkContext(conf) +val sqlContext = new HiveContext(sc) +val path = args(0) + +sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text String) + PARTITIONED BY (type String, newsgroup String) + STORED AS TEXTFILE location '$path') + +val newsgroups = Array(alt.atheism, comp.graphics, + comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, + comp.sys.mac.hardware, comp.windows.x, misc.forsale, + rec.autos, rec.motorcycles, rec.sport.baseball, + rec.sport.hockey, sci.crypt, sci.electronics, + sci.med, sci.space, soc.religion.christian, + talk.politics.guns, talk.politics.mideast, + talk.politics.misc, talk.religion.misc) + +for (t - Array(20news-bydate-train, 20news-bydate-train)) { --- End diff -- I assume one of these should be test A comment on what this is doing would help --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684152 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) + * + * We assume some minimal preprocessing of this dataset has been done to unzip the dataset and + * load the data into HDFS as follows: + * wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz + * tar -xvzf 20news-bydate.tar.gz + * hadoop fs -mkdir ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-train/ ${20news.root.dir} + * hadoop fs -copyFromLocal 20news-bydate-test/ ${20news.root.dir} + * + * This example uses Hive to schematize the data as tables, in order to map the folder + * structure ${20news.root.dir}/{20news-bydate-train, 20news-bydate-train}/{newsgroup}/ + * to partition columns type, newsgroup resulting in a dataset with three columns: + * type, newsgroup, text + * + * In order to run this example, Spark needs to be build with hive, and at runtime there + * should be a valid hive-site.xml in $SPARK_HOME/conf with at minimal the following + * configuration: + * configuration + * property + * namehive.metastore.uris/name + * !-- Ensure that the following statement points to the Hive Metastore URI in your cluster -- + * valuethrift://${thriftserver.host}:${thriftserver.port}/value + * descriptionURI for client to contact metastore server/description + * /property + * /configuration + * + * Run with + * {{{ + * bin/spark-submit --class org.apache.spark.examples.ml.ComplexPipelineExample + * --driver-memory 4g [examples JAR path] ${20news.root.dir} + * }}} + */ +object ComplexPipelineExample { + + def main(args: Array[String]): Unit = { +val conf = new SparkConf().setAppName(ComplexPipelineExample) +val sc = new SparkContext(conf) +val sqlContext = new HiveContext(sc) +val path = args(0) + +sqlContext.sql(sCREATE EXTERNAL TABLE IF NOT EXISTS 20NEWS(text String) --- End diff -- A comment about what this is doing would help people not used to *QL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/6654#discussion_r32684146 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ComplexPipelineExample.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest} +import org.apache.spark.ml.feature.{HashingTF, StringIndexer, Tokenizer} +import org.apache.spark.mllib.evaluation.MulticlassMetrics +import org.apache.spark.sql.Row +import org.apache.spark.sql.hive.HiveContext +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example of an end to end machine learning pipeline that classifies text + * into one of twenty possible news categories. The dataset is the 20newsgroups + * dataset (http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) --- End diff -- Do you know how stable this URL is? Is the UCI dataset ok to use too, if that's more stable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7546][ml][WIP] An example of a complex ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/6654#issuecomment-112974372 @harsha2010 I added a few comments, but for this example, my intention was to have a complex chain of feature encoders (maybe using 6+ since the SimpleTextClassificationPipeline already includes 3 stages), with the focus on demonstrating how feature transformers can feed into each other, maybe in a non-linear graph. This code example actually seems more useful to me as an example of how to work with HDFS and Hive. I haven't looked at the SQL examples much; do they cover similar info? I'm wondering where the best place for these topics is. (But covering them in examples seems valuable to me.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org