[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729377 I meant MAP...what's the MAP on netflix dataset you have seen before and with what lambda ? I am running MAP experiments with various factorization formulations including loglikelihood loss with normalization constraints...also how do you define MAP for implicit feedback (binary dataset, click is 1 and no click is 0) ? In the label set every rating is 1.0 and so there is no ranking defined as such... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6712][YARN] Allow lower the log level i...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5362#issuecomment-89729144 Writing to stdout/stderr defeats the point of a logging framework, no. I think you could argue that some of these other messages aren't vital at log level ("setting up", "preparing", etc.) and turn those down. Otherwise I'm afraid logging configuration can't be controlled at the level of statements by end users, and others would say these are useful enough / not a problem enough to disable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2808][Streaming][Kafka] update kafka to...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4537#issuecomment-89728917 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89728639 @debasish83 do you mean RMSE? it is well-defined but not very useful. MAP is the useful metric. I think that only a rank-dependent metric makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6712][YARN] Allow lower the log level i...
Github user piaozhexiu commented on the pull request: https://github.com/apache/spark/pull/5362#issuecomment-89728738 @srowen I'd like to turn down pretty much every INFO message from YARN client except the AM url. (See below.) As can be seen, none of these is useful for end users except for the AM url. Unfortunately, I can't selectively turn down other messages since they're all in the same package. How about if I print the tracking url to stdout and leave the INFO log as is? Then, I can turn off INFO in YARN client. - 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 15/04/05 06:36:29 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 15/04/05 06:36:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/04/05 06:36:29 INFO spark.SparkContext: Running Spark version 1.3.0 15/04/05 06:36:30 INFO spark.SecurityManager: Changing view acls to: cheolsoop 15/04/05 06:36:30 INFO spark.SecurityManager: Changing modify acls to: cheolsoop 15/04/05 06:36:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cheolsoop); users with modify permissions: Set(cheolsoop) 15/04/05 06:36:30 INFO slf4j.Slf4jLogger: Slf4jLogger started 15/04/05 06:36:30 INFO Remoting: Starting remoting 15/04/05 06:36:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@ip-10-99-146-254.ec2.internal:64877] 15/04/05 06:36:30 INFO util.Utils: Successfully started service 'sparkDriver' on port 64877. 15/04/05 06:36:30 INFO spark.SparkEnv: Registering MapOutputTracker 15/04/05 06:36:30 INFO spark.SparkEnv: Registering BlockManagerMaster 15/04/05 06:36:30 INFO storage.DiskBlockManager: Created local directory at /mnt/spark_tmp/spark-e95cf4af-ec65-469a-ad1a-827d1149eeab/blockmgr-ec922fd7-58c5-497a-a952-247fcb3ab779 15/04/05 06:36:30 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB 15/04/05 06:36:31 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark_tmp/spark-84f23ed2-ebf3-4022-93e1-fbb31325ab3f/httpd-128e4efa-c666-4311-b1ee-c0868eaca4bc 15/04/05 06:36:31 INFO spark.HttpServer: Starting HTTP Server 15/04/05 06:36:31 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/04/05 06:36:31 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:36543 15/04/05 06:36:31 INFO util.Utils: Successfully started service 'HTTP file server' on port 36543. 15/04/05 06:36:31 INFO spark.SparkEnv: Registering OutputCommitCoordinator 15/04/05 06:36:31 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/04/05 06:36:31 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:47936 15/04/05 06:36:31 INFO util.Utils: Successfully started service 'SparkUI' on port 47936. 15/04/05 06:36:31 INFO ui.SparkUI: Started SparkUI at http://ip-10-99-146-254.ec2.internal:47936 15/04/05 06:36:31 INFO client.RMProxy: Connecting to ResourceManager at /10.171.119.231:9022 15/04/05 06:36:31 INFO yarn.Client: Requesting a new application from cluster with 1300 NodeManagers 15/04/05 06:36:31 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (10240 MB per container) 15/04/05 06:36:31 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/04/05 06:36:31 INFO yarn.Client: Setting up container launch context for our AM 15/04/05 06:36:31 INFO yarn.Client: Preparing resources for our AM container 15/04/05 06:36:32 INFO yarn.Client: Uploading resource file:/mnt/tmp/bdp-clients/cheolsoop/20150405_063623.027801.prodsparkshell13/jars/spark-1.3.0/lib/spark-assembly-1.3.1-SNAPSHOT-hadoop2.4.0.jar -> hdfs://10.171.119.231:9000/user/cheolsoop/.sparkStaging/application_1426271585556_249126/spark-assembly-1.3.1-SNAPSHOT-hadoop2.4.0.jar 15/04/05 06:36:35 INFO y
[GitHub] spark pull request: [SPARK-6712][YARN] Allow lower the log level i...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5362#issuecomment-89727695 No, println isn't appropriate here. That removes control over the logging entirely. Instead, what log messages do you find noisy? maybe they can be turned *down* since this message is appropriate at the info level. Or, can you not just selectively disable messages from packages in your log4j config? or is the noise from the same package? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6712][YARN] Allow lower the log level i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5362#issuecomment-89724217 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6712][YARN] Allow lower the log level i...
GitHub user piaozhexiu opened a pull request: https://github.com/apache/spark/pull/5362 [SPARK-6712][YARN] Allow lower the log level in YARN client while keeping AM tracking URL printed In YARN mode, log messages are quite verbose in interactive shells (spark-shell, spark-sql, pyspark), and they sometimes mingle with shell prompts. In fact, it's very easy to tone it down via log4j.properties, but the problem is that the AM tracking URL is not printed if I do that. It would be nice if I could keep the AM tracking URL while disabling the other INFO messages that don't matter to most end users. You can merge this pull request into a Git repository by running: $ git pull https://github.com/piaozhexiu/spark SPARK-6712 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5362 commit 9b0a047e1e34e351f22329156efb50f4a452e091 Author: Cheolsoo Park Date: 2015-04-05T05:57:05Z Use println instead of logInfo to print AM tracking url in YARN client --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6521][Core]executors in the same node r...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/5178#issuecomment-89716003 @maropu , yeah i think it is a common case for yarn mode. We often specify more executors than nodemanager, that means there are more than one executor on one machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6698: where RandomForest input specifies...
Github user bien commented on the pull request: https://github.com/apache/spark/pull/5351#issuecomment-89713636 The behavior I was seeing was that RandomTree training tasks were spending ~90% of their time doing GC, and when I turned on verbose GC I would see that most of the time was spent (fruitlessly) on older generation objects. I assumed the baggedInput RDD was the culprit because there were no other RDDs in my code (other than the original input), and this patch did help things somewhat. Under these circumstances I don't have a problem spending time deserializing objects or creating objects in the younger generation. >> An explicit parameter with a reasonable default might be better than making users persist RDDs as a way of specifying the parameter This sounds fine to me but I don't know the Spark codebase well enough to contribute this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89706247 @coderxiang @mengxr If I have a dataset with implicit (click or 0) then MAP is not that well defined right since in label set everything is 1.0 and so there is no ordering definedshould we add a rank independent metric for implicit datasets ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4897] [PySpark] Python 3 support
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/5173#discussion_r27773756 --- Diff: python/pyspark/cloudpickle.py --- @@ -40,164 +40,126 @@ NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ - +from __future__ import print_function import operator import os +import io import pickle import struct import sys import types from functools import partial import itertools -from copy_reg import _extension_registry, _inverted_registry, _extension_cache -import new import dis import traceback -import platform - -PyImp = platform.python_implementation() - -import logging -cloudLog = logging.getLogger("Cloud.Transport") --- End diff -- I have an open issue to [replace cloudpickle with Dill](https://issues.apache.org/jira/browse/SPARK-4898), but I think it's still blocked by some [open issues](https://github.com/uqfoundation/dill/issues/50) against the Dill project. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89697236 @srowen For netflix dataset what's the MAP you have seen before...I started experiments on Netflix dataset...lambda is 0.065 for netflix as well right ? For MovieLens 0.065 works well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4897] [PySpark] Python 3 support
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/5173#issuecomment-89697205 > TODO: ec2/spark-ec2.py is not fully tested with python3. I can help with this. Do we want to hold off other spark-ec2 PRs until this one goes in? Do we have a rough goal for when we want to merge this in? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4897] [PySpark] Python 3 support
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/5173#discussion_r27773735 --- Diff: python/pyspark/sql/functions.py --- @@ -116,7 +114,7 @@ def __init__(self, func, returnType): def _create_judf(self): f = self.func # put it in closure `func` -func = lambda _, it: imap(lambda x: f(*x), it) +func = lambda _, it: map(lambda x: f(*x), it) --- End diff -- A common approach I've seen in projects wanting to support both Python 2 and 3 is to use the [`six`](https://pythonhosted.org/six/) compatibility module, which has [support for renamed methods](https://pythonhosted.org/six/#module-six.moves). ``` from six.moves import map ``` We probably don't want to add another external dependency, but just thought I'd throw that out there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2808][Streaming][Kafka] update kafka to...
Github user zzcclp commented on the pull request: https://github.com/apache/spark/pull/4537#issuecomment-89694634 @koeninger , I can't visit [this url](https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28872/) , it's 404. ?? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6661] Python type errors should print t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5361#issuecomment-89686461 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29717/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6661] Python type errors should print t...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5361#issuecomment-89671362 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6661] Python type errors should print t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5361#issuecomment-89661772 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6661] Python type errors should print t...
GitHub user 31z4 opened a pull request: https://github.com/apache/spark/pull/5361 [SPARK-6661] Python type errors should print type, not object You can merge this pull request into a Git repository by running: $ git pull https://github.com/31z4/spark spark-6661 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5361.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5361 commit f8a3ef83bdbf4dc0cf93a2002a720a74ab2eb47d Author: Elisey Zanko Date: 2015-04-04T20:39:25Z [SPARK-6661] Python type errors should print type, not object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89639987 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29716/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5990] [MLLIB] Model import/export for I...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5270#issuecomment-89639900 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29715/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Replace direct use of Akka ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5268 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Replace direct use of Akka ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/5268#issuecomment-89639203 Merging this in master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6264] [MLLIB] Support FPGrowth algorith...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5213#issuecomment-89633708 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29713/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-89633239 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29712/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89632303 hmm i see. Would definitely go through these PRs. Anyways fixed the whitespace problem here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-975][CORE] Visual debugger of stages an...
Github user wbraik commented on the pull request: https://github.com/apache/spark/pull/2077#issuecomment-89632207 Does anyone have a good example of an application which produces multiple (different) jobs, that we could use to test this on ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89631527 Ah, I'm also considering similar optimizations for Spark 1.4 :) The tricky part here is that, when scanning the Parquet table, Spark needs to call `ParquetInputFormat.getSplits` to compute (Spark) partition information. This `getSplits` call can be super expensive as it needs to read footers of all Parquet part-files to compute the Parquet splits. And that's why `ParquetRelation2` caches those footers at the very beginning and inject them into an extended Parquet input format. With all these footers cached, `ParquetRelation2.readSchma()` is actually quite lightweight. So the real bottleneck is reading all those footers. Fortunately, Parquet is also trying to avoid reading footers entirely at the driver side (see https://github.com/apache/incubator-parquet-mr/pull/91 and https://github.com/apache/incubator-parquet-mr/pull/45). After upgrading to Parquet 1.6, which is expected to be released next week, we can do this properly for better performance. So ideally, we don't read footers on driver side, and when we have a central arbitrative schema at hand, either from metastore or data source DDL, we don't do schema merging at driver side either. I haven't got time to walk through all related Parquet code path and PRs yet, so the above statements may be inaccurate. Please correct me if you find any mistakes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Replace direct use of Akka ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5268#issuecomment-89629603 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29711/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2883][SQL] Spark Support for ORCFile fo...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5275#issuecomment-89623421 @zhzhan I'm right now designing partitioning support for the data sources API, and will hopefully make the design doc next week. Will come back to this PR after that. With that part at hand, I believe we can further simplify the ORC data source. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89624702 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SQL][SPARK-6632]: Read schema from each ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5298#issuecomment-89624832 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29714/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5325] [SQL] Shrink the Hive shim layer
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/4107 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5325] [SQL] Shrink the Hive shim layer
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4107#issuecomment-89621672 Yeah agree. Closing this. Though the `callWithAlternatives` utility function can be very neat to do simple lightweight reflection tricks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6201] [SQL] promote string and do widen...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4945#issuecomment-89621194 The thing that makes me hesitant here is whether we should stick to Hive, because Hive's behavior is actually error prone and unintuitive. In Hive, `IN` is implemented as a UDF, and function argument type coercion rules apply here. Take `"1.00" IN (1.0, 2.0)` as an example, `"1.00"`, `1.0`, and `2.0` are all arguments of `GenericUDFIn`. When doing type coercion, `1.0` and `2.0` is first converted to string `"1.0"` and `"2.0"`, and then compared with `"1.00"`, thus returns false. Personally I think maybe we should just throw an exception if the left side of `IN` has different data types from the right side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Blacklists several Hive 0.13.1 spe...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/4851#issuecomment-89615036 No. With the metastore adapter layer, we can always keep our tests consistent with the most recent Hive version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5684][SQL]: Pass in partition name alon...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4469#issuecomment-89613886 Hi @marmbrus , this is a pretty common scenario in production, where the data is generated in some directory and then later partitions are added to tables using alter table add partition (=value) location In the old parquet path in v1.2.1, this is not possible. This is doable in the new parquet path in spark 1.3 though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6694][SQL]SparkSQL CLI must be able to ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5345#issuecomment-89611702 @adachij2002 Would you mind to add a test case for this in `CliSuite`? We can pass `--database ` via `extraArgs` in `runCliWithin` there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5348#discussion_r27770121 --- Diff: docs/sql-programming-guide.md --- @@ -1034,6 +1034,79 @@ df3.printSchema() +### Hive metastore Parquet table conversion + +When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own +Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by default. + + Hive/Parquet Schema Reconciliation + +There are two key differences between Hive and Parquet from the perspective of table schema +processing. + +1. Hive is case insensitive, while Parquet is not +1. Hive considers all columns nullable, while nullability in Parquet is significant + +Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a +Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are: + +1. Fields that have the same name in both schema must have the same data type regardless of + nullability. The reconciled field should have the data type of the Parquet side, so that + nullability is respected. + +1. The reconciled schema contains exactly those fields defined in Hive metastore schema. + + - Any fields that only appear in the Parquet schema are dropped in the reconciled schema. + - Any fileds that only appear in the Hive metastore schema are added as nullable field in the + reconciled schema. + + Metadata Refreshing + +Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table --- End diff -- Agree, missing such a section is part of the reason why I put the metadata refreshing section here... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6696] [SQL] Adds HiveContext.refreshTab...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5349#issuecomment-89610598 We need a properly configured Hive environment to run the test. I can add a simple `TestHive`-like class to do metastore / warehouse configurations though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6607][SQL] Check invalid characters for...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5263 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6607][SQL] Check invalid characters for...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5263#issuecomment-89608688 Thanks for working on this! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-89604947 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29709/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6602][Core] Replace direct use of Akka ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5268#issuecomment-89604923 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29710/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6006][SQL]: Optimize count distinct for...
Github user saucam commented on the pull request: https://github.com/apache/spark/pull/4764#issuecomment-89604768 fixed the test case of zero count when there is no data. rebased with latest master. please retest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27769592 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(s"Test RMSE = $rmse.") +params.metrics match { + case "rmse" => +val rmse = computeRmse(model, test, params.implicitPrefs) +println(s"Test RMSE = $rmse") + case "map" => +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(s"Test users $users MAP $map") + case _ => println(s"Metrics not defined, options are rmse/map") +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x => - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x => + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x => ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x => x._2) + +val testUserLabels = test.map { --- End diff -- I will update with topByKeyIs there a better place to move this function ? may be inside ALS object for example ? That way I can add a test-case to guard it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Use path.makeQualified in newParquet.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5353 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-89594722 @mengxr any insight on it ? the runtime issue is only in first iteration and I think you can point out if there is any obvious issue in the way I call the solver...looks like something to do with initialization... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Use path.makeQualified in newParquet.
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5353#issuecomment-89594648 LGTM, merging to master and branch-1.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6262][MLLIB]Implement missing methods f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5359#issuecomment-89591756 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29708/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARKR-92] Phase 2: implement sum(rdd)
Github user hqzizania closed the pull request at: https://github.com/apache/spark/pull/5360 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARKR-92] Phase 2: implement sum(rdd)
GitHub user hqzizania opened a pull request: https://github.com/apache/spark/pull/5360 [SPARKR-92] Phase 2: implement sum(rdd) You can merge this pull request into a Git repository by running: $ git pull https://github.com/hqzizania/spark R3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5360.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5360 commit 7afa4c9d31fc3a7e9676a75ac51e0983708ccb1a Author: Shivaram Venkataraman Date: 2015-03-01T22:44:59Z Merge pull request #186 from hlin09/funcDep3 [SPARKR-142][SPARKR-196] (Step 2) Replaces getDependencies() with cleanClosure to capture UDF closures and serialize them to worker. commit 6e51c7ff25388bcf05776fa1ee353401b31b9443 Author: Shivaram Venkataraman Date: 2015-03-01T23:00:24Z Fix stderr redirection on executors commit 8c4deaedc570c2753a2103d59aba20178d9ef777 Author: Shivaram Venkataraman Date: 2015-03-01T23:06:29Z Remove unused function commit f7caeb84321f04291214f17a7a6606cb3a0ddee8 Author: Davies Liu Date: 2015-03-01T23:11:37Z Update SparkRBackend.scala commit b457833ea90575fb11840a18ff616f2d94be2aeb Author: Shivaram Venkataraman Date: 2015-03-01T23:15:05Z Merge pull request #189 from shivaram/stdErrFix Fix stderr redirection on executors commit 862f07c337705337ca8719485e6fe301a711bac7 Author: Shivaram Venkataraman Date: 2015-03-01T23:20:35Z Merge pull request #190 from shivaram/SPARKR-79 [SPARKR-79] Remove unused function commit 773baf064c923d3f44ea8fdbb5d2f36194245040 Author: Zongheng Yang Date: 2015-03-02T00:35:23Z Merge pull request #178 from davies/random [SPARKR-204] use random port in backend commit 5c0bb24bd77a6e1ed4474144f14b6458cdd2c157 Author: Felix Cheung Date: 2015-03-02T06:20:41Z Doc updates: build and running on YARN commit 8caf5bb81b027aa9e0dc4c3e9d95028d7865e0b9 Author: Davies Liu Date: 2015-03-02T19:34:10Z use S4 methods commit 7dfe27d06baf5bb00e679ea6a1bb7472295307d4 Author: Davies Liu Date: 2015-03-02T20:24:19Z fix cyclic namespace dependency commit d7b17a428c27aac28d89e1c85f1ba7d9d4b021d2 Author: Davies Liu Date: 2015-03-02T21:07:44Z fix approxCountDistinct commit acae5272f0d3c6e853d767ec489e64999306db0f Author: Davies Liu Date: 2015-03-02T21:18:46Z refactor commit 8ec21af07caea512cc90c66010d3b7b2dc0fc6e3 Author: Davies Liu Date: 2015-03-02T21:40:34Z fix signature commit 71d66a1f75f846c77a6e0ece4c40c6d5d5019c06 Author: Davies Liu Date: 2015-03-02T21:47:44Z fix first(0 commit e9983566f93304f2f5624613aedadd1e9d9a5069 Author: cafreeman Date: 2015-03-02T22:00:29Z define generic for 'first' in RDD API commit f585929cc9edabb3098ed4460eac01237a500e6a Author: cafreeman Date: 2015-03-02T22:02:35Z Fix brackets commit 1955a09f83a269d84139891bc29b41d0bcb9a1ae Author: cafreeman Date: 2015-03-02T23:50:12Z return object instead of a list of one object commit 76cf2e0ded37175550362ea7474dc9f6866b337b Author: Shivaram Venkataraman Date: 2015-03-03T00:02:26Z Merge pull request #192 from cafreeman/sparkr-sql define generic for 'first' in RDD API commit 03402ebdef99be680c4d0c9c475fd08702d3eb9e Author: Felix Cheung Date: 2015-03-03T00:17:17Z Updates as per feedback on sparkR-submit commit 1d0f2ae2097f0838d8c079b0bbcf89fe9805509f Author: Davies Liu Date: 2015-03-03T00:42:34Z Update DataFrame.R commit f798402e5ae02853f0477369273c478f7090700a Author: Davies Liu Date: 2015-03-03T00:43:01Z Update column.R commit 524c122b0b91ccd73a1eddce465a063d76bd3c47 Author: Davies Liu Date: 2015-03-03T00:44:47Z Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column commit 8a676b19475fcafbad925b7ee7fe91ea68e3f3a5 Author: Shivaram Venkataraman Date: 2015-03-03T00:59:46Z Merge pull request #188 from davies/column [SPARKR-189] [SPARKR-190] Column and expression commit 06cbc2d233e6c0da062d0984e7cb95d3d9a5a1a1 Author: Davies Liu Date: 2015-03-03T01:26:14Z launch R worker by a daemon commit 3beadcf9d5ea3db893e469407d2723cfbe6687ef Author: Davies Liu Date: 2015-03-03T01:39:06Z Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api Conflicts: pkg/R/RDD.R commit e2d144a798f8ef293467ed8a3eb20b6cf77dcb56 Author: Felix Cheung Date: 2015-03-03T01:52:10Z Fixed small typos commit 98cc97a7c94a61f290207e4a8481ae97203014c7 Author: Davies Liu Date: 2015-03-03T02:01:55Z fix test and docs commit 39c253d97224d41abeee52ec486aaed57af270eb Author: Davies Liu Date: 2015-03-03T02:05:19Z Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group Conflicts: pkg/NAMESPACE pkg/R/DataFrame.R pkg/R/utils.R
[GitHub] spark pull request: Implement missing methods for MultivariateStat...
GitHub user Lewuathe opened a pull request: https://github.com/apache/spark/pull/5359 Implement missing methods for MultivariateStatisticalSummary Add below methods in pyspark for MultivariateStatisticalSummary - normL1 - normL2 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Lewuathe/spark SPARK-6262 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5359.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5359 commit cbe439e4703bf5c2550d38b06cc4eada5bef6484 Author: lewuathe Date: 2015-04-04T13:34:34Z Implement missing methods for MultivariateStatisticalSummary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] [WIP] Tries to skip row groups when read...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/5334#discussion_r27768822 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala --- @@ -226,7 +224,7 @@ private[sql] case class ParquetRelation2( private var commonMetadataStatuses: Array[FileStatus] = _ // Parquet footer cache. -var footers: Map[FileStatus, Footer] = _ +var footers: Map[Path, Footer] = _ --- End diff -- `FileStatus` objects are also cached, so this should be OK. Bounding the size can be a good idea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5338][MESOS] Add cluster mode support f...
Github user dragos commented on the pull request: https://github.com/apache/spark/pull/5144#issuecomment-89547152 I still want to run this on a local cluster before I say LGTM, but the code looks good so far! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5338][MESOS] Add cluster mode support f...
Github user dragos commented on a diff in the pull request: https://github.com/apache/spark/pull/5144#discussion_r27768214 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/DriverQueue.scala --- @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler.cluster.mesos + +import scala.collection.mutable + +import org.apache.spark.deploy.mesos.MesosDriverDescription + +/** + * A request queue for launching drivers in Mesos cluster mode. + * This queue automatically stores the state after each pop/push + * so it can be recovered later. + * This queue is also bounded and rejects offers when it's full. + * @param state Mesos state abstraction to fetch persistent state. + */ +private[mesos] class DriverQueue(state: MesosClusterPersistenceEngine, capacity: Int) { + var queue: mutable.Queue[MesosDriverDescription] = new mutable.Queue[MesosDriverDescription]() + private var count = 0 + + initialize() + + def initialize(): Unit = { +state.fetchAll[MesosDriverDescription]().foreach(d => queue.enqueue(d)) + +// This size might be larger than the passed in capacity, but we allow +// this so we don't lose queued drivers. +count = queue.size + } + + def isFull = count >= capacity + + def size: Int = count + + def contains(submissionId: String): Boolean = { +queue.exists(s => s.submissionId.equals(submissionId)) --- End diff -- You are right, I missed the fact that the queue isn't storing `submissionId`s directly. Ignore this :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-6489: Optimize lateral view with e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5358#issuecomment-89536841 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SQL ] SparkSPARK-6489: Optimize lateral view...
Github user dreamquster closed the pull request at: https://github.com/apache/spark/pull/5346 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ SQL ] SparkSPARK-6489: Optimize lateral view...
Github user dreamquster commented on the pull request: https://github.com/apache/spark/pull/5346#issuecomment-89536605 ok,I split it into two pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-6489: Optimize lateral view with e...
GitHub user dreamquster opened a pull request: https://github.com/apache/spark/pull/5358 [SQL] SPARK-6489: Optimize lateral view with explode to not unnecessary columns. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dreamquster/spark spark-explode-optimize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5358 commit 1b29835b2beaba53f6cec3c02680002ad89802f5 Author: dreamquster Date: 2015-04-04T08:42:47Z [SQL] SPARK-6489: Optimize lateral view with explode to not read unnecessary columns commit 376d332462a3e2a21a28ecdeab14a3bd1f49ffbf Author: dreamquster Date: 2015-04-04T09:14:04Z [SQL] SPARK-6489: adding test files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6130] [SQL] support if not exists for i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4865#issuecomment-89527734 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29706/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-6548: Adding stddev to DataFrame f...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5357#issuecomment-89527728 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-6548: Adding stddev to DataFrame f...
GitHub user dreamquster opened a pull request: https://github.com/apache/spark/pull/5357 [SQL] SPARK-6548: Adding stddev to DataFrame functions remerge SPARK-6548 https://github.com/apache/spark/pull/5228 You can merge this pull request into a Git repository by running: $ git pull https://github.com/dreamquster/spark spark-stddev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5357.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5357 commit 58dec463b8280b425b5534bdeb28b013ae02eec4 Author: dreamquster Date: 2015-04-04T08:13:56Z [SQL] SPARK-6548: Adding stddev to DataFrame functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6705][MLLIB] Add fit intercept api to m...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5301#issuecomment-89516868 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29707/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6705][MLLIB] Add fit intercept api to m...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/5301#issuecomment-89516426 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6638] [SQL] Improve performance of Stri...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5350#issuecomment-89514738 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29703/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Bug fix for SPARK-5242: "ec2/spark_ec2.py lauc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4038#issuecomment-89514490 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29705/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org