[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19218 What are multiple expressions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/19218 @gatorsmile I'd test manully. When table-level compression not configured, it always take the session level compression, and ignore the existing file compression. Seems like a bug, however, table files with multiple compressions do not affect the reading and writing. Is it ok to add a test to check reading and writing when there are multiple conpressions in the existing table files? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19984: [SPARK-22789] Map-only continuous processing exec...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19984 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/19984 Thanks! Merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85332/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85332 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85332/testReport)** for PR 19984 at commit [`b4f7976`](https://github.com/apache/spark/commit/b4f79762c083735011bf98250c39c263876c8cc8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20034: [SPARK-22846][SQL] Fix table owner is null when c...
Github user BruceXu1991 commented on a diff in the pull request: https://github.com/apache/spark/pull/20034#discussion_r158577749 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala --- @@ -186,7 +186,7 @@ private[hive] class HiveClientImpl( /** Returns the configuration for the current session. */ def conf: HiveConf = state.getConf - private val userName = state.getAuthenticator.getUserName + private val userName = conf.getUser --- End diff -- well, if using spark 2.2.1's current implementation ``` private val userName = state.getAuthenticator.getUserName ``` when the implementation of state.getAuthenticator is **HadoopDefaultAuthenticator**, which is default in hive conf, the username is got. however, in the case that the implementation of state.getAuthenticator is **SessionStateUserAuthenticator**, which is used in my case, then username will be null. the simplified code below explains the reason: 1) HadoopDefaultAuthenticator ``` public class HadoopDefaultAuthenticator implements HiveAuthenticationProvider { @Override public String getUserName() { return userName; } @Override public void setConf(Configuration conf) { this.conf = conf; UserGroupInformation ugi = null; try { ugi = Utils.getUGI(); } catch (Exception e) { throw new RuntimeException(e); } this.userName = ugi.getShortUserName(); if (ugi.getGroupNames() != null) { this.groupNames = Arrays.asList(ugi.getGroupNames()); } } } public class Utils { public static UserGroupInformation getUGI() throws LoginException, IOException { String doAs = System.getenv("HADOOP_USER_NAME"); if(doAs != null && doAs.length() > 0) { return UserGroupInformation.createProxyUser(doAs, UserGroupInformation.getLoginUser()); } return UserGroupInformation.getCurrentUser(); } } ``` it shows that HadoopDefaultAuthenticator will get username through Utils.getUGI(), so the username is HADOOP_USER_NAME of LoginUser. 2) SessionStateUserAuthenticator ``` public class SessionStateUserAuthenticator implements HiveAuthenticationProvider { @Override public void setConf(Configuration arg0) { } @Override public String getUserName() { return sessionState.getUserName(); } } ``` it shows that SessionStateUserAuthenticator get the username through sessionState.getUserName(), which is null. Here is the [instantiation of SessionState in HiveClientImpl](https://github.com/apache/spark/blob/1cf3e3a26961d306eb17b7629d8742a4df45f339/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L187) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85331/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85331 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85331/testReport)** for PR 20059 at commit [`f23bf0f`](https://github.com/apache/spark/commit/f23bf0fdf21f21224895a5c35e0d95956a29abf9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20061 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20061 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85330/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20061 **[Test build #85330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85330/testReport)** for PR 20061 at commit [`e8e4d11`](https://github.com/apache/spark/commit/e8e4d11a504c4169848baeabbec84af2a1b3e6a8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 might have...I'll check the performance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20004: [Spark-22818][SQL] csv escape of quote escape
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20004#discussion_r158576836 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -148,6 +149,9 @@ class CSVOptions( format.setDelimiter(delimiter) format.setQuote(quote) format.setQuoteEscape(escape) +if (charToEscapeQuoteEscaping != '\u') { --- End diff -- Because we always call `format.setQuoteEscape(escape)`, this default value becomes wrong. That means, if users set `\u`, the actual quoteEscape in `univocity` is `\`. Let us use `Option[Char]` as the type of `charToEscapeQuoteEscaping `? We call `format.setCharToEscapeQuoteEscaping ` only if it is defined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20004 Thank you for your investigation! Basically, because the option `escape` is set to `\`, the default value of `charToEscapeQuoteEscaping` is actually `\` in effect. Could you update the doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user danielvdende commented on the issue: https://github.com/apache/spark/pull/20057 @gatorsmile could you explain why you have doubts about the feature? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85329/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19977 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19977 **[Test build #85329 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85329/testReport)** for PR 19977 at commit [`3ec440d`](https://github.com/apache/spark/commit/3ec440de914d8f71c35f30cdccec815b199b5f17). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20023 ``` db2 => create table decimals_test(id int, a decimal(31,18), b decimal(31,18)) DB2I The SQL command completed successfully. db2 => insert into decimals_test values (1, 2.33, 1.123456789123456789) DB2I The SQL command completed successfully. db2 => select a * b from decimals_test 1 - SQL0802N Arithmetic overflow or other arithmetic exception occurred. SQLSTATE=22003 ``` I might not get your point. Above is the result I got. This is your scenario 3 or 2? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17619: [SPARK-19755][Mesos] Blacklist is always active for Meso...
Github user timout commented on the issue: https://github.com/apache/spark/pull/17619 That does exactly what is supposed to do. And you absolutely right it related to executors. I am sorry if it is not clear from my previous explanations. Let us say: Spark Streaming App - very long running app: Driver, started by marathon using docker image, schedules (in mesos meaning) executors using docker images.(net=HOST) (every executor started from docker image on some mesos agent) So if some recoverable error happens, for instance: ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Remote RPC client disassociated...(I do not know how about others but it is relatively often in my env.) As result the executor will be dead and after 2 failures mesos agent node will be included in MesosCoarseGrainedSchedulerBackend black list and driver will never schedule (in mesos meaning) executor on it. So the app will starve... and notice will not die. That exactly what happened with my streams apps before that patch. That patch may be incompatible with master already but i can fix it if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19984: [SPARK-22789] Map-only continuous processing execution
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19984 **[Test build #85332 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85332/testReport)** for PR 19984 at commit [`b4f7976`](https://github.com/apache/spark/commit/b4f79762c083735011bf98250c39c263876c8cc8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85327/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85327 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85327/testReport)** for PR 19954 at commit [`f82c568`](https://github.com/apache/spark/commit/f82c5682565bc0e1bc34ec428faedb53ee5ddecd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20011 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20011 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85326/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20011 **[Test build #85326 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85326/testReport)** for PR 20011 at commit [`931b2d2`](https://github.com/apache/spark/commit/931b2d262aa02880631ca4c693a84fa4c4d12318). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20057: [SPARK-22880][SQL] Add cascadeTruncate option to JDBC da...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20057 Thank you for your contribution! I am doubting the value of this feature. If you are interested in the JDBC-related work, could you take https://issues.apache.org/jira/browse/SPARK-22731? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85331 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85331/testReport)** for PR 20059 at commit [`f23bf0f`](https://github.com/apache/spark/commit/f23bf0fdf21f21224895a5c35e0d95956a29abf9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19977 Yes; otherwise, it will introduce a performance regression, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19218#discussion_r158575144 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala --- @@ -42,8 +43,15 @@ private[parquet] class ParquetOptions( * Acceptable values are defined in [[shortParquetCompressionCodecNames]]. */ val compressionCodecClassName: String = { -val codecName = parameters.getOrElse("compression", - sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT) +// `compression`, `parquet.compression`(i.e., ParquetOutputFormat.COMPRESSION), and +// `spark.sql.parquet.compression.codec` +// are in order of precedence from highest to lowest. +val parquetCompressionConf = parameters.get(ParquetOutputFormat.COMPRESSION) +val codecName = parameters + .get("compression") + .orElse(parquetCompressionConf) --- End diff -- Yeah, we can submit a separate PR for that issue. The behavior change needs to be documented in SparkSQL doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158575137 --- Diff: docs/running-on-kubernetes.md --- @@ -528,51 +579,90 @@ specific to Spark on Kubernetes. - spark.kubernetes.driver.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. - - - - spark.kubernetes.executor.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. - - - - spark.kubernetes.node.selector.[labelKey] - (none) - - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the - configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier - will result in the driver pod and executors having a node selector with key identifier and value - myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. - - - - spark.kubernetes.driverEnv.[EnvironmentVariableName] - (none) - - Add the environment variable specified by EnvironmentVariableName to - the Driver process. The user can specify multiple of these to set multiple environment variables. - - - - spark.kubernetes.mountDependencies.jarsDownloadDir -/var/spark-data/spark-jars - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - - - spark.kubernetes.mountDependencies.filesDownloadDir - /var/spark-data/spark-files - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - + spark.kubernetes.driver.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. + + + + spark.kubernetes.executor.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. + + + + spark.kubernetes.node.selector.[labelKey] + (none) + +Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the +configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier +will result in the driver pod and executors having a node selector with key identifier and value + myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. + + + + spark.kubernetes.driverEnv.[EnvironmentVariableName] + (none) + +Add the environment variable specified by EnvironmentVariableName to +the Driver process. The user can specify multiple of these to set multiple environment variables. + + + + spark.kubernetes.mountDependencies.jarsDownloadDir + /var/spark-data/spark-jars + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.filesDownloadDir + /var/spark-data/spark-files + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.timeout + 5 minutes + + Timeout before aborting the attempt to download and unpack dependencies from remote locations into the driver and executor pods. + + + + spark.kubernetes.mountDependencies.maxThreadPoolSize + 5 + + Maximum size of the thread pool for downloading remote dependencies into the driver and executor pods. --- End diff -- Done. --- - To unsubscribe, e-mail:
[GitHub] spark pull request #20059: [SPARK-22648][K8s] Add documentation covering ini...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158575135 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,57 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods +need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading +the dependencies so the driver and executor containers can use them locally. This requires users to specify the container +image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users +simply add the following option to the `spark-submit` command to specify the init-container image: + +``` +--conf spark.kubernetes.initContainer.image= +``` + +The init-container handles remote dependencies specified in `spark.jars` (or the `--jars` option of `spark-submit`) and +`spark.files` (or the `--files` option of `spark-submit`). It also handles remotely hosted main application resources, e.g., +the main application jar. The following shows an example of using remote dependencies with the `spark-submit` command: + +```bash +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ +--class org.apache.spark.examples.SparkPi \ +--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar +--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2 +--conf spark.executor.instances=5 \ +--conf spark.kubernetes.driver.docker.image= \ +--conf spark.kubernetes.executor.docker.image= \ +--conf spark.kubernetes.initContainer.image= +https://path/to/examples.jar +``` + +## Secret Management +In some cases, a Spark application may need to use some credentials, e.g., for accessing data on a secured HDFS cluster --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20061: [SPARK-22890][TEST] Basic tests for DateTimeOperations
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20061 **[Test build #85330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85330/testReport)** for PR 20061 at commit [`e8e4d11`](https://github.com/apache/spark/commit/e8e4d11a504c4169848baeabbec84af2a1b3e6a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20061: [SPARK-22890][TEST] Basic tests for DateTimeOpera...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20061 [SPARK-22890][TEST] Basic tests for DateTimeOperations ## What changes were proposed in this pull request? Test Coverage for `DateTimeOperations`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722). ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-22890 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20061.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20061 commit 24b50f0c8371af258ed152363a9ba8148b23d2d2 Author: Yuming Wang Date: 2017-12-23T02:39:39Z Basic tests for DateTimeOperations commit e8e4d11a504c4169848baeabbec84af2a1b3e6a8 Author: Yuming Wang Date: 2017-12-23T02:53:40Z Append a blank line --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user fjh100456 commented on a diff in the pull request: https://github.com/apache/spark/pull/19218#discussion_r158574986 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala --- @@ -42,8 +43,15 @@ private[parquet] class ParquetOptions( * Acceptable values are defined in [[shortParquetCompressionCodecNames]]. */ val compressionCodecClassName: String = { -val codecName = parameters.getOrElse("compression", - sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT) +// `compression`, `parquet.compression`(i.e., ParquetOutputFormat.COMPRESSION), and +// `spark.sql.parquet.compression.codec` +// are in order of precedence from highest to lowest. +val parquetCompressionConf = parameters.get(ParquetOutputFormat.COMPRESSION) +val codecName = parameters + .get("compression") + .orElse(parquetCompressionConf) --- End diff -- If so, parquet's table-level compression may be overwrited in this PR, and it may not be what we want. Shall I fix it first in another PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85325/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85325/testReport)** for PR 20059 at commit [`fbb2112`](https://github.com/apache/spark/commit/fbb21121447fe131358042ce05454f88d6fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85323/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85323 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85323/testReport)** for PR 20059 at commit [`a6a6660`](https://github.com/apache/spark/commit/a6a666060787cadd832b5bd32281940a6c81a9a9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85324/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85324 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85324/testReport)** for PR 19954 at commit [`785b90e`](https://github.com/apache/spark/commit/785b90e52580e9175896b22b00b23f30fbe020ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20039#discussion_r158574199 --- Diff: core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala --- @@ -125,13 +128,39 @@ private[spark] class LiveListenerBus(conf: SparkConf) { /** Post an event to all queues. */ def post(event: SparkListenerEvent): Unit = { -if (!stopped.get()) { - metrics.numEventsPosted.inc() - val it = queues.iterator() - while (it.hasNext()) { -it.next().post(event) +if (stopped.get()) { + return +} + +metrics.numEventsPosted.inc() + +// If the event buffer is null, it means the bus has been started and we can avoid +// synchronization and post events directly to the queues. This should be the most +// common case during the life of the bus. +if (queuedEvents == null) { + postToQueues(event) --- End diff -- What if stop() called after the null judge and before postToQueues() call ? Do you think we should check the stopped.get() in postToQueues()? like: ``` private def postToQueues(event: SparkListenerEvent): Unit = { if (!stopped.get()) { val it = queues.iterator() while (it.hasNext()) { it.next().post(event) } } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20039#discussion_r158573853 --- Diff: core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala --- @@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) { } this.sparkContext = sc -queues.asScala.foreach(_.start(sc)) +queues.asScala.foreach { q => + q.start(sc) + queuedEvents.foreach(q.post) --- End diff -- Ok,**synchronized** can avoid this problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19982: [SPARK-22787] [TEST] [SQL] Add a TPC-H query suite
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19982 @gatorsmile Any progress on this? https://github.com/apache/spark/pull/19982#discussion_r157119941 After I thought your comment, I came up with collecting metrics for each rule like; https://github.com/apache/spark/compare/master...maropu:MetricSpike This conflicts with your activity, or this is not acceptable? welcome any comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20011 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85320/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20011 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20011 **[Test build #85320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85320/testReport)** for PR 20011 at commit [`be3f130`](https://github.com/apache/spark/commit/be3f1307f0edffdc7c9457ec960781cd28b07bf8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20039#discussion_r158573360 --- Diff: core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala --- @@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) { } this.sparkContext = sc -queues.asScala.foreach(_.start(sc)) +queues.asScala.foreach { q => + q.start(sc) + queuedEvents.foreach(q.post) --- End diff -- What if stop() called before all queuedEvents post to AsyncEventQueue? ``` /** * Stop the listener bus. It will wait until the queued events have been processed, but new * events will be dropped. */ ``` **(the "queued events" mentioned in description above is not equal to "queuedEvents" here.)** As queuedEvents "post" before listeners install, so, can they be treated as new events? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20060 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85328/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20060 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20060 **[Test build #85328 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85328/testReport)** for PR 20060 at commit [`6edbd71`](https://github.com/apache/spark/commit/6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85321/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85321/testReport)** for PR 20059 at commit [`c0a659a`](https://github.com/apache/spark/commit/c0a659a3117674dfbd5c078badd653886c15cc8e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19929 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85322/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19929 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19929 **[Test build #85322 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85322/testReport)** for PR 19929 at commit [`cc309b0`](https://github.com/apache/spark/commit/cc309b0ce2496365afd8c602c282e3d84aeed940). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/19977 I found the optimizer rule can't combine nested concat like; ``` scala>: psate val df = sql(""" SELECT ((col1 || col2) || (col3 || col4)) col FROM ( SELECT encode(string(id), 'utf-8') col1, encode(string(id + 1), 'utf-8') col2, string(id + 2) col3, string(id + 3) col4 FROM range(10) ) """) scala> df.explain(true) == Parsed Logical Plan == 'Project [concat(concat('col1, 'col2), concat('col3, 'col4)) AS col#4]+- 'SubqueryAlias __auto_generated_subquery_name +- 'Project ['encode('string('id), utf-8) AS col1#0, 'encode('string(('id + 1)), utf-8) AS col2#1, 'string(('id + 2)) AS col3#2, 'string(('id + 3)) AS col4#3 ] +- 'UnresolvedTableValuedFunction range, [10] == Analyzed Logical Plan == col: string Project [concat(cast(concat(col1#0, col2#1) as string), concat(col3#2, col4#3)) AS col#4] +- SubqueryAlias __auto_generated_subquery_name +- Project [encode(cast(id#9L as string), utf-8) AS col1#0, encode(cast((id#9L + cast(1 as bigint)) as string), utf-8) AS col2#1, cast((id#9L + cast(2 as bigint)) as string) AS col3#2, cast((id#9L + cast(3 as bigint)) as string) AS col4#3] +- Range (0, 10, step=1, splits=None) == Optimized Logical Plan == Project [concat(cast(concat(encode(cast(id#9L as string), utf-8), encode(cast((id#9L + 1) as string), utf-8)) as string), cast((id#9L + 2) as string), cast((id#9L + 3) as string)) AS col#4] +- Range (0, 10, step=1, splits=None) == Physical Plan == *Project [concat(cast(concat(encode(cast(id#9L as string), utf-8), encode(cast((id#9L + 1) as string), utf-8)) as string), cast((id#9L + 2) as string), cast((id#9L + 3) as string)) AS col#4] +- *Range (0, 10, step=1, splits=4) ``` We need to support the optimization in this case, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19977: [SPARK-22771][SQL] Concatenate binary inputs into a bina...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19977 **[Test build #85329 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85329/testReport)** for PR 19977 at commit [`3ec440d`](https://github.com/apache/spark/commit/3ec440de914d8f71c35f30cdccec815b199b5f17). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20039: [SPARK-22850][core] Ensure queued events are deli...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20039#discussion_r158572288 --- Diff: core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala --- @@ -149,7 +158,11 @@ private[spark] class LiveListenerBus(conf: SparkConf) { } this.sparkContext = sc -queues.asScala.foreach(_.start(sc)) +queues.asScala.foreach { q => + q.start(sc) + queuedEvents.foreach(q.post) --- End diff -- Yes,that‘s true, semantic only. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19977: [SPARK-22771][SQL] Concatenate binary inputs into...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/19977#discussion_r158572201 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2171,7 +2171,8 @@ object functions { def base64(e: Column): Column = withExpr { Base64(e.expr) } /** - * Concatenates multiple input string columns together into a single string column. + * Concatenates multiple input columns together into a single column. + * If all inputs are binary, concat returns an output as binary. Otherwise, it returns as string. --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20053: [SPARK-22873] [CORE] Init lastReportTimestamp wit...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/20053#discussion_r158572161 --- Diff: core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala --- @@ -112,6 +112,7 @@ private class AsyncEventQueue(val name: String, conf: SparkConf, metrics: LiveLi private[scheduler] def start(sc: SparkContext): Unit = { if (started.compareAndSet(false, true)) { this.sc = sc + lastReportTimestamp = System.currentTimeMillis() --- End diff -- But there's no 'last time' for 'first time'. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20060 **[Test build #85328 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85328/testReport)** for PR 20060 at commit [`6edbd71`](https://github.com/apache/spark/commit/6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20060: [SPARK-22889][SPARKR] Set overwrite=T when install Spark...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20060 cc @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20060: [SPARK-22889][SPARKR] Set overwrite=T when instal...
GitHub user shivaram opened a pull request: https://github.com/apache/spark/pull/20060 [SPARK-22889][SPARKR] Set overwrite=T when install SparkR in tests ## What changes were proposed in this pull request? Since all CRAN checks go through the same machine, if there is an older partial download or partial install of Spark left behind the tests fail. This PR overwrites the install files when running tests. This shouldn't affect Jenkins as `SPARK_HOME` is set when running Jenkins tests. ## How was this patch tested? Test manually by running `R CMD check --as-cran` You can merge this pull request into a Git repository by running: $ git pull https://github.com/shivaram/spark-1 sparkr-overwrite-cran Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20060.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20060 commit 6edbd710d0b73a7c4170dbb78eb42ace5a8c73a0 Author: Shivaram Venkataraman Date: 2017-12-23T00:27:56Z Set overwrite=T when install SparkR in tests --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19218#discussion_r158571702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala --- @@ -42,8 +43,15 @@ private[parquet] class ParquetOptions( * Acceptable values are defined in [[shortParquetCompressionCodecNames]]. */ val compressionCodecClassName: String = { -val codecName = parameters.getOrElse("compression", - sqlConf.parquetCompressionCodec).toLowerCase(Locale.ROOT) +// `compression`, `parquet.compression`(i.e., ParquetOutputFormat.COMPRESSION), and +// `spark.sql.parquet.compression.codec` +// are in order of precedence from highest to lowest. +val parquetCompressionConf = parameters.get(ParquetOutputFormat.COMPRESSION) +val codecName = parameters + .get("compression") + .orElse(parquetCompressionConf) --- End diff -- Could we keep the old behavior? We could add it later? We do not want to mix multiple issues in the same PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19218#discussion_r158571657 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala --- @@ -102,4 +111,18 @@ object HiveOptions { "collectionDelim" -> "colelction.delim", "mapkeyDelim" -> "mapkey.delim", "lineDelim" -> "line.delim").map { case (k, v) => k.toLowerCase(Locale.ROOT) -> v } + + def getHiveWriteCompression(tableInfo: TableDesc, sqlConf: SQLConf): Option[(String, String)] = { +tableInfo.getOutputFileFormatClassName.toLowerCase match { + case formatName if formatName.endsWith("parquetoutputformat") => +val compressionCodec = new ParquetOptions(tableInfo.getProperties.asScala.toMap, + sqlConf).compressionCodecClassName +Option((ParquetOutputFormat.COMPRESSION, compressionCodec)) + case formatName if formatName.endsWith("orcoutputformat") => +val compressionCodec = new OrcOptions(tableInfo.getProperties.asScala.toMap, + sqlConf).compressionCodec --- End diff -- Yeah. Just to make it consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19218#discussion_r158571649 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala --- @@ -35,7 +39,7 @@ case class TestData(key: Int, value: String) case class ThreeCloumntable(key: Int, value: String, key1: String) class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter -with SQLTestUtils { +with ParquetTest { --- End diff -- Fine to me. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158571445 --- Diff: python/pyspark/sql/functions.py --- @@ -2075,9 +2075,10 @@ class PandasUDFType(object): def udf(f=None, returnType=StringType()): --- End diff -- Do we need to just add a parameter for deterministic? Adding it to the end is OK to PySpark without breaking the existing app? cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158571371 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -58,6 +58,7 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends | pythonIncludes: ${udf.func.pythonIncludes} | pythonExec: ${udf.func.pythonExec} | dataType: ${udf.dataType} --- End diff -- Could you also print out `pythonEvalType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158571355 --- Diff: python/pyspark/sql/udf.py --- @@ -157,5 +158,13 @@ def wrapper(*args): wrapper.func = self.func wrapper.returnType = self.returnType wrapper.evalType = self.evalType +wrapper.asNondeterministic = self.asNondeterministic return wrapper + +def asNondeterministic(self): +""" +Updates UserDefinedFunction to nondeterministic. --- End diff -- ``` """ Updates UserDefinedFunction to nondeterministic. .. versionadded:: 2.3 """ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19929: [SPARK-22629][PYTHON] Add deterministic flag to p...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19929#discussion_r158571320 --- Diff: python/pyspark/sql/tests.py --- @@ -434,6 +434,16 @@ def test_udf_with_array_type(self): self.assertEqual(list(range(3)), l1) self.assertEqual(1, l2) +def test_nondeterministic_udf(self): +from pyspark.sql.functions import udf +import random +udf_random_col = udf(lambda: int(100 * random.random()), IntegerType()).asNondeterministic() +df = self.spark.createDataFrame([Row(1)]).select(udf_random_col().alias('RAND')) +random.seed(1234) +udf_add_ten = udf(lambda rand: rand + 10, IntegerType()) +[row] = df.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).collect() +self.assertEqual(row[0] + 10, row[1]) --- End diff -- Compare the values, since you already set the seed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85327/testReport)** for PR 19954 at commit [`f82c568`](https://github.com/apache/spark/commit/f82c5682565bc0e1bc34ec428faedb53ee5ddecd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158570531 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,57 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods +need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading +the dependencies so the driver and executor containers can use them locally. This requires users to specify the container +image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users +simply add the following option to the `spark-submit` command to specify the init-container image: + +``` +--conf spark.kubernetes.initContainer.image= +``` + +The init-container handles remote dependencies specified in `spark.jars` (or the `--jars` option of `spark-submit`) and +`spark.files` (or the `--files` option of `spark-submit`). It also handles remotely hosted main application resources, e.g., +the main application jar. The following shows an example of using remote dependencies with the `spark-submit` command: + +```bash +$ bin/spark-submit \ +--master k8s://https://: \ +--deploy-mode cluster \ +--name spark-pi \ +--class org.apache.spark.examples.SparkPi \ +--jars https://path/to/dependency1.jar,https://path/to/dependency2.jar +--files hdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2 +--conf spark.executor.instances=5 \ +--conf spark.kubernetes.driver.docker.image= \ +--conf spark.kubernetes.executor.docker.image= \ +--conf spark.kubernetes.initContainer.image= +https://path/to/examples.jar +``` + +## Secret Management +In some cases, a Spark application may need to use some credentials, e.g., for accessing data on a secured HDFS cluster --- End diff -- I'd rewrite this. "Kubernetes Secrets can be used to provide credentials for a Spark application to access secured services. To mount secrets into a driver container, ..." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158570653 --- Diff: docs/running-on-kubernetes.md --- @@ -528,51 +579,90 @@ specific to Spark on Kubernetes. - spark.kubernetes.driver.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. - - - - spark.kubernetes.executor.limit.cores - (none) - - Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. - - - - spark.kubernetes.node.selector.[labelKey] - (none) - - Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the - configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier - will result in the driver pod and executors having a node selector with key identifier and value - myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. - - - - spark.kubernetes.driverEnv.[EnvironmentVariableName] - (none) - - Add the environment variable specified by EnvironmentVariableName to - the Driver process. The user can specify multiple of these to set multiple environment variables. - - - - spark.kubernetes.mountDependencies.jarsDownloadDir -/var/spark-data/spark-jars - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - - - spark.kubernetes.mountDependencies.filesDownloadDir - /var/spark-data/spark-files - - Location to download jars to in the driver and executors. - This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. - - + spark.kubernetes.driver.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for the driver pod. + + + + spark.kubernetes.executor.limit.cores + (none) + +Specify the hard CPU [limit](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container) for each executor pod launched for the Spark Application. + + + + spark.kubernetes.node.selector.[labelKey] + (none) + +Adds to the node selector of the driver pod and executor pods, with key labelKey and the value as the +configuration's value. For example, setting spark.kubernetes.node.selector.identifier to myIdentifier +will result in the driver pod and executors having a node selector with key identifier and value + myIdentifier. Multiple node selector keys can be added by setting multiple configurations with this prefix. + + + + spark.kubernetes.driverEnv.[EnvironmentVariableName] + (none) + +Add the environment variable specified by EnvironmentVariableName to +the Driver process. The user can specify multiple of these to set multiple environment variables. + + + + spark.kubernetes.mountDependencies.jarsDownloadDir + /var/spark-data/spark-jars + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.filesDownloadDir + /var/spark-data/spark-files + +Location to download jars to in the driver and executors. +This directory must be empty and will be mounted as an empty directory volume on the driver and executor pods. + + + + spark.kubernetes.mountDependencies.timeout + 5 minutes + + Timeout before aborting the attempt to download and unpack dependencies from remote locations into the driver and executor pods. + + + + spark.kubernetes.mountDependencies.maxThreadPoolSize + 5 + + Maximum size of the thread pool for downloading remote dependencies into the driver and executor pods. --- End diff -- I'd clarify this controls how many downloads happen simultaneously; could even change the name of the config
[GitHub] spark issue #20011: [SPARK-20654][core] Add config to limit disk usage of th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20011 **[Test build #85326 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85326/testReport)** for PR 20011 at commit [`931b2d2`](https://github.com/apache/spark/commit/931b2d262aa02880631ca4c693a84fa4c4d12318). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20011: [SPARK-20654][core] Add config to limit disk usag...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20011#discussion_r158569095 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/HistoryServerDiskManager.scala --- @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.nio.file.Files +import java.nio.file.attribute.PosixFilePermissions +import java.util.concurrent.atomic.AtomicLong + +import scala.collection.JavaConverters._ +import scala.collection.mutable.{HashMap, ListBuffer} + +import org.apache.commons.io.FileUtils + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.{Clock, Utils} +import org.apache.spark.util.kvstore.KVStore + +/** + * A class used to keep track of disk usage by the SHS, allowing application data to be deleted + * from disk when usage exceeds a configurable threshold. + * + * The goal of the class is not to guarantee that usage will never exceed the threshold; because of + * how application data is written, disk usage may temporarily go higher. But, eventually, it + * should fall back under the threshold. + * + * @param conf Spark configuration. + * @param path Path where to store application data. + * @param listing The listing store, used to persist usage data. + * @param clock Clock instance to use. + */ +private class HistoryServerDiskManager( +conf: SparkConf, +path: File, +listing: KVStore, +clock: Clock) extends Logging { + + import config._ + + private val appStoreDir = new File(path, "apps") + if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) { +throw new IllegalArgumentException(s"Failed to create app directory ($appStoreDir).") + } + + private val tmpStoreDir = new File(path, "temp") + if (!tmpStoreDir.isDirectory() && !tmpStoreDir.mkdir()) { +throw new IllegalArgumentException(s"Failed to create temp directory ($tmpStoreDir).") + } + + private val maxUsage = conf.get(MAX_LOCAL_DISK_USAGE) + private val currentUsage = new AtomicLong(0L) + private val committedUsage = new AtomicLong(0L) + private val active = new HashMap[(String, Option[String]), Long]() + + def initialize(): Unit = { +updateUsage(sizeOf(appStoreDir), committed = true) + +// Clean up any temporary stores during start up. This assumes that they're leftover from other +// instances and are not useful. +tmpStoreDir.listFiles().foreach(FileUtils.deleteQuietly) + +// Go through the recorded store directories and remove any that may have been removed by +// external code. +val orphans = listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info => + !new File(info.path).exists() +}.toSeq + +orphans.foreach { info => + listing.delete(info.getClass(), info.path) +} + } + + /** + * Lease some space from the store. The leased space is calculated as a fraction of the given + * event log size; this is an approximation, and doesn't mean the application store cannot + * outgrow the lease. + * + * If there's not enough space for the lease, other applications might be evicted to make room. + * This method always returns a lease, meaning that it's possible for local disk usage to grow + * past the configured threshold if there aren't enough idle applications to evict. + * + * While the lease is active, the data is written to a temporary location, so `openStore()` + * will still return `None` for the application. + */ + def lease(eventLogSize: Long, isCompressed: Boolean = false): Lease = { +val needed = approximateSize(eventLogSize, isCompressed) +makeRoom(needed) + +val perms = PosixFilePermissio
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158568498 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading the dependencies so the driver and executor containers can use them locally. This requires users to specify the container image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users simply add the following option to the `spark-submit` command to specify the init-container image: --- End diff -- HDFS and HTTP sound good. We can cover GCS elsewhere. Line breaks were for ease of reviewing by others (being able to comment on individual lines) and for consistency with the rest of the docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20011: [SPARK-20654][core] Add config to limit disk usag...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/20011#discussion_r158568388 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/HistoryServerDiskManager.scala --- @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.history + +import java.io.File +import java.nio.file.Files +import java.nio.file.attribute.PosixFilePermissions +import java.util.concurrent.atomic.AtomicLong + +import scala.collection.JavaConverters._ +import scala.collection.mutable.{HashMap, ListBuffer} + +import org.apache.commons.io.FileUtils + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging +import org.apache.spark.status.KVUtils._ +import org.apache.spark.util.{Clock, Utils} +import org.apache.spark.util.kvstore.KVStore + +/** + * A class used to keep track of disk usage by the SHS, allowing application data to be deleted + * from disk when usage exceeds a configurable threshold. + * + * The goal of the class is not to guarantee that usage will never exceed the threshold; because of + * how application data is written, disk usage may temporarily go higher. But, eventually, it + * should fall back under the threshold. + * + * @param conf Spark configuration. + * @param path Path where to store application data. + * @param listing The listing store, used to persist usage data. + * @param clock Clock instance to use. + */ +private class HistoryServerDiskManager( +conf: SparkConf, +path: File, +listing: KVStore, +clock: Clock) extends Logging { + + import config._ + + private val appStoreDir = new File(path, "apps") + if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) { +throw new IllegalArgumentException(s"Failed to create app directory ($appStoreDir).") + } + + private val tmpStoreDir = new File(path, "temp") + if (!tmpStoreDir.isDirectory() && !tmpStoreDir.mkdir()) { +throw new IllegalArgumentException(s"Failed to create temp directory ($tmpStoreDir).") + } + + private val maxUsage = conf.get(MAX_LOCAL_DISK_USAGE) + private val currentUsage = new AtomicLong(0L) + private val committedUsage = new AtomicLong(0L) + private val active = new HashMap[(String, Option[String]), Long]() + + def initialize(): Unit = { +updateUsage(sizeOf(appStoreDir), committed = true) + +// Clean up any temporary stores during start up. This assumes that they're leftover from other +// instances and are not useful. +tmpStoreDir.listFiles().foreach(FileUtils.deleteQuietly) + +// Go through the recorded store directories and remove any that may have been removed by +// external code. +val orphans = listing.view(classOf[ApplicationStoreInfo]).asScala.filter { info => + !new File(info.path).exists() +}.toSeq + +orphans.foreach { info => + listing.delete(info.getClass(), info.path) +} + } + + /** + * Lease some space from the store. The leased space is calculated as a fraction of the given + * event log size; this is an approximation, and doesn't mean the application store cannot + * outgrow the lease. + * + * If there's not enough space for the lease, other applications might be evicted to make room. + * This method always returns a lease, meaning that it's possible for local disk usage to grow + * past the configured threshold if there aren't enough idle applications to evict. + * + * While the lease is active, the data is written to a temporary location, so `openStore()` + * will still return `None` for the application. + */ + def lease(eventLogSize: Long, isCompressed: Boolean = false): Lease = { +val needed = approximateSize(eventLogSize, isCompressed) +makeRoom(needed) + +val perms = PosixFilePermissio
[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85325/testReport)** for PR 20059 at commit [`fbb2112`](https://github.com/apache/spark/commit/fbb21121447fe131358042ce05454f88d6fb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/20059 Updated in https://github.com/apache/spark/pull/20059/commits/fbb21121447fe131358042ce05454f88d6fb. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20039 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85317/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20039 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158568089 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading the dependencies so the driver and executor containers can use them locally. This requires users to specify the container image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users simply add the following option to the `spark-submit` command to specify the init-container image: --- End diff -- Regarding examples, I can add one spark-submit example showing how to use remote jars/files on http/https and hdfs. But gcs requires the connector in the init-container, which is non-trivial. I'm not sure about s3. I think we should avoid doing so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20039: [SPARK-22850][core] Ensure queued events are delivered t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20039 **[Test build #85317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85317/testReport)** for PR 20039 at commit [`2602fa6`](https://github.com/apache/spark/commit/2602fa68424e984f2cd49f79fb54bcf9676ba5fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18754: [SPARK-21552][SQL] Add DecimalType support to Arr...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18754#discussion_r158566190 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala --- @@ -214,6 +216,22 @@ private[arrow] class DoubleWriter(val valueVector: Float8Vector) extends ArrowFi } } +private[arrow] class DecimalWriter( +val valueVector: DecimalVector, +precision: Int, +scale: Int) extends ArrowFieldWriter { + + override def setNull(): Unit = { +valueVector.setNull(count) + } + + override def setValue(input: SpecializedGetters, ordinal: Int): Unit = { +val decimal = input.getDecimal(ordinal, precision, scale) +decimal.changePrecision(precision, scale) --- End diff -- Is it necessary to call `changePrecision` even though `getDecimal` already takes the precision/scale as input - is it not guaranteed to return a decimal with that scale? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18754: [SPARK-21552][SQL] Add DecimalType support to Arr...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18754#discussion_r158567491 --- Diff: python/pyspark/sql/types.py --- @@ -1617,7 +1617,7 @@ def to_arrow_type(dt): elif type(dt) == DoubleType: arrow_type = pa.float64() elif type(dt) == DecimalType: -arrow_type = pa.decimal(dt.precision, dt.scale) +arrow_type = pa.decimal128(dt.precision, dt.scale) --- End diff -- yes, that's the right way - it is now fixed at 128 bits internally. I believe the Arrow Java limit is the same as Spark 38/38, not sure if pyarrow is the same but I think so. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20045: [Spark-22360][SQL] Add unit tests for Window Specificati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20045 **[Test build #4023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4023/testReport)** for PR 20045 at commit [`797c907`](https://github.com/apache/spark/commit/797c90739a4fc57f487d4d73c28ba59bfd598942). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158567222 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading the dependencies so the driver and executor containers can use them locally. This requires users to specify the container image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users simply add the following option to the `spark-submit` command to specify the init-container image: --- End diff -- Do we need to break them into lines? I thought this should be automatically wrapped when being viewed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85324 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85324/testReport)** for PR 19954 at commit [`785b90e`](https://github.com/apache/spark/commit/785b90e52580e9175896b22b00b23f30fbe020ef). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158566909 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading the dependencies so the driver and executor containers can use them locally. This requires users to specify the container image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users simply add the following option to the `spark-submit` command to specify the init-container image: --- End diff -- Maybe we should include 2-3 examples of remote file usage - ideally, showing that one can use http, hdfs, gcs, s3 in dependencies. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20059: [SPARK-22648][Kubernetes] Update documentation to...
Github user foxish commented on a diff in the pull request: https://github.com/apache/spark/pull/20059#discussion_r158566870 --- Diff: docs/running-on-kubernetes.md --- @@ -120,6 +120,23 @@ by their appropriate remote URIs. Also, application dependencies can be pre-moun Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. +### Using Remote Dependencies +When there are application dependencies hosted in remote locations like HDFS or HTTP servers, the driver and executor pods need a Kubernetes [init-container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) for downloading the dependencies so the driver and executor containers can use them locally. This requires users to specify the container image for the init-container using the configuration property `spark.kubernetes.initContainer.image`. For example, users simply add the following option to the `spark-submit` command to specify the init-container image: --- End diff -- This and below text should be broken up into multiple lines. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][Kubernetes] Update documentation to cover ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85323/testReport)** for PR 20059 at commit [`a6a6660`](https://github.com/apache/spark/commit/a6a666060787cadd832b5bd32281940a6c81a9a9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org