[GitHub] spark pull request: spark-submit with accept multiple properties-f...
Github user lvsoft commented on the pull request: https://github.com/apache/spark/pull/3490#issuecomment-66414770 Well, I can't understand what's the complexity of this PR. I've reviewed the SPARK-3779 marked as related and didn't find something related to this patch. And, this patch will be downward compatible with current `spark-submit` behavior. From my point of view, let's talk it level by level: 1. In case of necessity: I've give out two reasons, one for benchmark case, one for common intuition in most systems. 2. In case of complexity: This patch maintains downward compatibility, and I've described its detail at the beginning and didn't catch the relationship with SPARK-3779. 3. In case of elegance: I don't think this is the most elegant solution. However, in order to maintain compatibility and least impact to current system, this is the relatively elegant solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-submit with accept multiple properties-f...
Github user lvsoft commented on the pull request: https://github.com/apache/spark/pull/3490#issuecomment-66405387 Well, that's called separated property files, not *common* properties. It'll be hard to adjust common properties and easy to make mistakes. Delete tmp files is a common requirement in system design. Of course you can ignore tmp files. As I said, I think it's a more elegant approach. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-submit with accept multiple properties-f...
Github user lvsoft commented on the pull request: https://github.com/apache/spark/pull/3490#issuecomment-66404194 Sorry for late reply. I'll explain the use cases for multiple properties files. Currently I'm working on a benchmark utility for spark. It'll be nature to adjust properties for different workloads. I'd like to setup the configures with two parts: global confs for common properties, and private confs for each workloads. Without the support of multiple properties files, I have to merge the properties as a tmp conf file, and remove it after spark-submit finished. What's more, consider to submit multiple workloads for multiple times concurrently, the tmp conf file name need to be mutually exclusive. And if the benchmark processing was interrupted, the tmp conf files will be hard to clean. So I think, a more elegant approach is to add the support of multiple properties files for spark. Another reason for this PR: currently spark will use `spark-defaults.conf` if no properties-file specified, or use the specified properties-file and *discard* `spark-defaults.conf`. This behavior is also counter-intuitive for beginners. In most systems, it is a natural assumption that the values in `xxx-defaults.conf` will take effect if the properties is not overrided in user's config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-submit with accept multiple properties-f...
GitHub user lvsoft opened a pull request: https://github.com/apache/spark/pull/3490 spark-submit with accept multiple properties-files and merge the values Current ```spark-submit``` accepts only one properties-file, and use ```spark-defaults.conf``` if unspecified. A more nature approach is patching the properties-files sequentially against ```spark-defaults.conf```. This PR affairs: 1. spark-submit script: join multiple ```--properties-file``` with comma and stored as ```SPARK_SUBMIT_PROPERTIES_FILES``` environment variable. Peek each properties-file to set ```SPARK_SUBMIT_BOOTSTRAP_DRIVER``` flag. 2. SparkSubmitArguments.scala: similar with 1. 3. SparkSubmitDriverBootstrapper.scala: accept ```SPARK_SUBMIT_PROPERTIES_FILES``` and call ```getPropertiesFromFiles``` for parsing. 4. Utils.scala: add ```getPropertiesFromFiles``` for the parsing of multiple properties-files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvsoft/spark spark_submit_with_multi_properties Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3490.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3490 commit c18a266a1fa0c20331faed1193c168c1021edcf1 Author: Lv, Qi Date: 2014-11-25T08:48:03Z Spark submit accept multiple properties files commit 752a0581fde0692ee05213b51d0fc0368d8fd205 Author: Lv, Qi Date: 2014-11-26T08:56:29Z test pass --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4475] change "localhost" to "127.0.0.1"...
Github user lvsoft commented on the pull request: https://github.com/apache/spark/pull/3425#issuecomment-64308603 I did a doctest in aggregation.py to confirm this fix is OK if ```localhost``` can not be resolved. However, I'm not fully confident that spark will work well totally in such situation. And I don't think a node is proper configured if ```localhost``` can not be resolved also. However, I think ```127.0.0.1``` should always be used in local communication rather than ```localhost```, which can provide better suitability, without introducing any shortage. After all, make things right in the right situation is trivial, while make things right in a tolerable wrong situation is more difficult and more meaningful, which is what we are working hard for. If you agreed with that, I can do a further check to eliminate all related ```localhost``` in spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2313] PySpark pass port rather than std...
Github user lvsoft commented on the pull request: https://github.com/apache/spark/pull/3424#issuecomment-64302794 I think this is a better solution. However, pass the port back via socket will affair py4j too. Currently, stdin is the only supported method in py4j to pass back the port number. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4475] change "localhost" to "127.0.0.1"...
GitHub user lvsoft opened a pull request: https://github.com/apache/spark/pull/3425 [SPARK-4475] change "localhost" to "127.0.0.1" if "localhost" can't be resolved This will fix [SPARK-4475] Simply change "localhost" to equivalent "127.0.0.1" will solve the issue. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvsoft/spark feature/FixPySpark_failed_to_initialize_if_localhost_can_not_be_resolved Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3425.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3425 commit 25efc78dc766f63888bdae0fdb8dfabb457145ae Author: Lv, Qi Date: 2014-11-24T08:24:56Z change "localhost" to "127.0.0.1" if "localhost" can't be resolved --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2313] PySpark pass port rather than std...
GitHub user lvsoft opened a pull request: https://github.com/apache/spark/pull/3424 [SPARK-2313] PySpark pass port rather than stdin This patch will fix [SPARK-2313]. It peek available free port number, and pass the port number to Py4j.Gateway for binding via command line argument. The initial value of the port number is scanned beginning at the mod of PID, which could avoid potential concurrency issues such as supporting multiple PySpark instances in future. And the port number printed from Py4j in STDIN is also parsed for double check. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lvsoft/spark feature/PySparkPassPortRatherThanSTDIN Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3424.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3424 commit ac603586647c7db7064464ec4bc96d045f664202 Author: Lv, Qi Date: 2014-11-24T07:38:52Z make pyspark accept port via command line argument, and STDIN for double check commit 3f843674ee1c3a5e364acdee3954806f6a6e05d8 Author: Lv, Qi Date: 2014-11-24T07:42:36Z remove useless import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org