[GitHub] spark pull request #20436: [MINOR] Fix typos in dev/* scripts.
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/20436#discussion_r164757437 --- Diff: dev/lint-python --- @@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH" export "PATH=$PYTHONPATH:$PATH" # There is no need to write this output to a file -#+ first, but we do so so that the check status can -#+ be output before the report, like with the -#+ scalastyle and RAT checks. --- End diff -- The `#+` convention is something I picked up from [the Linux Documentation Project](http://tldp.org/LDP/abs/html/here-docs.html#COMMENTBLOCK), if that's what you're referring to. You can safely do away with it and just have the `#`. It was a "phase". I'm over it now... ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 Agreed with @HyukjinKwon. This PR has a very narrow goal -- improving the error messages -- which I think it accomplished. I think @gatorsmile was expecting a more significant set of improvements, but that's not what this PR (or the associated JIRA) are about. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 It's cleaner but less specific. Unless we branch on whether `startPos` and `length` are the same type, we will give the same error message for mixed types and for unsupported types. That seems like a step back to me as these are two different problems which should get different error messages. If we want to group all the type checking in one place, we should do it as in the first example from [Hyukjin's comment](https://github.com/apache/spark/pull/18926#issuecomment-322393819). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133186642 --- Diff: python/pyspark/sql/tests.py --- @@ -1220,6 +1220,18 @@ def test_rand_functions(self): rndn2 = df.select('key', functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)) +def test_string_functions(self): +from pyspark.sql.functions import col, lit +df = self.spark.createDataFrame([['nick']], schema=['name']) +self.assertRaisesRegexp( +TypeError, +"must be the same type", +lambda: df.select(col('name').substr(0, lit(1 --- End diff -- @HyukjinKwon - I opted to just search for a key phrase since that sufficiently captures the intent of the updated error message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133180053 --- Diff: python/pyspark/sql/tests.py --- @@ -1220,6 +1220,13 @@ def test_rand_functions(self): rndn2 = df.select('key', functions.randn(0)).collect() self.assertEqual(sorted(rndn1), sorted(rndn2)) +def test_string_functions(self): +from pyspark.sql.functions import col, lit +df = self.spark.createDataFrame([['nick']], schema=['name']) +self.assertRaises(TypeError, lambda: df.select(col('name').substr(0, lit(1 --- End diff -- I was considering doing that at first, but it felt like just duplicating logic. Looking through the other uses of `assertRaisesRegexp()`, it looks like most of the time we just search for a keyword, but there are also some instances where a large part of the exception message is checked. I can do that here as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 @gatorsmile > Even if we plan to drop `long` in this PR We are not dropping `long` in this PR. It was [never supported](https://github.com/apache/spark/pull/18926#discussion_r132837359). Both the docstring and actual behavior of `.substr()` make it clear that `long` is not supported. Only `int` and `Column` are supported. > the checking looks weird to me. Basically, the change just wants to ensure the type of length is int. Can you elaborate please? As @HyukjinKwon pointed out, `.substr()` accepts either `int` or `Column`, but both arguments must be of the same type. The goal of this PR is to make that clearer. I am not changing any semantics or behavior other than to throw a Python `TypeError` on `long`, as opposed to letting the underlying Scala implementation throw a [messy exception](https://github.com/apache/spark/pull/18926#discussion_r132837359). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 I think my latest commits address the concerns raised here. Let me know if I missed or misunderstood anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/18926#discussion_r133029498 --- Diff: python/pyspark/sql/column.py --- @@ -406,8 +406,14 @@ def substr(self, startPos, length): [Row(col=u'Ali'), Row(col=u'Bob')] """ if type(startPos) != type(length): -raise TypeError("Can not mix the type") -if isinstance(startPos, (int, long)): +raise TypeError( +"startPos and length must be the same type. " +"Got {startPos_t} and {length_t}, respectively." +.format( +startPos_t=type(startPos), +length_t=type(length), +)) +if isinstance(startPos, int): --- End diff -- Since `long` is [not supported](https://github.com/apache/spark/pull/18926#discussion_r132837359), I just removed it from here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 To summarize the feedback from @HyukjinKwon and @gatorsmile, I think what I need to do is: * Add a test for the mixed type case. * Explicitly check for `long` in Python 2 and throw a `TypeError` from PySpark. * Add a test for the `long` `TypeError` in Python 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 Oh, like a docstring test for the type error? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18926 Pinging freshly minted committer @HyukjinKwon for a review on this tiny PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/18926 [SPARK-21712] [PySpark] Clarify type error for Column.substr() Proposed changes: * Clarify the type error that `Column.substr()` gives. Test plan: * Tested this manually. * Test code: ```python from pyspark.sql.functions import col, lit spark.createDataFrame([['nick']], schema=['name']).select(col('name').substr(0, lit(1))) ``` * Before: ``` TypeError: Can not mix the type ``` * After: ``` TypeError: startPos and length must be the same type. Got and , respectively. ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark SPARK-21712-substr-type-error Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18926.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18926 commit 753dbe1743f552fe7b4867d3e4d24cdcc2ca1669 Author: Nicholas Chammas Date: 2017-08-11T18:39:59Z clarify type error for Column.substr() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/18818#discussion_r131640333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala --- @@ -79,18 +79,6 @@ private[sql] class TypeCollection(private val types: Seq[AbstractDataType]) private[sql] object TypeCollection { /** - * Types that can be ordered/compared. In the long run we should probably make this a trait - * that can be mixed into each data type, and perhaps create an `AbstractDataType`. - */ - // TODO: Should we consolidate this with RowOrdering.isOrderable? --- End diff -- Just curious: Do we need to do anything with `RowOrdering.isOrderable` given the change in this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18820 > I don't think we should allow user to change field nullability while doing replace. Why not? As long as we correctly update the schema from non-nullable to nullable, it seems OK to me. What would we be protecting against by disallowing this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/18820 Jenkins test this please. (Let's see if I still have the magic power.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/18820#discussion_r131208895 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1423,8 +1434,9 @@ def all_of_(xs): subset = [subset] # Verify we were not passed in mixed type generics." -if not any(all_of_type(rep_dict.keys()) and all_of_type(rep_dict.values()) - for all_of_type in [all_of_bool, all_of_str, all_of_numeric]): +if not any(key_all_of_type(rep_dict.keys()) and value_all_of_type(rep_dict.values()) + for (key_all_of_type, value_all_of_type) + in [all_of_bool, all_of_str, all_of_numeric]): --- End diff -- Why not just put `None` here and keep the various `all_of_*` variables defined as they were before? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #3029: [SPARK-4017] show progress bar in console
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/3029 `spark.ui.showConsoleProgress=false` works for me. I pass it via `--conf` to `spark-submit`. Try that if you haven't already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r115497704 --- Diff: python/pyspark/ml/tests.py --- @@ -71,6 +71,34 @@ ser = PickleSerializer() +def generate_multinomial_logistic_input( +weights, x_mean, x_variance, add_intercept, n_points, seed=None): +"""Creates multinomial logistic dataset""" + +if seed: +np.random.seed(seed) +n_features = x_mean.shape[0] + +x = np.random.randn(n_points, n_features) +x = x * np.sqrt(x_variance) + x_mean + +if add_intercept: +x = np.hstack([x, np.ones((n_points, 1))]) + +# Compute margins +margins = np.hstack([np.zeros((n_points, 1)), x.dot(weights.T)]) +# Shift to avoid overflow and compute probs +probs = np.exp(np.subtract(margins, margins.max(axis=1).reshape(n_points, -1))) +# Compute cumulative prob +cum_probs = np.cumsum(probs / probs.sum(axis=1).reshape(n_points, -1), axis=1) +# Asign class --- End diff -- "Assign class", though IMO you could also just do away with the comments in this section. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/17922#discussion_r115497473 --- Diff: python/pyspark/ml/classification.py --- @@ -374,6 +415,48 @@ def getFamily(self): """ return self.getOrDefault(self.family) +@since("2.2.0") --- End diff -- Since we're voting on 2.2 now, I presume this will make it for 2.3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13257: [SPARK-15474][SQL]ORC data source fails to write and rea...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/13257 The discussion on [ORC-152](https://issues.apache.org/jira/browse/ORC-152) suggests that this is an issue with Spark's DataFrame writer for ORC, not with ORC itself. If you have evidence that this is not the case, it would be good to post it directly on ORC-152 so we can get input from people on that project. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace impr...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/16793#discussion_r100701818 --- Diff: python/pyspark/sql/tests.py --- @@ -1591,6 +1591,67 @@ def test_replace(self): self.assertEqual(row.age, 10) self.assertEqual(row.height, None) +# replace with lists +row = self.spark.createDataFrame( +[(u'Alice', 10, 80.1)], schema).replace([u'Alice'], [u'Ann']).first() +self.assertTupleEqual(row, (u'Ann', 10, 80.1)) + +# replace with dict +row = self.spark.createDataFrame( +[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first() +self.assertTupleEqual(row, (u'Alice', 11, 80.1)) --- End diff -- This is the only test of "new" functionality (excluding error cases), correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 > the AWS SDK you get will be in sync with hadoop-aws; you have to keep them in sync. Did you mean here, "you _don't_ have to keep them in sync"? > Dependency management is an enternal conflict As an aside, I guess this is what the whole process of shading dependencies is for, right? I always wondered whether that could be done automatically somehow. Anyway, thanks for orienting me @steveloughran and @srowen. I appreciate your time. I'll step aside and let y'all continue working out what this PR needs to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 > This won't be enabled in a default build of Spark. Okie doke. I don't want to derail the PR review here, but I'll ask since it's on-topic: Is there a way for projects like [Flintrock](https://github.com/nchammas/flintrock) and spark-ec2 to set clusters up such that Spark automatically has S3 support enabled? Do we just name the appropriate packages in `spark-defaults.conf` under `spark.jars.packages`? Actually, I feel a little silly now. It seems kinda obvious in retrospect. So, to @steveloughran's point, that leaves (for me, at least) the question of knowing what version of the AWS SDK goes with what version of `hadoop-aws`, and so on. Is there a place outside of this PR where one would be able to see that? [This page](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) doesn't have a version mapping, for example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 Thanks for elaborating on where this work will help @steveloughran. Again, just speaking from my own point of view as Spark user and [Flintrock](https://github.com/nchammas/flintrock) maintainer, this sounds like it would be a big help. I hope that after getting something like this in, we can have the default builds of Spark leverage it to bundle support for S3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 > Does a build of Spark + Hadoop 2.7 right now have no ability at all to read from S3 out of the box, or just not full / ideal support? No ability at all, as far as I can tell. People have to explicitly start their Spark session with a call to `--packages` like this: ``` pyspark --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 ``` Without that, you get a `java.io.IOException: No FileSystem for scheme: s3n` if you try to read something from S3. I see the maintainer case for not wanting to have the default builds of Spark include AWS-specific stuff, and at the same time the end-user case for having that is just as clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 As a dumb end-user, and as the maintainer of [Flintrock](https://github.com/nchammas/flintrock), my interest in this PR stems from the hope that we will be able to get builds of Spark against the latest version of Hadoop that can interact with S3 out of the box. Because Spark builds against Hadoop 2.6 and 2.7 don't have that support, many Flintrock users [opt to use Spark built against Hadoop 2.4](https://github.com/nchammas/flintrock/issues/88) since S3 support was still bundled in with those versions. Many users don't know that they can get S3 support at runtime with the right call to `--packages`. Given that Spark and S3 are very commonly used together, I hope there is some way we can address the out-of-the-box use case here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/16151 @davies - Should this also be cherry-picked into 2.0 and 2.1? I think this config has been there for a while, just without documentation. ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/16151 @srowen - OK, I elaborated a bit based on the snippet you posted. Feel free to nitpick on the wording. Would be happy to tweak further. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/16151 @srowen - Good call. Will elaborate a bit based on what you posted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/16151 cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to...
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/16151 [SPARK-18719] Add spark.ui.showConsoleProgress to configuration docs This PR adds `spark.ui.showConsoleProgress` to the configuration docs. I tested this PR by building the docs locally and confirming that this change shows up as expected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark ui-progressbar-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16151 commit ceef9197513f20f85b9ac73cff014a0dc31adb37 Author: Nicholas Chammas Date: 2016-12-05T19:04:09Z Add spark.ui.showConsoleProgress to configuration docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16130: Update location of Spark YARN shuffle jar
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/16130 cc @vanzin? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16130: Update location of Spark YARN shuffle jar
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/16130 Update location of Spark YARN shuffle jar Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark yarn-doc-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16130.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16130 commit 979a8a1811f471cd333bdde459649974626e612e Author: Nicholas Chammas Date: 2016-12-03T20:11:18Z update location of Spark shuffle jar --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 LGTM as a first cut. The workflow that I will use during development and that I think should be supported, i.e. ```sh ./dev/make-distribution.sh --pip pip install -e ./python/ ``` works, so I'm happy. ð There is room for future improvements -- like building wheels and maybe simplifying the packaging tests -- but I think if we get this in now as an experimental new feature and give people time to use it, it'll help us refine things with more confidence down the line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86699002 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,73 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars")) or + os.path.isdir(os.path.join(path, "assembly" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +module_home = imp.find_module("pyspark")[1] +paths.append(module_home) +# If we are installed in edit mode also look two dirs up +paths.append(os.path.join(module_home, "../../")) +except ImportError: +# Not pip installed no worries +True +else: +from importlib.util import find_spec +try: +module_home = os.path.dirname(find_spec("pyspark").origin) +paths.append(module_home) +# If we are installed in edit mode also look two dirs up +paths.append(os.path.join(module_home, "../../")) +except ImportError: +# Not pip installed no worries +True --- End diff -- Same nit about `pass`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86698782 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,73 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars")) or + os.path.isdir(os.path.join(path, "assembly" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +module_home = imp.find_module("pyspark")[1] +paths.append(module_home) +# If we are installed in edit mode also look two dirs up +paths.append(os.path.join(module_home, "../../")) +except ImportError: +# Not pip installed no worries +True --- End diff -- Nit: The idiom in Python for "do nothing" is usually `pass`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86699184 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,73 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars")) or + os.path.isdir(os.path.join(path, "assembly" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +module_home = imp.find_module("pyspark")[1] +paths.append(module_home) +# If we are installed in edit mode also look two dirs up +paths.append(os.path.join(module_home, "../../")) +except ImportError: +# Not pip installed no worries +True +else: +from importlib.util import find_spec +try: +module_home = os.path.dirname(find_spec("pyspark").origin) +paths.append(module_home) +# If we are installed in edit mode also look two dirs up +paths.append(os.path.join(module_home, "../../")) +except ImportError: +# Not pip installed no worries +True + +# Normalize the paths +paths = [os.path.abspath(p) for p in paths] + +try: +return next(path for path in paths if is_spark_home(path)) +except StopIteration: +print("Could not find valid SPARK_HOME while searching %s".format(paths), file=sys.stderr) --- End diff -- Hmm, did a commit get gobbled up accidentally? This line still uses `%` and is missing an `exit(1)`. I see you changed it for another file, so I assume you meant to do it here too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86698987 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,73 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars")) or + os.path.isdir(os.path.join(path, "assembly" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] --- End diff -- I guess if this works we don't have to change it, but to clarify my earlier comment about why `dirname()` is better than joining to `'../'`: ``` >>> os.path.join('/example/path', '../') '/example/path/../' >>> os.path.dirname('/example/path') '/example' ``` There are a few places where this could be changed, but it's not a big deal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 Dunno why the tests are failing, but it's not related to packaging. Anyway, the install recipe I [posted earlier](https://github.com/apache/spark/pull/15659#issuecomment-258693543) is working now, so that's good. Since the earlier failure I reported was not caught by our packaging test, does that mean our tests are missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 I'll try out your install recipe, but I believe ```sh ./dev/make-distribution.sh --pip pip install -e ./python/ ``` should be a valid way of installing a development version of PySpark. Specifically, `pip install -e` is [how Python users install local projects](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86692033 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,66 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] --- End diff -- I meant you are probably looking for ```python paths = [THIS_DIR, os.path.dirname(THIS_DIR)] ``` The signature of `os.path.dirname()` is the same in Python 3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86690907 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,66 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +paths.append(imp.find_module("pyspark")[1]) +except ImportError: +# Not pip installed no worries +True +else: +from importlib.util import find_spec +try: +paths.append(os.path.dirname(find_spec("pyspark").origin)) +except ImportError: +# Not pip installed no worries +True + +# Normalize the paths +paths = map(lambda path: os.path.abspath(path), paths) --- End diff -- ```python paths = [os.path.abspath(p) for p in paths] ``` This is more Pythonic and eliminates the need to call `list()` on the output of `map()` later, because `map()` returns an iterator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86690854 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,66 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +paths.append(imp.find_module("pyspark")[1]) +except ImportError: +# Not pip installed no worries +True +else: +from importlib.util import find_spec +try: +paths.append(os.path.dirname(find_spec("pyspark").origin)) +except ImportError: +# Not pip installed no worries +True + +# Normalize the paths +paths = map(lambda path: os.path.abspath(path), paths) + +try: +return next(path for path in paths if is_spark_home(path)) +except StopIteration: +print("Could not find valid SPARK_HOME while searching %s" % paths, file=sys.stderr) --- End diff -- ```python print("Could not find valid SPARK_HOME while searching {}".format(paths), file=sys.stderr) ``` Minor point, but `%` is discouraged these days in favor of `format()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86691246 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,66 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] --- End diff -- Couple of comments here: 1. A better way to get a directory relative to the current file is to have something like this at the top of the file and refer to it as necessary: ``` THIS_DIR = os.path.dirname(os.path.realpath(__file__)) ``` 2. The correct way to go up one directory is to just call `dirname()` again. `os.path.dirname(..., '../')` will just append `'../'` to the end of the path, which may not work as expected later on. So I think you're looking for `THIS_DIR` and `os.dirname(THIS_DIR)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86690957 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,66 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and +(os.path.isdir(os.path.join(path, "jars" + +paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")] + +# Add the path of the PySpark module if it exists +if sys.version < "3": +import imp +try: +paths.append(imp.find_module("pyspark")[1]) +except ImportError: +# Not pip installed no worries +True +else: +from importlib.util import find_spec +try: +paths.append(os.path.dirname(find_spec("pyspark").origin)) +except ImportError: +# Not pip installed no worries +True + +# Normalize the paths +paths = map(lambda path: os.path.abspath(path), paths) + +try: +return next(path for path in paths if is_spark_home(path)) +except StopIteration: +print("Could not find valid SPARK_HOME while searching %s" % paths, file=sys.stderr) --- End diff -- We should raise an exception here or `exit(1)` since this is a fatal error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 I tested this out with Python 3 on my system with the following commands: ``` # Inside ./spark/. python3 -m venv venv source venv/bin/activate ./dev/make-distribution.sh --pip pip install -e ./python/ which pyspark pyspark ``` Seems there is a bug with how `SPARK_HOME` is computed: ``` [make-distribution.sh output snipped] $ pip install -e ./python/ Obtaining file:///.../apache/spark/python Collecting py4j==0.10.4 (from pyspark==2.1.0.dev1) Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB) 100% |ââââââââââââââââââââââââââââââââ| 194kB 2.0MB/s Installing collected packages: py4j, pyspark Running setup.py develop for pyspark Successfully installed py4j-0.10.4 pyspark $ which pyspark .../apache/spark/venv/bin/pyspark $ pyspark Could not find valid SPARK_HOME while searching .../apache/spark/venv/bin/pyspark: line 24: None/bin/load-spark-env.sh: No such file or directory .../apache/spark/venv/bin/pyspark: line 77: .../apache/spark/None/bin/spark-submit: No such file or directory ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86668198 --- Diff: docs/building-spark.md --- @@ -259,6 +259,14 @@ or Java 8 tests are automatically enabled when a Java 8 JDK is detected. If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests. +## PySpark pip installable + +If you are building Spark for use in a Python environment and you wish to pip install it, you will first need to build the Spark JARs as described above. Then you can construct an sdist package suitable for setup.py and pip installable package. + +cd python; python setup.py sdist --- End diff -- Just to confirm, if I run this: ``` ./dev/make-distribution.sh --pip ``` It should take care of both building the right JARs _and_ building the Python package. Then I just run: ``` pip install -e ./python/ ``` to install Spark into my Python environment. Is that all correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86668059 --- Diff: python/setup.py --- @@ -0,0 +1,180 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import glob +import os +import sys +from setuptools import setup, find_packages +from shutil import copyfile, copytree, rmtree + +if sys.version_info < (2, 7): +print("Python versions prior to 2.7 are not supported for pip installed PySpark.", + file=sys.stderr) +exit(-1) + +try: +exec(open('pyspark/version.py').read()) +except IOError: +print("Failed to load PySpark version file for packaging you must be in Spark's python dir.", + file=sys.stderr) +sys.exit(-1) +VERSION = __version__ +# A temporary path so we can access above the Python project root and fetch scripts and jars we need +TEMP_PATH = "deps" +SPARK_HOME = os.path.abspath("../") +JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME + +# Use the release jars path if we are in release mode. +if (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1): +JARS_PATH = "%s/jars/" % SPARK_HOME + +EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME +SCRIPTS_PATH = "%s/bin" % SPARK_HOME +SCRIPTS_TARGET = "%s/bin" % TEMP_PATH +JARS_TARGET = "%s/jars" % TEMP_PATH +EXAMPLES_TARGET = "%s/examples" % TEMP_PATH + +# Check and see if we are under the spark path in which case we need to build the symlink farm. +# This is important because we only want to build the symlink farm while under Spark otherwise we +# want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a +# partially built sdist) we should error and have the user sort it out. +in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or +(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1)) + +if (in_spark): +# Construct links for setup +try: +os.mkdir(TEMP_PATH) +except: +print("Temp path for symlink to parent already exists %s" % TEMP_PATH, file=sys.stderr) +exit(-1) + +try: +if (in_spark): +# Construct the symlink farm - this is necessary since we can't refer to the path above the +# package root and we need to copy the jars and scripts which are up above the python root. +if getattr(os, "symlink", None) is not None: +os.symlink(JARS_PATH, JARS_TARGET) +os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET) +os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET) +else: +# For windows fall back to the slower copytree +copytree(JARS_PATH, JARS_TARGET) +copytree(SCRIPTS_PATH, SCRIPTS_TARGET) +copytree(EXAMPLES_PATH, EXAMPLES_TARGET) +else: +# If we are not inside of SPARK_HOME verify we have the required symlink farm +if not os.path.exists(JARS_TARGET): +print("To build packaging must be in the python directory under the SPARK_HOME.", + file=sys.stderr) +# We copy the shell script to be under pyspark/python/pyspark so that the launcher scripts +# find it where expected. The rest of the files aren't copied because they are accessed +# using Python imports instead which will be resolved correctly. +try: +os.makedirs("pyspark/python/pyspark") +except OSError: +# Don't worry if the directory already exists.
[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r86667967 --- Diff: python/setup.py --- @@ -0,0 +1,180 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import glob +import os +import sys +from setuptools import setup, find_packages +from shutil import copyfile, copytree, rmtree + +if sys.version_info < (2, 7): +print("Python versions prior to 2.7 are not supported for pip installed PySpark.", + file=sys.stderr) +exit(-1) + +try: +exec(open('pyspark/version.py').read()) +except IOError: +print("Failed to load PySpark version file for packaging you must be in Spark's python dir.", --- End diff -- Seems like there is a missing sentence break somewhere here. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 @rxin - Not yet, but I will test it this weekend. Yes, PyPI does have a limit, but we can request an exemption. I can help coordinate that with the PyPI admins when we get there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15733: [SPARK-18138][DOCS] Document that Java 7, Python ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15733#discussion_r86158332 --- Diff: docs/index.md --- @@ -28,8 +28,9 @@ Spark runs on Java 7+, Python 2.6+/3.4+ and R 3.1+. For the Scala API, Spark {{s uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version ({{site.SCALA_BINARY_VERSION}}.x). -Note that support for Java 7, Python 2.6, Scala 2.10 and version of Hadoop before 2.6 are -deprecated as of Spark 2.1.0, and may be removed in Spark 2.2.0. +Note that support for Java 7 and Python 2.6 are deprecated as of Spark 2.0.0, and support for +Scala 2.10 and version of Hadoop before 2.6 are deprecated as of Spark 2.1.0, and may be --- End diff -- "... and versions of Hadoop..." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 Later today (or later this week) I will try actually using this branch to install Spark via pip and report back. ``` pip install git+https://github.com/holdenk/spark@SPARK-1267-pip-install-pyspark ``` @holdenk - I use this method to install development versions of packages straight off of GitHub. Do you expect this pattern to work for Spark as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15733: [SPARK-18138][DOCS] Document that Java 7, Python ...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15733#discussion_r86141486 --- Diff: docs/building-spark.md --- @@ -13,6 +13,7 @@ redirect_from: "building-with-maven.html" The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+. +Note that support for Java 7 is deprecated as of Spark 2.1.0 and may be removed in Spark 2.2.0. --- End diff -- I believe it's been deprecated since 2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 We have an AppVeyor build now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85531031 --- Diff: python/setup.py --- @@ -0,0 +1,170 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import glob +import os +import sys +from setuptools import setup, find_packages --- End diff -- pip bundles setuptools, so if you have pip you have setuptools. Specifically, I think if this script is being invoked because the user ran pip, this will work. If it is invoked as `python setup.py`, though, it is possible for this to fail because the user doesn't have setuptools. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip i...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 From the PR description: > figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) Don't we want to publish to `apache-spark`? Dunno if Apache has any rules about that. For prior art, see [`apache-libcloud` on PyPI](https://pypi.org/project/apache-libcloud/). Btw, how did you determine that `pyspark` is taken on PyPI? We can definitely reach out to the admins to ask if they can release the name. I'll find out how exactly to do that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip i...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15659 Thanks for the additional context @holdenk and @rgbkrk. It's important to lay it out somewhere clearly so that the non-Python developers among us (and the forgetful Python developers like me) can understand the benefit we're aiming for here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85377223 --- Diff: pom.xml --- @@ -26,6 +26,7 @@ org.apache.spark spark-parent_2.11 + --- End diff -- Not a sticking point for me, but since it adds a manual step for committers during release ("verify the PySpark version is correct" - maybe this can be automated?) they may object. I remember @davies had an issue with this in the last PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85364365 --- Diff: pom.xml --- @@ -26,6 +26,7 @@ org.apache.spark spark-parent_2.11 + --- End diff -- Something along the lines of `.splitlines()...trim().startswith('')` would work, and it's easy to error out if it broke, no? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85365186 --- Diff: python/README.md --- @@ -0,0 +1,32 @@ +# Apache Spark + +Spark is a fast and general cluster computing system for Big Data. It provides +high-level APIs in Scala, Java, Python, and R, and an optimized engine that +supports general computation graphs for data analysis. It also supports a +rich set of higher-level tools including Spark SQL for SQL and DataFrames, +MLlib for machine learning, GraphX for graph processing, +and Spark Streaming for stream processing. + +<http://spark.apache.org/> + +## Online Documentation + +You can find the latest Spark documentation, including a programming +guide, on the [project web page](http://spark.apache.org/documentation.html) + + +## Python Packaging + +This README file only contains basic information related to pip installed PySpark. +This packaging is currently experimental and may change in future versions (although we will do our best to keep compatibility). +Using PySpark requires the Spark JARs, and if you are building this from source please see the builder instructions at +["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html). + +The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the [Apache Spark downloads page](http://spark.apache.org/downloads.html). --- End diff -- I see. So `pip install pyspark` can completely replace `brew install apache-spark` for local development, or for submitting from a local machine to a remote cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85364778 --- Diff: python/README.md --- @@ -0,0 +1,32 @@ +# Apache Spark + +Spark is a fast and general cluster computing system for Big Data. It provides --- End diff -- I see. And I'm guessing we can't/don't want to somehow reference the README in the root directory? (Perhaps even with a symlink, if necessary...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85355701 --- Diff: python/setup.py --- @@ -0,0 +1,169 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import print_function +import glob +import os +import sys +from setuptools import setup, find_packages +from shutil import copyfile, copytree, rmtree + +exec(open('pyspark/version.py').read()) +VERSION = __version__ +# A temporary path so we can access above the Python project root and fetch scripts and jars we need +TEMP_PATH = "deps" +SPARK_HOME = os.path.abspath("../") +JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME + +# Use the release jars path if we are in release mode. +if (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1): +JARS_PATH = "%s/jars/" % SPARK_HOME + +EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME +SCRIPTS_PATH = "%s/bin" % SPARK_HOME +SCRIPTS_TARGET = "%s/bin" % TEMP_PATH +JARS_TARGET = "%s/jars" % TEMP_PATH +EXAMPLES_TARGET = "%s/examples" % TEMP_PATH + +if sys.version_info < (2, 7): +print("Python versions prior to 2.7 are not supported.", file=sys.stderr) +exit(-1) + +# Check and see if we are under the spark path in which case we need to build the symlink farm. +# This is important because we only want to build the symlink farm while under Spark otherwise we +# want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a +# partially built sdist) we should error and have the user sort it out. +in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or +(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1)) + +if (in_spark): +# Construct links for setup +try: +os.mkdir(TEMP_PATH) +except: +print("Temp path for symlink to parent already exists %s" % TEMP_PATH, file=sys.stderr) +exit(-1) + +try: +if (in_spark): +# Construct the symlink farm --- End diff -- What's the purpose of these symlinks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85352748 --- Diff: pom.xml --- @@ -26,6 +26,7 @@ org.apache.spark spark-parent_2.11 + --- End diff -- Would it be overkill to just have `version.py` parse this file for the version string? Not necessarily with a full XML parser, but with a simple string match or regex and fail noisily if we're unable to extract the version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85355211 --- Diff: python/pyspark/find_spark_home.py --- @@ -0,0 +1,65 @@ +#!/usr/bin/python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This script attempt to determine the correct setting for SPARK_HOME given +# that Spark may have been installed on the system with pip. + +from __future__ import print_function +import os +import sys + + +def _find_spark_home(): +"""Find the SPARK_HOME.""" +# If the enviroment has SPARK_HOME set trust it. +if "SPARK_HOME" in os.environ: +return os.environ["SPARK_HOME"] + +def is_spark_home(path): +"""Takes a path and returns true if the provided path could be a reasonable SPARK_HOME""" +return (os.path.isfile(path + "/bin/spark-submit") and os.path.isdir(path + "/jars")) --- End diff -- Instead of building paths with `+`, we should be using `os.path.join()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85354868 --- Diff: python/README.md --- @@ -0,0 +1,32 @@ +# Apache Spark + +Spark is a fast and general cluster computing system for Big Data. It provides +high-level APIs in Scala, Java, Python, and R, and an optimized engine that +supports general computation graphs for data analysis. It also supports a +rich set of higher-level tools including Spark SQL for SQL and DataFrames, +MLlib for machine learning, GraphX for graph processing, +and Spark Streaming for stream processing. + +<http://spark.apache.org/> + +## Online Documentation + +You can find the latest Spark documentation, including a programming +guide, on the [project web page](http://spark.apache.org/documentation.html) + + +## Python Packaging + +This README file only contains basic information related to pip installed PySpark. +This packaging is currently experimental and may change in future versions (although we will do our best to keep compatibility). +Using PySpark requires the Spark JARs, and if you are building this from source please see the builder instructions at +["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html). + +The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the [Apache Spark downloads page](http://spark.apache.org/downloads.html). --- End diff -- If I am doing local development on my Mac, for example, what does pip installing Spark get me? It sounds like from this line that even if I pip install Spark, I will still need to separately `brew install apache-spark` or something to be able to run Spark programs. Is that correct? How does my workflow change or improve if I can pip install Spark? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85350993 --- Diff: bin/spark-class --- @@ -36,7 +36,7 @@ else fi # Find Spark jars. -if [ -f "${SPARK_HOME}/RELEASE" ]; then +if [ -d "${SPARK_HOME}/jars" ]; then --- End diff -- Why did this get changed from `RELEASE` to `jars`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85355847 --- Diff: python/setup.cfg --- @@ -0,0 +1,22 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +[bdist_wheel] +universal = 1 + +[metadata] +description-file = README.md --- End diff -- Newline here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85353057 --- Diff: python/MANIFEST.in --- @@ -0,0 +1,23 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +recursive-include deps/jars *.jar +recursive-include deps/bin * --- End diff -- Minor point, but `graft` seems more appropriate here. See: https://docs.python.org/3/distutils/commandref.html#sdist-cmd --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85353699 --- Diff: python/README.md --- @@ -0,0 +1,32 @@ +# Apache Spark + +Spark is a fast and general cluster computing system for Big Data. It provides --- End diff -- Would it be appropriate to cut this paragraph out and just leave the stuff about packaging? If these blurbs ever change I don't think we want to have to update them in multiple places, and we already have this blurb in at least one other place, I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15659#discussion_r85351820 --- Diff: dev/create-release/release-build.sh --- @@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then export ZINC_PORT=$ZINC_PORT echo "Creating distribution: $NAME ($FLAGS)" +# Write out the NAME and VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT --- End diff -- Do we want to have a version string that's slightly different from the "original", just for Python? I'm thinking about what will happen if people, for example, want to do the same for R. Having 3 slightly different ways of showing the version string seems unnecessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15567: [SPARK-14393][SQL] values generated by non-deterministic...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/15567 @mengxr - I think this PR will also address [SPARK-14241](https://issues.apache.org/jira/browse/SPARK-14241). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/12004 @steveloughran - Is this message in the most recent build log critical? ``` Spark's published dependencies DO NOT MATCH the manifest file (dev/spark-deps). To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r83349121 --- Diff: sbin/spark-daemon.sh --- @@ -146,13 +176,11 @@ run_command() { case "$mode" in (class) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command nice -n $SPARK_NICENESS ${SPARK_HOME}/bin/spark-class $command $@ ;; (submit) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-submit --class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command nice -n $SPARK_NICENESS bash ${SPARK_HOME}/bin/spark-submit --class $command $@ --- End diff -- Same here: I would quote the `SPARK_` environment variables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r83349154 --- Diff: sbin/spark-daemon.sh --- @@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then export SPARK_NICENESS=0 fi +execute_command() { + local command="$@" + if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then + nohup -- $command >> $log 2>&1 < /dev/null & + newpid="$!" + + echo "$newpid" > "$pid" + + #Poll for up to 5 seconds for the java process to start --- End diff -- Nit: Space after `#`. (I know it was like this before your PR.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/15338#discussion_r83349053 --- Diff: sbin/spark-daemon.sh --- @@ -146,13 +176,11 @@ run_command() { case "$mode" in (class) - nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class $command "$@" >> "$log" 2>&1 < /dev/null & - newpid="$!" + execute_command nice -n $SPARK_NICENESS ${SPARK_HOME}/bin/spark-class $command $@ --- End diff -- If `SPARK_HOME` contains spaces, this will break. I recommend quoting both `SPARK_HOME` and `SPARK_NICENESS` as they were before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Looks good to me. ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Thanks for the quick overview. That's pretty straightforward, actually! I'll take a look at `PipelinedRDD` for the details. ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, so the examples are going a bit over my head. ð ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 > So there is no chaining requirement, and it will only work in a with statement. @MLnick - Couldn't we also create a scenario (like @holdenk did earlier) where a user does something like this? ```python persisted_rdd = persisted(rdd) persisted_rdd.map(...).filter(...).count() ``` This would break pipelining too, no? And I think the expectation would be for it not to break pipelining, because existing common context managers in Python don't have a requirement that they _must_ be used in a `with` block. For example, `f = open(file)` works fine, as does `s = requests.Session()`, and the resulting objects have the same behavior as they would inside a `with` block. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is used so I'll have to defer to y'all on this. Does the `cached()` utility method have this same problem? > We could possibly work around it with some type checking etc but it then starts to feel like adding more complexity than the feature is worth... Agreed. At this point, actually, I'm beginning to feel this feature is not worth it. Context managers seem to work best when the objects they're working on have clear open/close-style semantics. File handles, network connections, and the like fit this pattern well. In fact, the [doc for `with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement) says: > This allows common `try...except...finally` usage patterns to be encapsulated for convenient reuse. RDDs and DataFrames, on the other hand, don't have a simple open/close or `try...except...finally` pattern. And when we try to map one onto persist and unpersist, we get the various side-effects we've been discussing here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/14579#discussion_r74307747 --- Diff: python/pyspark/rdd.py --- @@ -221,6 +227,21 @@ def context(self): def cache(self): """ Persist this RDD with the default storage level (C{MEMORY_ONLY}). + +:py:meth:`cache` can be used in a 'with' statement. The RDD will be automatically +unpersisted once the 'with' block is exited. Note however that any actions on the RDD +that require the RDD to be cached, should be invoked inside the 'with' block; otherwise, +caching will have no effect. --- End diff -- Agreed, especially since this is technically a new Public API that we are potentially committing to for the life of the 2.x line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Sorry, you're right, `__exit__()`'s return value is not going to be consumed anywhere. What I meant is that `unpersist()` would return the base RDD or DataFrame object. But I'm not seeing the issue with the example you posted. Reformatting for clarity: ```python magic = rdd.persist() with magic as awesome: awesome.count() magic.map(lambda x: x + 1) ``` Are you saying `magic.map()` will error? Why would it? `magic` would be an instance of `PersistedRDD`, which in turn is a subclass of `RDD`, which has `map()` and all of the usual methods defined, plus the magic methods we need for the context manager. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 > the subclassing of RDD approach could cause us to miss out on pipelining if the RDD was used again after it was unpersisted How so? Wouldn't `__exit__()` simply return the parent RDD or DataFrame object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 None of our options seems great, but if I had to rank them I would say: 1. Add new `Persisted...` classes. 2. Make no changes. 3. Add separate `persisted()` or `cached()` utility method. 4. Modify base RDD and DataFrame classes. Adding new internal classes for this use-case honestly seems a bit heavy-handed to me, so if we are against that then I would lean towards not doing anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Ah, you're right. So if we want to avoid needing magic methods in the main RDD/DataFrame classes and avoid needing a separate utility method like `cache()`, I think one option available to us is to have separate `PersistedRDD` and `PersistedDataFrame` classes that simply wrap the base RDD and DataFrames classes and add the appropriate magic methods. `.persist()` and `.cache()` would then return instances of these classes, which should satisfy the `type(x).__enter__(x)` behavior while still preserving backwards compatibility and method chaining. What do you think of that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14579 Thanks @MLnick for taking this on and for breaking down what you've found so far. I took a look through [`contextlib`](https://docs.python.org/3/library/contextlib.html) for inspiration, and I wonder if the source code for [`closing()`](https://docs.python.org/3/library/contextlib.html#contextlib.closing) offers a template we can follow that would let `persist()` return an RDD/DataFrame instance with the correct magic methods, without having to modify the class. Have you taken a look at that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14496: [SPARK-16772] [Python] [Docs] Fix API doc references to ...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14496 Thanks @srowen. ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14496: [SPARK-16772] [Python] [Docs] Fix API doc references to ...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14496 cc @rxin - Follow-on to #14393. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14496: [SPARK-16772] [Python] [Docs] Fix API doc referen...
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/14496 [SPARK-16772] [Python] [Docs] Fix API doc references to UDFRegistration + Update "important classes" ## Proposed Changes * Update the list of "important classes" in `pyspark.sql` to match 2.0. * Fix references to `UDFRegistration` so that the class shows up in the docs. It currently [doesn't](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html). * Remove some unnecessary whitespace in the Python RST doc files. I reused the [existing JIRA](https://issues.apache.org/jira/browse/SPARK-16772) I created last week for similar API doc fixes. ## How was this patch tested? * I ran `lint-python` successfully. * I ran `make clean build` on the Python docs and confirmed the results are as expected locally in my browser. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark SPARK-16772-UDFRegistration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14496 commit 62f4f823ed33972d782506f5226b192fc45b1ede Author: Nicholas Chammas Date: 2016-08-04T17:16:31Z fix references to UDFRegistration --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14408: [SPARK-16772] Restore "datatype string" to Python API do...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14408 cc @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14393: [SPARK-16772] Correct API doc references to PySpa...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/14393#discussion_r72853914 --- Diff: python/pyspark/sql/context.py --- @@ -226,28 +226,34 @@ def createDataFrame(self, data, schema=None, samplingRatio=None): from ``data``, which should be an RDD of :class:`Row`, or :class:`namedtuple`, or :class:`dict`. -When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or --- End diff -- Correction here: https://github.com/apache/spark/pull/14408 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14408: [SPARK-16772] Restore "datatype string" to Python...
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/14408 [SPARK-16772] Restore "datatype string" to Python API docstrings ## What changes were proposed in this pull request? This PR corrects [an error made in an earlier PR](https://github.com/apache/spark/pull/14393/files#r72843069). ## How was this patch tested? ```sh $ ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ``` I also built the docs and confirmed that they looked good in my browser. You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark SPARK-16772 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14408 commit 58f388533a6300e49de0d239d3ad0f7d17afca50 Author: Nicholas Chammas Date: 2016-07-29T20:03:50Z restore "datatype string" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14393: [SPARK-16772] Correct API doc references to PySpa...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/14393#discussion_r72843069 --- Diff: python/pyspark/sql/context.py --- @@ -226,28 +226,34 @@ def createDataFrame(self, data, schema=None, samplingRatio=None): from ``data``, which should be an RDD of :class:`Row`, or :class:`namedtuple`, or :class:`dict`. -When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or --- End diff -- I made a mistake here, thinking "datatype string" was actually meant to be `StringType()`. I understand now that a datatype string is actually a thing. Correction incoming... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14393: [SPARK-16772] Correct API doc references to PySpark clas...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14393 Yes, I built the docs and reviewed several (but not all) of the changes locally in my browser and confirmed that the corrections I wanted took place as expected. (Apologies about not using the PR template when I first opened the PR. GitHub Desktop seems not to support that yet. I've updated the PR description to include this info now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14393: [SPARK-16772] Correct API doc references to PySpark clas...
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/14393 Apologies for making a fairly "noisy" PR, with changes in several scattered places. However, as a PySpark user it's important to me that the API docs be properly formatted and that docstring class references work. Feel free to ping me on Python docstring changes in the future. I would be happy to review them. cc @rxin @davies - Ready for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14393: [SPARK-16772] Correct references to DataType + ot...
GitHub user nchammas opened a pull request: https://github.com/apache/spark/pull/14393 [SPARK-16772] Correct references to DataType + other minor tweaks You can merge this pull request into a Git repository by running: $ git pull https://github.com/nchammas/spark python-docstring-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14393.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14393 commit 3a24f4fb7ce30b4a261c0db2c27be11976dea678 Author: Nicholas Chammas Date: 2016-07-28T16:42:13Z [SPARK-16772] correct references to DataType --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13114: Branch 1.4
Github user nchammas commented on the issue: https://github.com/apache/spark/pull/13114 @srowen @vanzin - Shouldn't some automated process be picking up your comments ("close this PR") and closing this PR? I thought we had something like that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15542][SparkR] Make error message clear...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/13308#discussion_r64774474 --- Diff: R/install-dev.sh --- @@ -38,7 +38,12 @@ pushd $FWDIR > /dev/null if [ ! -z "$R_HOME" ] then R_SCRIPT_PATH="$R_HOME/bin" - else + else +# if system wide R_HOME is not found, then exit +if ! [ `command -v R` ]; then --- End diff -- Yeah, we typically put the `!` after the test: * https://github.com/apache/spark/blob/6d506c9ae9a2519d1a93e788ae5316d4f942d35d/dev/lint-python#L44 * https://github.com/apache/spark/blob/6d506c9ae9a2519d1a93e788ae5316d4f942d35d/dev/lint-java#L25 (In Bash, `[ ... ]` and `test` are synonyms.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15072][SQL][PYSPARK][HOT-FIX] Remove Sp...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/13069#issuecomment-219517952 Okie doke, thanks for the explanation! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org