from:"nchammas"

[GitHub] spark pull request #20436: [MINOR] Fix typos in dev/* scripts.

2018-01-30 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20436#discussion_r164757437
  
--- Diff: dev/lint-python ---
@@ -60,9 +60,9 @@ export "PYLINT_HOME=$PYTHONPATH"
 export "PATH=$PYTHONPATH:$PATH"
 
 # There is no need to write this output to a file
-#+ first, but we do so so that the check status can
-#+ be output before the report, like with the
-#+ scalastyle and RAT checks.
--- End diff --

The `#+` convention is something I picked up from [the Linux Documentation 
Project](http://tldp.org/LDP/abs/html/here-docs.html#COMMENTBLOCK), if that's 
what you're referring to. You can safely do away with it and just have the `#`.

It was a "phase". I'm over it now... ð


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-16 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
Agreed with @HyukjinKwon. This PR has a very narrow goal -- improving the 
error messages -- which I think it accomplished. I think @gatorsmile was 
expecting a more significant set of improvements, but that's not what this PR 
(or the associated JIRA) are about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-15 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
It's cleaner but less specific. Unless we branch on whether `startPos` and 
`length` are the same type, we will give the same error message for mixed types 
and for unsupported types. That seems like a step back to me as these are two 
different problems which should get different error messages.

If we want to group all the type checking in one place, we should do it as 
in the first example from [Hyukjin's 
comment](https://github.com/apache/spark/pull/18926#issuecomment-322393819).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...

2017-08-15 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/18926#discussion_r133186642
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1220,6 +1220,18 @@ def test_rand_functions(self):
 rndn2 = df.select('key', functions.randn(0)).collect()
 self.assertEqual(sorted(rndn1), sorted(rndn2))
 
+def test_string_functions(self):
+from pyspark.sql.functions import col, lit
+df = self.spark.createDataFrame([['nick']], schema=['name'])
+self.assertRaisesRegexp(
+TypeError,
+"must be the same type",
+lambda: df.select(col('name').substr(0, lit(1
--- End diff --

@HyukjinKwon - I opted to just search for a key phrase since that 
sufficiently captures the intent of the updated error message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...

2017-08-15 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/18926#discussion_r133180053
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1220,6 +1220,13 @@ def test_rand_functions(self):
 rndn2 = df.select('key', functions.randn(0)).collect()
 self.assertEqual(sorted(rndn1), sorted(rndn2))
 
+def test_string_functions(self):
+from pyspark.sql.functions import col, lit
+df = self.spark.createDataFrame([['nick']], schema=['name'])
+self.assertRaises(TypeError, lambda: 
df.select(col('name').substr(0, lit(1
--- End diff --

I was considering doing that at first, but it felt like just duplicating 
logic. Looking through the other uses of `assertRaisesRegexp()`, it looks like 
most of the time we just search for a keyword, but there are also some 
instances where a large part of the exception message is checked. I can do that 
here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-15 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
@gatorsmile 

> Even if we plan to drop `long` in this PR

We are not dropping `long` in this PR. It was [never 
supported](https://github.com/apache/spark/pull/18926#discussion_r132837359). 
Both the docstring and actual behavior of `.substr()` make it clear that `long` 
is not supported. Only `int` and `Column` are supported.

>  the checking looks weird to me. Basically, the change just wants to 
ensure the type of length is int.

Can you elaborate please? As @HyukjinKwon pointed out, `.substr()` accepts 
either `int` or `Column`, but both arguments must be of the same type. The goal 
of this PR is to make that clearer.

I am not changing any semantics or behavior other than to throw a Python 
`TypeError` on `long`, as opposed to letting the underlying Scala 
implementation throw a [messy 
exception](https://github.com/apache/spark/pull/18926#discussion_r132837359).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-14 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
I think my latest commits address the concerns raised here. Let me know if 
I missed or misunderstood anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...

2017-08-14 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/18926#discussion_r133029498
  
--- Diff: python/pyspark/sql/column.py ---
@@ -406,8 +406,14 @@ def substr(self, startPos, length):
 [Row(col=u'Ali'), Row(col=u'Bob')]
 """
 if type(startPos) != type(length):
-raise TypeError("Can not mix the type")
-if isinstance(startPos, (int, long)):
+raise TypeError(
+"startPos and length must be the same type. "
+"Got {startPos_t} and {length_t}, respectively."
+.format(
+startPos_t=type(startPos),
+length_t=type(length),
+))
+if isinstance(startPos, int):
--- End diff --

Since `long` is [not 
supported](https://github.com/apache/spark/pull/18926#discussion_r132837359), I 
just removed it from here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-14 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
To summarize the feedback from @HyukjinKwon and @gatorsmile, I think what I 
need to do is:
* Add a test for the mixed type case.
* Explicitly check for `long` in Python 2 and throw a `TypeError` from 
PySpark.
* Add a test for the `long` `TypeError` in Python 2.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
Oh, like a docstring test for the type error?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18926: [SPARK-21712] [PySpark] Clarify type error for Column.su...

2017-08-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18926
  
Pinging freshly minted committer @HyukjinKwon for a review on this tiny PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18926: [SPARK-21712] [PySpark] Clarify type error for Co...

2017-08-11 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/18926

[SPARK-21712] [PySpark] Clarify type error for Column.substr()

Proposed changes:
* Clarify the type error that `Column.substr()` gives.

Test plan:
* Tested this manually.
* Test code:
```python
from pyspark.sql.functions import col, lit
spark.createDataFrame([['nick']], 
schema=['name']).select(col('name').substr(0, lit(1)))
```
* Before:
```
TypeError: Can not mix the type
```
* After:
```
TypeError: startPos and length must be the same type. Got  
and
, respectively.
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark SPARK-21712-substr-type-error

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18926.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18926


commit 753dbe1743f552fe7b4867d3e4d24cdcc2ca1669
Author: Nicholas Chammas 
Date:   2017-08-11T18:39:59Z

clarify type error for Column.substr()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18818: [SPARK-21110][SQL] Structs, arrays, and other ord...

2017-08-07 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/18818#discussion_r131640333
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala 
---
@@ -79,18 +79,6 @@ private[sql] class TypeCollection(private val types: 
Seq[AbstractDataType])
 private[sql] object TypeCollection {
 
   /**
-   * Types that can be ordered/compared. In the long run we should 
probably make this a trait
-   * that can be mixed into each data type, and perhaps create an 
`AbstractDataType`.
-   */
-  // TODO: Should we consolidate this with RowOrdering.isOrderable?
--- End diff --

Just curious: Do we need to do anything with `RowOrdering.isOrderable` 
given the change in this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-03 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18820
  
> I don't think we should allow user to change field nullability while 
doing replace.

Why not? As long as we correctly update the schema from non-nullable to 
nullable, it seems OK to me. What would we be protecting against by disallowing 
this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to replace ...

2017-08-03 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/18820
  
Jenkins test this please. 

(Let's see if I still have the magic power.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18820: [SPARK-14932][SQL] Allow DataFrame.replace() to r...

2017-08-03 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/18820#discussion_r131208895
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1423,8 +1434,9 @@ def all_of_(xs):
 subset = [subset]
 
 # Verify we were not passed in mixed type generics."
-if not any(all_of_type(rep_dict.keys()) and 
all_of_type(rep_dict.values())
-   for all_of_type in [all_of_bool, all_of_str, 
all_of_numeric]):
+if not any(key_all_of_type(rep_dict.keys()) and 
value_all_of_type(rep_dict.values())
+   for (key_all_of_type, value_all_of_type)
+   in [all_of_bool, all_of_str, all_of_numeric]):
--- End diff --

Why not just put `None` here and keep the various `all_of_*` variables 
defined as they were before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #3029: [SPARK-4017] show progress bar in console

2017-07-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/3029
  
`spark.ui.showConsoleProgress=false` works for me. I pass it via `--conf` 
to `spark-submit`. Try that if you haven't already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r115497704
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -71,6 +71,34 @@
 ser = PickleSerializer()
 
 
+def generate_multinomial_logistic_input(
+weights, x_mean, x_variance, add_intercept, n_points, seed=None):
+"""Creates multinomial logistic dataset"""
+
+if seed:
+np.random.seed(seed)
+n_features = x_mean.shape[0]
+
+x = np.random.randn(n_points, n_features)
+x = x * np.sqrt(x_variance) + x_mean
+
+if add_intercept:
+x = np.hstack([x, np.ones((n_points, 1))])
+
+# Compute margins
+margins = np.hstack([np.zeros((n_points, 1)), x.dot(weights.T)])
+# Shift to avoid overflow and compute probs
+probs = np.exp(np.subtract(margins, 
margins.max(axis=1).reshape(n_points, -1)))
+# Compute cumulative prob
+cum_probs = np.cumsum(probs / probs.sum(axis=1).reshape(n_points, -1), 
axis=1)
+# Asign class
--- End diff --

"Assign class", though IMO you could also just do away with the comments in 
this section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17922: [SPARK-20601][PYTHON][ML] Python API Changes for ...

2017-05-09 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17922#discussion_r115497473
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -374,6 +415,48 @@ def getFamily(self):
 """
 return self.getOrDefault(self.family)
 
+@since("2.2.0")
--- End diff --

Since we're voting on 2.2 now, I presume this will make it for 2.3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13257: [SPARK-15474][SQL]ORC data source fails to write and rea...

2017-03-01 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/13257
  
The discussion on [ORC-152](https://issues.apache.org/jira/browse/ORC-152) 
suggests that this is an issue with Spark's DataFrame writer for ORC, not with 
ORC itself.

If you have evidence that this is not the case, it would be good to post it 
directly on ORC-152 so we can get input from people on that project.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16793: [SPARK-19454][PYTHON][SQL] DataFrame.replace impr...

2017-02-12 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/16793#discussion_r100701818
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -1591,6 +1591,67 @@ def test_replace(self):
 self.assertEqual(row.age, 10)
 self.assertEqual(row.height, None)
 
+# replace with lists
+row = self.spark.createDataFrame(
+[(u'Alice', 10, 80.1)], schema).replace([u'Alice'], 
[u'Ann']).first()
+self.assertTupleEqual(row, (u'Ann', 10, 80.1))
+
+# replace with dict
+row = self.spark.createDataFrame(
+[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first()
+self.assertTupleEqual(row, (u'Alice', 11, 80.1))
--- End diff --

This is the only test of "new" functionality (excluding error cases), 
correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2017-01-30 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
> the AWS SDK you get will be in sync with hadoop-aws; you have to keep 
them in sync.

Did you mean here, "you _don't_ have to keep them in sync"?

> Dependency management is an enternal conflict

As an aside, I guess this is what the whole process of shading dependencies 
is for, right? I always wondered whether that could be done automatically 
somehow.

Anyway, thanks for orienting me @steveloughran and @srowen. I appreciate 
your time. I'll step aside and let y'all continue working out what this PR 
needs to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2017-01-21 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
> This won't be enabled in a default build of Spark.

Okie doke. I don't want to derail the PR review here, but I'll ask since 
it's on-topic:

Is there a way for projects like 
[Flintrock](https://github.com/nchammas/flintrock) and spark-ec2 to set 
clusters up such that Spark automatically has S3 support enabled? Do we just 
name the appropriate packages in `spark-defaults.conf` under 
`spark.jars.packages`?

Actually, I feel a little silly now. It seems kinda obvious in retrospect.

So, to @steveloughran's point, that leaves (for me, at least) the question 
of knowing what version of the AWS SDK goes with what version of `hadoop-aws`, 
and so on. Is there a place outside of this PR where one would be able to see 
that? [This 
page](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html)
 doesn't have a version mapping, for example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2017-01-20 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
Thanks for elaborating on where this work will help @steveloughran. Again, 
just speaking from my own point of view as Spark user and 
[Flintrock](https://github.com/nchammas/flintrock) maintainer, this sounds like 
it would be a big help. I hope that after getting something like this in, we 
can have the default builds of Spark leverage it to bundle support for S3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2017-01-19 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
>  Does a build of Spark + Hadoop 2.7 right now have no ability at all to 
read from S3 out of the box, or just not full / ideal support?

No ability at all, as far as I can tell. People have to explicitly start 
their Spark session with a call to `--packages` like this:

```
pyspark --packages 
com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
```

Without that, you get a `java.io.IOException: No FileSystem for scheme: 
s3n` if you try to read something from S3.

I see the maintainer case for not wanting to have the default builds of 
Spark include AWS-specific stuff, and at the same time the end-user case for 
having that is just as clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2017-01-18 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
As a dumb end-user, and as the maintainer of 
[Flintrock](https://github.com/nchammas/flintrock), my interest in this PR 
stems from the hope that we will be able to get builds of Spark against the 
latest version of Hadoop that can interact with S3 out of the box.

Because Spark builds against Hadoop 2.6 and 2.7 don't have that support, 
many Flintrock users [opt to use Spark built against Hadoop 
2.4](https://github.com/nchammas/flintrock/issues/88) since S3 support was 
still bundled in with those versions. Many users don't know that they can get 
S3 support at runtime with the right call to `--packages`.

Given that Spark and S3 are very commonly used together, I hope there is 
some way we can address the out-of-the-box use case here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...

2016-12-05 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/16151
  
@davies - Should this also be cherry-picked into 2.0 and 2.1?

I think this config has been there for a while, just without documentation. 
ð


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...

2016-12-05 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/16151
  
@srowen - OK, I elaborated a bit based on the snippet you posted. Feel free 
to nitpick on the wording. Would be happy to tweak further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...

2016-12-05 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/16151
  
@srowen - Good call. Will elaborate a bit based on what you posted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to config...

2016-12-05 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/16151
  
cc @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16151: [SPARK-18719] Add spark.ui.showConsoleProgress to...

2016-12-05 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/16151

[SPARK-18719] Add spark.ui.showConsoleProgress to configuration docs

This PR adds `spark.ui.showConsoleProgress` to the configuration docs.

I tested this PR by building the docs locally and confirming that this 
change shows up as expected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark ui-progressbar-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16151


commit ceef9197513f20f85b9ac73cff014a0dc31adb37
Author: Nicholas Chammas 
Date:   2016-12-05T19:04:09Z

Add spark.ui.showConsoleProgress to configuration docs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16130: Update location of Spark YARN shuffle jar

2016-12-03 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/16130
  
cc @vanzin?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16130: Update location of Spark YARN shuffle jar

2016-12-03 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/16130

Update location of Spark YARN shuffle jar

Looking at the distributions provided on spark.apache.org, I see that the 
Spark YARN shuffle jar is under `yarn/` and not `lib/`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark yarn-doc-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16130.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16130


commit 979a8a1811f471cd333bdde459649974626e612e
Author: Nicholas Chammas 
Date:   2016-12-03T20:11:18Z

update location of Spark shuffle jar




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-07 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
LGTM as a first cut. The workflow that I will use during development and 
that I think should be supported, i.e.

```sh
./dev/make-distribution.sh --pip
pip install -e ./python/
```

works, so I'm happy. ð

There is room for future improvements -- like building wheels and maybe 
simplifying the packaging tests -- but I think if we get this in now as an 
experimental new feature and give people time to use it, it'll help us refine 
things with more confidence down the line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86699002
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+module_home = os.path.dirname(find_spec("pyspark").origin)
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
--- End diff --

Same nit about `pass`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86698782
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
--- End diff --

Nit: The idiom in Python for "do nothing" is usually `pass`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86699184
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+module_home = imp.find_module("pyspark")[1]
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+module_home = os.path.dirname(find_spec("pyspark").origin)
+paths.append(module_home)
+# If we are installed in edit mode also look two dirs up
+paths.append(os.path.join(module_home, "../../"))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = [os.path.abspath(p) for p in paths]
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching 
%s".format(paths), file=sys.stderr)
--- End diff --

Hmm, did a commit get gobbled up accidentally? This line still uses `%` and 
is missing an `exit(1)`. I see you changed it for another file, so I assume you 
meant to do it here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86698987
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,73 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars")) or
+ os.path.isdir(os.path.join(path, "assembly"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

I guess if this works we don't have to change it, but to clarify my earlier 
comment about why `dirname()` is better than joining to `'../'`:

```
>>> os.path.join('/example/path', '../')
'/example/path/../'
>>> os.path.dirname('/example/path')
'/example'
```

There are a few places where this could be changed, but it's not a big deal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-06 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
Dunno why the tests are failing, but it's not related to packaging.

Anyway, the install recipe I [posted 
earlier](https://github.com/apache/spark/pull/15659#issuecomment-258693543) is 
working now, so that's good. Since the earlier failure I reported was not 
caught by our packaging test, does that mean our tests are missing something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-06 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-06 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
I'll try out your install recipe, but I believe

```sh
./dev/make-distribution.sh --pip
pip install -e ./python/
```

should be a valid way of installing a development version of PySpark. 
Specifically, `pip install -e` is [how Python users install local 
projects](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86692033
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

I meant you are probably looking for

```python
paths = [THIS_DIR, os.path.dirname(THIS_DIR)]
```

The signature of `os.path.dirname()` is the same in Python 3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690907
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
--- End diff --

```python
paths = [os.path.abspath(p) for p in paths]
```

This is more Pythonic and eliminates the need to call `list()` on the 
output of `map()` later, because `map()` returns an iterator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690854
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching %s" % 
paths, file=sys.stderr)
--- End diff --

```python
print("Could not find valid SPARK_HOME while searching {}".format(paths), 
file=sys.stderr)
```

Minor point, but `%` is discouraged these days in favor of `format()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86691246
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
--- End diff --

Couple of comments here:

1. A better way to get a directory relative to the current file is to have 
something like this at the top of the file and refer to it as necessary:

   ```
   THIS_DIR = os.path.dirname(os.path.realpath(__file__))
   ```

2. The correct way to go up one directory is to just call `dirname()` 
again. `os.path.dirname(..., '../')` will just append `'../'` to the end of the 
path, which may not work as expected later on.

  So I think you're looking for `THIS_DIR` and `os.dirname(THIS_DIR)`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-06 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86690957
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,66 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(os.path.join(path, "bin/spark-submit")) and
+(os.path.isdir(os.path.join(path, "jars"
+
+paths = ["../", os.path.join(os.path.dirname(sys.argv[0]), "../")]
+
+# Add the path of the PySpark module if it exists
+if sys.version < "3":
+import imp
+try:
+paths.append(imp.find_module("pyspark")[1])
+except ImportError:
+# Not pip installed no worries
+True
+else:
+from importlib.util import find_spec
+try:
+paths.append(os.path.dirname(find_spec("pyspark").origin))
+except ImportError:
+# Not pip installed no worries
+True
+
+# Normalize the paths
+paths = map(lambda path: os.path.abspath(path), paths)
+
+try:
+return next(path for path in paths if is_spark_home(path))
+except StopIteration:
+print("Could not find valid SPARK_HOME while searching %s" % 
paths, file=sys.stderr)
--- End diff --

We should raise an exception here or `exit(1)` since this is a fatal error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-06 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
I tested this out with Python 3 on my system with the following commands:

```
# Inside ./spark/.
python3 -m venv venv
source venv/bin/activate
./dev/make-distribution.sh --pip
pip install -e ./python/
which pyspark
pyspark
```

Seems there is a bug with how `SPARK_HOME` is computed:

```
[make-distribution.sh output snipped]
$ pip install -e ./python/
Obtaining file:///.../apache/spark/python
Collecting py4j==0.10.4 (from pyspark==2.1.0.dev1)
  Downloading py4j-0.10.4-py2.py3-none-any.whl (186kB)
100% 
|ââââââââââââââââââââââââââââââââ|
 194kB 2.0MB/s 
Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
Successfully installed py4j-0.10.4 pyspark
$ which pyspark
.../apache/spark/venv/bin/pyspark
$ pyspark
Could not find valid SPARK_HOME while searching 
.../apache/spark/venv/bin/pyspark: line 24: None/bin/load-spark-env.sh: No 
such file or directory
.../apache/spark/venv/bin/pyspark: line 77: 
.../apache/spark/None/bin/spark-submit: No such file or directory
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86668198
  
--- Diff: docs/building-spark.md ---
@@ -259,6 +259,14 @@ or
 Java 8 tests are automatically enabled when a Java 8 JDK is detected.
 If you have JDK 8 installed but it is not the system default, you can set 
JAVA_HOME to point to JDK 8 before running the tests.
 
+## PySpark pip installable
+
+If you are building Spark for use in a Python environment and you wish to 
pip install it, you will first need to build the Spark JARs as described above. 
Then you can construct an sdist package suitable for setup.py and pip 
installable package.
+
+cd python; python setup.py sdist
--- End diff --

Just to confirm, if I run this:

```
./dev/make-distribution.sh --pip
```

It should take care of both building the right JARs _and_ building the 
Python package.

Then I just run:

```
pip install -e ./python/
```

to install Spark into my Python environment.

Is that all correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86668059
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
+  file=sys.stderr)
+sys.exit(-1)
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm - this is necessary since we can't 
refer to the path above the
+# package root and we need to copy the jars and scripts which are 
up above the python root.
+if getattr(os, "symlink", None) is not None:
+os.symlink(JARS_PATH, JARS_TARGET)
+os.symlink(SCRIPTS_PATH, SCRIPTS_TARGET)
+os.symlink(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# For windows fall back to the slower copytree
+copytree(JARS_PATH, JARS_TARGET)
+copytree(SCRIPTS_PATH, SCRIPTS_TARGET)
+copytree(EXAMPLES_PATH, EXAMPLES_TARGET)
+else:
+# If we are not inside of SPARK_HOME verify we have the required 
symlink farm
+if not os.path.exists(JARS_TARGET):
+print("To build packaging must be in the python directory 
under the SPARK_HOME.",
+  file=sys.stderr)
+# We copy the shell script to be under pyspark/python/pyspark so 
that the launcher scripts
+# find it where expected. The rest of the files aren't copied 
because they are accessed
+# using Python imports instead which will be resolved correctly.
+try:
+os.makedirs("pyspark/python/pyspark")
+except OSError:
+# Don't worry if the directory already exists.

[GitHub] spark pull request #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip...

2016-11-05 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r86667967
  
--- Diff: python/setup.py ---
@@ -0,0 +1,180 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported for pip 
installed PySpark.",
+  file=sys.stderr)
+exit(-1)
+
+try:
+exec(open('pyspark/version.py').read())
+except IOError:
+print("Failed to load PySpark version file for packaging you must be 
in Spark's python dir.",
--- End diff --

Seems like there is a missing sentence break somewhere here. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-04 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
@rxin - Not yet, but I will test it this weekend.

Yes, PyPI does have a limit, but we can request an exemption. I can help 
coordinate that with the PyPI admins when we get there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15733: [SPARK-18138][DOCS] Document that Java 7, Python ...

2016-11-02 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15733#discussion_r86158332
  
--- Diff: docs/index.md ---
@@ -28,8 +28,9 @@ Spark runs on Java 7+, Python 2.6+/3.4+ and R 3.1+. For 
the Scala API, Spark {{s
 uses Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a 
compatible Scala version
 ({{site.SCALA_BINARY_VERSION}}.x).
 
-Note that support for Java 7, Python 2.6, Scala 2.10 and version of Hadoop 
before 2.6 are 
-deprecated as of Spark 2.1.0, and may be removed in Spark 2.2.0.
+Note that support for Java 7 and Python 2.6 are deprecated as of Spark 
2.0.0, and support for 
+Scala 2.10 and version of Hadoop before 2.6 are deprecated as of Spark 
2.1.0, and may be 
--- End diff --

"... and versions of Hadoop..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-11-02 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
Later today (or later this week) I will try actually using this branch to 
install Spark via pip and report back.

```
pip install 
git+https://github.com/holdenk/spark@SPARK-1267-pip-install-pyspark
```

@holdenk - I use this method to install development versions of packages 
straight off of GitHub. Do you expect this pattern to work for Spark as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15733: [SPARK-18138][DOCS] Document that Java 7, Python ...

2016-11-02 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15733#discussion_r86141486
  
--- Diff: docs/building-spark.md ---
@@ -13,6 +13,7 @@ redirect_from: "building-with-maven.html"
 
 The Maven-based build is the build of reference for Apache Spark.
 Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.
+Note that support for Java 7 is deprecated as of Spark 2.1.0 and may be 
removed in Spark 2.2.0.
--- End diff --

I believe it's been deprecated since 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [SPARK-1267][SPARK-18129] Allow PySpark to be pip instal...

2016-10-31 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
We have an AppVeyor build now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-28 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85531031
  
--- Diff: python/setup.py ---
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
--- End diff --

pip bundles setuptools, so if you have pip you have setuptools. 
Specifically, I think if this script is being invoked because the user ran pip, 
this will work. 

If it is invoked as `python setup.py`, though, it is possible for this to 
fail because the user doesn't have setuptools. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip i...

2016-10-27 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
From the PR description:

> figure out who owns the pyspark package name on prod PyPI (is it someone 
with in the project or should we ask PyPI or should we choose a different name 
to publish with like ApachePySpark?)

Don't we want to publish to `apache-spark`? Dunno if Apache has any rules 
about that. For prior art, see [`apache-libcloud` on 
PyPI](https://pypi.org/project/apache-libcloud/).

Btw, how did you determine that `pyspark` is taken on PyPI? We can 
definitely reach out to the admins to ask if they can release the name. I'll 
find out how exactly to do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to be pip i...

2016-10-27 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15659
  
Thanks for the additional context @holdenk and @rgbkrk. It's important to 
lay it out somewhere clearly so that the non-Python developers among us (and 
the forgetful Python developers like me) can understand the benefit we're 
aiming for here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85377223
  
--- Diff: pom.xml ---
@@ -26,6 +26,7 @@
   
   org.apache.spark
   spark-parent_2.11
+  
--- End diff --

Not a sticking point for me, but since it adds a manual step for committers 
during release ("verify the PySpark version is correct" - maybe this can be 
automated?) they may object. I remember @davies had an issue with this in the 
last PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85364365
  
--- Diff: pom.xml ---
@@ -26,6 +26,7 @@
   
   org.apache.spark
   spark-parent_2.11
+  
--- End diff --

Something along the lines of 
`.splitlines()...trim().startswith('')` would work, and it's easy to 
error out if it broke, no?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85365186
  
--- Diff: python/README.md ---
@@ -0,0 +1,32 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It 
provides
+high-level APIs in Scala, Java, Python, and R, and an optimized engine that
+supports general computation graphs for data analysis. It also supports a
+rich set of higher-level tools including Spark SQL for SQL and DataFrames,
+MLlib for machine learning, GraphX for graph processing,
+and Spark Streaming for stream processing.
+
+<http://spark.apache.org/>
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the [project web 
page](http://spark.apache.org/documentation.html)
+
+
+## Python Packaging
+
+This README file only contains basic information related to pip installed 
PySpark.
+This packaging is currently experimental and may change in future versions 
(although we will do our best to keep compatibility).
+Using PySpark requires the Spark JARs, and if you are building this from 
source please see the builder instructions at
+["Building 
Spark"](http://spark.apache.org/docs/latest/building-spark.html).
+
+The Python packaging for Spark is not intended to replace all of the other 
use cases. This Python packaged version of Spark is suitable for interacting 
with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does 
not contain the tools required to setup your own standalone Spark cluster. You 
can download the full version of Spark from the [Apache Spark downloads 
page](http://spark.apache.org/downloads.html).
--- End diff --

I see. So `pip install pyspark` can completely replace `brew install 
apache-spark` for local development, or for submitting from a local machine to 
a remote cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85364778
  
--- Diff: python/README.md ---
@@ -0,0 +1,32 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It 
provides
--- End diff --

I see. And I'm guessing we can't/don't want to somehow reference the README 
in the root directory? (Perhaps even with a symlink, if necessary...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85355701
  
--- Diff: python/setup.py ---
@@ -0,0 +1,169 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import glob
+import os
+import sys
+from setuptools import setup, find_packages
+from shutil import copyfile, copytree, rmtree
+
+exec(open('pyspark/version.py').read())
+VERSION = __version__
+# A temporary path so we can access above the Python project root and 
fetch scripts and jars we need
+TEMP_PATH = "deps"
+SPARK_HOME = os.path.abspath("../")
+JARS_PATH = "%s/assembly/target/scala-2.11/jars/" % SPARK_HOME
+
+# Use the release jars path if we are in release mode.
+if (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+JARS_PATH = "%s/jars/" % SPARK_HOME
+
+EXAMPLES_PATH = "%s/examples/src/main/python" % SPARK_HOME
+SCRIPTS_PATH = "%s/bin" % SPARK_HOME
+SCRIPTS_TARGET = "%s/bin" % TEMP_PATH
+JARS_TARGET = "%s/jars" % TEMP_PATH
+EXAMPLES_TARGET = "%s/examples" % TEMP_PATH
+
+if sys.version_info < (2, 7):
+print("Python versions prior to 2.7 are not supported.", 
file=sys.stderr)
+exit(-1)
+
+# Check and see if we are under the spark path in which case we need to 
build the symlink farm.
+# This is important because we only want to build the symlink farm while 
under Spark otherwise we
+# want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
+# partially built sdist) we should error and have the user sort it out.
+in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
+(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+
+if (in_spark):
+# Construct links for setup
+try:
+os.mkdir(TEMP_PATH)
+except:
+print("Temp path for symlink to parent already exists %s" % 
TEMP_PATH, file=sys.stderr)
+exit(-1)
+
+try:
+if (in_spark):
+# Construct the symlink farm
--- End diff --

What's the purpose of these symlinks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85352748
  
--- Diff: pom.xml ---
@@ -26,6 +26,7 @@
   
   org.apache.spark
   spark-parent_2.11
+  
--- End diff --

Would it be overkill to just have `version.py` parse this file for the 
version string? Not necessarily with a full XML parser, but with a simple 
string match or regex and fail noisily if we're unable to extract the version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85355211
  
--- Diff: python/pyspark/find_spark_home.py ---
@@ -0,0 +1,65 @@
+#!/usr/bin/python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This script attempt to determine the correct setting for SPARK_HOME given
+# that Spark may have been installed on the system with pip.
+
+from __future__ import print_function
+import os
+import sys
+
+
+def _find_spark_home():
+"""Find the SPARK_HOME."""
+# If the enviroment has SPARK_HOME set trust it.
+if "SPARK_HOME" in os.environ:
+return os.environ["SPARK_HOME"]
+
+def is_spark_home(path):
+"""Takes a path and returns true if the provided path could be a 
reasonable SPARK_HOME"""
+return (os.path.isfile(path + "/bin/spark-submit") and 
os.path.isdir(path + "/jars"))
--- End diff --

Instead of building paths with `+`, we should be using `os.path.join()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85354868
  
--- Diff: python/README.md ---
@@ -0,0 +1,32 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It 
provides
+high-level APIs in Scala, Java, Python, and R, and an optimized engine that
+supports general computation graphs for data analysis. It also supports a
+rich set of higher-level tools including Spark SQL for SQL and DataFrames,
+MLlib for machine learning, GraphX for graph processing,
+and Spark Streaming for stream processing.
+
+<http://spark.apache.org/>
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the [project web 
page](http://spark.apache.org/documentation.html)
+
+
+## Python Packaging
+
+This README file only contains basic information related to pip installed 
PySpark.
+This packaging is currently experimental and may change in future versions 
(although we will do our best to keep compatibility).
+Using PySpark requires the Spark JARs, and if you are building this from 
source please see the builder instructions at
+["Building 
Spark"](http://spark.apache.org/docs/latest/building-spark.html).
+
+The Python packaging for Spark is not intended to replace all of the other 
use cases. This Python packaged version of Spark is suitable for interacting 
with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does 
not contain the tools required to setup your own standalone Spark cluster. You 
can download the full version of Spark from the [Apache Spark downloads 
page](http://spark.apache.org/downloads.html).
--- End diff --

If I am doing local development on my Mac, for example, what does pip 
installing Spark get me?

It sounds like from this line that even if I pip install Spark, I will 
still need to separately `brew install apache-spark` or something to be able to 
run Spark programs. Is that correct?

How does my workflow change or improve if I can pip install Spark?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85350993
  
--- Diff: bin/spark-class ---
@@ -36,7 +36,7 @@ else
 fi
 
 # Find Spark jars.
-if [ -f "${SPARK_HOME}/RELEASE" ]; then
+if [ -d "${SPARK_HOME}/jars" ]; then
--- End diff --

Why did this get changed from `RELEASE` to `jars`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85355847
  
--- Diff: python/setup.cfg ---
@@ -0,0 +1,22 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+[bdist_wheel]
+universal = 1
+
+[metadata]
+description-file = README.md
--- End diff --

Newline here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85353057
  
--- Diff: python/MANIFEST.in ---
@@ -0,0 +1,23 @@
+#!/usr/bin/env python
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+recursive-include deps/jars *.jar
+recursive-include deps/bin *
--- End diff --

Minor point, but `graft` seems more appropriate here.

See: https://docs.python.org/3/distutils/commandref.html#sdist-cmd


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85353699
  
--- Diff: python/README.md ---
@@ -0,0 +1,32 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It 
provides
--- End diff --

Would it be appropriate to cut this paragraph out and just leave the stuff 
about packaging? If these blurbs ever change I don't think we want to have to 
update them in multiple places, and we already have this blurb in at least one 
other place, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15659: [WIP][SPARK-1267][SPARK-18129] Allow PySpark to b...

2016-10-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15659#discussion_r85351820
  
--- Diff: dev/create-release/release-build.sh ---
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
 export ZINC_PORT=$ZINC_PORT
 echo "Creating distribution: $NAME ($FLAGS)"
 
+# Write out the NAME and VERSION to PySpark version info we rewrite 
the - into a . and SNAPSHOT
--- End diff --

Do we want to have a version string that's slightly different from the 
"original", just for Python?

I'm thinking about what will happen if people, for example, want to do the 
same for R. Having 3 slightly different ways of showing the version string 
seems unnecessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15567: [SPARK-14393][SQL] values generated by non-deterministic...

2016-10-21 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/15567
  
@mengxr - I think this PR will also address 
[SPARK-14241](https://issues.apache.org/jira/browse/SPARK-14241).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12004: [SPARK-7481] [build] Add spark-cloud module to pull in o...

2016-10-17 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/12004
  
@steveloughran - Is this message in the most recent build log critical?

```
Spark's published dependencies DO NOT MATCH the manifest file 
(dev/spark-deps).
To update the manifest file, run './dev/test-dependencies.sh 
--replace-manifest'.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-13 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r83349121
  
--- Diff: sbin/spark-daemon.sh ---
@@ -146,13 +176,11 @@ run_command() {
 
   case "$mode" in
 (class)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class 
$command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command nice -n $SPARK_NICENESS 
${SPARK_HOME}/bin/spark-class $command $@
   ;;
 
 (submit)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-submit 
--class $command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command nice -n $SPARK_NICENESS bash 
${SPARK_HOME}/bin/spark-submit --class $command $@
--- End diff --

Same here: I would quote the `SPARK_` environment variables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-13 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r83349154
  
--- Diff: sbin/spark-daemon.sh ---
@@ -122,6 +123,35 @@ if [ "$SPARK_NICENESS" = "" ]; then
 export SPARK_NICENESS=0
 fi
 
+execute_command() {
+  local command="$@"
+  if [ -z ${SPARK_NO_DAEMONIZE+set} ]; then
+  nohup -- $command >> $log 2>&1 < /dev/null &
+  newpid="$!"
+
+  echo "$newpid" > "$pid"
+
+  #Poll for up to 5 seconds for the java process to start
--- End diff --

Nit: Space after `#`. (I know it was like this before your PR.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15338: [SPARK-11653][Deploy] Allow spark-daemon.sh to ru...

2016-10-13 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15338#discussion_r83349053
  
--- Diff: sbin/spark-daemon.sh ---
@@ -146,13 +176,11 @@ run_command() {
 
   case "$mode" in
 (class)
-  nohup nice -n "$SPARK_NICENESS" "${SPARK_HOME}"/bin/spark-class 
$command "$@" >> "$log" 2>&1 < /dev/null &
-  newpid="$!"
+  execute_command nice -n $SPARK_NICENESS 
${SPARK_HOME}/bin/spark-class $command $@
--- End diff --

If `SPARK_HOME` contains spaces, this will break. I recommend quoting both 
`SPARK_HOME` and `SPARK_NICENESS` as they were before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-25 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Looks good to me. ð


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Thanks for the quick overview. That's pretty straightforward, actually! 
I'll take a look at `PipelinedRDD` for the details. ð


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Hmm, OK I see. (Apologies, I don't understand what pipelined RDDs are for, 
so the examples are going a bit over my head. ð)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-11 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
> So there is no chaining requirement, and it will only work in a with 
statement.

@MLnick - Couldn't we also create a scenario (like @holdenk did earlier) 
where a user does something like this?

```python
persisted_rdd = persisted(rdd)
persisted_rdd.map(...).filter(...).count()
```

This would break pipelining too, no?

And I think the expectation would be for it not to break pipelining, 
because existing common context managers in Python don't have a requirement 
that they _must_ be used in a `with` block.

For example, `f = open(file)` works fine, as does `s = requests.Session()`, 
and the resulting objects have the same behavior as they would inside a `with` 
block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is 
used so I'll have to defer to y'all on this. Does the `cached()` utility method 
have this same problem?

> We could possibly work around it with some type checking etc but it then 
starts to feel like adding more complexity than the feature is worth...

Agreed.

At this point, actually, I'm beginning to feel this feature is not worth it.

Context managers seem to work best when the objects they're working on have 
clear open/close-style semantics. File handles, network connections, and the 
like fit this pattern well.

In fact, the [doc for 
`with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)
 says:

> This allows common `try...except...finally` usage patterns to be 
encapsulated for convenient reuse.

RDDs and DataFrames, on the other hand, don't have a simple open/close or 
`try...except...finally` pattern. And when we try to map one onto persist and 
unpersist, we get the various side-effects we've been discussing here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/ca...

2016-08-10 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14579#discussion_r74307747
  
--- Diff: python/pyspark/rdd.py ---
@@ -221,6 +227,21 @@ def context(self):
 def cache(self):
 """
 Persist this RDD with the default storage level (C{MEMORY_ONLY}).
+
+:py:meth:`cache` can be used in a 'with' statement. The RDD will 
be automatically
+unpersisted once the 'with' block is exited. Note however that any 
actions on the RDD
+that require the RDD to be cached, should be invoked inside the 
'with' block; otherwise,
+caching will have no effect.
--- End diff --

Agreed, especially since this is technically a new Public API that we are 
potentially committing to for the life of the 2.x line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Sorry, you're right, `__exit__()`'s return value is not going to be 
consumed anywhere. What I meant is that `unpersist()` would return the base RDD 
or DataFrame object.

But I'm not seeing the issue with the example you posted. Reformatting for 
clarity:

```python
magic = rdd.persist()

with magic as awesome:
awesome.count()

magic.map(lambda x: x + 1)
```

Are you saying `magic.map()` will error? Why would it?

`magic` would be an instance of `PersistedRDD`, which in turn is a subclass 
of `RDD`, which has `map()` and all of the usual methods defined, plus the 
magic methods we need for the context manager.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
> the subclassing of RDD approach could cause us to miss out on pipelining 
if the RDD was used again after it was unpersisted

How so? Wouldn't `__exit__()` simply return the parent RDD or DataFrame 
object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
None of our options seems great, but if I had to rank them I would say:

1. Add new `Persisted...` classes.
2. Make no changes.
3. Add separate `persisted()` or `cached()` utility method.
4. Modify base RDD and DataFrame classes.

Adding new internal classes for this use-case honestly seems a bit 
heavy-handed to me, so if we are against that then I would lean towards not 
doing anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Ah, you're right.

So if we want to avoid needing magic methods in the main RDD/DataFrame 
classes and avoid needing a separate utility method like `cache()`, I think one 
option available to us is to have separate `PersistedRDD` and 
`PersistedDataFrame` classes that simply wrap the base RDD and DataFrames 
classes and add the appropriate magic methods.

`.persist()` and `.cache()` would then return instances of these classes, 
which should satisfy the `type(x).__enter__(x)` behavior while still preserving 
backwards compatibility and method chaining.

What do you think of that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

2016-08-10 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14579
  
Thanks @MLnick for taking this on and for breaking down what you've found 
so far.

I took a look through 
[`contextlib`](https://docs.python.org/3/library/contextlib.html) for 
inspiration, and I wonder if the source code for 
[`closing()`](https://docs.python.org/3/library/contextlib.html#contextlib.closing)
 offers a template we can follow that would let `persist()` return an 
RDD/DataFrame instance with the correct magic methods, without having to modify 
the class.

Have you taken a look at that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14496: [SPARK-16772] [Python] [Docs] Fix API doc references to ...

2016-08-05 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14496
  
Thanks @srowen. ð


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14496: [SPARK-16772] [Python] [Docs] Fix API doc references to ...

2016-08-04 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14496
  
cc @rxin - Follow-on to #14393.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14496: [SPARK-16772] [Python] [Docs] Fix API doc referen...

2016-08-04 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/14496

[SPARK-16772] [Python] [Docs] Fix API doc references to UDFRegistration + 
Update "important classes"

## Proposed Changes

* Update the list of "important classes" in `pyspark.sql` to match 2.0.
* Fix references to `UDFRegistration` so that the class shows up in the 
docs. It currently 
[doesn't](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html).
* Remove some unnecessary whitespace in the Python RST doc files.

I reused the [existing 
JIRA](https://issues.apache.org/jira/browse/SPARK-16772) I created last week 
for similar API doc fixes.

## How was this patch tested?

* I ran `lint-python` successfully.
* I ran `make clean build` on the Python docs and confirmed the results are 
as expected locally in my browser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark SPARK-16772-UDFRegistration

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14496


commit 62f4f823ed33972d782506f5226b192fc45b1ede
Author: Nicholas Chammas 
Date:   2016-08-04T17:16:31Z

fix references to UDFRegistration




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14408: [SPARK-16772] Restore "datatype string" to Python API do...

2016-07-29 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14408
  
cc @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14393: [SPARK-16772] Correct API doc references to PySpa...

2016-07-29 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14393#discussion_r72853914
  
--- Diff: python/pyspark/sql/context.py ---
@@ -226,28 +226,34 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None):
 from ``data``, which should be an RDD of :class:`Row`,
 or :class:`namedtuple`, or :class:`dict`.
 
-When ``schema`` is :class:`DataType` or datatype string, it must 
match the real data, or
--- End diff --

Correction here: https://github.com/apache/spark/pull/14408


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14408: [SPARK-16772] Restore "datatype string" to Python...

2016-07-29 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/14408

[SPARK-16772] Restore "datatype string" to Python API docstrings

## What changes were proposed in this pull request?

This PR corrects [an error made in an earlier 
PR](https://github.com/apache/spark/pull/14393/files#r72843069).


## How was this patch tested?

```sh
$ ./dev/lint-python 
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.
```

I also built the docs and confirmed that they looked good in my browser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark SPARK-16772

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14408


commit 58f388533a6300e49de0d239d3ad0f7d17afca50
Author: Nicholas Chammas 
Date:   2016-07-29T20:03:50Z

restore "datatype string"




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14393: [SPARK-16772] Correct API doc references to PySpa...

2016-07-29 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/14393#discussion_r72843069
  
--- Diff: python/pyspark/sql/context.py ---
@@ -226,28 +226,34 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None):
 from ``data``, which should be an RDD of :class:`Row`,
 or :class:`namedtuple`, or :class:`dict`.
 
-When ``schema`` is :class:`DataType` or datatype string, it must 
match the real data, or
--- End diff --

I made a mistake here, thinking "datatype string" was actually meant to be 
`StringType()`. I understand now that a datatype string is actually a thing.

Correction incoming...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14393: [SPARK-16772] Correct API doc references to PySpark clas...

2016-07-28 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14393
  
Yes, I built the docs and reviewed several (but not all) of the changes 
locally in my browser and confirmed that the corrections I wanted took place as 
expected.

(Apologies about not using the PR template when I first opened the PR. 
GitHub Desktop seems not to support that yet. I've updated the PR description 
to include this info now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14393: [SPARK-16772] Correct API doc references to PySpark clas...

2016-07-28 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/14393
  
Apologies for making a fairly "noisy" PR, with changes in several scattered 
places. However, as a PySpark user it's important to me that the API docs be 
properly formatted and that docstring class references work.

Feel free to ping me on Python docstring changes in the future. I would be 
happy to review them.

cc @rxin @davies - Ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14393: [SPARK-16772] Correct references to DataType + ot...

2016-07-28 Thread nchammas

GitHub user nchammas opened a pull request:

https://github.com/apache/spark/pull/14393

[SPARK-16772] Correct references to DataType + other minor tweaks



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nchammas/spark python-docstring-fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14393


commit 3a24f4fb7ce30b4a261c0db2c27be11976dea678
Author: Nicholas Chammas 
Date:   2016-07-28T16:42:13Z

[SPARK-16772] correct references to DataType




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13114: Branch 1.4

2016-06-20 Thread nchammas

Github user nchammas commented on the issue:

https://github.com/apache/spark/pull/13114
  
@srowen @vanzin - Shouldn't some automated process be picking up your 
comments ("close this PR") and closing this PR? I thought we had something like 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15542][SparkR] Make error message clear...

2016-05-26 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13308#discussion_r64774474
  
--- Diff: R/install-dev.sh ---
@@ -38,7 +38,12 @@ pushd $FWDIR > /dev/null
 if [ ! -z "$R_HOME" ]
   then
 R_SCRIPT_PATH="$R_HOME/bin"
-   else
+  else
+# if system wide R_HOME is not found, then exit
+if ! [ `command -v R` ]; then
--- End diff --

Yeah, we typically put the `!` after the test:
* 
https://github.com/apache/spark/blob/6d506c9ae9a2519d1a93e788ae5316d4f942d35d/dev/lint-python#L44
* 
https://github.com/apache/spark/blob/6d506c9ae9a2519d1a93e788ae5316d4f942d35d/dev/lint-java#L25

(In Bash, `[ ... ]` and `test` are synonyms.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15072][SQL][PYSPARK][HOT-FIX] Remove Sp...

2016-05-16 Thread nchammas

Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/13069#issuecomment-219517952
  
Okie doke, thanks for the explanation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 945 matches

Mail list logo