[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
BTW, it'd be even nicer if the images in 
https://github.com/apache/spark/pull/15053#issuecomment-249550254 are included 
in the PR description as we have the images already.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@mortada I see. Then,

- It'd not almost increase the test time.
- It will improve the documentation.
- We have a reference for this, `numpy`

Therefore, +1 for each docstring within each method. Can we remove the 
dataframes in `globs` and then do the same thing across PySpark documentation?

I just took a quicky look and it seems we have more instances such as ..

```
$ grep -r "globs\['df" .

./pyspark/sql/column.py:globs['df'] = sc.parallelize([(2, 'Alice'), (5, 
'Bob')]) \
./pyspark/sql/context.py:globs['df'] = rdd.toDF()
./pyspark/sql/dataframe.py:globs['df'] = sc.parallelize([(2, 'Alice'), 
(5, 'Bob')])\
./pyspark/sql/dataframe.py:globs['df2'] = 
sc.parallelize([Row(name='Tom', height=80), Row(name='Bob', height=85)]).toDF()
./pyspark/sql/dataframe.py:globs['df3'] = 
sc.parallelize([Row(name='Alice', age=2),
./pyspark/sql/dataframe.py:globs['df4'] = 
sc.parallelize([Row(name='Alice', age=10, height=80),
./pyspark/sql/functions.py:globs['df'] = 
sc.parallelize([Row(name='Alice', age=2), Row(name='Bob', age=5)]).toDF()
./pyspark/sql/group.py:globs['df'] = sc.parallelize([(2, 'Alice'), (5, 
'Bob')]) \
./pyspark/sql/group.py:globs['df3'] = sc.parallelize([Row(name='Alice', 
age=2, height=80),
./pyspark/sql/group.py:globs['df4'] = 
sc.parallelize([Row(course="dotNET", year=2012, earnings=1),
./pyspark/sql/readwriter.py:globs['df'] = 
spark.read.parquet('python/test_support/sql/parquet_partitioned')
./pyspark/sql/session.py:globs['df'] = rdd.toDF()
./pyspark/sql/streaming.py:globs['df'] = \
```

Also, I think we'd better have a JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-29 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon do you know how I could run the doctests for these files? I 
found this online: 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals which says 
that I could do 

```
SPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py.
```
but that doesn't actually work
```
$ SPARK_TESTING=1 ./bin/pyspark python/pyspark/sql/dataframe.py
python: Error while finding spec for 'python/pyspark/sql/dataframe.py' 
(ImportError: No module named 'python/pyspark/sql/dataframe')
```

in terms of the extra time to create these small DataFrames, I think we'd 
probably be looking at something reasonably negligible (a few seconds tops) 
```
In [2]: %timeit df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], 
['name', 'age'])
100 loops, best of 3: 10.3 ms per loop
```

i.e. this should be on the order of 1 second even with creating `df` 100 
times. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Also, I guess this change will be an example and will affect all the Python 
documentation and worry disagreement from others in the future. So, I'd rather 
set a clear reason and investigation further.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
IMHO, i think the original author intended to avoid creating each dataframe 
in each method docstring. I am uncertain about doing each method docstring 
although I do prefer verbose/complete documentation. If we should go for the 
Suggestion A, then I think we chould

- verify if there is any degradation from this cahnge. (e.g. test time 
increase) and then decide if this change is worth it. I mean we should be able 
to defence with a concreate reason or reference not only now here but also in 
the future. 

-  find another way to reach this goal.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
OK, I just manually fixed and built the docs.

- 
[**Original**](https://cloud.githubusercontent.com/assets/6477701/18833109/11098e5c-842a-11e6-9e48-6b3cae7c10b0.png)

- [**Suggestion A (each dataframe in each method 
docstring)**](https://cloud.githubusercontent.com/assets/6477701/18833116/1a400f82-842a-11e6-9da4-e4db0197e9e7.png)

- [**Suggestion B (dataframes in each package 
docstring)**](https://cloud.githubusercontent.com/assets/6477701/18833098/034207d6-842a-11e6-8bfb-3365303304c3.png)

OK, I tend to agree with Suggestion A looks better but I am worried if it 
would make PySpark tests slower down as it now creates each dataframe for each 
test. @holdenk and @srowen WDYT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@mortada Otherwise, let me please try this in my local and post some images 
here by myself if you can bear with few days :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@mortada yes, I understand your point. However, I'd like to avoid change a 
lot and I guess @holdenk feels in the same way (she is insightful in Python).

I guess it'd not take a lot of time to fix this. Let's do it in this way 
first. Then, I am happy to support to build the documentation and then leave 
the images here so that we can decide which way is better. 

I can't currently imagine how the documentation would look like or how much 
it'd be pretty or clean with package level docstring one.

So, it'd be nicer if we have some images or things like we can discuss 
further about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
In other words, currently it is not possible for the user to follow the 
examples in this docstring below. It's not clear what all these input variables 
(`df`, `df2`, etc) are, and where you'd even find them 
```
In [5]: DataFrame.join?
Signature: DataFrame.join(self, other, on=None, how=None)
Docstring:
Joins with another :class:`DataFrame`, using the given join expression.

:param other: Right side of the join
:param on: a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If `on` is a string or a list of strings indicating the name of the 
join column(s),
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default 'inner'.
One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.

The following performs a full outer join between ``df1`` and ``df2``.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, 
df2.height).collect()
[Row(name=None, height=80), Row(name='Bob', height=85), Row(name='Alice', 
height=None)]

>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
[Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', 
height=None)]

>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]

>>> df.join(df2, 'name').select(df.name, df2.height).collect()
[Row(name='Bob', height=85)]

>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
[Row(name='Bob', age=5)]

.. versionadded:: 1.3
File:  ~/code/spark/python/pyspark/sql/dataframe.py
Type:  function
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
But that's what this PR is supposed to fix, the problem that the docstring 
for each individual method is not self-contained :)

I think I now see where I was confused - it seems like we are assuming the 
user would be looking at the package level docstring? I don't think that's the 
typical workflow.

I think the user would be looking at the docstring of one method and expect 
the docstring to explain how the method works. (hence the example with `numpy` 
I posted above 
https://github.com/apache/spark/pull/15053#issuecomment-247906649) 

For instance in `ipython` if you do `DataFrame.join?` it would bring up the 
docstring for the method `join()`, and it just seems really odd that it'd have 
everything including: function signature and parameters, explanation for how it 
works, example usage ... except for how to construct the very input data you 
need to interact with the example 

I don't think the user would know that the input DataFrame in the example 
is somehow defined in the package level docstring.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@mortada Exactly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
I see, ok so you mean leave all the docstrings for the individual methods 
unchanged, but instead just add 
```
"""
>>> df.show()
+-+---+
| name|age|
+-+---+
|Alice|  2|
|  Bob|  5|
+-+---+
"""
```

at the top of the file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Oh, maybe my previous comments were not clear. I meant 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L18

> Also, is the idea that we would define df globally, and then for the 
docstring of each function we would not do:

For this one, I meant not defining dataframes that were already defined in 
`_test` but just print them in the package level doc string. So.. the complete 
structure would be as below:

```python
# Apache license
# ...

"""
>>> df.show()
+-+---+
| name|age|
+-+---+
|Alice|  2|
|  Bob|  5|
+-+---+
"""

import doctest
import sys

# Other tests..

def _test():
# Just leave the dataframe definition as they are
globs = globals()
globs['df'] = df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], 
['name', 'age'])
doctest.testmod(sys.modules[__name__], globs=globs)

if __name__ == "__main__":
_test()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I may still be confused about something - first of all what do 
you mean by the package level docstring? Do you mean here: 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1559
 or here: 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L18?

Also, is the idea that we would define `df` globally, and then for the 
docstring of each function we would *not* do:

```
>>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age'])
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
 [Row(age=3), Row(age=4)]
```

and instead we do:
```
>>> df.show()
+-+---+
| name|age|
+-+---+
|Alice|  2|
|  Bob|  5|
+-+---+
>>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect()
 [Row(age=3), Row(age=4)]
```

therefore not showing the user how to construct the DataFrame? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Hm.. @mortada It seems the dataframe definition is added into each 
docstring. Could you maybe remove the duplicates and then just print the 
contents in the package level docstring?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-25 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15053
  
@mortada what do you think of @HyukjinKwon 's suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-19 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15053
  
Oh yes I agree that's a much smaller change - I was just explaining the 
motivation behind my initial comment as mortada asked me to elaborate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@holdenk I am also cautious but leaving everything but adding `df.show()` 
in the package docstring with cleaning up duplicated defining dataframes in 
each docstring will be minimal change   and fine for the goal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15053
  
I was thinking that the user would probably read the package string 
documentation before looking at the individual functions (or if they went 
looking for the definition of the dataframe). I'm a little hesitant to add a 
bunch of copy and paste code so its possible I'm being overly cautious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Oh, I meant just don't touch the `globs` in `_test()` but just print the 
global dataframes (which should be rather `show()` to show the contents) so 
that users can understand the input and output. So, my scenario is, users open 
the doc and see the input dataframes first and then check the function 
documentation they want.

I do like the idea of defining each within each docstring but maybe we 
could try the minimal change suggestion to reach the same goal.

I guess she meant defining the global dataframes within the package 
docstring which is, for example,

```python
"""
# here
"""

import doctest
import sys
...
```

However, after investigating, it seems we can't easily do that. So, I 
suggested to print them in the package docstring such as

```python
"""
>>> df.show()
+-+
|field|
+-+
|value|
|value|
+-+
"""

import doctest
import sys
...
```

rather than defining them in the package docstring.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I understand we can have `py.test` and `doctest`, but I don't 
quite see how we could define the input DataFrame globally while at the same 
time have a clear, self-contained docstring for each function?

@holdenk could you please elaborate on what you mean? 

If we want to repeat something like this in every docstring
```
>>> print(df.collect())
```
we might as well simply include how to actually create the DataFrame so the 
user can easily reproduce the example?

It seems to me that the user would often want to see the docstring to 
understand how a function works, and they may not be looking at some global 
documentation as a whole. And the fact that many of the input DataFrames are 
the same is really just a convenience for the doc writer and not a requirement.

For instance this is the docstring for a numpy method (`numpy.argmax`), and 
the example is with the input clearly defined:
```
Examples

>>> a = np.arange(6).reshape(2,3)
>>> a
array([[0, 1, 2],
   [3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])
```

IMHO it seems odd to require the user to look at some global doc in order 
to follow the example usage for one single function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Hi @mortada , I am sorry that I am a bit being noisy here but I just took a 
look for myself.

I resembled the PySpark structure and made a draft for myself.

```python
"""
>>> print df
[1, 2, 3]
>>> print df2
[]
"""

import doctest
import sys


def my_function(a, b):
"""
>>> my_function(df[0], df[1])
2
"""
return a * b


if __name__ == "__main__":
globs = globals()
globs['df'] = [1, 2, 3]
globs['df2'] = []
doctest.testmod(sys.modules[__name__], globs=globs)
```

If we do this, we would be able to just add 

```
"""
>>> print df
[1, 2, 3]
>>> print df2
[]
"""
```

on the top as a docstring and remove all the duplicated dataframe 
definitions in each doc string.

I guess this will be a minimal change and also satisfy your demands here 
with making the docstring clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
I haven't checked if package level docstring can define global level 
variables to be accessed from other docstrings. So, I would like to defer this 
to @holdenk (if you are not sure too, then we all together can look into this 
deeper).

I think it'd better if we follow @holdenk's comment first as I blieve this 
requires minimal change first and let's see if we all agree with fixing it in 
that way. If not, we could follow my comment as a safe choice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread mortada
Github user mortada commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon thanks for your help! I'm happy to complete this PR and follow 
what you suggest for testing. 

How would the package level docstring work? The goal (which I think we all 
agree on) is to be able to let the user easily see how the input is set up for 
each function in the docstring in a self-contained way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Hi @mortada, do you mind if I ask to address mine or @holdenk' comment? If 
you find any problem with testing, I am willing to take over this which I will 
ask comitters to credit this to you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@holdenk Personally, I like package level docstring more if we can write 
them pretty and well.
(If we are not sure on that, then, I think we can maybe do this for each as 
a safe choice though).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-15 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15053
  
What about if for the shared DF we also put the comment in the package 
level docstring so people reading the documentation would see it first but we 
wouldn't need to have it repeated for all of the tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@srowen - Thanks for your feedback. In my personal opinion, I believe 
examples in documentation should be self-contained, at least, to allow copy and 
paste them to test and let users understand the input and output easily. I 
prefer verbosity and consistency.

As the output is being printed already in the examples, I think input 
should be written in the example. If I were a newbie in PySpark, I guess it'd 
take a while to understand what each function does in 
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-15 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15053
  
@HyukjinKwon I think you'd know better than I. I tend to agree it would 
make the examples more self-contained and readable. It would lead to some more 
duplication of code I guess, but if you judge it's worth it I'd follow your 
lead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
@srowen Would this sensible if we remove all the predefined dataframes as 
globals and convert them to within each doctest if this change looks good?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
BTW - If you meant how they passed before not how they ran, it seems some 
dataframes were created before actually running the tests, 
[dataframe.py#L1600-L1617](https://github.com/mortada/spark/blob/52240bcf8df42dd454e874ce7640d7040c5cdad9/python/pyspark/sql/dataframe.py#L1600-L1617).
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15053
  
Oh, it will run doc tests as far as I know, 
http://www.sphinx-doc.org/en/stable/ext/doctest.html

Maybe I will try to run it locally to check it by myself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15053
  
Looks like the expected output doesn't match now. Maybe you all can teach 
me something -- how did this docstring work as a test at all before? or if it 
doesn't run correctly is the output not checked?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65242/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15053
  
**[Test build #65242 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)**
 for PR 15053 at commit 
[`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15053
  
**[Test build #65242 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)**
 for PR 15053 at commit 
[`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-12 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/15053
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org