[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 BTW, it'd be even nicer if the images in https://github.com/apache/spark/pull/15053#issuecomment-249550254 are included in the PR description as we have the images already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @mortada I see. Then, - It'd not almost increase the test time. - It will improve the documentation. - We have a reference for this, `numpy` Therefore, +1 for each docstring within each method. Can we remove the dataframes in `globs` and then do the same thing across PySpark documentation? I just took a quicky look and it seems we have more instances such as .. ``` $ grep -r "globs\['df" . ./pyspark/sql/column.py:globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \ ./pyspark/sql/context.py:globs['df'] = rdd.toDF() ./pyspark/sql/dataframe.py:globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')])\ ./pyspark/sql/dataframe.py:globs['df2'] = sc.parallelize([Row(name='Tom', height=80), Row(name='Bob', height=85)]).toDF() ./pyspark/sql/dataframe.py:globs['df3'] = sc.parallelize([Row(name='Alice', age=2), ./pyspark/sql/dataframe.py:globs['df4'] = sc.parallelize([Row(name='Alice', age=10, height=80), ./pyspark/sql/functions.py:globs['df'] = sc.parallelize([Row(name='Alice', age=2), Row(name='Bob', age=5)]).toDF() ./pyspark/sql/group.py:globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \ ./pyspark/sql/group.py:globs['df3'] = sc.parallelize([Row(name='Alice', age=2, height=80), ./pyspark/sql/group.py:globs['df4'] = sc.parallelize([Row(course="dotNET", year=2012, earnings=1), ./pyspark/sql/readwriter.py:globs['df'] = spark.read.parquet('python/test_support/sql/parquet_partitioned') ./pyspark/sql/session.py:globs['df'] = rdd.toDF() ./pyspark/sql/streaming.py:globs['df'] = \ ``` Also, I think we'd better have a JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 @HyukjinKwon do you know how I could run the doctests for these files? I found this online: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals which says that I could do ``` SPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py. ``` but that doesn't actually work ``` $ SPARK_TESTING=1 ./bin/pyspark python/pyspark/sql/dataframe.py python: Error while finding spec for 'python/pyspark/sql/dataframe.py' (ImportError: No module named 'python/pyspark/sql/dataframe') ``` in terms of the extra time to create these small DataFrames, I think we'd probably be looking at something reasonably negligible (a few seconds tops) ``` In [2]: %timeit df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age']) 100 loops, best of 3: 10.3 ms per loop ``` i.e. this should be on the order of 1 second even with creating `df` 100 times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Also, I guess this change will be an example and will affect all the Python documentation and worry disagreement from others in the future. So, I'd rather set a clear reason and investigation further. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 IMHO, i think the original author intended to avoid creating each dataframe in each method docstring. I am uncertain about doing each method docstring although I do prefer verbose/complete documentation. If we should go for the Suggestion A, then I think we chould - verify if there is any degradation from this cahnge. (e.g. test time increase) and then decide if this change is worth it. I mean we should be able to defence with a concreate reason or reference not only now here but also in the future. - find another way to reach this goal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 OK, I just manually fixed and built the docs. - [**Original**](https://cloud.githubusercontent.com/assets/6477701/18833109/11098e5c-842a-11e6-9e48-6b3cae7c10b0.png) - [**Suggestion A (each dataframe in each method docstring)**](https://cloud.githubusercontent.com/assets/6477701/18833116/1a400f82-842a-11e6-9da4-e4db0197e9e7.png) - [**Suggestion B (dataframes in each package docstring)**](https://cloud.githubusercontent.com/assets/6477701/18833098/034207d6-842a-11e6-8bfb-3365303304c3.png) OK, I tend to agree with Suggestion A looks better but I am worried if it would make PySpark tests slower down as it now creates each dataframe for each test. @holdenk and @srowen WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @mortada Otherwise, let me please try this in my local and post some images here by myself if you can bear with few days :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @mortada yes, I understand your point. However, I'd like to avoid change a lot and I guess @holdenk feels in the same way (she is insightful in Python). I guess it'd not take a lot of time to fix this. Let's do it in this way first. Then, I am happy to support to build the documentation and then leave the images here so that we can decide which way is better. I can't currently imagine how the documentation would look like or how much it'd be pretty or clean with package level docstring one. So, it'd be nicer if we have some images or things like we can discuss further about. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 In other words, currently it is not possible for the user to follow the examples in this docstring below. It's not clear what all these input variables (`df`, `df2`, etc) are, and where you'd even find them ``` In [5]: DataFrame.join? Signature: DataFrame.join(self, other, on=None, how=None) Docstring: Joins with another :class:`DataFrame`, using the given join expression. :param other: Right side of the join :param on: a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If `on` is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. :param how: str, default 'inner'. One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`. The following performs a full outer join between ``df1`` and ``df2``. >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect() [Row(name=None, height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)] >>> df.join(df2, 'name', 'outer').select('name', 'height').collect() [Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)] >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() [Row(name='Alice', age=2), Row(name='Bob', age=5)] >>> df.join(df2, 'name').select(df.name, df2.height).collect() [Row(name='Bob', height=85)] >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() [Row(name='Bob', age=5)] .. versionadded:: 1.3 File: ~/code/spark/python/pyspark/sql/dataframe.py Type: function ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 But that's what this PR is supposed to fix, the problem that the docstring for each individual method is not self-contained :) I think I now see where I was confused - it seems like we are assuming the user would be looking at the package level docstring? I don't think that's the typical workflow. I think the user would be looking at the docstring of one method and expect the docstring to explain how the method works. (hence the example with `numpy` I posted above https://github.com/apache/spark/pull/15053#issuecomment-247906649) For instance in `ipython` if you do `DataFrame.join?` it would bring up the docstring for the method `join()`, and it just seems really odd that it'd have everything including: function signature and parameters, explanation for how it works, example usage ... except for how to construct the very input data you need to interact with the example I don't think the user would know that the input DataFrame in the example is somehow defined in the package level docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @mortada Exactly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 I see, ok so you mean leave all the docstrings for the individual methods unchanged, but instead just add ``` """ >>> df.show() +-+---+ | name|age| +-+---+ |Alice| 2| | Bob| 5| +-+---+ """ ``` at the top of the file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Oh, maybe my previous comments were not clear. I meant https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L18 > Also, is the idea that we would define df globally, and then for the docstring of each function we would not do: For this one, I meant not defining dataframes that were already defined in `_test` but just print them in the package level doc string. So.. the complete structure would be as below: ```python # Apache license # ... """ >>> df.show() +-+---+ | name|age| +-+---+ |Alice| 2| | Bob| 5| +-+---+ """ import doctest import sys # Other tests.. def _test(): # Just leave the dataframe definition as they are globs = globals() globs['df'] = df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age']) doctest.testmod(sys.modules[__name__], globs=globs) if __name__ == "__main__": _test() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 @HyukjinKwon I may still be confused about something - first of all what do you mean by the package level docstring? Do you mean here: https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1559 or here: https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L18? Also, is the idea that we would define `df` globally, and then for the docstring of each function we would *not* do: ``` >>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age']) >>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect() [Row(age=3), Row(age=4)] ``` and instead we do: ``` >>> df.show() +-+---+ | name|age| +-+---+ |Alice| 2| | Bob| 5| +-+---+ >>> df.select(when(df['age'] == 2, 3).otherwise(4).alias("age")).collect() [Row(age=3), Row(age=4)] ``` therefore not showing the user how to construct the DataFrame? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Hm.. @mortada It seems the dataframe definition is added into each docstring. Could you maybe remove the duplicates and then just print the contents in the package level docstring? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15053 @mortada what do you think of @HyukjinKwon 's suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15053 Oh yes I agree that's a much smaller change - I was just explaining the motivation behind my initial comment as mortada asked me to elaborate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @holdenk I am also cautious but leaving everything but adding `df.show()` in the package docstring with cleaning up duplicated defining dataframes in each docstring will be minimal change and fine for the goal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15053 I was thinking that the user would probably read the package string documentation before looking at the individual functions (or if they went looking for the definition of the dataframe). I'm a little hesitant to add a bunch of copy and paste code so its possible I'm being overly cautious. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Oh, I meant just don't touch the `globs` in `_test()` but just print the global dataframes (which should be rather `show()` to show the contents) so that users can understand the input and output. So, my scenario is, users open the doc and see the input dataframes first and then check the function documentation they want. I do like the idea of defining each within each docstring but maybe we could try the minimal change suggestion to reach the same goal. I guess she meant defining the global dataframes within the package docstring which is, for example, ```python """ # here """ import doctest import sys ... ``` However, after investigating, it seems we can't easily do that. So, I suggested to print them in the package docstring such as ```python """ >>> df.show() +-+ |field| +-+ |value| |value| +-+ """ import doctest import sys ... ``` rather than defining them in the package docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 @HyukjinKwon I understand we can have `py.test` and `doctest`, but I don't quite see how we could define the input DataFrame globally while at the same time have a clear, self-contained docstring for each function? @holdenk could you please elaborate on what you mean? If we want to repeat something like this in every docstring ``` >>> print(df.collect()) ``` we might as well simply include how to actually create the DataFrame so the user can easily reproduce the example? It seems to me that the user would often want to see the docstring to understand how a function works, and they may not be looking at some global documentation as a whole. And the fact that many of the input DataFrames are the same is really just a convenience for the doc writer and not a requirement. For instance this is the docstring for a numpy method (`numpy.argmax`), and the example is with the input clearly defined: ``` Examples >>> a = np.arange(6).reshape(2,3) >>> a array([[0, 1, 2], [3, 4, 5]]) >>> np.argmax(a) 5 >>> np.argmax(a, axis=0) array([1, 1, 1]) >>> np.argmax(a, axis=1) array([2, 2]) ``` IMHO it seems odd to require the user to look at some global doc in order to follow the example usage for one single function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Hi @mortada , I am sorry that I am a bit being noisy here but I just took a look for myself. I resembled the PySpark structure and made a draft for myself. ```python """ >>> print df [1, 2, 3] >>> print df2 [] """ import doctest import sys def my_function(a, b): """ >>> my_function(df[0], df[1]) 2 """ return a * b if __name__ == "__main__": globs = globals() globs['df'] = [1, 2, 3] globs['df2'] = [] doctest.testmod(sys.modules[__name__], globs=globs) ``` If we do this, we would be able to just add ``` """ >>> print df [1, 2, 3] >>> print df2 [] """ ``` on the top as a docstring and remove all the duplicated dataframe definitions in each doc string. I guess this will be a minimal change and also satisfy your demands here with making the docstring clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 I haven't checked if package level docstring can define global level variables to be accessed from other docstrings. So, I would like to defer this to @holdenk (if you are not sure too, then we all together can look into this deeper). I think it'd better if we follow @holdenk's comment first as I blieve this requires minimal change first and let's see if we all agree with fixing it in that way. If not, we could follow my comment as a safe choice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user mortada commented on the issue: https://github.com/apache/spark/pull/15053 @HyukjinKwon thanks for your help! I'm happy to complete this PR and follow what you suggest for testing. How would the package level docstring work? The goal (which I think we all agree on) is to be able to let the user easily see how the input is set up for each function in the docstring in a self-contained way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Hi @mortada, do you mind if I ask to address mine or @holdenk' comment? If you find any problem with testing, I am willing to take over this which I will ask comitters to credit this to you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @holdenk Personally, I like package level docstring more if we can write them pretty and well. (If we are not sure on that, then, I think we can maybe do this for each as a safe choice though). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/15053 What about if for the shared DF we also put the comment in the package level docstring so people reading the documentation would see it first but we wouldn't need to have it repeated for all of the tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @srowen - Thanks for your feedback. In my personal opinion, I believe examples in documentation should be self-contained, at least, to allow copy and paste them to test and let users understand the input and output easily. I prefer verbosity and consistency. As the output is being printed already in the examples, I think input should be written in the example. If I were a newbie in PySpark, I guess it'd take a while to understand what each function does in http://spark.apache.org/docs/latest/api/python/pyspark.sql.html. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15053 @HyukjinKwon I think you'd know better than I. I tend to agree it would make the examples more self-contained and readable. It would lead to some more duplication of code I guess, but if you judge it's worth it I'd follow your lead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 @srowen Would this sensible if we remove all the predefined dataframes as globals and convert them to within each doctest if this change looks good? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 BTW - If you meant how they passed before not how they ran, it seems some dataframes were created before actually running the tests, [dataframe.py#L1600-L1617](https://github.com/mortada/spark/blob/52240bcf8df42dd454e874ce7640d7040c5cdad9/python/pyspark/sql/dataframe.py#L1600-L1617). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15053 Oh, it will run doc tests as far as I know, http://www.sphinx-doc.org/en/stable/ext/doctest.html Maybe I will try to run it locally to check it by myself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15053 Looks like the expected output doesn't match now. Maybe you all can teach me something -- how did this docstring work as a test at all before? or if it doesn't run correctly is the output not checked? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15053 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15053 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65242/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15053 **[Test build #65242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)** for PR 15053 at commit [`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15053 **[Test build #65242 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65242/consoleFull)** for PR 15053 at commit [`52240bc`](https://github.com/apache/spark/commit/52240bcf8df42dd454e874ce7640d7040c5cdad9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15053 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15053: [Doc] improve python API docstrings
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15053 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org