[ https://issues.apache.org/jira/browse/SPARK-44564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruifeng Zheng updated SPARK-44564: ---------------------------------- Description: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, Note that the LLM is not 100% reliable, the generated doc string may contain some mistakes, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. was: Let's first focus on the Documents of *PySpark DataFrame APIs*. *1*, Chose a subset of DF APIs Since the review bandwidth is limited, we recommend each PR contains at least 5 APIs; *2*, For each API, copy-paste the function (including function signature, doc string) to a LLM Model, and ask it to refine the document with prompts like: * please improve the docstring of the 'unionByName' function * please refine the comments of the 'unionByName' function * please refine the documents of the 'unionByName' function, and add more examples * please provide more example for function 'unionByName' * ... It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the former generate better results. *3*, The generated doc string may contain some bugs, e.g. * The example results are incorrect * The example code doesn't reflect the example title * The description use wrong version, add a 'Raise' selection for non-existent exception * ... we need to fix them before sending a PR. We can generate the docs with different prompts, choose the good parts and combine them to the new doc sting. > Refine the documents with LLM > ----------------------------- > > Key: SPARK-44564 > URL: https://issues.apache.org/jira/browse/SPARK-44564 > Project: Spark > Issue Type: Umbrella > Components: Documentation > Affects Versions: 4.0.0 > Reporter: Ruifeng Zheng > Priority: Major > > Let's first focus on the Documents of *PySpark DataFrame APIs*. > *1*, Chose a subset of DF APIs > Since the review bandwidth is limited, we recommend each PR contains at least > 5 APIs; > *2*, For each API, copy-paste the function (including function signature, doc > string) to a LLM Model, and ask it to refine the document with prompts like: > * please improve the docstring of the 'unionByName' function > * please refine the comments of the 'unionByName' function > * please refine the documents of the 'unionByName' function, and add more > examples > * please provide more example for function 'unionByName' > * ... > It is highly recommended to leverage *GPT-4* instead of GPT-3.5, since the > former generate better results. > *3*, Note that the LLM is not 100% reliable, the generated doc string may > contain some mistakes, e.g. > * The example results are incorrect > * The example code doesn't reflect the example title > * The description use wrong version, add a 'Raise' selection for non-existent > exception > * ... > we need to fix them before sending a PR. > We can generate the docs with different prompts, choose the good parts and > combine them to the new doc sting. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org