The documentation for DataFrame.join() <https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join> lists all the join types we support:
- inner - cross - outer - full - full_outer - left - left_outer - right - right_outer - left_semi - left_anti Some of these join types are also listed on the SQL Programming Guide <http://spark.apache.org/docs/2.3.0/sql-programming-guide.html#supported-hive-features> . Is it obvious to everyone what all these different join types are? For example, I had never heard of a LEFT ANTI join until stumbling on it in the PySpark docs. It’s quite handy! But I had to experiment with it a bit just to understand what it does. I think it would be a good service to our users if we either documented these join types ourselves clearly, or provided a link to an external resource that documented them sufficiently. I’m happy to file a JIRA about this and do the work itself. It would be great if the documentation could be expressed as a series of simple doc tests, but brief prose describing how each join works would still be valuable. Does this seem worthwhile to folks here? And does anyone want to offer guidance on how best to provide this kind of documentation so that it’s easy to find by users, regardless of the language they’re using? Nick