[ https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445988#comment-17445988 ]
Tim Schwab commented on SPARK-37348: ------------------------------------ Fair enough. The reasoning to add a function as given in the comment you linked is precisely the reasoning I have—to create a compile-time check as opposed to a runtime check. That, and the niceness of it fully integrating with the rest of PySpark. (E.g. I can't use F.expr() in the middle of several other functions - I would have to either rewrite the whole line to use F.expr() or break the line into several intermediate columns. Not a big deal obviously, but still not ideal.) As for whether it is commonly used, I am not sure how to validate this one way or the other. However, I can say that the majority of use cases for the % operator in computer science in general are looking for the modulus rather than the remainder. Specifically, the majority of use cases expect a range of [0, n) as opposed to a range of (-n, n). At the same time, the majority of use cases also have a restricted domain of positive numbers anyway, so there is no difference. But in the cases where the domain is negative, often modulus is desired rather than remainder. This is because the % operator is most often used to map a larger number onto a range of smaller numbers [0, n). The counterpoint to this is cryptographic functions, which can use remainders just fine, but I would expect that manual implementations of cryptographic functions on RDDs or Dataframes are not common. So, as far as I can see, when the domain includes negative numbers, usually modulus is desired rather than remainder. It happens that Spark includes a very commonly used function that has a range that includes negative numbers and whose output is often fed into the % operator: hash(). This is in fact the exact use case that brought me here initially; I want to map hash() outputs to a range of [0, n) instead of (-n, n). For this use case alone I would think it is worth it to include pmod in PySpark. In addition to this, Python's % operator is modulus rather than remainder unlike the JVM's, therefore I would expect Python users to more often feel a need for pmod() than, say, Scala users. > PySpark pmod function > --------------------- > > Key: SPARK-37348 > URL: https://issues.apache.org/jira/browse/SPARK-37348 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Tim Schwab > Priority: Minor > > Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns > -1. However, the modulus is often desired instead of the remainder. > > There is a [PMOD() function in Spark > SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in > PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions]. > So at the moment, the two options for getting the modulus is to use > F.expr("pmod(A, B)"), or create a helper function such as: > > {code:java} > def pmod(dividend, divisor): > return F.when(dividend < 0, (dividend % divisor) + > divisor).otherwise(dividend % divisor){code} > > > Neither are optimal - pmod should be native to PySpark as it is in Spark SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org