Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Sorry, I didn't try that. On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney wrote: > Oliver, just curious: did you get a clean error message when you broke it > out into separate statements? > > Thanks, > Russell Jurney @rjurney > russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it out into separate statements? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com Book a time on

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Hello, Thanks for the advice. First of all, it looks like I used the wrong *max* function, but *pyspark.sql.functions.max* isn't right either, because it finds the maximum of a given column over groups of rows. To find the maximum among multiple columns, I need

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That's pretty impressive. I'm not sure it's quite right - not clear that the intent is taking a minimum of absolute values (is it? that'd be wild). But I think it might have pointed in the right direction. I'm not quite sure why that error pops out, but I think 'max' is the wrong function. That's

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
I'm trying to learn how to use chatgpt for coding. So after a lite chat I got this. The code you provided seems to calculate the distance between a gene and a variant by finding the maximum value between the difference of the variant position and the gene start position, the difference of the

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it out and perform each minute operation as a field, then combine those into a final answer. Can you do that here? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Here is the complete error: ``` Traceback (most recent call last): File "nearest-gene.py", line 74, in main() File "nearest-gene.py", line 62, in main distances = joined.withColumn("distance", max(col("start") - col("position"), col("position") - col("end"), 0)) File

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position),

[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Hello, I'm trying to calculate the distance between a gene (with start and end) and a variant (with position), so I joined gene and variant data by chromosome and then tried to calculate the distance like this: ``` distances = joined.withColumn("distance", max(col("start") -