Hello, Thanks for the advice. First of all, it looks like I used the wrong *max* function, but *pyspark.sql.functions.max* isn't right either, because it finds the maximum of a given column over groups of rows. To find the maximum among multiple columns, I need *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to make it a column.
In short, the correct line is: *distances = joined.withColumn("distance", greatest(col("start") - col("position"), col("position") - col("end"), lit(0)))* Again, thanks to all who responded! Best, Oliver On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote: > That's pretty impressive. I'm not sure it's quite right - not clear that > the intent is taking a minimum of absolute values (is it? that'd be wild). > But I think it might have pointed in the right direction. I'm not quite > sure why that error pops out, but I think 'max' is the wrong function. > That's an aggregate function. "greatest" is the function that returns the > max of several cols. Try that? > > On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> I'm trying to learn how to use chatgpt for coding. >> >> So after a lite chat I got this. >> >> >> The code you provided seems to calculate the distance between a gene and >> a variant by finding the maximum value between the difference of the >> variant position and the gene start position, the difference of the gene >> end position and the variant position, and 0. However, the result of this >> calculation may not always be correct. >> >> Here's an alternative approach to calculating the distance between a gene >> and a variant: >> >> >> from pyspark.sql.functions import abs >> >> # Calculate the absolute difference between the variant position and the >> gene start and end positions >> distances = joined.withColumn("distance_start", abs(col("position") - >> col("start"))) \ >> .withColumn("distance_end", abs(col("position") - >> col("end"))) >> >> # Take the minimum distance between the variant and the gene >> distances = distances.withColumn("distance", least(col("distance_start"), >> col("distance_end"))) >> >> >> In this approach, we first calculate the absolute difference between the >> variant position and the gene start and end positions using the abs >> function. Then, we use the least function to take the minimum distance >> between the variant and the gene. This should give you a more accurate >> distance calculation. >> >> >> >> >> Yes, that's correct. The original code you provided calculates the >> distance between a gene and a variant as the maximum of three terms: the >> difference between the variant position and the gene start position, the >> difference between the gene end position and the variant position, and 0. >> >> While this approach might work in some cases, it has a few issues. First, >> it doesn't take into account the possibility that the variant might be >> inside the gene (i.e., between the start and end positions). Second, it >> assumes that the gene is always located upstream or downstream of the >> variant, which may not be the case. >> >> The alternative approach I provided earlier calculates the distance as >> the minimum of two absolute differences: the absolute difference between >> the variant position and the gene start position, and the absolute >> difference between the variant position and the gene end position. This >> approach should work in all cases, including when the variant is inside the >> gene or when the gene is located upstream or downstream of the variant. >> >> >> >> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney < >> russell.jur...@gmail.com>: >> >>> Usually, the solution to these problems is to do less per line, break it >>> out and perform each minute operation as a field, then combine those into a >>> final answer. Can you do that here? >>> >>> Thanks, >>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly >>> <https://calendly.com/rjurney_personal/30min> >>> >>> >>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> Here is the complete error: >>>> >>>> ``` >>>> Traceback (most recent call last): >>>> File "nearest-gene.py", line 74, in <module> >>>> main() >>>> File "nearest-gene.py", line 62, in main >>>> distances = joined.withColumn("distance", max(col("start") - >>>> col("position"), col("position") - col("end"), 0)) >>>> File >>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py", >>>> line 907, in __nonzero__ >>>> ValueError: Cannot convert column into bool: please use '&' for 'and', >>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. >>>> ``` >>>> >>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: >>>> >>>>> That error sounds like it's from pandas not spark. Are you sure it's >>>>> this line? >>>>> >>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> I'm trying to calculate the distance between a gene (with start and >>>>>> end) and a variant (with position), so I joined gene and variant data by >>>>>> chromosome and then tried to calculate the distance like this: >>>>>> >>>>>> ``` >>>>>> distances = joined.withColumn("distance", max(col("start") - >>>>>> col("position"), col("position") - col("end"), 0)) >>>>>> ``` >>>>>> >>>>>> Basically, the distance is the maximum of three terms. >>>>>> >>>>>> This line causes an obscure error: >>>>>> >>>>>> ``` >>>>>> ValueError: Cannot convert column into bool: please use '&' for >>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean >>>>>> expressions. >>>>>> ``` >>>>>> >>>>>> How can I do this? Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>