Oliver, just curious: did you get a clean error message when you broke it out into separate statements?
Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly <https://calendly.com/rjurney_personal/30min> On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > Thanks for the advice. First of all, it looks like I used the wrong > *max* function, but *pyspark.sql.functions.max* isn't right either, > because it finds the maximum of a given column over groups of rows. To find > the maximum among multiple columns, I need > *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to > make it a column. > > In short, the correct line is: > > > *distances = joined.withColumn("distance", greatest(col("start") - > col("position"), col("position") - col("end"), lit(0)))* > > Again, thanks to all who responded! > > Best, Oliver > > On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote: > >> That's pretty impressive. I'm not sure it's quite right - not clear that >> the intent is taking a minimum of absolute values (is it? that'd be wild). >> But I think it might have pointed in the right direction. I'm not quite >> sure why that error pops out, but I think 'max' is the wrong function. >> That's an aggregate function. "greatest" is the function that returns the >> max of several cols. Try that? >> >> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> >> wrote: >> >>> I'm trying to learn how to use chatgpt for coding. >>> >>> So after a lite chat I got this. >>> >>> >>> The code you provided seems to calculate the distance between a gene and >>> a variant by finding the maximum value between the difference of the >>> variant position and the gene start position, the difference of the gene >>> end position and the variant position, and 0. However, the result of this >>> calculation may not always be correct. >>> >>> Here's an alternative approach to calculating the distance between a >>> gene and a variant: >>> >>> >>> from pyspark.sql.functions import abs >>> >>> # Calculate the absolute difference between the variant position and the >>> gene start and end positions >>> distances = joined.withColumn("distance_start", abs(col("position") - >>> col("start"))) \ >>> .withColumn("distance_end", abs(col("position") - >>> col("end"))) >>> >>> # Take the minimum distance between the variant and the gene >>> distances = distances.withColumn("distance", >>> least(col("distance_start"), col("distance_end"))) >>> >>> >>> In this approach, we first calculate the absolute difference between the >>> variant position and the gene start and end positions using the abs >>> function. Then, we use the least function to take the minimum distance >>> between the variant and the gene. This should give you a more accurate >>> distance calculation. >>> >>> >>> >>> >>> Yes, that's correct. The original code you provided calculates the >>> distance between a gene and a variant as the maximum of three terms: the >>> difference between the variant position and the gene start position, the >>> difference between the gene end position and the variant position, and 0. >>> >>> While this approach might work in some cases, it has a few issues. >>> First, it doesn't take into account the possibility that the variant might >>> be inside the gene (i.e., between the start and end positions). Second, it >>> assumes that the gene is always located upstream or downstream of the >>> variant, which may not be the case. >>> >>> The alternative approach I provided earlier calculates the distance as >>> the minimum of two absolute differences: the absolute difference between >>> the variant position and the gene start position, and the absolute >>> difference between the variant position and the gene end position. This >>> approach should work in all cases, including when the variant is inside the >>> gene or when the gene is located upstream or downstream of the variant. >>> >>> >>> >>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney < >>> russell.jur...@gmail.com>: >>> >>>> Usually, the solution to these problems is to do less per line, break >>>> it out and perform each minute operation as a field, then combine those >>>> into a final answer. Can you do that here? >>>> >>>> Thanks, >>>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly >>>> <https://calendly.com/rjurney_personal/30min> >>>> >>>> >>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker < >>>> oliv...@broadinstitute.org> wrote: >>>> >>>>> Here is the complete error: >>>>> >>>>> ``` >>>>> Traceback (most recent call last): >>>>> File "nearest-gene.py", line 74, in <module> >>>>> main() >>>>> File "nearest-gene.py", line 62, in main >>>>> distances = joined.withColumn("distance", max(col("start") - >>>>> col("position"), col("position") - col("end"), 0)) >>>>> File >>>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py", >>>>> line 907, in __nonzero__ >>>>> ValueError: Cannot convert column into bool: please use '&' for 'and', >>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. >>>>> ``` >>>>> >>>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> That error sounds like it's from pandas not spark. Are you sure it's >>>>>> this line? >>>>>> >>>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < >>>>>> oliv...@broadinstitute.org> wrote: >>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I'm trying to calculate the distance between a gene (with start >>>>>>> and end) and a variant (with position), so I joined gene and variant >>>>>>> data >>>>>>> by chromosome and then tried to calculate the distance like this: >>>>>>> >>>>>>> ``` >>>>>>> distances = joined.withColumn("distance", max(col("start") - >>>>>>> col("position"), col("position") - col("end"), 0)) >>>>>>> ``` >>>>>>> >>>>>>> Basically, the distance is the maximum of three terms. >>>>>>> >>>>>>> This line causes an obscure error: >>>>>>> >>>>>>> ``` >>>>>>> ValueError: Cannot convert column into bool: please use '&' for >>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean >>>>>>> expressions. >>>>>>> ``` >>>>>>> >>>>>>> How can I do this? Thanks! >>>>>>> >>>>>>> Best, Oliver >>>>>>> >>>>>>> -- >>>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>>> Institute <http://www.broadinstitute.org/> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Oliver Ruebenacker, Ph.D. (he) >>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>>> <http://www.broadinstitute.org/> >>>>> >>>> >>> >>> -- >>> Bjørn Jørgensen >>> Vestre Aspehaug 4, 6010 Ålesund >>> Norge >>> >>> +47 480 94 297 >>> >> > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> >