Re: [PySpark SQL] New column with the maximum of multiple terms?

Russell Jurney Fri, 24 Feb 2023 13:14:20 -0800

Oliver, just curious: did you get a clean error message when you broke it
out into separate statements?


Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
<https://calendly.com/rjurney_personal/30min>


On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>      Hello,
>
>   Thanks for the advice. First of all, it looks like I used the wrong
> *max* function, but *pyspark.sql.functions.max* isn't right either,
> because it finds the maximum of a given column over groups of rows. To find
> the maximum among multiple columns, I need
> *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to
> make it a column.
>
>   In short, the correct line is:
>
>
> *distances = joined.withColumn("distance", greatest(col("start") -
> col("position"), col("position") - col("end"), lit(0)))*
>
>   Again, thanks to all who responded!
>
>      Best, Oliver
>
> On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote:
>
>> That's pretty impressive. I'm not sure it's quite right - not clear that
>> the intent is taking a minimum of absolute values (is it? that'd be wild).
>> But I think it might have pointed in the right direction. I'm not quite
>> sure why that error pops out, but I think 'max' is the wrong function.
>> That's an aggregate function. "greatest" is the function that returns the
>> max of several cols. Try that?
>>
>> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
>> wrote:
>>
>>> I'm trying to learn how to use chatgpt for coding.
>>>
>>> So after a lite chat I got this.
>>>
>>>
>>> The code you provided seems to calculate the distance between a gene and
>>> a variant by finding the maximum value between the difference of the
>>> variant position and the gene start position, the difference of the gene
>>> end position and the variant position, and 0. However, the result of this
>>> calculation may not always be correct.
>>>
>>> Here's an alternative approach to calculating the distance between a
>>> gene and a variant:
>>>
>>>
>>> from pyspark.sql.functions import abs
>>>
>>> # Calculate the absolute difference between the variant position and the
>>> gene start and end positions
>>> distances = joined.withColumn("distance_start", abs(col("position") -
>>> col("start"))) \
>>>                  .withColumn("distance_end", abs(col("position") -
>>> col("end")))
>>>
>>> # Take the minimum distance between the variant and the gene
>>> distances = distances.withColumn("distance",
>>> least(col("distance_start"), col("distance_end")))
>>>
>>>
>>> In this approach, we first calculate the absolute difference between the
>>> variant position and the gene start and end positions using the abs
>>> function. Then, we use the least function to take the minimum distance
>>> between the variant and the gene. This should give you a more accurate
>>> distance calculation.
>>>
>>>
>>>
>>>
>>> Yes, that's correct. The original code you provided calculates the
>>> distance between a gene and a variant as the maximum of three terms: the
>>> difference between the variant position and the gene start position, the
>>> difference between the gene end position and the variant position, and 0.
>>>
>>> While this approach might work in some cases, it has a few issues.
>>> First, it doesn't take into account the possibility that the variant might
>>> be inside the gene (i.e., between the start and end positions). Second, it
>>> assumes that the gene is always located upstream or downstream of the
>>> variant, which may not be the case.
>>>
>>> The alternative approach I provided earlier calculates the distance as
>>> the minimum of two absolute differences: the absolute difference between
>>> the variant position and the gene start position, and the absolute
>>> difference between the variant position and the gene end position. This
>>> approach should work in all cases, including when the variant is inside the
>>> gene or when the gene is located upstream or downstream of the variant.
>>>
>>>
>>>
>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>>> russell.jur...@gmail.com>:
>>>
>>>> Usually, the solution to these problems is to do less per line, break
>>>> it out and perform each minute operation as a field, then combine those
>>>> into a final answer. Can you do that here?
>>>>
>>>> Thanks,
>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>>>> <https://calendly.com/rjurney_personal/30min>
>>>>
>>>>
>>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>>>> oliv...@broadinstitute.org> wrote:
>>>>
>>>>> Here is the complete error:
>>>>>
>>>>> ```
>>>>> Traceback (most recent call last):
>>>>>   File "nearest-gene.py", line 74, in <module>
>>>>>     main()
>>>>>   File "nearest-gene.py", line 62, in main
>>>>>     distances = joined.withColumn("distance", max(col("start") -
>>>>> col("position"), col("position") - col("end"), 0))
>>>>>   File
>>>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py",
>>>>> line 907, in __nonzero__
>>>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>>>> ```
>>>>>
>>>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>
>>>>>> That error sounds like it's from pandas not spark. Are you sure it's
>>>>>> this line?
>>>>>>
>>>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>>>>> oliv...@broadinstitute.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>      Hello,
>>>>>>>
>>>>>>>   I'm trying to calculate the distance between a gene (with start
>>>>>>> and end) and a variant (with position), so I joined gene and variant 
>>>>>>> data
>>>>>>> by chromosome and then tried to calculate the distance like this:
>>>>>>>
>>>>>>> ```
>>>>>>> distances = joined.withColumn("distance", max(col("start") -
>>>>>>> col("position"), col("position") - col("end"), 0))
>>>>>>> ```
>>>>>>>
>>>>>>>   Basically, the distance is the maximum of three terms.
>>>>>>>
>>>>>>>   This line causes an obscure error:
>>>>>>>
>>>>>>> ```
>>>>>>> ValueError: Cannot convert column into bool: please use '&' for
>>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean
>>>>>>> expressions.
>>>>>>> ```
>>>>>>>
>>>>>>>   How can I do this? Thanks!
>>>>>>>
>>>>>>>      Best, Oliver
>>>>>>>
>>>>>>> --
>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Re: [PySpark SQL] New column with the maximum of multiple terms?

Reply via email to