Re: [PySpark SQL] New column with the maximum of multiple terms?

Oliver Ruebenacker Fri, 24 Feb 2023 09:53:22 -0800

     Hello,

  Thanks for the advice. First of all, it looks like I used the wrong *max*
function, but *pyspark.sql.functions.max* isn't right either, because it
finds the maximum of a given column over groups of rows. To find the
maximum among multiple columns, I need *pyspark.sql.functions.greatest*.
Also, instead of 0, I need *lit(0)* to make it a column.


  In short, the correct line is:


*distances = joined.withColumn("distance", greatest(col("start") -
col("position"), col("position") - col("end"), lit(0)))*

  Again, thanks to all who responded!

     Best, Oliver

On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote:

> That's pretty impressive. I'm not sure it's quite right - not clear that
> the intent is taking a minimum of absolute values (is it? that'd be wild).
> But I think it might have pointed in the right direction. I'm not quite
> sure why that error pops out, but I think 'max' is the wrong function.
> That's an aggregate function. "greatest" is the function that returns the
> max of several cols. Try that?
>
> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> I'm trying to learn how to use chatgpt for coding.
>>
>> So after a lite chat I got this.
>>
>>
>> The code you provided seems to calculate the distance between a gene and
>> a variant by finding the maximum value between the difference of the
>> variant position and the gene start position, the difference of the gene
>> end position and the variant position, and 0. However, the result of this
>> calculation may not always be correct.
>>
>> Here's an alternative approach to calculating the distance between a gene
>> and a variant:
>>
>>
>> from pyspark.sql.functions import abs
>>
>> # Calculate the absolute difference between the variant position and the
>> gene start and end positions
>> distances = joined.withColumn("distance_start", abs(col("position") -
>> col("start"))) \
>>                  .withColumn("distance_end", abs(col("position") -
>> col("end")))
>>
>> # Take the minimum distance between the variant and the gene
>> distances = distances.withColumn("distance", least(col("distance_start"),
>> col("distance_end")))
>>
>>
>> In this approach, we first calculate the absolute difference between the
>> variant position and the gene start and end positions using the abs
>> function. Then, we use the least function to take the minimum distance
>> between the variant and the gene. This should give you a more accurate
>> distance calculation.
>>
>>
>>
>>
>> Yes, that's correct. The original code you provided calculates the
>> distance between a gene and a variant as the maximum of three terms: the
>> difference between the variant position and the gene start position, the
>> difference between the gene end position and the variant position, and 0.
>>
>> While this approach might work in some cases, it has a few issues. First,
>> it doesn't take into account the possibility that the variant might be
>> inside the gene (i.e., between the start and end positions). Second, it
>> assumes that the gene is always located upstream or downstream of the
>> variant, which may not be the case.
>>
>> The alternative approach I provided earlier calculates the distance as
>> the minimum of two absolute differences: the absolute difference between
>> the variant position and the gene start position, and the absolute
>> difference between the variant position and the gene end position. This
>> approach should work in all cases, including when the variant is inside the
>> gene or when the gene is located upstream or downstream of the variant.
>>
>>
>>
>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>> russell.jur...@gmail.com>:
>>
>>> Usually, the solution to these problems is to do less per line, break it
>>> out and perform each minute operation as a field, then combine those into a
>>> final answer. Can you do that here?
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>>> <https://calendly.com/rjurney_personal/30min>
>>>
>>>
>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
>>>> Here is the complete error:
>>>>
>>>> ```
>>>> Traceback (most recent call last):
>>>>   File "nearest-gene.py", line 74, in <module>
>>>>     main()
>>>>   File "nearest-gene.py", line 62, in main
>>>>     distances = joined.withColumn("distance", max(col("start") -
>>>> col("position"), col("position") - col("end"), 0))
>>>>   File
>>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py",
>>>> line 907, in __nonzero__
>>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>>> ```
>>>>
>>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> That error sounds like it's from pandas not spark. Are you sure it's
>>>>> this line?
>>>>>
>>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>>>> oliv...@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   I'm trying to calculate the distance between a gene (with start and
>>>>>> end) and a variant (with position), so I joined gene and variant data by
>>>>>> chromosome and then tried to calculate the distance like this:
>>>>>>
>>>>>> ```
>>>>>> distances = joined.withColumn("distance", max(col("start") -
>>>>>> col("position"), col("position") - col("end"), 0))
>>>>>> ```
>>>>>>
>>>>>>   Basically, the distance is the maximum of three terms.
>>>>>>
>>>>>>   This line causes an obscure error:
>>>>>>
>>>>>> ```
>>>>>> ValueError: Cannot convert column into bool: please use '&' for
>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean
>>>>>> expressions.
>>>>>> ```
>>>>>>
>>>>>>   How can I do this? Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark SQL] New column with the maximum of multiple terms?

Reply via email to