Re: [PySpark SQL] New column with the maximum of multiple terms?

Bjørn Jørgensen Thu, 23 Feb 2023 12:26:58 -0800

I'm trying to learn how to use chatgpt for coding.

So after a lite chat I got this.

The code you provided seems to calculate the distance between a gene and a
variant by finding the maximum value between the difference of the variant
position and the gene start position, the difference of the gene end
position and the variant position, and 0. However, the result of this
calculation may not always be correct.

Here's an alternative approach to calculating the distance between a gene
and a variant:

from pyspark.sql.functions import abs

# Calculate the absolute difference between the variant position and the
gene start and end positions
distances = joined.withColumn("distance_start", abs(col("position") -
col("start"))) \
                 .withColumn("distance_end", abs(col("position") -
col("end")))

# Take the minimum distance between the variant and the gene
distances = distances.withColumn("distance", least(col("distance_start"),
col("distance_end")))

In this approach, we first calculate the absolute difference between the
variant position and the gene start and end positions using the abs
function. Then, we use the least function to take the minimum distance
between the variant and the gene. This should give you a more accurate
distance calculation.

Yes, that's correct. The original code you provided calculates the distance
between a gene and a variant as the maximum of three terms: the difference
between the variant position and the gene start position, the difference
between the gene end position and the variant position, and 0.

While this approach might work in some cases, it has a few issues. First,
it doesn't take into account the possibility that the variant might be
inside the gene (i.e., between the start and end positions). Second, it
assumes that the gene is always located upstream or downstream of the
variant, which may not be the case.

The alternative approach I provided earlier calculates the distance as the
minimum of two absolute differences: the absolute difference between the
variant position and the gene start position, and the absolute difference
between the variant position and the gene end position. This approach
should work in all cases, including when the variant is inside the gene or
when the gene is located upstream or downstream of the variant.

tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <[email protected]
>:

> Usually, the solution to these problems is to do less per line, break it
> out and perform each minute operation as a field, then combine those into a
> final answer. Can you do that here?
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> [email protected] LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
> <https://calendly.com/rjurney_personal/30min>
>
>
> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
> [email protected]> wrote:
>
>> Here is the complete error:
>>
>> ```
>> Traceback (most recent call last):
>>   File "nearest-gene.py", line 74, in <module>
>>     main()
>>   File "nearest-gene.py", line 62, in main
>>     distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>>   File
>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py",
>> line 907, in __nonzero__
>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>> ```
>>
>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <[email protected]> wrote:
>>
>>> That error sounds like it's from pandas not spark. Are you sure it's
>>> this line?
>>>
>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>> [email protected]> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   I'm trying to calculate the distance between a gene (with start and
>>>> end) and a variant (with position), so I joined gene and variant data by
>>>> chromosome and then tried to calculate the distance like this:
>>>>
>>>> ```
>>>> distances = joined.withColumn("distance", max(col("start") -
>>>> col("position"), col("position") - col("end"), 0))
>>>> ```
>>>>
>>>>   Basically, the distance is the maximum of three terms.
>>>>
>>>>   This line causes an obscure error:
>>>>
>>>> ```
>>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>>> ```
>>>>
>>>>   How can I do this? Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: [PySpark SQL] New column with the maximum of multiple terms?

Reply via email to