Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Sorry, I didn't try that.

On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney 
wrote:

> Oliver, just curious: did you get a clean error message when you broke it
> out into separate statements?
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com Book a time on Calendly
> 
>
>
> On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thanks for the advice. First of all, it looks like I used the wrong
>> *max* function, but *pyspark.sql.functions.max* isn't right either,
>> because it finds the maximum of a given column over groups of rows. To find
>> the maximum among multiple columns, I need
>> *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to
>> make it a column.
>>
>>   In short, the correct line is:
>>
>>
>> *distances = joined.withColumn("distance", greatest(col("start") -
>> col("position"), col("position") - col("end"), lit(0)))*
>>
>>   Again, thanks to all who responded!
>>
>>  Best, Oliver
>>
>> On Thu, Feb 23, 2023 at 4:54 PM Sean Owen  wrote:
>>
>>> That's pretty impressive. I'm not sure it's quite right - not clear that
>>> the intent is taking a minimum of absolute values (is it? that'd be wild).
>>> But I think it might have pointed in the right direction. I'm not quite
>>> sure why that error pops out, but I think 'max' is the wrong function.
>>> That's an aggregate function. "greatest" is the function that returns the
>>> max of several cols. Try that?
>>>
>>> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
 I'm trying to learn how to use chatgpt for coding.

 So after a lite chat I got this.


 The code you provided seems to calculate the distance between a gene
 and a variant by finding the maximum value between the difference of the
 variant position and the gene start position, the difference of the gene
 end position and the variant position, and 0. However, the result of this
 calculation may not always be correct.

 Here's an alternative approach to calculating the distance between a
 gene and a variant:


 from pyspark.sql.functions import abs

 # Calculate the absolute difference between the variant position and
 the gene start and end positions
 distances = joined.withColumn("distance_start", abs(col("position") -
 col("start"))) \
  .withColumn("distance_end", abs(col("position") -
 col("end")))

 # Take the minimum distance between the variant and the gene
 distances = distances.withColumn("distance",
 least(col("distance_start"), col("distance_end")))


 In this approach, we first calculate the absolute difference between
 the variant position and the gene start and end positions using the abs
 function. Then, we use the least function to take the minimum distance
 between the variant and the gene. This should give you a more accurate
 distance calculation.




 Yes, that's correct. The original code you provided calculates the
 distance between a gene and a variant as the maximum of three terms: the
 difference between the variant position and the gene start position, the
 difference between the gene end position and the variant position, and 0.

 While this approach might work in some cases, it has a few issues.
 First, it doesn't take into account the possibility that the variant might
 be inside the gene (i.e., between the start and end positions). Second, it
 assumes that the gene is always located upstream or downstream of the
 variant, which may not be the case.

 The alternative approach I provided earlier calculates the distance as
 the minimum of two absolute differences: the absolute difference between
 the variant position and the gene start position, and the absolute
 difference between the variant position and the gene end position. This
 approach should work in all cases, including when the variant is inside the
 gene or when the gene is located upstream or downstream of the variant.



 tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
 russell.jur...@gmail.com>:

> Usually, the solution to these problems is to do less per line, break
> it out and perform each minute operation as a field, then combine those
> into a final answer. Can you do that here?
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com Book a time on Calendly
> 
>
>
> On Thu, Feb 23, 2023 at 11:07 AM Oliver 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it
out into separate statements?

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Thanks for the advice. First of all, it looks like I used the wrong
> *max* function, but *pyspark.sql.functions.max* isn't right either,
> because it finds the maximum of a given column over groups of rows. To find
> the maximum among multiple columns, I need
> *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to
> make it a column.
>
>   In short, the correct line is:
>
>
> *distances = joined.withColumn("distance", greatest(col("start") -
> col("position"), col("position") - col("end"), lit(0)))*
>
>   Again, thanks to all who responded!
>
>  Best, Oliver
>
> On Thu, Feb 23, 2023 at 4:54 PM Sean Owen  wrote:
>
>> That's pretty impressive. I'm not sure it's quite right - not clear that
>> the intent is taking a minimum of absolute values (is it? that'd be wild).
>> But I think it might have pointed in the right direction. I'm not quite
>> sure why that error pops out, but I think 'max' is the wrong function.
>> That's an aggregate function. "greatest" is the function that returns the
>> max of several cols. Try that?
>>
>> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
>> wrote:
>>
>>> I'm trying to learn how to use chatgpt for coding.
>>>
>>> So after a lite chat I got this.
>>>
>>>
>>> The code you provided seems to calculate the distance between a gene and
>>> a variant by finding the maximum value between the difference of the
>>> variant position and the gene start position, the difference of the gene
>>> end position and the variant position, and 0. However, the result of this
>>> calculation may not always be correct.
>>>
>>> Here's an alternative approach to calculating the distance between a
>>> gene and a variant:
>>>
>>>
>>> from pyspark.sql.functions import abs
>>>
>>> # Calculate the absolute difference between the variant position and the
>>> gene start and end positions
>>> distances = joined.withColumn("distance_start", abs(col("position") -
>>> col("start"))) \
>>>  .withColumn("distance_end", abs(col("position") -
>>> col("end")))
>>>
>>> # Take the minimum distance between the variant and the gene
>>> distances = distances.withColumn("distance",
>>> least(col("distance_start"), col("distance_end")))
>>>
>>>
>>> In this approach, we first calculate the absolute difference between the
>>> variant position and the gene start and end positions using the abs
>>> function. Then, we use the least function to take the minimum distance
>>> between the variant and the gene. This should give you a more accurate
>>> distance calculation.
>>>
>>>
>>>
>>>
>>> Yes, that's correct. The original code you provided calculates the
>>> distance between a gene and a variant as the maximum of three terms: the
>>> difference between the variant position and the gene start position, the
>>> difference between the gene end position and the variant position, and 0.
>>>
>>> While this approach might work in some cases, it has a few issues.
>>> First, it doesn't take into account the possibility that the variant might
>>> be inside the gene (i.e., between the start and end positions). Second, it
>>> assumes that the gene is always located upstream or downstream of the
>>> variant, which may not be the case.
>>>
>>> The alternative approach I provided earlier calculates the distance as
>>> the minimum of two absolute differences: the absolute difference between
>>> the variant position and the gene start position, and the absolute
>>> difference between the variant position and the gene end position. This
>>> approach should work in all cases, including when the variant is inside the
>>> gene or when the gene is located upstream or downstream of the variant.
>>>
>>>
>>>
>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>>> russell.jur...@gmail.com>:
>>>
 Usually, the solution to these problems is to do less per line, break
 it out and perform each minute operation as a field, then combine those
 into a final answer. Can you do that here?

 Thanks,
 Russell Jurney @rjurney 
 russell.jur...@gmail.com LI  FB
  datasyndrome.com Book a time on Calendly
 


 On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

> Here is the complete error:
>
> ```
> Traceback (most recent call last):
>   File "nearest-gene.py", line 74, in 
> main()
>   File 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
 Hello,

  Thanks for the advice. First of all, it looks like I used the wrong *max*
function, but *pyspark.sql.functions.max* isn't right either, because it
finds the maximum of a given column over groups of rows. To find the
maximum among multiple columns, I need *pyspark.sql.functions.greatest*.
Also, instead of 0, I need *lit(0)* to make it a column.

  In short, the correct line is:


*distances = joined.withColumn("distance", greatest(col("start") -
col("position"), col("position") - col("end"), lit(0)))*

  Again, thanks to all who responded!

 Best, Oliver

On Thu, Feb 23, 2023 at 4:54 PM Sean Owen  wrote:

> That's pretty impressive. I'm not sure it's quite right - not clear that
> the intent is taking a minimum of absolute values (is it? that'd be wild).
> But I think it might have pointed in the right direction. I'm not quite
> sure why that error pops out, but I think 'max' is the wrong function.
> That's an aggregate function. "greatest" is the function that returns the
> max of several cols. Try that?
>
> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
> wrote:
>
>> I'm trying to learn how to use chatgpt for coding.
>>
>> So after a lite chat I got this.
>>
>>
>> The code you provided seems to calculate the distance between a gene and
>> a variant by finding the maximum value between the difference of the
>> variant position and the gene start position, the difference of the gene
>> end position and the variant position, and 0. However, the result of this
>> calculation may not always be correct.
>>
>> Here's an alternative approach to calculating the distance between a gene
>> and a variant:
>>
>>
>> from pyspark.sql.functions import abs
>>
>> # Calculate the absolute difference between the variant position and the
>> gene start and end positions
>> distances = joined.withColumn("distance_start", abs(col("position") -
>> col("start"))) \
>>  .withColumn("distance_end", abs(col("position") -
>> col("end")))
>>
>> # Take the minimum distance between the variant and the gene
>> distances = distances.withColumn("distance", least(col("distance_start"),
>> col("distance_end")))
>>
>>
>> In this approach, we first calculate the absolute difference between the
>> variant position and the gene start and end positions using the abs
>> function. Then, we use the least function to take the minimum distance
>> between the variant and the gene. This should give you a more accurate
>> distance calculation.
>>
>>
>>
>>
>> Yes, that's correct. The original code you provided calculates the
>> distance between a gene and a variant as the maximum of three terms: the
>> difference between the variant position and the gene start position, the
>> difference between the gene end position and the variant position, and 0.
>>
>> While this approach might work in some cases, it has a few issues. First,
>> it doesn't take into account the possibility that the variant might be
>> inside the gene (i.e., between the start and end positions). Second, it
>> assumes that the gene is always located upstream or downstream of the
>> variant, which may not be the case.
>>
>> The alternative approach I provided earlier calculates the distance as
>> the minimum of two absolute differences: the absolute difference between
>> the variant position and the gene start position, and the absolute
>> difference between the variant position and the gene end position. This
>> approach should work in all cases, including when the variant is inside the
>> gene or when the gene is located upstream or downstream of the variant.
>>
>>
>>
>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>> russell.jur...@gmail.com>:
>>
>>> Usually, the solution to these problems is to do less per line, break it
>>> out and perform each minute operation as a field, then combine those into a
>>> final answer. Can you do that here?
>>>
>>> Thanks,
>>> Russell Jurney @rjurney 
>>> russell.jur...@gmail.com LI  FB
>>>  datasyndrome.com Book a time on Calendly
>>> 
>>>
>>>
>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
 Here is the complete error:

 ```
 Traceback (most recent call last):
   File "nearest-gene.py", line 74, in 
 main()
   File "nearest-gene.py", line 62, in main
 distances = joined.withColumn("distance", max(col("start") -
 col("position"), col("position") - col("end"), 0))
   File
 "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
 line 907, in __nonzero__
 ValueError: Cannot convert column into bool: please use '&' for 'and',
 '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
 ```

 On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:

> 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That's pretty impressive. I'm not sure it's quite right - not clear that
the intent is taking a minimum of absolute values (is it? that'd be wild).
But I think it might have pointed in the right direction. I'm not quite
sure why that error pops out, but I think 'max' is the wrong function.
That's an aggregate function. "greatest" is the function that returns the
max of several cols. Try that?

On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
wrote:

> I'm trying to learn how to use chatgpt for coding.
>
> So after a lite chat I got this.
>
>
> The code you provided seems to calculate the distance between a gene and a
> variant by finding the maximum value between the difference of the variant
> position and the gene start position, the difference of the gene end
> position and the variant position, and 0. However, the result of this
> calculation may not always be correct.
>
> Here's an alternative approach to calculating the distance between a gene
> and a variant:
>
>
> from pyspark.sql.functions import abs
>
> # Calculate the absolute difference between the variant position and the
> gene start and end positions
> distances = joined.withColumn("distance_start", abs(col("position") -
> col("start"))) \
>  .withColumn("distance_end", abs(col("position") -
> col("end")))
>
> # Take the minimum distance between the variant and the gene
> distances = distances.withColumn("distance", least(col("distance_start"),
> col("distance_end")))
>
>
> In this approach, we first calculate the absolute difference between the
> variant position and the gene start and end positions using the abs
> function. Then, we use the least function to take the minimum distance
> between the variant and the gene. This should give you a more accurate
> distance calculation.
>
>
>
>
> Yes, that's correct. The original code you provided calculates the
> distance between a gene and a variant as the maximum of three terms: the
> difference between the variant position and the gene start position, the
> difference between the gene end position and the variant position, and 0.
>
> While this approach might work in some cases, it has a few issues. First,
> it doesn't take into account the possibility that the variant might be
> inside the gene (i.e., between the start and end positions). Second, it
> assumes that the gene is always located upstream or downstream of the
> variant, which may not be the case.
>
> The alternative approach I provided earlier calculates the distance as the
> minimum of two absolute differences: the absolute difference between the
> variant position and the gene start position, and the absolute difference
> between the variant position and the gene end position. This approach
> should work in all cases, including when the variant is inside the gene or
> when the gene is located upstream or downstream of the variant.
>
>
>
> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
> russell.jur...@gmail.com>:
>
>> Usually, the solution to these problems is to do less per line, break it
>> out and perform each minute operation as a field, then combine those into a
>> final answer. Can you do that here?
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com Book a time on Calendly
>> 
>>
>>
>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>>   File "nearest-gene.py", line 74, in 
>>> main()
>>>   File "nearest-gene.py", line 62, in main
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>>   File
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>>> line 907, in __nonzero__
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>>
 That error sounds like it's from pandas not spark. Are you sure it's
 this line?

 On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and
> end) and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
I'm trying to learn how to use chatgpt for coding.

So after a lite chat I got this.


The code you provided seems to calculate the distance between a gene and a
variant by finding the maximum value between the difference of the variant
position and the gene start position, the difference of the gene end
position and the variant position, and 0. However, the result of this
calculation may not always be correct.

Here's an alternative approach to calculating the distance between a gene
and a variant:


from pyspark.sql.functions import abs

# Calculate the absolute difference between the variant position and the
gene start and end positions
distances = joined.withColumn("distance_start", abs(col("position") -
col("start"))) \
 .withColumn("distance_end", abs(col("position") -
col("end")))

# Take the minimum distance between the variant and the gene
distances = distances.withColumn("distance", least(col("distance_start"),
col("distance_end")))


In this approach, we first calculate the absolute difference between the
variant position and the gene start and end positions using the abs
function. Then, we use the least function to take the minimum distance
between the variant and the gene. This should give you a more accurate
distance calculation.




Yes, that's correct. The original code you provided calculates the distance
between a gene and a variant as the maximum of three terms: the difference
between the variant position and the gene start position, the difference
between the gene end position and the variant position, and 0.

While this approach might work in some cases, it has a few issues. First,
it doesn't take into account the possibility that the variant might be
inside the gene (i.e., between the start and end positions). Second, it
assumes that the gene is always located upstream or downstream of the
variant, which may not be the case.

The alternative approach I provided earlier calculates the distance as the
minimum of two absolute differences: the absolute difference between the
variant position and the gene start position, and the absolute difference
between the variant position and the gene end position. This approach
should work in all cases, including when the variant is inside the gene or
when the gene is located upstream or downstream of the variant.



tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney :

> Usually, the solution to these problems is to do less per line, break it
> out and perform each minute operation as a field, then combine those into a
> final answer. Can you do that here?
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com Book a time on Calendly
> 
>
>
> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>> Here is the complete error:
>>
>> ```
>> Traceback (most recent call last):
>>   File "nearest-gene.py", line 74, in 
>> main()
>>   File "nearest-gene.py", line 62, in main
>> distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>>   File
>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>> line 907, in __nonzero__
>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>> ```
>>
>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>
>>> That error sounds like it's from pandas not spark. Are you sure it's
>>> this line?
>>>
>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   I'm trying to calculate the distance between a gene (with start and
 end) and a variant (with position), so I joined gene and variant data by
 chromosome and then tried to calculate the distance like this:

 ```
 distances = joined.withColumn("distance", max(col("start") -
 col("position"), col("position") - col("end"), 0))
 ```

   Basically, the distance is the maximum of three terms.

   This line causes an obscure error:

 ```
 ValueError: Cannot convert column into bool: please use '&' for 'and',
 '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
 ```

   How can I do this? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it
out and perform each minute operation as a field, then combine those into a
final answer. Can you do that here?

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

> Here is the complete error:
>
> ```
> Traceback (most recent call last):
>   File "nearest-gene.py", line 74, in 
> main()
>   File "nearest-gene.py", line 62, in main
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
> line 907, in __nonzero__
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>
>> That error sounds like it's from pandas not spark. Are you sure it's this
>> line?
>>
>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   I'm trying to calculate the distance between a gene (with start and
>>> end) and a variant (with position), so I joined gene and variant data by
>>> chromosome and then tried to calculate the distance like this:
>>>
>>> ```
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>> ```
>>>
>>>   Basically, the distance is the maximum of three terms.
>>>
>>>   This line causes an obscure error:
>>>
>>> ```
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>>   How can I do this? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Here is the complete error:

```
Traceback (most recent call last):
  File "nearest-gene.py", line 74, in 
main()
  File "nearest-gene.py", line 62, in main
distances = joined.withColumn("distance", max(col("start") -
col("position"), col("position") - col("end"), 0))
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
line 907, in __nonzero__
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:

> That error sounds like it's from pandas not spark. Are you sure it's this
> line?
>
> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   I'm trying to calculate the distance between a gene (with start and
>> end) and a variant (with position), so I joined gene and variant data by
>> chromosome and then tried to calculate the distance like this:
>>
>> ```
>> distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>> ```
>>
>>   Basically, the distance is the maximum of three terms.
>>
>>   This line causes an obscure error:
>>
>> ```
>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>> ```
>>
>>   How can I do this? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this
line?

On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and end)
> and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
>   How can I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
 Hello,

  I'm trying to calculate the distance between a gene (with start and end)
and a variant (with position), so I joined gene and variant data by
chromosome and then tried to calculate the distance like this:

```
distances = joined.withColumn("distance", max(col("start") -
col("position"), col("position") - col("end"), 0))
```

  Basically, the distance is the maximum of three terms.

  This line causes an obscure error:

```
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

  How can I do this? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute