Unable to handle bignumeric datatype in spark/pyspark

2023-02-23 Thread nidhi kher
Hello,


I am facing below issue in pyspark code:

We are running spark code using dataproc serverless batch in google cloud
platform. Spark code is causing issue while writing the data to bigquery
table. In bigquery table , few of the columns have datatype as bignumeric
and spark code is changing the datatype from bignumeric to numeric while
writing the data. We need datatype to be kept as bignumeric only as we need
data of 38,20 precision.


Can we cast a column to bignumeric in spark sql dataframe like below code
for decimal:


df= spark.sql("""SELECT cast(col1 as decimal(38,20)) as col1 from table1""")

Spark version :3.3

Pyspark version : 1.1


Regards,

Nidhi


unsubscribe

2023-02-23 Thread Roberto Jr
please unsubscribe from that email list.
thank you in advance.
roberto.


Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That's pretty impressive. I'm not sure it's quite right - not clear that
the intent is taking a minimum of absolute values (is it? that'd be wild).
But I think it might have pointed in the right direction. I'm not quite
sure why that error pops out, but I think 'max' is the wrong function.
That's an aggregate function. "greatest" is the function that returns the
max of several cols. Try that?

On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen 
wrote:

> I'm trying to learn how to use chatgpt for coding.
>
> So after a lite chat I got this.
>
>
> The code you provided seems to calculate the distance between a gene and a
> variant by finding the maximum value between the difference of the variant
> position and the gene start position, the difference of the gene end
> position and the variant position, and 0. However, the result of this
> calculation may not always be correct.
>
> Here's an alternative approach to calculating the distance between a gene
> and a variant:
>
>
> from pyspark.sql.functions import abs
>
> # Calculate the absolute difference between the variant position and the
> gene start and end positions
> distances = joined.withColumn("distance_start", abs(col("position") -
> col("start"))) \
>  .withColumn("distance_end", abs(col("position") -
> col("end")))
>
> # Take the minimum distance between the variant and the gene
> distances = distances.withColumn("distance", least(col("distance_start"),
> col("distance_end")))
>
>
> In this approach, we first calculate the absolute difference between the
> variant position and the gene start and end positions using the abs
> function. Then, we use the least function to take the minimum distance
> between the variant and the gene. This should give you a more accurate
> distance calculation.
>
>
>
>
> Yes, that's correct. The original code you provided calculates the
> distance between a gene and a variant as the maximum of three terms: the
> difference between the variant position and the gene start position, the
> difference between the gene end position and the variant position, and 0.
>
> While this approach might work in some cases, it has a few issues. First,
> it doesn't take into account the possibility that the variant might be
> inside the gene (i.e., between the start and end positions). Second, it
> assumes that the gene is always located upstream or downstream of the
> variant, which may not be the case.
>
> The alternative approach I provided earlier calculates the distance as the
> minimum of two absolute differences: the absolute difference between the
> variant position and the gene start position, and the absolute difference
> between the variant position and the gene end position. This approach
> should work in all cases, including when the variant is inside the gene or
> when the gene is located upstream or downstream of the variant.
>
>
>
> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
> russell.jur...@gmail.com>:
>
>> Usually, the solution to these problems is to do less per line, break it
>> out and perform each minute operation as a field, then combine those into a
>> final answer. Can you do that here?
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com Book a time on Calendly
>> 
>>
>>
>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>>   File "nearest-gene.py", line 74, in 
>>> main()
>>>   File "nearest-gene.py", line 62, in main
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>>   File
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>>> line 907, in __nonzero__
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>>
 That error sounds like it's from pandas not spark. Are you sure it's
 this line?

 On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and
> end) and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
I'm trying to learn how to use chatgpt for coding.

So after a lite chat I got this.


The code you provided seems to calculate the distance between a gene and a
variant by finding the maximum value between the difference of the variant
position and the gene start position, the difference of the gene end
position and the variant position, and 0. However, the result of this
calculation may not always be correct.

Here's an alternative approach to calculating the distance between a gene
and a variant:


from pyspark.sql.functions import abs

# Calculate the absolute difference between the variant position and the
gene start and end positions
distances = joined.withColumn("distance_start", abs(col("position") -
col("start"))) \
 .withColumn("distance_end", abs(col("position") -
col("end")))

# Take the minimum distance between the variant and the gene
distances = distances.withColumn("distance", least(col("distance_start"),
col("distance_end")))


In this approach, we first calculate the absolute difference between the
variant position and the gene start and end positions using the abs
function. Then, we use the least function to take the minimum distance
between the variant and the gene. This should give you a more accurate
distance calculation.




Yes, that's correct. The original code you provided calculates the distance
between a gene and a variant as the maximum of three terms: the difference
between the variant position and the gene start position, the difference
between the gene end position and the variant position, and 0.

While this approach might work in some cases, it has a few issues. First,
it doesn't take into account the possibility that the variant might be
inside the gene (i.e., between the start and end positions). Second, it
assumes that the gene is always located upstream or downstream of the
variant, which may not be the case.

The alternative approach I provided earlier calculates the distance as the
minimum of two absolute differences: the absolute difference between the
variant position and the gene start position, and the absolute difference
between the variant position and the gene end position. This approach
should work in all cases, including when the variant is inside the gene or
when the gene is located upstream or downstream of the variant.



tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney :

> Usually, the solution to these problems is to do less per line, break it
> out and perform each minute operation as a field, then combine those into a
> final answer. Can you do that here?
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com Book a time on Calendly
> 
>
>
> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>> Here is the complete error:
>>
>> ```
>> Traceback (most recent call last):
>>   File "nearest-gene.py", line 74, in 
>> main()
>>   File "nearest-gene.py", line 62, in main
>> distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>>   File
>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
>> line 907, in __nonzero__
>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>> ```
>>
>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>>
>>> That error sounds like it's from pandas not spark. Are you sure it's
>>> this line?
>>>
>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   I'm trying to calculate the distance between a gene (with start and
 end) and a variant (with position), so I joined gene and variant data by
 chromosome and then tried to calculate the distance like this:

 ```
 distances = joined.withColumn("distance", max(col("start") -
 col("position"), col("position") - col("end"), 0))
 ```

   Basically, the distance is the maximum of three terms.

   This line causes an obscure error:

 ```
 ValueError: Cannot convert column into bool: please use '&' for 'and',
 '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
 ```

   How can I do this? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad 

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it
out and perform each minute operation as a field, then combine those into a
final answer. Can you do that here?

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

> Here is the complete error:
>
> ```
> Traceback (most recent call last):
>   File "nearest-gene.py", line 74, in 
> main()
>   File "nearest-gene.py", line 62, in main
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
>   File
> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
> line 907, in __nonzero__
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:
>
>> That error sounds like it's from pandas not spark. Are you sure it's this
>> line?
>>
>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   I'm trying to calculate the distance between a gene (with start and
>>> end) and a variant (with position), so I joined gene and variant data by
>>> chromosome and then tried to calculate the distance like this:
>>>
>>> ```
>>> distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>> ```
>>>
>>>   Basically, the distance is the maximum of three terms.
>>>
>>>   This line causes an obscure error:
>>>
>>> ```
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>>   How can I do this? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Here is the complete error:

```
Traceback (most recent call last):
  File "nearest-gene.py", line 74, in 
main()
  File "nearest-gene.py", line 62, in main
distances = joined.withColumn("distance", max(col("start") -
col("position"), col("position") - col("end"), 0))
  File
"/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_01/pyspark.zip/pyspark/sql/column.py",
line 907, in __nonzero__
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

On Thu, Feb 23, 2023 at 2:00 PM Sean Owen  wrote:

> That error sounds like it's from pandas not spark. Are you sure it's this
> line?
>
> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   I'm trying to calculate the distance between a gene (with start and
>> end) and a variant (with position), so I joined gene and variant data by
>> chromosome and then tried to calculate the distance like this:
>>
>> ```
>> distances = joined.withColumn("distance", max(col("start") -
>> col("position"), col("position") - col("end"), 0))
>> ```
>>
>>   Basically, the distance is the maximum of three terms.
>>
>>   This line causes an obscure error:
>>
>> ```
>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>> ```
>>
>>   How can I do this? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this
line?

On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   I'm trying to calculate the distance between a gene (with start and end)
> and a variant (with position), so I joined gene and variant data by
> chromosome and then tried to calculate the distance like this:
>
> ```
> distances = joined.withColumn("distance", max(col("start") -
> col("position"), col("position") - col("end"), 0))
> ```
>
>   Basically, the distance is the maximum of three terms.
>
>   This line causes an obscure error:
>
> ```
> ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
> for 'or', '~' for 'not' when building DataFrame boolean expressions.
> ```
>
>   How can I do this? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
 Hello,

  I'm trying to calculate the distance between a gene (with start and end)
and a variant (with position), so I joined gene and variant data by
chromosome and then tried to calculate the distance like this:

```
distances = joined.withColumn("distance", max(col("start") -
col("position"), col("position") - col("end"), 0))
```

  Basically, the distance is the maximum of three terms.

  This line causes an obscure error:

```
ValueError: Cannot convert column into bool: please use '&' for 'and', '|'
for 'or', '~' for 'not' when building DataFrame boolean expressions.
```

  How can I do this? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute