Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
ur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly > <https://calendly.com/rjurney_personal/30min> > > > On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker < > oliv...@broadinstitut

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
t call last): >>>> File "nearest-gene.py", line 74, in >>>> main() >>>> File "nearest-gene.py", line 62, in main >>>> distances = joined.withColumn("distance", max(col("start") - >>>> co

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` On Thu, Feb 23, 2023 at 2:00 PM Sean Owen wrote: > That error sounds like it's from pandas not spark. Are you sure it's this > line? > > On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < &g

[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
hen building DataFrame boolean expressions. ``` How can I do this? Thanks! Best, Oliver -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-18 Thread Oliver Ruebenacker
must be same type but were: string != >> array; >> >> How do I do this? Thanks! >> >> Best, Oliver >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flan

[PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Oliver Ruebenacker
: pyspark.sql.utils.AnalysisException: cannot resolve '(gene IN (nearest))' due to data type mismatch: Arguments must be same type but were: string != array; How do I do this? Thanks! Best, Oliver -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <h

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
,>=1.19.5 in > /usr/local/lib64/python3.11/site-packages (from scipy) (1.24.1) > Installing collected packages: scipy > Successfully installed scipy-1.10.0 > WARNING: Running pip as the 'root' user can result in broken permissions > and conflicting behaviour with the syste

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
gt; > > > > fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker < > oliv...@broadinstitute.org>: > >> >> Hello, >> >> I'm trying to install SciPy using a bootstrap script and then use it to >> calculate a new field in a dataframe, running

[PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
, but then at this line: *from scipy.stats import norm* I get the following error: *ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject* Any advice on how to proceed? Thanks! Best, Oliver -- Oliver Ruebenacker, Ph.D. (he) Senior

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Oliver Ruebenacker
be improved. > > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > los

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
it needs to order/sort. > -- > Raghavendra > > > On Mon, Dec 19, 2022 at 8:57 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> How can I retain from each group only the row for which one value is >>

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
ur own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Miami 6202 >> >> Which you could further filter in another CTE or subquery where >> PopulationRank = 1. >> >> As I mentioned, I'm not sure how this translates into PySpark, but that's >> the general concept in SQL. >> >> On Mon, Dec 19

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
window function? > > On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> Thank you for the response! >> >> I can think of two ways to get the largest city by country, but both >

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
> loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > >

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
. On Tue, Dec 6, 2022 at 10:47 AM Holden Karau wrote: > Take a look at https://github.com/nielsbasjes/splittablegzip :D > > On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello Holden, >> >> Thank

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello Chris, >> >> Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but >> to start reading from somewhere other than the beginning of the file

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
e either of those, > it would require writing a custom Hadoop compression codec to integrate > more closely with the data format. > > Chris Nauroth > > > On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >>

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Oliver Ruebenacker
s like Snappy are > generally preferred for greater efficiency. (Of course, we're not always in > complete control of the data formats we're given, so the support for bz2 is > there.) > > [1] > https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/

[PySpark] Reader/Writer for bgzipped data

2022-12-02 Thread Oliver Ruebenacker
Hello, Is it possible to read/write a DataFrame from/to a set of bgzipped files? Can it read from/write to AWS S3? Thanks! Best, Oliver -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flanni

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-28 Thread Oliver Ruebenacker
gt; On 11/27/22 12:30 PM, Oliver Ruebenacker wrote: > > > Hello, > > I have two Dataframes I want to join using a condition such that each > record from each Dataframe may be joined with multiple records from the > other Dataframe. This means the original records shoul

[PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Oliver Ruebenacker
e) & \ (genes.start - padding <= variants.position) & \ (genes.end + padding >= variants.position)gene_variants = genes.join(variants.alias('variants'), cond, "left_outer") print('Joining genes and variants give ' + str(gene_variants.count()) + ' pairs:')for row in g

Re: [scala-user] ERROR TaskResultGetter: Exception while getting task result java.io.IOException: java.lang.ClassNotFoundException: scala.Some

2016-06-16 Thread Oliver Ruebenacker
ot; at "http://repo.akka.io/releases/; > > I am getting TaskResultGetter error with ClassNotFoundException for > scala.Some . > > Can I please get some help how to fix it? > > Thanks, > S. Sarkar > > -- > You received this message because you are subscribed