[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

antonkulaga (JIRA) Sun, 28 Jul 2019 02:32:03 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-28547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


antonkulaga updated SPARK-28547:
--------------------------------
    Description: 
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes hours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work fast (minutes) and well 
with pure pandas (without any spark involved).
f

  was:
Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
rows). Most of the genomics/transcriptomic data is wide because number of genes 
is usually >20kb and number of samples ass well. Very popular GTEX dataset is a 
good example ( see for instance RNA-Seq data at  
https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is just 
a .tsv file with two comments in the beginning). Everything done in wide tables 
(even simple "describe" functions applied to all the genes-columns) either 
takes ours or gets frozen (because of lost executors) irrespective of memory 
and numbers of cores. While the same operations work well with pure pandas 
(without any spark involved).
f


> Make it work for wide (> 10K columns data)
> ------------------------------------------
>
>                 Key: SPARK-28547
>                 URL: https://issues.apache.org/jira/browse/SPARK-28547
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.4, 2.4.3
>         Environment: Ubuntu server, Spark 2.4.3 Scala with >64GB RAM per 
> node, 32 cores (tried different configurations of executors)
>            Reporter: antonkulaga
>            Priority: Critical
>
> Spark is super-slow for all wide data (when there are >15kb columns and >15kb 
> rows). Most of the genomics/transcriptomic data is wide because number of 
> genes is usually >20kb and number of samples ass well. Very popular GTEX 
> dataset is a good example ( see for instance RNA-Seq data at  
> https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data where gct is 
> just a .tsv file with two comments in the beginning). Everything done in wide 
> tables (even simple "describe" functions applied to all the genes-columns) 
> either takes hours or gets frozen (because of lost executors) irrespective of 
> memory and numbers of cores. While the same operations work fast (minutes) 
> and well with pure pandas (without any spark involved).
> f



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28547) Make it work for wide (> 10K columns data)

Reply via email to