[jira] [Resolved] (SPARK-13570) pyspark save with partitionBy is very slow

Sean Owen (JIRA) Tue, 07 Jun 2016 07:44:46 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-13570.
-------------------------------
    Resolution: Incomplete

> pyspark save with partitionBy is very slow
> ------------------------------------------
>
>                 Key: SPARK-13570
>                 URL: https://issues.apache.org/jira/browse/SPARK-13570
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>            Reporter: Shubhanshu Mishra
>              Labels: dataframe, partitioning, pyspark, save
>
> Running the following code to store data from each year and pos in a seperate 
> folder for a very large dataframe is taking a huge amount of time. (>37 hours 
> for 60% of the work)
> {code}
> ## IPYTHON was started using the following command: 
> # IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g 
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, Row
> from pyspark.sql.types import *
> conf = SparkConf()
> conf.setMaster("local[30]")
> conf.setAppName("analysis")
> conf.set("spark.local.dir", "./tmp")
> conf.set("spark.executor.memory", "50g")
> conf.set("spark.driver.maxResultSize", "5g")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> df = sqlContext.read.format("csv").options(header=False, inferschema=True, 
> delimiter="\t").load("out/new_features")
> df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in 
> enumerate(columns)))
> # year can take values from [1902,2015]
> # pos takes integer values from [-1,0,1,2]
> # df is a dataframe with 20 columns and 1 billion rows
> # Running on  Machine with 32 cores and 500 GB RAM
> df.write.save("out/model_input_partitioned", format="csv", 
> partitionBy=["year", "pos"], delimiter="\t")
> {code}
> Currently, the code is at: 
> [Stage 12:==============================>                    (1367 + 30) / 
> 2290]
> And it has already been more than 37 hours. A single sweep on this data for 
> filter by value takes less than 6.5 minutes. 
> The spark web interface shows the following lines for the 2 stages of the job:
> Stage Description     Submitted       Duration        Tasks:succeeded/total   
> Input   Output  Shuffle Read    Shuffle Write
> 11    load at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:07:04   
> 6.5 min 2290/2290       66.8 GB
> 12    save at NativeMethodAccessorImpl.java:-2 +details 2016/02/27 23:15:59   
> 37.1 h  1370/2290       40.9 GB



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13570) pyspark save with partitionBy is very slow

Reply via email to