[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

Arman Yazdani (JIRA) Sun, 25 Feb 2018 00:36:36 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375995#comment-16375995
 ]


Arman Yazdani commented on SPARK-21177:
---------------------------------------

I configured spark with hive. in my case when i want to save partitioned 
dataset to hive, spark waits about 10 minute for hive metastore and metastore 
process uses 100% of 1 thread of cpu. I changed log level of metastore to 
debug, and metastore waits after logging of getMTable function in objectStore 
file. in this 10 minute waiting, spark have not any job to do and just waits 
for hive metastore. this waiting goes up when number of partitions goes up.

> df.saveAsTable slows down linearly, with number of appends
> ----------------------------------------------------------
>
>                 Key: SPARK-21177
>                 URL: https://issues.apache.org/jira/browse/SPARK-21177
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Prashant Sharma
>            Priority: Major
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>       /_/
>          
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
>     val start = System.nanoTime()
>     f()
>     val end = System.nanoTime()
>     val timetaken = end - start
>     import scala.concurrent.duration._
>     println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>      |      |      |      |      |      |      | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> ....
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

Reply via email to