[ https://issues.apache.org/jira/browse/SPARK-14031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204493#comment-15204493 ]
Sean Owen commented on SPARK-14031: ----------------------------------- Yes, but what is being executed in that stage? you'll see it in the web UI. What is the GC activity from your updated run? I suspect it is _not_ GC this time given how much you've increased, but worth looking. > Dataframe to csv IO, system performance enters high CPU state and write > operation takes 1 hour to complete > ---------------------------------------------------------------------------------------------------------- > > Key: SPARK-14031 > URL: https://issues.apache.org/jira/browse/SPARK-14031 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Affects Versions: 2.0.0 > Environment: MACOSX 10.11.2 Macbook Pro 16g - 2.2 GHz Intel Core i7 > -1TB and Ubuntu14.04 Vagrant 4 Cores 8g > Reporter: Vincent Ohprecio > Priority: Minor > Attachments: visualVMscreenshot.png > > > Summary > When using spark-assembly-2.0.0/spark-shell trying to write out results of > dataframe to csv, system performance enters high CPU state and write > operation takes 1 hour to complete. > * Affecting: [Stage 5:> (0 + 2) / 21] > * Stage 5 elapsed time 3488272270000ns > In comparison, tests where conducted using 1.4, 1.5, 1.6 with same code/data > and Stage5 csv write times where between 2 - 22 seconds. > In addition, Parquet (Stage 3) write tests 1.4, 1.5, 1.6 and 2.0 where > similar between 2 - 22 seconds. > Files > 1. Data File is "2008.csv" > 2. Data file download http://stat-computing.org/dataexpo/2009/the-data.html > 3. Code https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 1 - Setup > High CPU and 58 minute average completion time > * MACOSX 10.11.2 > * Macbook Pro 16g - 2.2 GHz Intel Core i7 -1TB > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > * Code: https://gist.github.com/bigsnarfdude/581b780ce85d7aaecbcb > Observation 2 - Setup > High CPU and waited over hour for csv write but didnt wait to complete > * Ubuntu14.04 > * 4cores 8gb > * spark-assembly-2.0.0 > * spark-csv_2.11-1.4 > Code Output: https://gist.github.com/bigsnarfdude/930f5832c231c3d39651 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org