Re: Low throughput and effect of GC in SparkSql GROUP BY
I hadn't turned on codegen. I enabled it and ran it again, it is running 4-5 times faster now! :) Since my log statements are no longer appearing, I presume the code path seems quite different from the earlier hashmap related stuff in Aggregates.scala? Pramod On Wed, May 20, 2015 at 9:18 PM, Reynold Xin r...@databricks.com wrote: Does this turn codegen on? I think the performance is fairly different when codegen is turned on. For 1.5, we are investigating having codegen on by default, so users get much better performance out of the box. On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: Hi, Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a data point regarding the performance of Group By, indicating there's excessive GC and it's impacting the throughput. I want to know if the new memory manager for aggregations ( https://github.com/apache/spark/pull/5725/) is going to address this kind of issue. I only have a small amount of data on each node (~360MB) with a large heap size (18 Gig). I still see 2-3 minor collections happening whenever I do a Select Sum() with a group by(). I have tried with different sizes for Young Generation without much effect, though not with different GC algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps). I have made a chart of my results [1] by adding timing code to Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab benchmark, running over 10 million records. The chart is from one of the 4 worker nodes in the cluster. I am trying to square this with a claim on the Project Tungsten blog post [2]: When profiling Spark user applications, we’ve found that a large fraction of the CPU time is spent waiting for data to be fetched from main memory. Am I correct in assuming that SparkSql is yet to reach that level of efficiency, at least in aggregation operations? Thanks. [1] - https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174 [2] https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Pramod
Re: Re: Low throughput and effect of GC in SparkSql GROUP BY
Hi Zhang, No my data is not compressed. I'm trying to minimize the load on the CPU. The GC time reduced for me after codegen. Pramod On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei zhangxiongfei0...@163.com wrote: Hi Pramod Is your data compressed? I encountered similar problem,however, after turned codegen on, the GC time was still very long.The size of input data for my map task is about 100M lzo file. My query is select ip, count(*) as c from stage_bitauto_adclick_d group by ip sort by c limit 100 Thanks Zhang Xiongfei At 2015-05-21 12:18:35, Reynold Xin r...@databricks.com wrote: Does this turn codegen on? I think the performance is fairly different when codegen is turned on. For 1.5, we are investigating having codegen on by default, so users get much better performance out of the box. On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: Hi, Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a data point regarding the performance of Group By, indicating there's excessive GC and it's impacting the throughput. I want to know if the new memory manager for aggregations ( https://github.com/apache/spark/pull/5725/) is going to address this kind of issue. I only have a small amount of data on each node (~360MB) with a large heap size (18 Gig). I still see 2-3 minor collections happening whenever I do a Select Sum() with a group by(). I have tried with different sizes for Young Generation without much effect, though not with different GC algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps). I have made a chart of my results [1] by adding timing code to Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab benchmark, running over 10 million records. The chart is from one of the 4 worker nodes in the cluster. I am trying to square this with a claim on the Project Tungsten blog post [2]: When profiling Spark user applications, we’ve found that a large fraction of the CPU time is spent waiting for data to be fetched from main memory. Am I correct in assuming that SparkSql is yet to reach that level of efficiency, at least in aggregation operations? Thanks. [1] - https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174 [2] https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Pramod
Low throughput and effect of GC in SparkSql GROUP BY
Hi, Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a data point regarding the performance of Group By, indicating there's excessive GC and it's impacting the throughput. I want to know if the new memory manager for aggregations (https://github.com/apache/spark/pull/5725/) is going to address this kind of issue. I only have a small amount of data on each node (~360MB) with a large heap size (18 Gig). I still see 2-3 minor collections happening whenever I do a Select Sum() with a group by(). I have tried with different sizes for Young Generation without much effect, though not with different GC algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps). I have made a chart of my results [1] by adding timing code to Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab benchmark, running over 10 million records. The chart is from one of the 4 worker nodes in the cluster. I am trying to square this with a claim on the Project Tungsten blog post [2]: When profiling Spark user applications, we’ve found that a large fraction of the CPU time is spent waiting for data to be fetched from main memory. Am I correct in assuming that SparkSql is yet to reach that level of efficiency, at least in aggregation operations? Thanks. [1] - https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174 [2] https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html Pramod
Re: unable to extract tgz files downloaded from spark
This happens sometimes when the download gets stopped or corrupted. You can verify the integrity of your file by comparing with the md5 and sha signatures published here: http://www.apache.org/dist/spark/spark-1.3.1/ Pramod On Wed, May 6, 2015 at 7:16 PM, Praveen Kumar Muthuswamy muthusamy...@gmail.com wrote: Hi I have been trying to install latest spark verison and downloaded the .tgz files(ex spark-1.3.1.tgz). But, I could not extract them. It complains of invalid tar format. Has any seen this issue ? Thanks Praveen
Re: Speeding up Spark build during development
I had to make a small change to Emre's suggestion above, in order for my changes to get picked up. This worked for me: mvn --projects sql/core -DskipTests install #not package mvn --projects assembly/ -DskipTests install Pramod On Tue, May 5, 2015 at 2:36 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: I'm probably the only Eclipse user here, but it seems I have the best workflow :) At least for me things work as they should: once I imported projects in the workspace I can build and run/debug tests from the IDE. I only go to sbt when I need to re-create projects or I want to run the full test suite. iulian On Tue, May 5, 2015 at 7:35 AM, Tathagata Das t...@databricks.com wrote: In addition to Michael suggestion, in my SBT workflow I also use ~ to automatically kickoff build and unit test. For example, sbt/sbt ~streaming/test-only *BasicOperationsSuite* It will automatically detect any file changes in the project and start of the compilation and testing. So my full workflow involves changing code in IntelliJ and then continuously running unit tests in the background on the command line using this ~. TD On Mon, May 4, 2015 at 2:49 PM, Michael Armbrust mich...@databricks.com wrote: FWIW... My Spark SQL development workflow is usually to run build/sbt sparkShell or build/sbt 'sql/test-only testSuiteName'. These commands starts in as little as 30s on my laptop, automatically figure out which subprojects need to be rebuilt, and don't require the expensive assembly creation. On Mon, May 4, 2015 at 5:48 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: * * ** ** ** ** ** ** Hi, Is it really necessary to run **mvn --projects assembly/ -DskipTests install ? Could you please explain why this is needed? I got the changes after running mvn --projects streaming/ -DskipTests package. Regards, Meethu On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote: Just to give you an example: When I was trying to make a small change only to the Streaming component of Spark, first I built and installed the whole Spark project (this took about 15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files only in Streaming, I ran something like (in the top-level directory): mvn --projects streaming/ -DskipTests package and then mvn --projects assembly/ -DskipTests install This was much faster than trying to build the whole Spark from scratch, because Maven was only building one component, in my case the Streaming component, of Spark. I think you can use a very similar approach. -- Emre Sevinç On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then the assembly. This was much faster than trying to build the whole Spark every time. -- Emre Sevinç On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code
Re: Speeding up Spark build during development
Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: Speeding up Spark build during development
No, I just need to build one project at a time. Right now SparkSql. Pramod On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Pramod, Do you need to build the whole project every time? Generally you don't, e.g., when I was changing some files that belong only to Spark Streaming, I was building only the streaming (of course after having build and installed the whole project, but that was done only once), and then the assembly. This was much faster than trying to build the whole Spark every time. -- Emre Sevinç On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com wrote: Using the inbuilt maven and zinc it takes around 10 minutes for each build. Is that reasonable? My maven opts looks like this: $ echo $MAVEN_OPTS -Xmx12000m -XX:MaxPermSize=2048m I'm running it as build/mvn -DskipTests package Should I be tweaking my Zinc/Nailgun config? Pramod On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com wrote: https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. -- Emre Sevinc
Re: Why does SortShuffleWriter write to disk always?
Thanks for the info. I agree, it makes sense the way it is designed. Pramod On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com wrote: I agree, this is better handled by the filesystem cache - not to mention, being able to do zero copy writes. Regards, Mridul On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote: I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear how big of a gain it would be to put all of these in memory, under newer file systems (ext4, xfs). If the shuffle data is small, they are still in the file system buffer cache anyway. Note that network throughput is often lower than disk throughput, so it won't be a problem to read them from disk. And not having to keep all of these stuff in-memory substantially simplifies memory management. On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: Hi, I was trying to see if I can make Spark avoid hitting the disk for small jobs, but I see that the SortShuffleWriter.write() always writes to disk. I found an older thread ( http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html ) saying that it doesn't call fsync on this write path. My question is why does it always write to disk? Does it mean the reduce phase reads the result from the disk as well? Isn't it possible to read the data from map/buffer in ExternalSorter directly during the reduce phase? Thanks, Pramod
Re: Speeding up Spark build during development
This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com wrote: Following what Ted said, if you leverage the `mvn` from within the `build/` directory of Spark you¹ll get zinc for free which should help speed up build times. On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote: Pramod: Please remember to run Zinc so that the build is faster. Cheers On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com] Sent: Friday, May 01, 2015 1:46 AM To: dev@spark.apache.org Subject: Speeding up Spark build during development Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Why does SortShuffleWriter write to disk always?
Hi, I was trying to see if I can make Spark avoid hitting the disk for small jobs, but I see that the SortShuffleWriter.write() always writes to disk. I found an older thread ( http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html) saying that it doesn't call fsync on this write path. My question is why does it always write to disk? Does it mean the reduce phase reads the result from the disk as well? Isn't it possible to read the data from map/buffer in ExternalSorter directly during the reduce phase? Thanks, Pramod
Speeding up Spark build during development
Hi, I'm making some small changes to the Spark codebase and trying it out on a cluster. I was wondering if there's a faster way to build than running the package target each time. Currently I'm using: mvn -DskipTests package All the nodes have the same filesystem mounted at the same mount point. Pramod