Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Pramod Biligiri
I hadn't turned on codegen. I enabled it and ran it again, it is running
4-5 times faster now! :)
Since my log statements are no longer appearing, I presume the code path
seems quite different from the earlier hashmap related stuff in
Aggregates.scala?

Pramod

On Wed, May 20, 2015 at 9:18 PM, Reynold Xin r...@databricks.com wrote:

 Does this turn codegen on? I think the performance is fairly different
 when codegen is turned on.

 For 1.5, we are investigating having codegen on by default, so users get
 much better performance out of the box.


 On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com
  wrote:

 Hi,
 Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have
 a data point regarding the performance of Group By, indicating there's
 excessive GC and it's impacting the throughput. I want to know if the new
 memory manager for aggregations (
 https://github.com/apache/spark/pull/5725/) is going to address this
 kind of issue.

 I only have a small amount of data on each node (~360MB) with a large
 heap size (18 Gig). I still see 2-3 minor collections happening whenever I
 do a Select Sum() with a group by(). I have tried with different sizes for
 Young Generation without much effect, though not with different GC
 algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps).

 I have made a chart of my results [1] by adding timing code to
 Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
 benchmark, running over 10 million records. The chart is from one of the 4
 worker nodes in the cluster.

 I am trying to square this with a claim on the Project Tungsten blog post
 [2]: When profiling Spark user applications, we’ve found that a large
 fraction of the CPU time is spent waiting for data to be fetched from main
 memory. 

 Am I correct in assuming that SparkSql is yet to reach that level of
 efficiency, at least in aggregation operations?

 Thanks.

 [1] -
 https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
 [2]
 https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

 Pramod





Re: Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Pramod Biligiri
Hi Zhang,
No my data is not compressed. I'm trying to minimize the load on the CPU.
The GC time reduced for me after codegen.

Pramod

On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei zhangxiongfei0...@163.com
wrote:

 Hi Pramod

  Is your data compressed? I encountered similar problem,however, after
 turned codegen on, the GC time was still very long.The size of  input data
 for my map task is about 100M lzo file.
 My query is select ip, count(*) as c from stage_bitauto_adclick_d group
 by ip sort by c limit 100

 Thanks
 Zhang Xiongfei



 At 2015-05-21 12:18:35, Reynold Xin r...@databricks.com wrote:

 Does this turn codegen on? I think the performance is fairly different
 when codegen is turned on.

 For 1.5, we are investigating having codegen on by default, so users get
 much better performance out of the box.


 On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri pramodbilig...@gmail.com
  wrote:

 Hi,
 Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have
 a data point regarding the performance of Group By, indicating there's
 excessive GC and it's impacting the throughput. I want to know if the new
 memory manager for aggregations (
 https://github.com/apache/spark/pull/5725/) is going to address this
 kind of issue.

 I only have a small amount of data on each node (~360MB) with a large
 heap size (18 Gig). I still see 2-3 minor collections happening whenever I
 do a Select Sum() with a group by(). I have tried with different sizes for
 Young Generation without much effect, though not with different GC
 algorithms (Hm..I ought to try reducing the rdd storage fraction perhaps).

 I have made a chart of my results [1] by adding timing code to
 Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
 benchmark, running over 10 million records. The chart is from one of the 4
 worker nodes in the cluster.

 I am trying to square this with a claim on the Project Tungsten blog post
 [2]: When profiling Spark user applications, we’ve found that a large
 fraction of the CPU time is spent waiting for data to be fetched from main
 memory. 

 Am I correct in assuming that SparkSql is yet to reach that level of
 efficiency, at least in aggregation operations?

 Thanks.

 [1] -
 https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
 [2]
 https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

 Pramod







Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Pramod Biligiri
Hi,
Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a
data point regarding the performance of Group By, indicating there's
excessive GC and it's impacting the throughput. I want to know if the new
memory manager for aggregations (https://github.com/apache/spark/pull/5725/)
is going to address this kind of issue.

I only have a small amount of data on each node (~360MB) with a large heap
size (18 Gig). I still see 2-3 minor collections happening whenever I do a
Select Sum() with a group by(). I have tried with different sizes for Young
Generation without much effect, though not with different GC algorithms
(Hm..I ought to try reducing the rdd storage fraction perhaps).

I have made a chart of my results [1] by adding timing code to
Aggregates.scala. The query is actually Query 2 from Berkeley's AmpLab
benchmark, running over 10 million records. The chart is from one of the 4
worker nodes in the cluster.

I am trying to square this with a claim on the Project Tungsten blog post
[2]: When profiling Spark user applications, we’ve found that a large
fraction of the CPU time is spent waiting for data to be fetched from main
memory. 

Am I correct in assuming that SparkSql is yet to reach that level of
efficiency, at least in aggregation operations?

Thanks.

[1] -
https://docs.google.com/spreadsheets/d/1HSqYfic3n5s9i4Wsi1Qg0FKN_AWz2vV7_6RRMrtzplQ/edit#gid=481134174
[2]
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Pramod


Re: unable to extract tgz files downloaded from spark

2015-05-07 Thread Pramod Biligiri
This happens sometimes when the download gets stopped or corrupted. You can
verify the integrity of your file by comparing with the md5 and sha
signatures published here: http://www.apache.org/dist/spark/spark-1.3.1/

Pramod

On Wed, May 6, 2015 at 7:16 PM, Praveen Kumar Muthuswamy 
muthusamy...@gmail.com wrote:

 Hi
 I have been trying to install latest spark verison and downloaded the .tgz
 files(ex spark-1.3.1.tgz). But, I could not extract them. It complains of
 invalid tar format.
 Has any seen this issue ?

 Thanks
 Praveen



Re: Speeding up Spark build during development

2015-05-06 Thread Pramod Biligiri
I had to make a small change to Emre's suggestion above, in order for my
changes to get picked up. This worked for me:
mvn --projects sql/core -DskipTests install #not package
mvn --projects assembly/ -DskipTests install

Pramod

On Tue, May 5, 2015 at 2:36 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:

 I'm probably the only Eclipse user here, but it seems I have the best
 workflow :) At least for me things work as they should: once I imported
 projects in the workspace I can build and run/debug tests from the IDE. I
 only go to sbt when I need to re-create projects or I want to run the full
 test suite.


 iulian



 On Tue, May 5, 2015 at 7:35 AM, Tathagata Das t...@databricks.com wrote:

  In addition to Michael suggestion, in my SBT workflow I also use ~ to
  automatically kickoff build and unit test. For example,
 
  sbt/sbt ~streaming/test-only *BasicOperationsSuite*
 
  It will automatically detect any file changes in the project and start of
  the compilation and testing.
  So my full workflow involves changing code in IntelliJ and then
  continuously running unit tests in the background on the command line
 using
  this ~.
 
  TD
 
 
  On Mon, May 4, 2015 at 2:49 PM, Michael Armbrust mich...@databricks.com
 
  wrote:
 
   FWIW... My Spark SQL development workflow is usually to run build/sbt
   sparkShell or build/sbt 'sql/test-only testSuiteName'.  These
  commands
   starts in as little as 30s on my laptop, automatically figure out which
   subprojects need to be rebuilt, and don't require the expensive
 assembly
   creation.
  
   On Mon, May 4, 2015 at 5:48 AM, Meethu Mathew 
 meethu.mat...@flytxt.com
   wrote:
  
*
*
** ** ** ** **  **  Hi,
   
 Is it really necessary to run **mvn --projects assembly/ -DskipTests
install ? Could you please explain why this is needed?
I got the changes after running mvn --projects streaming/
 -DskipTests
package.
   
Regards,
Meethu
   
   
On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote:
   
Just to give you an example:
   
When I was trying to make a small change only to the Streaming
  component
of
Spark, first I built and installed the whole Spark project (this
 took
about
15 minutes on my 4-core, 4 GB RAM laptop). Then, after having
 changed
files
only in Streaming, I ran something like (in the top-level
 directory):
   
mvn --projects streaming/ -DskipTests package
   
and then
   
mvn --projects assembly/ -DskipTests install
   
   
This was much faster than trying to build the whole Spark from
  scratch,
because Maven was only building one component, in my case the
  Streaming
component, of Spark. I think you can use a very similar approach.
   
--
Emre Sevinç
   
   
   
On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri 
pramodbilig...@gmail.com
wrote:
   
 No, I just need to build one project at a time. Right now SparkSql.
   
Pramod
   
On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc 
 emre.sev...@gmail.com
wrote:
   
 Hello Pramod,
   
Do you need to build the whole project every time? Generally you
   don't,
e.g., when I was changing some files that belong only to Spark
Streaming, I
was building only the streaming (of course after having build and
installed
the whole project, but that was done only once), and then the
   assembly.
This was much faster than trying to build the whole Spark every
  time.
   
--
Emre Sevinç
   
On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri 
pramodbilig...@gmail.com
   
wrote:
Using the inbuilt maven and zinc it takes around 10 minutes for
  each
build.
Is that reasonable?
My maven opts looks like this:
$ echo $MAVEN_OPTS
-Xmx12000m -XX:MaxPermSize=2048m
   
I'm running it as build/mvn -DskipTests package
   
Should I be tweaking my Zinc/Nailgun config?
   
Pramod
   
On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra 
   m...@clearstorydata.com
wrote:
   
   
   
   
  
 
 https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn
   
On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 
   
pramodbilig...@gmail.com
   
wrote:
   
 This is great. I didn't know about the mvn script in the build
   
directory.
   
Pramod
   
On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
brennon.y...@capitalone.com
wrote:
   
 Following what Ted said, if you leverage the `mvn` from within
  the
`build/` directory of Spark you¹ll get zinc for free which
  should
   
help
   
speed up build times.
   
On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:
   
 Pramod:
Please remember to run Zinc so that the build is faster.
   
Cheers
   
On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
alexander.ula...@hp.com
wrote:
   
 Hi Pramod,
   
For cluster-like tests you might want to use the same code

Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
Using the inbuilt maven and zinc it takes around 10 minutes for each build.
Is that reasonable?
My maven opts looks like this:
$ echo $MAVEN_OPTS
-Xmx12000m -XX:MaxPermSize=2048m

I'm running it as build/mvn -DskipTests package

Should I be tweaking my Zinc/Nailgun config?

Pramod

On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com
wrote:


 https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn

 On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com
 wrote:

 This is great. I didn't know about the mvn script in the build directory.

 Pramod

 On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
 brennon.y...@capitalone.com
 wrote:

  Following what Ted said, if you leverage the `mvn` from within the
  `build/` directory of Spark you¹ll get zinc for free which should help
  speed up build times.
 
  On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:
 
  Pramod:
  Please remember to run Zinc so that the build is faster.
  
  Cheers
  
  On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
  alexander.ula...@hp.com
  wrote:
  
   Hi Pramod,
  
   For cluster-like tests you might want to use the same code as in
 mllib's
   LocalClusterSparkContext. You can rebuild only the package that you
  change
   and then run this main class.
  
   Best regards, Alexander
  
   -Original Message-
   From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
   Sent: Friday, May 01, 2015 1:46 AM
   To: dev@spark.apache.org
   Subject: Speeding up Spark build during development
  
   Hi,
   I'm making some small changes to the Spark codebase and trying it out
  on a
   cluster. I was wondering if there's a faster way to build than
 running
  the
   package target each time.
   Currently I'm using: mvn -DskipTests  package
  
   All the nodes have the same filesystem mounted at the same mount
 point.
  
   Pramod
  
 
  
 
  The information contained in this e-mail is confidential and/or
  proprietary to Capital One and/or its affiliates. The information
  transmitted herewith is intended only for use by the individual or
 entity
  to which it is addressed.  If the reader of this message is not the
  intended recipient, you are hereby notified that any review,
  retransmission, dissemination, distribution, copying or other use of, or
  taking of any action in reliance upon this information is strictly
  prohibited. If you have received this communication in error, please
  contact the sender and delete the material from your computer.
 
 





Re: Speeding up Spark build during development

2015-05-04 Thread Pramod Biligiri
No, I just need to build one project at a time. Right now SparkSql.

Pramod

On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello Pramod,

 Do you need to build the whole project every time? Generally you don't,
 e.g., when I was changing some files that belong only to Spark Streaming, I
 was building only the streaming (of course after having build and installed
 the whole project, but that was done only once), and then the assembly.
 This was much faster than trying to build the whole Spark every time.

 --
 Emre Sevinç

 On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com
 wrote:

 Using the inbuilt maven and zinc it takes around 10 minutes for each
 build.
 Is that reasonable?
 My maven opts looks like this:
 $ echo $MAVEN_OPTS
 -Xmx12000m -XX:MaxPermSize=2048m

 I'm running it as build/mvn -DskipTests package

 Should I be tweaking my Zinc/Nailgun config?

 Pramod

 On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 
 
 https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn
 
  On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 
 pramodbilig...@gmail.com
  wrote:
 
  This is great. I didn't know about the mvn script in the build
 directory.
 
  Pramod
 
  On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
  brennon.y...@capitalone.com
  wrote:
 
   Following what Ted said, if you leverage the `mvn` from within the
   `build/` directory of Spark you¹ll get zinc for free which should
 help
   speed up build times.
  
   On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:
  
   Pramod:
   Please remember to run Zinc so that the build is faster.
   
   Cheers
   
   On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
   alexander.ula...@hp.com
   wrote:
   
Hi Pramod,
   
For cluster-like tests you might want to use the same code as in
  mllib's
LocalClusterSparkContext. You can rebuild only the package that
 you
   change
and then run this main class.
   
Best regards, Alexander
   
-Original Message-
From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
Sent: Friday, May 01, 2015 1:46 AM
To: dev@spark.apache.org
Subject: Speeding up Spark build during development
   
Hi,
I'm making some small changes to the Spark codebase and trying it
 out
   on a
cluster. I was wondering if there's a faster way to build than
  running
   the
package target each time.
Currently I'm using: mvn -DskipTests  package
   
All the nodes have the same filesystem mounted at the same mount
  point.
   
Pramod
   
  
   
  
   The information contained in this e-mail is confidential and/or
   proprietary to Capital One and/or its affiliates. The information
   transmitted herewith is intended only for use by the individual or
  entity
   to which it is addressed.  If the reader of this message is not the
   intended recipient, you are hereby notified that any review,
   retransmission, dissemination, distribution, copying or other use
 of, or
   taking of any action in reliance upon this information is strictly
   prohibited. If you have received this communication in error, please
   contact the sender and delete the material from your computer.
  
  
 
 
 




 --
 Emre Sevinc



Re: Why does SortShuffleWriter write to disk always?

2015-05-03 Thread Pramod Biligiri
Thanks for the info. I agree, it makes sense the way it is designed.

Pramod

On Sat, May 2, 2015 at 10:37 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 I agree, this is better handled by the filesystem cache - not to
 mention, being able to do zero copy writes.

 Regards,
 Mridul

 On Sat, May 2, 2015 at 10:26 PM, Reynold Xin r...@databricks.com wrote:
  I've personally prototyped completely in-memory shuffle for Spark 3
 times.
  However, it is unclear how big of a gain it would be to put all of these
 in
  memory, under newer file systems (ext4, xfs). If the shuffle data is
 small,
  they are still in the file system buffer cache anyway. Note that network
  throughput is often lower than disk throughput, so it won't be a problem
 to
  read them from disk. And not having to keep all of these stuff in-memory
  substantially simplifies memory management.
 
 
 
  On Fri, May 1, 2015 at 7:59 PM, Pramod Biligiri 
 pramodbilig...@gmail.com
  wrote:
 
  Hi,
  I was trying to see if I can make Spark avoid hitting the disk for small
  jobs, but I see that the SortShuffleWriter.write() always writes to
 disk. I
  found an older thread (
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
  )
  saying that it doesn't call fsync on this write path.
 
  My question is why does it always write to disk?
  Does it mean the reduce phase reads the result from the disk as well?
  Isn't it possible to read the data from map/buffer in ExternalSorter
  directly during the reduce phase?
 
  Thanks,
  Pramod
 



Re: Speeding up Spark build during development

2015-05-03 Thread Pramod Biligiri
This is great. I didn't know about the mvn script in the build directory.

Pramod

On Fri, May 1, 2015 at 9:51 AM, York, Brennon brennon.y...@capitalone.com
wrote:

 Following what Ted said, if you leverage the `mvn` from within the
 `build/` directory of Spark you¹ll get zinc for free which should help
 speed up build times.

 On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:

 Pramod:
 Please remember to run Zinc so that the build is faster.
 
 Cheers
 
 On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
 alexander.ula...@hp.com
 wrote:
 
  Hi Pramod,
 
  For cluster-like tests you might want to use the same code as in mllib's
  LocalClusterSparkContext. You can rebuild only the package that you
 change
  and then run this main class.
 
  Best regards, Alexander
 
  -Original Message-
  From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
  Sent: Friday, May 01, 2015 1:46 AM
  To: dev@spark.apache.org
  Subject: Speeding up Spark build during development
 
  Hi,
  I'm making some small changes to the Spark codebase and trying it out
 on a
  cluster. I was wondering if there's a faster way to build than running
 the
  package target each time.
  Currently I'm using: mvn -DskipTests  package
 
  All the nodes have the same filesystem mounted at the same mount point.
 
  Pramod
 

 

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.




Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Pramod Biligiri
Hi,
I was trying to see if I can make Spark avoid hitting the disk for small
jobs, but I see that the SortShuffleWriter.write() always writes to disk. I
found an older thread (
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html)
saying that it doesn't call fsync on this write path.

My question is why does it always write to disk?
Does it mean the reduce phase reads the result from the disk as well?
Isn't it possible to read the data from map/buffer in ExternalSorter
directly during the reduce phase?

Thanks,
Pramod


Speeding up Spark build during development

2015-05-01 Thread Pramod Biligiri
Hi,
I'm making some small changes to the Spark codebase and trying it out on a
cluster. I was wondering if there's a faster way to build than running the
package target each time.
Currently I'm using: mvn -DskipTests  package

All the nodes have the same filesystem mounted at the same mount point.

Pramod