Re: Spark UI consuming lots of memory
Thanks for your help, most likely this is the memory leak you are fixing in https://issues.apache.org/jira/browse/SPARK-11126. -Nick On Mon, Oct 12, 2015 at 9:00 PM, Shixiong Zhu wrote: > In addition, you cannot turn off JobListener and SQLListener now... > > Best Regards, > Shixiong Zhu > > 2015-10-13 11:59 GMT+08:00 Shixiong Zhu : > >> Is your query very complicated? Could you provide the output of `explain` >> your query that consumes an excessive amount of memory? If this is a small >> query, there may be a bug that leaks memory in SQLListener. >> >> Best Regards, >> Shixiong Zhu >> >> 2015-10-13 11:44 GMT+08:00 Nicholas Pritchard < >> nicholas.pritch...@falkonry.com>: >> >>> As an update, I did try disabling the ui with "spark.ui.enabled=false", >>> but the JobListener and SQLListener still consume a lot of memory, leading >>> to OOM error. Has anyone encountered this before? Is the only solution just >>> to increase the driver heap size? >>> >>> Thanks, >>> Nick >>> >>> On Mon, Oct 12, 2015 at 8:42 PM, Nicholas Pritchard < >>> nicholas.pritch...@falkonry.com> wrote: >>> >>>> I set those configurations by passing to spark-submit script: >>>> "bin/spark-submit --conf spark.ui.retainedJobs=20 ...". I have verified >>>> that these configurations are being passed correctly because they are >>>> listed in the environments tab and also by counting the number of >>>> job/stages that are listed. The "spark.sql.ui.retainedExecutions=0" >>>> only applies to the number of "completed" executions; there will always be >>>> a "running" execution. For some reason, I have one execution that consumes >>>> an excessive amount of memory. >>>> >>>> Actually, I am not interested in the SQL UI, as I find the Job/Stages >>>> UI to have sufficient information. I am also using Spark Standalone cluster >>>> manager so have not had to use the history server. >>>> >>>> >>>> On Mon, Oct 12, 2015 at 8:17 PM, Shixiong Zhu >>>> wrote: >>>> >>>>> Could you show how did you set the configurations? You need to set >>>>> these configurations before creating SparkContext and SQLContext. >>>>> >>>>> Moreover, the history sever doesn't support SQL UI. So >>>>> "spark.eventLog.enabled=true" doesn't work now. >>>>> >>>>> Best Regards, >>>>> Shixiong Zhu >>>>> >>>>> 2015-10-13 2:01 GMT+08:00 pnpritchard >>>> >: >>>>> >>>>>> Hi, >>>>>> >>>>>> In my application, the Spark UI is consuming a lot of memory, >>>>>> especially the >>>>>> SQL tab. I have set the following configurations to reduce the memory >>>>>> consumption: >>>>>> - spark.ui.retainedJobs=20 >>>>>> - spark.ui.retainedStages=40 >>>>>> - spark.sql.ui.retainedExecutions=0 >>>>>> >>>>>> However, I still get OOM errors in the driver process with the >>>>>> default 1GB >>>>>> heap size. The following link is a screen shot of a heap dump report, >>>>>> showing the SQLListener instance having a retained size of 600MB. >>>>>> >>>>>> https://cloud.githubusercontent.com/assets/5124612/10404379/20fbdcfc-6e87-11e5-9415-27e25193a25c.png >>>>>> >>>>>> Rather than just increasing the allotted heap size, does anyone have >>>>>> any >>>>>> other ideas? Is it possible to disable the SQL tab specifically? I >>>>>> also >>>>>> thought about serving the UI from disk rather than memory with >>>>>> "spark.eventLog.enabled=true" and "spark.ui.enabled=false". Has >>>>>> anyone tried >>>>>> this before? >>>>>> >>>>>> Thanks, >>>>>> Nick >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-consuming-lots-of-memory-tp25033.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >
Re: Spark UI consuming lots of memory
As an update, I did try disabling the ui with "spark.ui.enabled=false", but the JobListener and SQLListener still consume a lot of memory, leading to OOM error. Has anyone encountered this before? Is the only solution just to increase the driver heap size? Thanks, Nick On Mon, Oct 12, 2015 at 8:42 PM, Nicholas Pritchard < nicholas.pritch...@falkonry.com> wrote: > I set those configurations by passing to spark-submit script: > "bin/spark-submit --conf spark.ui.retainedJobs=20 ...". I have verified > that these configurations are being passed correctly because they are > listed in the environments tab and also by counting the number of > job/stages that are listed. The "spark.sql.ui.retainedExecutions=0" only > applies to the number of "completed" executions; there will always be a > "running" execution. For some reason, I have one execution that consumes an > excessive amount of memory. > > Actually, I am not interested in the SQL UI, as I find the Job/Stages UI > to have sufficient information. I am also using Spark Standalone cluster > manager so have not had to use the history server. > > > On Mon, Oct 12, 2015 at 8:17 PM, Shixiong Zhu wrote: > >> Could you show how did you set the configurations? You need to set these >> configurations before creating SparkContext and SQLContext. >> >> Moreover, the history sever doesn't support SQL UI. So >> "spark.eventLog.enabled=true" doesn't work now. >> >> Best Regards, >> Shixiong Zhu >> >> 2015-10-13 2:01 GMT+08:00 pnpritchard : >> >>> Hi, >>> >>> In my application, the Spark UI is consuming a lot of memory, especially >>> the >>> SQL tab. I have set the following configurations to reduce the memory >>> consumption: >>> - spark.ui.retainedJobs=20 >>> - spark.ui.retainedStages=40 >>> - spark.sql.ui.retainedExecutions=0 >>> >>> However, I still get OOM errors in the driver process with the default >>> 1GB >>> heap size. The following link is a screen shot of a heap dump report, >>> showing the SQLListener instance having a retained size of 600MB. >>> >>> https://cloud.githubusercontent.com/assets/5124612/10404379/20fbdcfc-6e87-11e5-9415-27e25193a25c.png >>> >>> Rather than just increasing the allotted heap size, does anyone have any >>> other ideas? Is it possible to disable the SQL tab specifically? I also >>> thought about serving the UI from disk rather than memory with >>> "spark.eventLog.enabled=true" and "spark.ui.enabled=false". Has anyone >>> tried >>> this before? >>> >>> Thanks, >>> Nick >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-consuming-lots-of-memory-tp25033.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: Spark UI consuming lots of memory
I set those configurations by passing to spark-submit script: "bin/spark-submit --conf spark.ui.retainedJobs=20 ...". I have verified that these configurations are being passed correctly because they are listed in the environments tab and also by counting the number of job/stages that are listed. The "spark.sql.ui.retainedExecutions=0" only applies to the number of "completed" executions; there will always be a "running" execution. For some reason, I have one execution that consumes an excessive amount of memory. Actually, I am not interested in the SQL UI, as I find the Job/Stages UI to have sufficient information. I am also using Spark Standalone cluster manager so have not had to use the history server. On Mon, Oct 12, 2015 at 8:17 PM, Shixiong Zhu wrote: > Could you show how did you set the configurations? You need to set these > configurations before creating SparkContext and SQLContext. > > Moreover, the history sever doesn't support SQL UI. So > "spark.eventLog.enabled=true" doesn't work now. > > Best Regards, > Shixiong Zhu > > 2015-10-13 2:01 GMT+08:00 pnpritchard : > >> Hi, >> >> In my application, the Spark UI is consuming a lot of memory, especially >> the >> SQL tab. I have set the following configurations to reduce the memory >> consumption: >> - spark.ui.retainedJobs=20 >> - spark.ui.retainedStages=40 >> - spark.sql.ui.retainedExecutions=0 >> >> However, I still get OOM errors in the driver process with the default 1GB >> heap size. The following link is a screen shot of a heap dump report, >> showing the SQLListener instance having a retained size of 600MB. >> >> https://cloud.githubusercontent.com/assets/5124612/10404379/20fbdcfc-6e87-11e5-9415-27e25193a25c.png >> >> Rather than just increasing the allotted heap size, does anyone have any >> other ideas? Is it possible to disable the SQL tab specifically? I also >> thought about serving the UI from disk rather than memory with >> "spark.eventLog.enabled=true" and "spark.ui.enabled=false". Has anyone >> tried >> this before? >> >> Thanks, >> Nick >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-UI-consuming-lots-of-memory-tp25033.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Force RDD evaluation
Thanks, Sean! Yes, I agree that this logging would still have some cost and so would not be used in production. On Sat, Feb 21, 2015 at 1:37 AM, Sean Owen wrote: > I think the cheapest possible way to force materialization is something > like > > rdd.foreachPartition(i => None) > > I get the use case, but as you can see there is a cost: you are forced > to materialize an RDD and cache it just to measure the computation > time. In principle this could be taking significantly more time than > not doing so, since otherwise several RDD stages might proceed without > ever even having to persist intermediate results in memory. > > Consider looking at the Spark UI to see how much time a stage took, > although it's measuring end to end wall clock time, which may overlap > with other computations. > > (or maybe you are disabling / enabling this logging for prod / test anyway) > > On Sat, Feb 21, 2015 at 4:46 AM, pnpritchard > wrote: > > Is there a technique for forcing the evaluation of an RDD? > > > > I have used actions to do so but even the most basic "count" has a > > non-negligible cost (even on a cached RDD, repeated calls to count take > > time). > > > > My use case is for logging the execution time of the major components in > my > > application. At the end of each component I have a statement like > > "rdd.cache().count()" and time how long it takes. > > > > Thanks in advance for any advice! > > Nick > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Force-RDD-evaluation-tp21748.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > >
Creating time-sequential pairs
Hi Spark community, I have a design/algorithm question that I assume is common enough for someone else to have tackled before. I have an RDD of time-series data formatted as time-value tuples, RDD[(Double, Double)], and am trying to extract threshold crossings. In order to do so, I first want to transform the RDD into pairs of time-sequential values. For example: Input: The time-series data: (1, 0.05) (2, 0.10) (3, 0.15) Output: Transformed into time-sequential pairs: ((1, 0.05), (2, 0.10)) ((2, 0.10), (3, 0.15)) My initial thought was to try and utilize a custom partitioner. This partitioner could ensure sequential data was kept together. Then I could use "mapPartitions" to transform these lists of sequential data. Finally, I would need some logic for creating sequential pairs across the boundaries of each partition. However I was hoping to get some feedback and ideas from the community. Anyone have thoughts on a simpler solution? Thanks, Nick