Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?
Sorry, the bug link in previous mail was is wrong. Here is the real link: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html At 2016-05-13 09:49:05, "李明伟" <kramer2...@126.com> wrote: It seems we hit the same issue. There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 Here is the link about the bug in 1.5.1 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <ml-node+s1001560n2694...@n3.nabble.com> wrote: I read with Spark-Streaming from a Port. The incoming data consists of key and value pairs. Then I call forEachRDD on each window. There I create a Dataset from the window and do some SQL Querys on it. On the result i only do show, to see the content. It works well, but the memory usage increases. When it reaches the maximum nothing works anymore. When I use more memory. The Program runs some time longer, but the problem persists. Because I run a Programm which writes to the Port, I can control perfectly how much Data Spark has to Process. When I write every one ms one key and value Pair the Problem is the same as when i write only every second a key and value pair to the port. When I dont create a Dataset in the foreachRDD and only count the Elements in the RDD, then everything works fine. I also use groupBy agg functions in the querys. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html To unsubscribe from Will the HiveContext cause memory leak ?, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26947.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Will the HiveContext cause memory leak ?
The link below doesn't refer to specific bug. Can you send the correct link ? Thanks > On May 12, 2016, at 6:50 PM, "kramer2...@126.com" <kramer2...@126.com> wrote: > > It seems we hit the same issue. > > There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 > > Here is the link about the bug in 1.5.1 > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > > > > > > At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <[hidden > email]> wrote: > I read with Spark-Streaming from a Port. The incoming data consists of key > and value pairs. Then I call forEachRDD on each window. There I create a > Dataset from the window and do some SQL Querys on it. On the result i only do > show, to see the content. It works well, but the memory usage increases. When > it reaches the maximum nothing works anymore. When I use more memory. The > Program runs some time longer, but the problem persists. Because I run a > Programm which writes to the Port, I can control perfectly how much Data > Spark has to Process. When I write every one ms one key and value Pair the > Problem is the same as when i write only every second a key and value pair to > the port. > > When I dont create a Dataset in the foreachRDD and only count the Elements in > the RDD, then everything works fine. I also use groupBy agg functions in the > querys. > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html > To unsubscribe from Will the HiveContext cause memory leak ?, click here. > NAML > > > > > > View this message in context: Re:Re: Re:Re: Will the HiveContext cause memory > leak ? > Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re:Re: Re:Re: Will the HiveContext cause memory leak ?
It seems we hit the same issue. There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 Here is the link about the bug in 1.5.1 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <ml-node+s1001560n2694...@n3.nabble.com> wrote: I read with Spark-Streaming from a Port. The incoming data consists of key and value pairs. Then I call forEachRDD on each window. There I create a Dataset from the window and do some SQL Querys on it. On the result i only do show, to see the content. It works well, but the memory usage increases. When it reaches the maximum nothing works anymore. When I use more memory. The Program runs some time longer, but the problem persists. Because I run a Programm which writes to the Port, I can control perfectly how much Data Spark has to Process. When I write every one ms one key and value Pair the Problem is the same as when i write only every second a key and value pair to the port. When I dont create a Dataset in the foreachRDD and only count the Elements in the RDD, then everything works fine. I also use groupBy agg functions in the querys. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html To unsubscribe from Will the HiveContext cause memory leak ?, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26946.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re:Re: Will the HiveContext cause memory leak ?
Hi Simon Can you describe your problem in more details? I suspect that my problem is because the window function (or may be the groupBy agg functions). If you are the same. May be we should report a bug At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" <ml-node+s1001560n26930...@n3.nabble.com> wrote: I have the same Problem with Spark-2.0.0 Snapshot with Streaming. There I use Datasets instead of Dataframes. I hope you or someone will find a solution. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26930.html To unsubscribe from Will the HiveContext cause memory leak ?, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26934.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Will the HiveContext cause memory leak ?
sorry I have to correction again. It may still a memory leak. Because at last the memory usage goes up again... eventually , the stream program crashed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Will the HiveContext cause memory leak ?
After 8 hours. The usage of memory become stable. Use the Top command will find it will be 75%. So means 12GB memory. But it still do not make sense. Because my workload is very small. I use this spark to calculate on one csv file every 20 seconds. The size of the csv file is 1.3M. So spark is using almost 10 000 times of memory than my workload. Does that mean I need prepare 1TB RAM if the workload is 100M? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re:Re: Will the HiveContext cause memory leak ?
Hi Ted Spark version : spark-1.6.0-bin-hadoop2.6 I tried increase the memory of executor. Still have the same problem. I can use jmap to capture some thing. But the output is too difficult to understand. 在 2016-05-11 11:50:14,"Ted Yu" <yuzhih...@gmail.com> 写道: Which Spark release are you using ? I assume executor crashed due to OOME. Did you have a chance to capture jmap on the executor before it crashed ? Have you tried giving more memory to the executor ? Thanks On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com<kramer2...@126.com> wrote: I submit my code to a spark stand alone cluster. Find the memory usage executor process keeps growing. Which cause the program to crash. I modified the code and submit several times. Find below 4 line may causing the issue dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits')) windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc()) rank = func.dense_rank().over(windowSpec) ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2") It looks a little complicated but it is just some Window function on dataframe. I use the HiveContext because SQLContext do not support window function yet. Without the 4 line, my code can run all night. Adding them will cause the memory leak. Program will crash in a few hours. I will provided the whole code (50 lines)here. ForAsk01.py <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py> Please advice me if it is a bug.. Also here is the submit command nohup ./bin/spark-submit \ --master spark://ES01:7077 \ --executor-memory 4G \ --num-executors 1 \ --total-executor-cores 1 \ --conf "spark.storage.memoryFraction=0.2" \ ./ForAsk.py 1>a.log 2>b.log & -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Will the HiveContext cause memory leak ?
Which Spark release are you using ? I assume executor crashed due to OOME. Did you have a chance to capture jmap on the executor before it crashed ? Have you tried giving more memory to the executor ? Thanks On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com <kramer2...@126.com> wrote: > I submit my code to a spark stand alone cluster. Find the memory usage > executor process keeps growing. Which cause the program to crash. > > I modified the code and submit several times. Find below 4 line may causing > the issue > > dataframe = > > dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits')) > windowSpec = > Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc()) > rank = func.dense_rank().over(windowSpec) > ret = > > dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], > rank.alias('rank')).filter("rank<=2") > > It looks a little complicated but it is just some Window function on > dataframe. I use the HiveContext because SQLContext do not support window > function yet. Without the 4 line, my code can run all night. Adding them > will cause the memory leak. Program will crash in a few hours. > > I will provided the whole code (50 lines)here. ForAsk01.py > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py > > > Please advice me if it is a bug.. > > Also here is the submit command > > nohup ./bin/spark-submit \ > --master spark://ES01:7077 \ > --executor-memory 4G \ > --num-executors 1 \ > --total-executor-cores 1 \ > --conf "spark.storage.memoryFraction=0.2" \ > ./ForAsk.py 1>a.log 2>b.log & > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Will the HiveContext cause memory leak ?
I submit my code to a spark stand alone cluster. Find the memory usage executor process keeps growing. Which cause the program to crash. I modified the code and submit several times. Find below 4 line may causing the issue dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits')) windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc()) rank = func.dense_rank().over(windowSpec) ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2") It looks a little complicated but it is just some Window function on dataframe. I use the HiveContext because SQLContext do not support window function yet. Without the 4 line, my code can run all night. Adding them will cause the memory leak. Program will crash in a few hours. I will provided the whole code (50 lines)here. ForAsk01.py <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py> Please advice me if it is a bug.. Also here is the submit command nohup ./bin/spark-submit \ --master spark://ES01:7077 \ --executor-memory 4G \ --num-executors 1 \ --total-executor-cores 1 \ --conf "spark.storage.memoryFraction=0.2" \ ./ForAsk.py 1>a.log 2>b.log & -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org