Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
Sorry, the bug link in previous mail was is wrong. 


Here is the real link:


http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html











At 2016-05-13 09:49:05, "李明伟" <kramer2...@126.com> wrote:

It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n2694...@n3.nabble.com> wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML




 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu
The link below doesn't refer to specific bug. 

Can you send the correct link ?

Thanks 

> On May 12, 2016, at 6:50 PM, "kramer2...@126.com" <kramer2...@126.com> wrote:
> 
> It seems we hit the same issue.
> 
> There was a bug on 1.5.1 about memory leak. But I am using 1.6.1
> 
> Here is the link about the bug in 1.5.1 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> 
> 
> 
> 
> 
> At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <[hidden 
> email]> wrote:
> I read with Spark-Streaming from a Port. The incoming data consists of key 
> and value pairs. Then I call forEachRDD on each window. There I create a 
> Dataset from the window and do some SQL Querys on it. On the result i only do 
> show, to see the content. It works well, but the memory usage increases. When 
> it reaches the maximum nothing works anymore. When I use more memory. The 
> Program runs some time longer, but the problem persists. Because I run a 
> Programm which writes to the Port, I can control perfectly how much Data 
> Spark has to Process. When I write every one ms one key and value Pair the 
> Problem is the same as when i write only every second a key and value pair to 
> the port. 
> 
> When I dont create a Dataset in the foreachRDD and only count the Elements in 
> the RDD, then everything works fine. I also use groupBy agg functions in the 
> querys. 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
> To unsubscribe from Will the HiveContext cause memory leak ?, click here.
> NAML
> 
> 
>  
> 
> 
> View this message in context: Re:Re: Re:Re: Will the HiveContext cause memory 
> leak ?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
It seems we hit the same issue.


There was a bug on 1.5.1 about memory leak. But I am using 1.6.1


Here is the link about the bug in 1.5.1 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark






At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n2694...@n3.nabble.com> wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and 
value pairs. Then I call forEachRDD on each window. There I create a Dataset 
from the window and do some SQL Querys on it. On the result i only do show, to 
see the content. It works well, but the memory usage increases. When it reaches 
the maximum nothing works anymore. When I use more memory. The Program runs 
some time longer, but the problem persists. Because I run a Programm which 
writes to the Port, I can control perfectly how much Data Spark has to Process. 
When I write every one ms one key and value Pair the Problem is the same as 
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in 
the RDD, then everything works fine. I also use groupBy agg functions in the 
querys.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon


Can you describe your problem in more details? 
I suspect that my problem is because the window function (or may be the groupBy 
agg functions).
If you are the same. May be we should report a bug 






At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" 
<ml-node+s1001560n26930...@n3.nabble.com> wrote:
I have the same Problem with Spark-2.0.0 Snapshot with Streaming. There I use 
Datasets instead of Dataframes. I hope you or someone will find a solution.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26930.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26934.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again... 

eventually , the stream program crashed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.


But it still do not make sense. Because my workload is very small.


I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.


So spark is using almost 10 000 times of memory than my workload. Does that
mean I need prepare 1TB RAM if the workload is 100M?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re:Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread 李明伟
Hi  Ted


Spark version :  spark-1.6.0-bin-hadoop2.6
I tried increase the memory of executor. Still have the same problem.
I can use jmap to capture some thing. But the output is too difficult to 
understand. 










在 2016-05-11 11:50:14,"Ted Yu" <yuzhih...@gmail.com> 写道:

Which Spark release are you using ?


I assume executor crashed due to OOME.


Did you have a chance to capture jmap on the executor before it crashed ?


Have you tried giving more memory to the executor ?


Thanks


On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com<kramer2...@126.com> wrote:
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py>
Please advice me if it is a bug..

Also here is the submit command

nohup ./bin/spark-submit  \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: Will the HiveContext cause memory leak ?

2016-05-10 Thread Ted Yu
Which Spark release are you using ?

I assume executor crashed due to OOME.

Did you have a chance to capture jmap on the executor before it crashed ?

Have you tried giving more memory to the executor ?

Thanks

On Tue, May 10, 2016 at 8:25 PM, kramer2...@126.com <kramer2...@126.com>
wrote:

> I submit my code to a spark stand alone cluster. Find the memory usage
> executor process keeps growing. Which cause the program to crash.
>
> I modified the code and submit several times. Find below 4 line may causing
> the issue
>
> dataframe =
>
> dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
> windowSpec =
> Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
> rank = func.dense_rank().over(windowSpec)
> ret =
>
> dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
> rank.alias('rank')).filter("rank<=2")
>
> It looks a little complicated but it is just some Window function on
> dataframe. I use the HiveContext because SQLContext do not support window
> function yet. Without the 4 line, my code can run all night. Adding them
> will cause the memory leak. Program will crash in a few hours.
>
> I will provided the whole code (50 lines)here.  ForAsk01.py
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py
> >
> Please advice me if it is a bug..
>
> Also here is the submit command
>
> nohup ./bin/spark-submit  \
> --master spark://ES01:7077 \
> --executor-memory 4G \
> --num-executors 1 \
> --total-executor-cores 1 \
> --conf "spark.storage.memoryFraction=0.2"  \
> ./ForAsk.py 1>a.log 2>b.log &
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com
I submit my code to a spark stand alone cluster. Find the memory usage
executor process keeps growing. Which cause the program to crash.

I modified the code and submit several times. Find below 4 line may causing
the issue

dataframe =
dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec =
Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret =
dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'],
rank.alias('rank')).filter("rank<=2")

It looks a little complicated but it is just some Window function on
dataframe. I use the HiveContext because SQLContext do not support window
function yet. Without the 4 line, my code can run all night. Adding them
will cause the memory leak. Program will crash in a few hours.

I will provided the whole code (50 lines)here.  ForAsk01.py
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26921/ForAsk01.py>  
Please advice me if it is a bug..

Also here is the submit command 

nohup ./bin/spark-submit  \  
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2"  \
./ForAsk.py 1>a.log 2>b.log &





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org