Will spark swap memory out to disk if the memory is not enough?

2016-05-16 Thread kramer2...@126.com
I know the cache operation can cache data in memoyr/disk... But I am expecting to know will other operation will do the same? For example, I created a dataframe called df. The df is big so when I run some action like : df.sort(column_name).show() df.collect() It will throw error like :

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
Sorry, the bug link in previous mail was is wrong. Here is the real link: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html At 2016-05-13 09:49:05, "李明伟" <kramer2...@126.com>

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
It seems we hit the same issue. There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 Here is the link about the bug in 1.5.1 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]"

Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon Can you describe your problem in more details? I suspect that my problem is because the window function (or may be the groupBy agg functions). If you are the same. May be we should report a bug At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]"

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last the memory usage goes up again... eventually , the stream program crashed. -- View this message in context:

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will find it will be 75%. So means 12GB memory. But it still do not make sense. Because my workload is very small. I use this spark to calculate on one csv file every 20 seconds. The size of the csv file is 1.3M. So spark

Will the HiveContext cause memory leak ?

2016-05-10 Thread kramer2...@126.com
I submit my code to a spark stand alone cluster. Find the memory usage executor process keeps growing. Which cause the program to crash. I modified the code and submit several times. Find below 4 line may causing the issue dataframe =

What does the spark stand alone cluster do?

2016-05-10 Thread kramer2...@126.com
Hello. My question here is what the spark stand alone cluster do here. Because when we submit program like below ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" We specified the

Re: Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com
To add more infor: This is the setting in my spark-env.sh [root@ES01 conf]# grep -v "#" spark-env.sh SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_INSTANCES=1 SPARK_DAEMON_MEMORY=4G So I did not set the executor to use more memory here. Also here is the top output KiB Mem : 16268156 total, 161116

Why I have memory leaking for such simple spark stream code?

2016-05-09 Thread kramer2...@126.com
Hi I wrote some Python code to do calculation on spark stream. The code works fine for about half an hour then the memory usage for the executor become very high. I assign 4GB in the submit command but it using 80% of my physical memory which is 16GB. I see this from top command. In this

How big the spark stream window could be ?

2016-05-08 Thread kramer2...@126.com
We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread kramer2...@126.com
I am using python and spark. I think one problem might be to communicate spark with third product. For example, combine spark with elasticsearch. You have to use java or scala. Python is not supported -- View this message in context:

Why very small work load cause GC overhead limit?

2016-04-19 Thread kramer2...@126.com
Hi All I use spark doing some calculation. The situation is 1. New file will come into a folder periodically 2. I turn the new files into data frame and insert it into an previous data frame. The code is like below : # Get the file list in the HDFS directory client =

Is it possible that spark worker require more resource than the cluster?

2016-04-18 Thread kramer2...@126.com
I have a stand alone cluster running on one node The ps command will show that Worker is having 1 GB memory and Driver is having 256m. root 23182 1 0 Apr01 ?00:19:30 java -cp

Why Spark having OutOfMemory Exception?

2016-04-11 Thread kramer2...@126.com
I use spark to do some very simple calculation. The description is like below (pseudo code): While timestamp == 5 minutes df = read_hdf() # Read hdfs to get a dataframe every 5 minutes my_dict[timestamp] = df # Put the data frame into a dict delete_old_dataframe( my_dict )

How to release data frame to avoid memory leak

2016-03-31 Thread kramer2...@126.com
Hi I have data frames created every 5 minutes. I use a dict to keep the recent 1 hour data frames. So only 12 data frame can be kept in the dict. New data frame come in, old data frame pop out. My question is when I pop out the old data frame, do I have to call dataframe.unpersist to release the

How to design the input source of spark stream

2016-03-30 Thread kramer2...@126.com
Hi My environment is described like below: 5 nodes, each nodes generate a big csv file every 5 minutes. I need spark stream to analyze these 5 files in every five minutes to generate some report. I am planning to do it in this way: 1. Put those 5 files into HDSF directory called /data 2. Merge