Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

Cheng Lian Fri, 12 Jun 2015 00:52:45 -0700

Thanks for the extra details and explanations Chaocai, will try toreproduce this when I got chance.


Cheng


On 6/12/15 3:44 PM, 姜超才 wrote:

I said "OOM occurred on slave node", because I monitored memoryutilization during the query task, on driver, very few memory wasocupied. And i remember i have ever seen the OOM stderr log on slavenode.
But recently there seems no OOM log on slave node.
Follow the cmd 、data 、env and the code I gave you, the OOM can 100%repro on cluster mode.
Thanks,

SuperJ


--------- 原始邮件信息 ---------
*发件人:* "Cheng Lian" <l...@databricks.com>
*收件人:* "姜超才" <jiangchao...@haiyisoft.com>, "Hester wang"<hester9...@gmail.com>, <user@spark.apache.org>*主题:* Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching morethan 1,000,000 rows.
*日期:* 2015/06/12 15:30:08 (Fri)

Hi Chaocai,
Glad that 1.4 fixes your case. However, I'm a bit confused by yourlast comment saying "The OOM or lose heartbeat was occurred on slavenode". Because from the log files you attached at first, those OOMactually happens on driver side (Thrift server log only contains loglines from driver side). Did you see OOM from executor stderr output?I ask this because there are still a large portion of users are using1.3, and we may want deliver a fix if there does exist bugs thatcauses unexpected OOM.
Cheng

On 6/12/15 3:14 PM, 姜超才 wrote:
Hi Lian,

Today I update my spark to v1.4. This issue resolved.

Thanks,
SuperJ

--------- 原始邮件信息 ---------
*发件人:* "姜超才"
*收件人:* "Cheng Lian" , "Hester wang" ,
*主题:* 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than1,000,000 rows.
*日期:* 2015/06/11 08:56:28 (Thu)

No problem on Local mode. I can get all rows.

Select * from foo;

The OOM or lose heartbeat was occured on slave node.

Thanks,

SuperJ


--------- 原始邮件信息 ---------
*发件人:* "Cheng Lian"
*收件人:* "姜超才" , "Hester wang" ,
*主题:* Re: 回复: Re: 回复: Re: Met OOM when fetching more than1,000,000 rows.
*日期:* 2015/06/10 19:58:59 (Wed)
Hm, I tried the following with 0.13.1 and 0.13.0 on my laptop (don'thave access to a cluster for now) but couldn't reproduce this issue.Your program just executed smoothly... :-/
Command line used to start the Thrift server:

    ./sbin/start-thriftserver.sh --driver-memory 4g --master local

SQL statements used to create the table with your data:

    create table foo(k string, v double);
    load data local inpath '/tmp/bar' into table foo;

Tried this via Beeline:

    select * from foo limit 1600000;

Also tried the Java program you provided.
Could you also try to verify whether this single node local modeworks for you? Will investigate this with a cluster when I get chance.
Cheng

On 6/10/15 5:19 PM, 姜超才 wrote:
When set "spark.sql.thriftServer.incrementalCollect" and set drivermemory to 7G, Things seems stable and simple:It can quickly run through the query line, but when traversal theresult set ( while rs.hasNext ), it can quickly get the OOM: javaheap space. See attachment.
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh --masterspark://cx-spark-001:7077 --conf spark.executor.memory=4g --confspark.driver.memory=7g --conf spark.shuffle.consolidateFiles=true--conf spark.shuffle.manager=sort --conf"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --confspark.file.transferTo=false --conf spark.akka.timeout=2000 --confspark.storage.memoryFraction=0.4 --conf spark.cores.max=8 --confspark.kryoserializer.buffer.mb=256 --confspark.serializer=org.apache.spark.serializer.KryoSerializer --confspark.akka.frameSize=512 --driver-class-path/usr/local/hive/lib/classes12.jar --confspark.sql.thriftServer.incrementalCollect=true
Thanks,

SuperJ


--------- 原始邮件信息 ---------
*发件人:* "Cheng Lian"
*收件人:* "姜超才" , "Hester wang" ,
*主题:* Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.
*日期:* 2015/06/10 16:37:34 (Wed)
Also, if the data isn't confidential, would you mind to send me acompressed copy (don't cc user@spark.apache.org)?
Cheng

On 6/10/15 4:23 PM, 姜超才 wrote:
Hi Lian,

Thanks for your quick response.
I forgot mention that I have tuned driver memory from 2G to 4G,seems got minor improvement, The dead way when fetching 1,400,000rows changed from "OOM::GC overhead limit exceeded" to " lostworker heartbeat after 120s".
I will try to set "spark.sql.thriftServer.incrementalCollect" andcontinue increase driver memory to 7G, and will send the result toyou.
Thanks,

SuperJ


--------- 原始邮件信息 ---------
*发件人:* "Cheng Lian"
*收件人:* "Hester wang" ,
*主题:* Re: Met OOM when fetching more than 1,000,000 rows.
*日期:* 2015/06/10 16:15:47 (Wed)

Hi Xiaohan,
Would you please try to set"spark.sql.thriftServer.incrementalCollect" to "true" andincreasing driver memory size? In this way, HiveThriftServer2 usesRDD.toLocalIterator rather than RDD.collect().iterator to returnthe result set. The key difference is that RDD.toLocalIteratorretrieves a single partition at a time, thus avoid holding thewhole result set on driver side. The memory issue happens on driverside rather than executor side, so tuning executor memory sizedoesn't help.
Cheng

On 6/10/15 3:46 PM, Hester wang wrote:
Hi Lian,
I met a SparkSQL problem. I really appreciate it if you could giveme some help! Below is the detailed description of the problem,for more information, attached are the original code and the logthat you may need.
Problem:
I want to query my table which stored in Hive through the SparkSQLJDBC interface.
And want to fetch more than 1,000,000 rows. But met OOM.
sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";

My Env:
5 Nodes = One master + 4 workers,  1000M Network Switch ,  Redhat 6.5
Each node: 8G RAM, 500G Harddisk
Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13

Data:
A table with user and there charge for electricity data.
About 1,600,000 Rows. About 28MB.
Each row occupy about 18 Bytes.
2 columns: user_id String, total_num Double

Repro Steps:
1. Start Spark
2. Start SparkSQL thriftserver, command:
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh --masterspark://cx-spark-001:7077 --conf spark.executor.memory=4g --confspark.driver.memory=2g --conf spark.shuffle.consolidateFiles=true--conf spark.shuffle.manager=sort --conf"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --confspark.file.transferTo=false --conf spark.akka.timeout=2000 --confspark.storage.memoryFraction=0.4 --conf spark.cores.max=8 --confspark.kryoserializer.buffer.mb=256 --confspark.serializer=org.apache.spark.serializer.KryoSerializer --confspark.akka.frameSize=512 --driver-class-path/usr/local/hive/lib/classes12.jar
3. Run the test code, see it in attached file: testHiveJDBC.java
4. Get the OOM:GC overhead limit exceeded or OOM: java heap spaceor lost worker heartbeat after 120s. see the attached logs.
Preliminary diagnose:
1. When fetching less than 1,000,000 rows , it always success.
2. When fetching more than 1,300,000 rows , it always fail withOOM: GC overhead limit exceeded.3. When fetching about 1,040,000-1,200,000 rows, if query rightafter the thrift server start up, most times success. if Isuccessfully query once then retry the same query, it will fail.4. There are 3 dead pattern: OOM:GC overhead limit exceeded orOOM: java heap space or lost worker heartbeat after 120s.5. I tried to start thrift with different configure, give theworker 4G MEM or 2G MEM , got the same behavior. That means , nomatter the total MEM of worker, i can get less than 1,000,000rows, and can not get more than 1,300,000 rows.
Preliminary conclusions:
1. The total data is less than 30MB, It is so small, And there isno complex computation operation.
So the failure is not caused by excessive memory requirements.
So I guess there are some defect in spark sql code.
2. Allocate 2G or 4G MEM to each worker, got same behavior.
This point strengthen my doubts: there are some defect in code.But I can't find the specific location.
Thank you so much!

Best,
Xiaohan Wang

Re: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

Reply via email to