Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

Cheng Lian Wed, 10 Jun 2015 01:37:20 -0700

Also, if the data isn't confidential, would you mind to send me acompressed copy (don't cc user@spark.apache.org)?


Cheng


On 6/10/15 4:23 PM, 姜超才 wrote:

Hi Lian,

Thanks for your quick response.
I forgot mention that I have tuned driver memory from 2G to 4G, seemsgot minor improvement, The dead way when fetching 1,400,000 rowschanged from "OOM::GC overhead limit exceeded" to " lost workerheartbeat after 120s".
I will try to set "spark.sql.thriftServer.incrementalCollect" andcontinue increase driver memory to 7G, and will send the result to you.
Thanks,

SuperJ


--------- 原始邮件信息 ---------
*发件人:* "Cheng Lian" <l...@databricks.com>
*收件人:* "Hester wang" <hester9...@gmail.com>, <user@spark.apache.org>
*主题:* Re: Met OOM when fetching more than 1,000,000 rows.
*日期:* 2015/06/10 16:15:47 (Wed)

Hi Xiaohan,
Would you please try to set"spark.sql.thriftServer.incrementalCollect" to "true" and increasingdriver memory size? In this way, HiveThriftServer2 usesRDD.toLocalIterator rather than RDD.collect().iterator to return theresult set. The key difference is that RDD.toLocalIterator retrieves asingle partition at a time, thus avoid holding the whole result set ondriver side. The memory issue happens on driver side rather thanexecutor side, so tuning executor memory size doesn't help.
Cheng

On 6/10/15 3:46 PM, Hester wang wrote:
Hi Lian,
I met a SparkSQL problem. I really appreciate it if you could give mesome help! Below is the detailed description of the problem, for moreinformation, attached are the original code and the log that you mayneed.
Problem:
I want to query my table which stored in Hive through the SparkSQLJDBC interface.
And want to fetch more than 1,000,000 rows. But met OOM.
sql = "select * from TEMP_ADMIN_150601_000001 limit XXX ";

My Env:
5 Nodes = One master + 4 workers,  1000M Network Switch ,  Redhat 6.5
Each node: 8G RAM, 500G Harddisk
Java 1.6, Scala 2.10.4, Hadoop 2.6, Spark 1.3.0, Hive 0.13

Data:
A table with user and there charge for electricity data.
About 1,600,000 Rows. About 28MB.
Each row occupy about 18 Bytes.
2 columns: user_id String, total_num Double

Repro Steps:
1. Start Spark
2. Start SparkSQL thriftserver, command:
/usr/local/spark/spark-1.3.0/sbin/start-thriftserver.sh --masterspark://cx-spark-001:7077 --conf spark.executor.memory=4g --confspark.driver.memory=2g --conf spark.shuffle.consolidateFiles=true--conf spark.shuffle.manager=sort --conf"spark.executor.extraJavaOptions=-XX:-UseGCOverheadLimit" --confspark.file.transferTo=false --conf spark.akka.timeout=2000 --confspark.storage.memoryFraction=0.4 --conf spark.cores.max=8 --confspark.kryoserializer.buffer.mb=256 --confspark.serializer=org.apache.spark.serializer.KryoSerializer --confspark.akka.frameSize=512 --driver-class-path/usr/local/hive/lib/classes12.jar
3. Run the test code, see it in attached file: testHiveJDBC.java
4. Get the OOM:GC overhead limit exceeded or OOM: java heap spaceor lost worker heartbeat after 120s. see the attached logs.
Preliminary diagnose:
1. When fetching less than 1,000,000 rows , it always success.
2. When fetching more than 1,300,000 rows , it always fail with OOM:GC overhead limit exceeded.3. When fetching about 1,040,000-1,200,000 rows, if query right afterthe thrift server start up, most times success. if I successfullyquery once then retry the same query, it will fail.4. There are 3 dead pattern: OOM:GC overhead limit exceeded or OOM:java heap space or lost worker heartbeat after 120s.5. I tried to start thrift with different configure, give the worker4G MEM or 2G MEM , got the same behavior. That means , no matter thetotal MEM of worker, i can get less than 1,000,000 rows, and can notget more than 1,300,000 rows.
Preliminary conclusions:
1. The total data is less than 30MB, It is so small, And there is nocomplex computation operation.
So the failure is not caused by excessive memory requirements.
So I guess there are some defect in spark sql code.
2. Allocate 2G or 4G MEM to each worker, got same behavior.
This point strengthen my doubts: there are some defect in code. But Ican't find the specific location.
Thank you so much!

Best,
Xiaohan Wang

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

Reply via email to