[ https://issues.apache.org/jira/browse/HIVE-15527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827376#comment-15827376 ]
Dapeng Sun edited comment on HIVE-15527 at 1/18/17 3:32 AM: ------------------------------------------------------------ Thank [~csun] and [~Ferd], here is the detail log: {noformat} 17/01/17 xx:xx:xx INFO client.RemoteDriver: Failed to run job xxxxxxxxxxxxxxxxxxxx java.lang.NumberFormatException: null at java.lang.Long.parseLong(Long.java:552) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:202) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:141) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:335) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/01/17 xx:xx:xx INFO client.RemoteDriver: Shutting down remote driver. {noformat} was (Author: dapengsun): Thank [~csun] and [~Ferd], here is the detail log: 17/01/17 xx:xx:xx INFO client.RemoteDriver: Failed to run job xxxxxxxxxxxxxxxxxxxx java.lang.NumberFormatException: null at java.lang.Long.parseLong(Long.java:552) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:202) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:141) at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109) at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:335) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366) at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/01/17 xx:xx:xx INFO client.RemoteDriver: Shutting down remote driver. > Memory usage is unbound in SortByShuffler for Spark > --------------------------------------------------- > > Key: HIVE-15527 > URL: https://issues.apache.org/jira/browse/HIVE-15527 > Project: Hive > Issue Type: Improvement > Components: Spark > Affects Versions: 1.1.0 > Reporter: Xuefu Zhang > Assignee: Chao Sun > Attachments: HIVE-15527.0.patch, HIVE-15527.0.patch, > HIVE-15527.1.patch, HIVE-15527.2.patch, HIVE-15527.3.patch, > HIVE-15527.4.patch, HIVE-15527.5.patch, HIVE-15527.6.patch, > HIVE-15527.7.patch, HIVE-15527.8.patch, HIVE-15527.patch > > > In SortByShuffler.java, an ArrayList is used to back the iterator for values > that have the same key in shuffled result produced by spark transformation > sortByKey. It's possible that memory can be exhausted because of a large key > group. > {code} > @Override > public Tuple2<HiveKey, Iterable<BytesWritable>> next() { > // TODO: implement this by accumulating rows with the same key > into a list. > // Note that this list needs to improved to prevent excessive > memory usage, but this > // can be done in later phase. > while (it.hasNext()) { > Tuple2<HiveKey, BytesWritable> pair = it.next(); > if (curKey != null && !curKey.equals(pair._1())) { > HiveKey key = curKey; > List<BytesWritable> values = curValues; > curKey = pair._1(); > curValues = new ArrayList<BytesWritable>(); > curValues.add(pair._2()); > return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, > values); > } > curKey = pair._1(); > curValues.add(pair._2()); > } > if (curKey == null) { > throw new NoSuchElementException(); > } > // if we get here, this should be the last element we have > HiveKey key = curKey; > curKey = null; > return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, > curValues); > } > {code} > Since the output from sortByKey is already sorted on key, it's possible to > backup the value iterable using the same input iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)