hadoop.ParquetOutputCommitter: could not write summary file
an error occured when write parquet files to disk. any advise? I want to know the reason.thanks ``` 16/03/29 18:31:48 WARN hadoop.ParquetOutputCommitter: could not write summary file for file:/tmp/goods/2015-6 java.lang.NullPointerException at org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456) at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420) at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230) ```
How to read compressed parquet file
I think too many parquet files may be affect reading capability,so I use hadoop archive to combine them,but sql_context.read.parquet(output_path) does not work on the file. How to fix it ,please help me. :)
Re: How to read compressed parquet file
It works. at spark 1.4 Thanks a lot. 2015-09-09 17:21 GMT+08:00 Cheng Lian <lian.cs@gmail.com>: > You need to use "har://" instead of "hdfs://" to read HAR files. Just > tested against Spark 1.5, and it works as expected. > > Cheng > > > On 9/9/15 3:29 PM, 李铖 wrote: > > I think too many parquet files may be affect reading capability,so I use > hadoop archive to combine them,but sql_context.read.parquet(output_path) > does not work on the file. > How to fix it ,please help me. > :) > > >
Differents in loading data using spark datasource api and using jdbc
Hi,everyone. I have one question in loading data using spark datasource api and using jdbc that which way is effective?
Differents of loading data
What is the differents of loading data using jdbc and loading data using spard data source api? or differents of loading data using mongo-hadoop and loading data using native java driver? Which way is better?
Re: How to increase the number of tasks
Did you have a change of the value of 'spark.default.parallelism'?be a bigger number. 2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com: It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM instance *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com] *Sent:* Friday, June 5, 2015 10:48 AM *To:* user *Subject:* How to increase the number of tasks I have a stage that spawns 174 tasks when i run repartition on avro data. Tasks read between 512/317/316/214/173 MB of data. Even if i increase number of executors/ number of partitions (when calling repartition) the number of tasks launched remains fixed to 174. 1) I want to speed up this task. How do i do it ? 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is this behavior ? Since this is a repartition stage, it should not depend on the nature of data. Its taking more than 30 mins and i want to speed it up by throwing more executors at it. Please suggest Deepak
Re: How to increase the number of tasks
just multiply 2-4 with the cpu core number of the node . 2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I did not change spark.default.parallelism, What is recommended value for it. On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote: Did you have a change of the value of 'spark.default.parallelism'?be a bigger number. 2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com: It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM instance *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com] *Sent:* Friday, June 5, 2015 10:48 AM *To:* user *Subject:* How to increase the number of tasks I have a stage that spawns 174 tasks when i run repartition on avro data. Tasks read between 512/317/316/214/173 MB of data. Even if i increase number of executors/ number of partitions (when calling repartition) the number of tasks launched remains fixed to 174. 1) I want to speed up this task. How do i do it ? 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is this behavior ? Since this is a repartition stage, it should not depend on the nature of data. Its taking more than 30 mins and i want to speed it up by throwing more executors at it. Please suggest Deepak -- Deepak
'Java heap space' error occured when query 4G data file from HDFS
In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit --master yarn-client --driver-memory 7g --executor-memory 6g /home/hadoop/spark/main.py* exception rised. *spark-defaults.conf* spark.master spark://cloud1:7077 spark.default.parallelism 100 spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 5g spark.driver.maxResultSize 6g spark.kryoserializer.buffer.mb 256 spark.kryoserializer.buffer.max.mb 512 spark.executor.memory 4g spark.rdd.compress true spark.storage.memoryFraction 0 spark.akka.frameSize 50 spark.shuffle.compress true spark.shuffle.spill.compress false spark.local.dir /home/hadoop/tmp * spark-evn.sh* export SCALA=/home/hadoop/softsetup/scala export JAVA_HOME=/home/hadoop/softsetup/jdk1.7.0_71 export SPARK_WORKER_CORES=1 export SPARK_WORKER_MEMORY=4g export HADOOP_CONF_DIR=/opt/cloud/hadoop/etc/hadoop export SPARK_EXECUTOR_MEMORY=4g export SPARK_DRIVER_MEMORY=4g *Exception:* 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on cloud3:38109 (size: 162.7 MB) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on cloud3:38109 (size: 162.7 MB) 15/04/07 18:11:03 INFO TaskSetManager: Starting task 31.0 in stage 1.0 (TID 31, cloud3, NODE_LOCAL, 1296 bytes) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on cloud2:49451 (size: 163.7 MB) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on cloud2:49451 (size: 163.7 MB) 15/04/07 18:11:03 INFO TaskSetManager: Starting task 30.0 in stage 1.0 (TID 32, cloud2, NODE_LOCAL, 1296 bytes) 15/04/07 18:11:03 ERROR Utils: Uncaught exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java heap space at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java heap space at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on cloud3:38109 (size: 162.7 MB) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on
Re: 'Java heap space' error occured when query 4G data file from HDFS
Any help?please. Help me do a right configure. 李铖 lidali...@gmail.com于2015年4月7日星期二写道: In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit --master yarn-client --driver-memory 7g --executor-memory 6g /home/hadoop/spark/main.py* exception rised. *spark-defaults.conf* spark.master spark://cloud1:7077 spark.default.parallelism 100 spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 5g spark.driver.maxResultSize 6g spark.kryoserializer.buffer.mb 256 spark.kryoserializer.buffer.max.mb 512 spark.executor.memory 4g spark.rdd.compress true spark.storage.memoryFraction 0 spark.akka.frameSize 50 spark.shuffle.compress true spark.shuffle.spill.compress false spark.local.dir /home/hadoop/tmp * spark-evn.sh* export SCALA=/home/hadoop/softsetup/scala export JAVA_HOME=/home/hadoop/softsetup/jdk1.7.0_71 export SPARK_WORKER_CORES=1 export SPARK_WORKER_MEMORY=4g export HADOOP_CONF_DIR=/opt/cloud/hadoop/etc/hadoop export SPARK_EXECUTOR_MEMORY=4g export SPARK_DRIVER_MEMORY=4g *Exception:* 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on cloud3:38109 (size: 162.7 MB) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on cloud3:38109 (size: 162.7 MB) 15/04/07 18:11:03 INFO TaskSetManager: Starting task 31.0 in stage 1.0 (TID 31, cloud3, NODE_LOCAL, 1296 bytes) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on cloud2:49451 (size: 163.7 MB) 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on cloud2:49451 (size: 163.7 MB) 15/04/07 18:11:03 INFO TaskSetManager: Starting task 30.0 in stage 1.0 (TID 32, cloud2, NODE_LOCAL, 1296 bytes) 15/04/07 18:11:03 ERROR Utils: Uncaught exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java heap space at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java heap space at org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985) at org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615
Missing an output location for shuffle. : (
Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) )
Re: Missing an output location for shuffle. : (
172.100.11.1565c:f8:a1:d2:3a:6222002.33MB35.80MB42小时4分37秒172.100.11.157d0:22:be:98:36:f71500588KB5.19MB41小时29分56秒172.100.11.1594c:7c:5f:1f:44:af20003KB022秒172.100.11.1601c:1a:c0:41:8a:d600090KB699KB47分26秒172.100.11.161c4:6a:b7:bc:5d:be1402019.29MB4.17MB42小时0分39秒172.100.11.16438:aa:3c:e2:63:0a1107050112KB05分47秒172.100.11.16558:55:ca:e4:f7:fe2310429280KB730KB41分19秒172.100.11.166e0:19:1d:42:1d:3990012.53MB411.33MB434分6秒172.100.11.16788:cb:87:4d:60:0310018KB85KB041分42秒172.100.11.1700c:1d:af:f5:29:d83002KB1KB42分2秒172.100.11.17148:6b:2c:95:d5:6220024KB42KB024分3秒172.100.11.17284:85:06:7d:3b:43113532125KB6KB47秒172.100.11.17388:32:9b:aa:28:8d600131KB117KB01小时39分44秒172.100.11.17424:69:a5:93:cf:bb206202KB01分16秒172.100.11.17618:f6:43:e3:a2:3f60013.42MB1.20MB050分54秒172.100.11.17860:fa:cd:30:cc:cc000103KB252KB48分8秒172.100.11.18054:e4:3a:13:37:321005KB4KB42分50秒172.100.11.18288:32:9b:b7:09:d6500118KB471KB057分45秒172.100.11.18454:ea:a8:94:01:52200601KB1003KB416分45秒172.100.11.185a8:8e:24:65:40:2c1007KB17KB41分10秒172.100.11.186b0:e0:3c:5d:61:6a1800467KB2.23MB41小时52分9秒172.100.11.18880:41:4e:ca:cd:8f700591KB1.47MB41小时19分47秒172.100.11.18934:e2:fd:6f:12:16001分0秒172.100.11.190e0:19:1d:3e:06:28300808KB17.83MB412分37秒172.100.11.1919c:f3:87:32:40:ab1600170KB470KB04分32秒172.100.11.1930c:1d:af:f4:f4:9e11001.83MB14.24MB42小时15分45秒172.100.11.19574:51:ba:bb:49:49150032KB69KB453秒172.100.11.20134:e2:fd:70:b3:33400189KB3.13MB45分18秒172.100.11.203ac:f7:f3:1c:22:431400971KB1.99MB42小时10分12秒172.100.11.20474:51:ba:d7:ff:7e1300127.57MB10.10MB42小时10分16秒172.100.11.20568:df:dd:96:c1:6615001.13MB1.37MB42小时9分44秒172.100.11.20618:dc:56:cc:75:06191550692KB1.20MB045分9秒172.100.11.20800:0c:e7:02:46:583014001.46MB1.78MB01小时21分37秒172.100.11.209d8:b3:77:3e:90:381002609202分48秒172.100.11.21428:e3:1f:b2:2c:8d2620521.99MB1.91MB01小时52分27秒172.100.11.21520:54:76:83:27:f59003.49MB45.49MB446分59秒172.100.11.216c4:05:28:07:77:6b287KB337KB13.36MB386.45MB41小时55分56秒172.100.11.21868:df:dd:eb:ac:de800186KB1.23MB416分1秒172.100.11.22098:fa:e3:cf:a9:4b900134KB136KB429分53秒172.100.11.22120:08:ed:d1:3e:ad2933842218.81MB586.56MB42小时50分20秒172.100.11.22234:23:ba:a9:75:a18005KB1KB02分50秒172.100.11.224bc:4c:c4:9b:f0:562002KB3KB440秒172.100.11.225f0:72:8c:2e:8f:ae2000347KB265KB042分25秒172.100.11.229d0:22:be:95:c8:9916001016KB5.26MB43小时1分43秒172.100.11.23098:ff:d0:0e:c6:5c90018KB5KB02分19秒172.100.11.232d0:2d:b3:90:e0:543539040842.35MB433.94MB03小时37分49秒172.100.11.23368:df:dd:27:36:9781KB79KB2.72MB171.63MB41小时9分24秒172.100.11.23400:08:22:66:77:e38001.31MB12.56MB437分20秒172.100.11.237a0:f4:50:d0:55:b24003KB1KB415分6秒172.100.11.238b8:98:f7:98:c7:c713001.34MB9.80MB41小时12分51秒172.100.11.240ac:f7:f3:87:73:b630023212841分22秒172.100.11.242a4:3d:78:a2:6e:66683044KB1.31MB47.90MB424分59秒172.100.11.243b8:b4:2e:d7:20:78241KB8813.28MB34.26MB42小时49分33秒172.100.11.24600:06:68:c4:10:802900499KB202KB01小时14分5秒172.100.11.24880:be:05:03:82:162002.22MB21.63MB41小时23分35秒172.100.11.25080:71:7a:95:48:a21364021秒 2015-03-26 23:01 GMT+08:00 Michael Armbrust mich...@databricks.com: I would suggest looking for errors in the logs of your executors. On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote: Again,when I do larger file Spark-sql query, error occured.Anyone have got fix it .Please help me. Here is the track. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute
Spark-sql query got exception.Help
It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at ```
Re: Spark-sql query got exception.Help
Yes, it works after I append the two properties in spark-defaults.conf. As I use python programing on spark platform,the python api does not have SparkConf api. Thanks. 2015-03-25 21:07 GMT+08:00 Cheng Lian lian.cs@gmail.com: Oh, just noticed that you were calling sc.setSystemProperty. Actually you need to set this property in SparkConf or in spark-defaults.conf. And there are two configurations related to Kryo buffer size, - spark.kryoserializer.buffer.mb, which is the initial size, and - spark.kryoserializer.buffer.max.mb, which is the max buffer size. Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it). Cheng On 3/25/15 7:31 PM, 李铖 wrote: Here is the full track 15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com: Could you please provide the full stack trace? On 3/25/15 6:26 PM, 李铖 wrote: It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at ```
Re: Spark-sql query got exception.Help
Here is the full track 15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com: Could you please provide the full stack trace? On 3/25/15 6:26 PM, 李铖 wrote: It is ok when I do query data from a small hdfs file. But if the hdfs file is 152m,I got this exception. I try this code .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error still. ``` com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at ```
Re: Spark-sql query got exception.Help
One more exception.How to fix it .Anybody help me ,please. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203) 2015-03-26 10:39 GMT+08:00 李铖 lidali...@gmail.com: Yes, it works after I append the two properties in spark-defaults.conf. As I use python programing on spark platform,the python api does not have SparkConf api. Thanks. 2015-03-25 21:07 GMT+08:00 Cheng Lian lian.cs@gmail.com: Oh, just noticed that you were calling sc.setSystemProperty. Actually you need to set this property in SparkConf or in spark-defaults.conf. And there are two configurations related to Kryo buffer size, - spark.kryoserializer.buffer.mb, which is the initial size, and - spark.kryoserializer.buffer.max.mb, which is the max buffer size. Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it). Cheng On 3/25/15 7:31 PM, 李铖 wrote: Here is the full track 15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 39135 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220) at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com
Should I do spark-sql query on HDFS or apache hive?
Hi,everybody. I am new in spark. Now I want to do interactive sql query using spark sql. spark sql can run under hive or loading files from hdfs. Which is better or faster? Thanks.
Should I do spark-sql query on HDFS or hive?
Hi,everybody. I am new in spark. Now I want to do interactive sql query using spark sql. spark sql can run under hive or loading files from hdfs. Which is better or faster? Thanks.
Re: Should I do spark-sql query on HDFS or apache hive?
Did you mean that parquet is faster than hive format ,and hive format is faster than hdfs ,for Spark SQL? : ) 2015-03-18 1:23 GMT+08:00 Michael Armbrust mich...@databricks.com: The performance has more to do with the particular format you are using, not where the metadata is coming from. Even hive tables are read from files HDFS usually. You probably should use HiveContext as its query language is more powerful than SQLContext. Also, parquet is usually the faster data format for Spark SQL. On Tue, Mar 17, 2015 at 3:41 AM, 李铖 lidali...@gmail.com wrote: Hi,everybody. I am new in spark. Now I want to do interactive sql query using spark sql. spark sql can run under hive or loading files from hdfs. Which is better or faster? Thanks.