hadoop.ParquetOutputCommitter: could not write summary file

2016-03-29 Thread
an error occured when write parquet files to disk.
any advise?
I want to know the reason.thanks
```
16/03/29 18:31:48 WARN hadoop.ParquetOutputCommitter: could not write
summary file for file:/tmp/goods/2015-6
java.lang.NullPointerException
at
org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
at
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
```


How to read compressed parquet file

2015-09-09 Thread
I think too many parquet files may be affect reading capability,so I use
hadoop archive to combine them,but  sql_context.read.parquet(output_path)
does not work on the file.
How to fix it ,please help me.
:)


Re: How to read compressed parquet file

2015-09-09 Thread
It works. at spark 1.4
Thanks a lot.

2015-09-09 17:21 GMT+08:00 Cheng Lian <lian.cs@gmail.com>:

> You need to use "har://" instead of "hdfs://" to read HAR files. Just
> tested against Spark 1.5, and it works as expected.
>
> Cheng
>
>
> On 9/9/15 3:29 PM, 李铖 wrote:
>
> I think too many parquet files may be affect reading capability,so I use
> hadoop archive to combine them,but  sql_context.read.parquet(output_path)
> does not work on the file.
> How to fix it ,please help me.
> :)
>
>
>


Differents in loading data using spark datasource api and using jdbc

2015-08-10 Thread
Hi,everyone.

I have one question in loading data using spark datasource api and using
jdbc that  which way is effective?


Differents of loading data

2015-08-10 Thread
What is the differents of loading data using jdbc and loading  data using
spard data source api?
or differents of loading data using mongo-hadoop and loading data using
native java driver?

Which way is better?


Re: How to increase the number of tasks

2015-06-05 Thread
Did you have a change of the value of 'spark.default.parallelism'?be a
bigger number.

2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com:

 It may be that your system runs out of resources (ie 174 is the ceiling)
 due to the following



 1.   RDD Partition = (Spark) Task

 2.   RDD Partition != (Spark) Executor

 3.   (Spark) Task != (Spark) Executor

 4.   (Spark) Task = JVM Thread

 5.   (Spark) Executor = JVM instance



 *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com]
 *Sent:* Friday, June 5, 2015 10:48 AM
 *To:* user
 *Subject:* How to increase the number of tasks



 I have a  stage that spawns 174 tasks when i run repartition on avro data.

 Tasks read between 512/317/316/214/173  MB of data. Even if i increase
 number of executors/ number of partitions (when calling repartition) the
 number of tasks launched remains fixed to 174.



 1) I want to speed up this task. How do i do it ?

 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why is
 this behavior ?

 Since this is a repartition stage, it should not depend on the nature of
 data.



 Its taking more than 30 mins and i want to speed it up by throwing more
 executors at it.



 Please suggest



 Deepak





Re: How to increase the number of tasks

2015-06-05 Thread
just multiply 2-4 with the cpu core number of the node .

2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:

 I did not change spark.default.parallelism,
 What is recommended value for it.

 On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote:

 Did you have a change of the value of 'spark.default.parallelism'?be a
 bigger number.

 2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com:

 It may be that your system runs out of resources (ie 174 is the ceiling)
 due to the following



 1.   RDD Partition = (Spark) Task

 2.   RDD Partition != (Spark) Executor

 3.   (Spark) Task != (Spark) Executor

 4.   (Spark) Task = JVM Thread

 5.   (Spark) Executor = JVM instance



 *From:* ÐΞ€ρ@Ҝ (๏̯͡๏) [mailto:deepuj...@gmail.com]
 *Sent:* Friday, June 5, 2015 10:48 AM
 *To:* user
 *Subject:* How to increase the number of tasks



 I have a  stage that spawns 174 tasks when i run repartition on avro
 data.

 Tasks read between 512/317/316/214/173  MB of data. Even if i increase
 number of executors/ number of partitions (when calling repartition) the
 number of tasks launched remains fixed to 174.



 1) I want to speed up this task. How do i do it ?

 2) Few tasks finish in 20 mins, few in 15 and few in less than 10. Why
 is this behavior ?

 Since this is a repartition stage, it should not depend on the nature of
 data.



 Its taking more than 30 mins and i want to speed it up by throwing more
 executors at it.



 Please suggest



 Deepak







 --
 Deepak




'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread
In my dev-test env .I have 3 virtual machines ,every machine have 12G
memory,8 cpu core.

Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right.

I run this command :*spark-submit --master yarn-client --driver-memory 7g
--executor-memory 6g /home/hadoop/spark/main.py*
exception rised.

*spark-defaults.conf*

spark.master spark://cloud1:7077
spark.default.parallelism 100
spark.eventLog.enabled   true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory  5g
spark.driver.maxResultSize 6g
spark.kryoserializer.buffer.mb 256
spark.kryoserializer.buffer.max.mb 512
spark.executor.memory 4g
spark.rdd.compress true
spark.storage.memoryFraction 0
spark.akka.frameSize 50
spark.shuffle.compress true
spark.shuffle.spill.compress false
spark.local.dir /home/hadoop/tmp

* spark-evn.sh*

export SCALA=/home/hadoop/softsetup/scala
export JAVA_HOME=/home/hadoop/softsetup/jdk1.7.0_71
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=4g
export HADOOP_CONF_DIR=/opt/cloud/hadoop/etc/hadoop
export SPARK_EXECUTOR_MEMORY=4g
export SPARK_DRIVER_MEMORY=4g

*Exception:*

15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on
cloud3:38109 (size: 162.7 MB)
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on
cloud3:38109 (size: 162.7 MB)
15/04/07 18:11:03 INFO TaskSetManager: Starting task 31.0 in stage 1.0 (TID
31, cloud3, NODE_LOCAL, 1296 bytes)
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on
cloud2:49451 (size: 163.7 MB)
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on
cloud2:49451 (size: 163.7 MB)
15/04/07 18:11:03 INFO TaskSetManager: Starting task 30.0 in stage 1.0 (TID
32, cloud2, NODE_LOCAL, 1296 bytes)
15/04/07 18:11:03 ERROR Utils: Uncaught exception in thread
task-result-getter-0
java.lang.OutOfMemoryError: Java heap space
at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
at
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Exception in thread task-result-getter-0 java.lang.OutOfMemoryError: Java
heap space
at
org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
at
org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on
cloud3:38109 (size: 162.7 MB)
15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread
Any help?please.

Help me do a right configure.


李铖 lidali...@gmail.com于2015年4月7日星期二写道:

 In my dev-test env .I have 3 virtual machines ,every machine have 12G
 memory,8 cpu core.

 Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not
 right.

 I run this command :*spark-submit --master yarn-client --driver-memory 7g
 --executor-memory 6g /home/hadoop/spark/main.py*
 exception rised.

 *spark-defaults.conf*

 spark.master spark://cloud1:7077
 spark.default.parallelism 100
 spark.eventLog.enabled   true
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.driver.memory  5g
 spark.driver.maxResultSize 6g
 spark.kryoserializer.buffer.mb 256
 spark.kryoserializer.buffer.max.mb 512
 spark.executor.memory 4g
 spark.rdd.compress true
 spark.storage.memoryFraction 0
 spark.akka.frameSize 50
 spark.shuffle.compress true
 spark.shuffle.spill.compress false
 spark.local.dir /home/hadoop/tmp

 * spark-evn.sh*

 export SCALA=/home/hadoop/softsetup/scala
 export JAVA_HOME=/home/hadoop/softsetup/jdk1.7.0_71
 export SPARK_WORKER_CORES=1
 export SPARK_WORKER_MEMORY=4g
 export HADOOP_CONF_DIR=/opt/cloud/hadoop/etc/hadoop
 export SPARK_EXECUTOR_MEMORY=4g
 export SPARK_DRIVER_MEMORY=4g

 *Exception:*

 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on
 cloud3:38109 (size: 162.7 MB)
 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_28 on disk on
 cloud3:38109 (size: 162.7 MB)
 15/04/07 18:11:03 INFO TaskSetManager: Starting task 31.0 in stage 1.0
 (TID 31, cloud3, NODE_LOCAL, 1296 bytes)
 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on
 cloud2:49451 (size: 163.7 MB)
 15/04/07 18:11:03 INFO BlockManagerInfo: Added taskresult_29 on disk on
 cloud2:49451 (size: 163.7 MB)
 15/04/07 18:11:03 INFO TaskSetManager: Starting task 30.0 in stage 1.0
 (TID 32, cloud2, NODE_LOCAL, 1296 bytes)
 15/04/07 18:11:03 ERROR Utils: Uncaught exception in thread
 task-result-getter-0
 java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
 at
 org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)
 at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Exception in thread task-result-getter-0 java.lang.OutOfMemoryError:
 Java heap space
 at
 org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:61)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:985)
 at
 org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:58)
 at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:73)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:49)
 at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
 at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:48)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615

Missing an output location for shuffle. : (

2015-03-26 Thread
Again,when I do larger file Spark-sql query, error occured.Anyone have got
fix it .Please help me.
Here is the track.

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)

)


Re: Missing an output location for shuffle. : (

2015-03-26 Thread
172.100.11.1565c:f8:a1:d2:3a:6222002.33MB35.80MB42小时4分37秒172.100.11.157d0:22:be:98:36:f71500588KB5.19MB41小时29分56秒172.100.11.1594c:7c:5f:1f:44:af20003KB022秒172.100.11.1601c:1a:c0:41:8a:d600090KB699KB47分26秒172.100.11.161c4:6a:b7:bc:5d:be1402019.29MB4.17MB42小时0分39秒172.100.11.16438:aa:3c:e2:63:0a1107050112KB05分47秒172.100.11.16558:55:ca:e4:f7:fe2310429280KB730KB41分19秒172.100.11.166e0:19:1d:42:1d:3990012.53MB411.33MB434分6秒172.100.11.16788:cb:87:4d:60:0310018KB85KB041分42秒172.100.11.1700c:1d:af:f5:29:d83002KB1KB42分2秒172.100.11.17148:6b:2c:95:d5:6220024KB42KB024分3秒172.100.11.17284:85:06:7d:3b:43113532125KB6KB47秒172.100.11.17388:32:9b:aa:28:8d600131KB117KB01小时39分44秒172.100.11.17424:69:a5:93:cf:bb206202KB01分16秒172.100.11.17618:f6:43:e3:a2:3f60013.42MB1.20MB050分54秒172.100.11.17860:fa:cd:30:cc:cc000103KB252KB48分8秒172.100.11.18054:e4:3a:13:37:321005KB4KB42分50秒172.100.11.18288:32:9b:b7:09:d6500118KB471KB057分45秒172.100.11.18454:ea:a8:94:01:52200601KB1003KB416分45秒172.100.11.185a8:8e:24:65:40:2c1007KB17KB41分10秒172.100.11.186b0:e0:3c:5d:61:6a1800467KB2.23MB41小时52分9秒172.100.11.18880:41:4e:ca:cd:8f700591KB1.47MB41小时19分47秒172.100.11.18934:e2:fd:6f:12:16001分0秒172.100.11.190e0:19:1d:3e:06:28300808KB17.83MB412分37秒172.100.11.1919c:f3:87:32:40:ab1600170KB470KB04分32秒172.100.11.1930c:1d:af:f4:f4:9e11001.83MB14.24MB42小时15分45秒172.100.11.19574:51:ba:bb:49:49150032KB69KB453秒172.100.11.20134:e2:fd:70:b3:33400189KB3.13MB45分18秒172.100.11.203ac:f7:f3:1c:22:431400971KB1.99MB42小时10分12秒172.100.11.20474:51:ba:d7:ff:7e1300127.57MB10.10MB42小时10分16秒172.100.11.20568:df:dd:96:c1:6615001.13MB1.37MB42小时9分44秒172.100.11.20618:dc:56:cc:75:06191550692KB1.20MB045分9秒172.100.11.20800:0c:e7:02:46:583014001.46MB1.78MB01小时21分37秒172.100.11.209d8:b3:77:3e:90:381002609202分48秒172.100.11.21428:e3:1f:b2:2c:8d2620521.99MB1.91MB01小时52分27秒172.100.11.21520:54:76:83:27:f59003.49MB45.49MB446分59秒172.100.11.216c4:05:28:07:77:6b287KB337KB13.36MB386.45MB41小时55分56秒172.100.11.21868:df:dd:eb:ac:de800186KB1.23MB416分1秒172.100.11.22098:fa:e3:cf:a9:4b900134KB136KB429分53秒172.100.11.22120:08:ed:d1:3e:ad2933842218.81MB586.56MB42小时50分20秒172.100.11.22234:23:ba:a9:75:a18005KB1KB02分50秒172.100.11.224bc:4c:c4:9b:f0:562002KB3KB440秒172.100.11.225f0:72:8c:2e:8f:ae2000347KB265KB042分25秒172.100.11.229d0:22:be:95:c8:9916001016KB5.26MB43小时1分43秒172.100.11.23098:ff:d0:0e:c6:5c90018KB5KB02分19秒172.100.11.232d0:2d:b3:90:e0:543539040842.35MB433.94MB03小时37分49秒172.100.11.23368:df:dd:27:36:9781KB79KB2.72MB171.63MB41小时9分24秒172.100.11.23400:08:22:66:77:e38001.31MB12.56MB437分20秒172.100.11.237a0:f4:50:d0:55:b24003KB1KB415分6秒172.100.11.238b8:98:f7:98:c7:c713001.34MB9.80MB41小时12分51秒172.100.11.240ac:f7:f3:87:73:b630023212841分22秒172.100.11.242a4:3d:78:a2:6e:66683044KB1.31MB47.90MB424分59秒172.100.11.243b8:b4:2e:d7:20:78241KB8813.28MB34.26MB42小时49分33秒172.100.11.24600:06:68:c4:10:802900499KB202KB01小时14分5秒172.100.11.24880:be:05:03:82:162002.22MB21.63MB41小时23分35秒172.100.11.25080:71:7a:95:48:a21364021秒

2015-03-26 23:01 GMT+08:00 Michael Armbrust mich...@databricks.com:

 I would suggest looking for errors in the logs of your executors.

 On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote:

 Again,when I do larger file Spark-sql query, error occured.Anyone have
 got fix it .Please help me.
 Here is the track.

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
 at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
 at
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
 at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.sql.SchemaRDD.compute

Spark-sql query got exception.Help

2015-03-25 Thread
It is ok when I do query data from a small hdfs file.
But if the hdfs file is 152m,I got this exception.
I try this code
.'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error
still.

```
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 39135
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at


```


Re: Spark-sql query got exception.Help

2015-03-25 Thread
Yes, it works after I append the two properties in spark-defaults.conf.

As I  use python programing on spark platform,the python api does not have
SparkConf api.

Thanks.

2015-03-25 21:07 GMT+08:00 Cheng Lian lian.cs@gmail.com:

  Oh, just noticed that you were calling sc.setSystemProperty. Actually
 you need to set this property in SparkConf or in spark-defaults.conf. And
 there are two configurations related to Kryo buffer size,

- spark.kryoserializer.buffer.mb, which is the initial size, and
- spark.kryoserializer.buffer.max.mb, which is the max buffer size.

 Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it).

 Cheng

 On 3/25/15 7:31 PM, 李铖 wrote:

   Here is the full track

  15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
 Available: 0, required: 39135
  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
  at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
  at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
  at
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)

 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com:

  Could you please provide the full stack trace?


 On 3/25/15 6:26 PM, 李铖 wrote:

  It is ok when I do query data from a small hdfs file.
 But if the hdfs file is 152m,I got this exception.
 I try this code
 .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error
 still.

  ```
 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
 required: 39135
  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
  at


 ```



​



Re: Spark-sql query got exception.Help

2015-03-25 Thread
Here is the full track

15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
Available: 0, required: 39135
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com:

  Could you please provide the full stack trace?


 On 3/25/15 6:26 PM, 李铖 wrote:

  It is ok when I do query data from a small hdfs file.
 But if the hdfs file is 152m,I got this exception.
 I try this code
 .'sc.setSystemProperty(spark.kryoserializer.buffer.mb,'256')'.error
 still.

  ```
 com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
 required: 39135
  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
  at


 ```





Re: Spark-sql query got exception.Help

2015-03-25 Thread
One more exception.How to fix it .Anybody help me ,please.


org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)


2015-03-26 10:39 GMT+08:00 李铖 lidali...@gmail.com:

 Yes, it works after I append the two properties in spark-defaults.conf.

 As I  use python programing on spark platform,the python api does not have
 SparkConf api.

 Thanks.

 2015-03-25 21:07 GMT+08:00 Cheng Lian lian.cs@gmail.com:

  Oh, just noticed that you were calling sc.setSystemProperty. Actually
 you need to set this property in SparkConf or in spark-defaults.conf. And
 there are two configurations related to Kryo buffer size,

- spark.kryoserializer.buffer.mb, which is the initial size, and
- spark.kryoserializer.buffer.max.mb, which is the max buffer size.

 Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it).

 Cheng

 On 3/25/15 7:31 PM, 李铖 wrote:

   Here is the full track

  15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
 Available: 0, required: 39135
  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
  at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
  at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
  at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
  at
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)

 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com

Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread
Hi,everybody.

I am new in spark. Now I want to do interactive sql query using spark sql.
spark sql can run under hive or loading files from hdfs.

Which is better or faster?

Thanks.


Should I do spark-sql query on HDFS or hive?

2015-03-17 Thread
Hi,everybody.

I am new in spark. Now I want to do interactive sql query using spark sql.
spark sql can run under hive or loading files from hdfs.

Which is better or faster?

Thanks.


Re: Should I do spark-sql query on HDFS or apache hive?

2015-03-17 Thread
Did you mean that parquet is faster than hive format ,and hive format is
faster than hdfs ,for Spark SQL?

: )

2015-03-18 1:23 GMT+08:00 Michael Armbrust mich...@databricks.com:

 The performance has more to do with the particular format you are using,
 not where the metadata is coming from.   Even hive tables are read from
 files HDFS usually.

 You probably should use HiveContext as its query language is more powerful
 than SQLContext.  Also, parquet is usually the faster data format for Spark
 SQL.

 On Tue, Mar 17, 2015 at 3:41 AM, 李铖 lidali...@gmail.com wrote:

 Hi,everybody.

 I am new in spark. Now I want to do interactive sql query using spark
 sql. spark sql can run under hive or loading files from hdfs.

 Which is better or faster?

 Thanks.