I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data as snappy-compressed files but encounted a problem.
My code is as follows: val InputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/text/new.txt" val OutputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/compressedText/" val sparkConf = new SparkConf() .setAppName("compress data").setMaster(Master) .set("spark.executor.memory", "8g") .set("spark.cores.max", "12") .set("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec") val sc = new SparkContext(sparkConf) val rdd = sc.textFile(InputTextFilePath) rdd.saveAsTextFile(OutputTextFilePath, classOf[org.apache.hadoop.io.compress.SnappyCodec]) sc.stop When I submitted the job the following exception was thrown: 14/12/08 21:16:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/12/08 21:16:21 WARN TaskSetManager: Lost TID 0 (task 0.0:0) 14/12/08 21:16:21 WARN TaskSetManager: Loss was due to java.lang.UnsatisfiedLinkError java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) at org.apache.hadoop.io.compress.SnappyCodec.createCompressor(SnappyCodec.java:143) at org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:98) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:136) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89) at org.apache.spark.rdd.PairRDDFunctions.org $apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:825) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:840) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:840) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) It seems that Spark could not find snappy's native library. I searched for solutions on the Internet and tried the following ways: Adding " spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec" to */etc/spark/conf.cloudera.spark/spark-defaults.conf* Adding the following configurations to */etc/spark/conf.cloudera.spark/spark-env.sh* export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native/libsnappy.so export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/snappy-java-1.0.4.1.jar Yet neither works, can anybody tell me what shoud I do to read from and write Snappy-compressed files in Spark ? Any answer is appreciated.