I'm using CDH 5.1.0  and Spark 1.0.0, and I'd like to write out data
as snappy-compressed files but encounted a problem.

My code is as follows:

  val InputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/text/new.txt"
  val OutputTextFilePath = "hdfs://ec2.hadoop.com:8020/xt/compressedText/"

  val sparkConf = new SparkConf()
    .setAppName("compress data").setMaster(Master)
    .set("spark.executor.memory", "8g")
    .set("spark.cores.max", "12")
    .set("spark.io.compression.codec",
"org.apache.spark.io.SnappyCompressionCodec")

  val sc = new SparkContext(sparkConf)

  val rdd = sc.textFile(InputTextFilePath)
  rdd.saveAsTextFile(OutputTextFilePath,
classOf[org.apache.hadoop.io.compress.SnappyCodec])

  sc.stop


When I submitted the job the following exception was thrown:
14/12/08 21:16:15 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/12/08 21:16:21 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/12/08 21:16:21 WARN TaskSetManager: Loss was due to
java.lang.UnsatisfiedLinkError
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
 at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
 at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
 at
org.apache.hadoop.io.compress.SnappyCodec.createCompressor(SnappyCodec.java:143)
 at
org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:98)
 at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:136)
 at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89)
 at org.apache.spark.rdd.PairRDDFunctions.org
$apache$spark$rdd$PairRDDFunctions$$writeToFile$1(PairRDDFunctions.scala:825)
 at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:840)
 at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:840)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

It seems that Spark could not find snappy's native library.
I searched for solutions on the Internet and tried the following ways:

Adding "
spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec" to
*/etc/spark/conf.cloudera.spark/spark-defaults.conf*

Adding the following configurations to
*/etc/spark/conf.cloudera.spark/spark-env.sh*
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native
export
SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/native/libsnappy.so
export
SPARK_CLASSPATH=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/snappy-java-1.0.4.1.jar

Yet neither works, can anybody tell me what shoud I do to read from and
write Snappy-compressed files in Spark ?

Any answer is appreciated.

Reply via email to