Is this a Scala-only<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFile>feature?
On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell <pwend...@gmail.com> wrote: > For textFile I believe we overload it and let you set a codec directly: > > > https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 > > For saveAsSequenceFile yep, I think Mark is right, you need an option. > > > On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra <m...@clearstorydata.com>wrote: > >> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option >> >> The signature is 'def saveAsSequenceFile(path: String, codec: >> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a >> Class, not an Option[Class]. >> >> Try counts.saveAsSequenceFile(output, >> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) >> >> >> >> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev < >> kudryavtsev.konstan...@gmail.com> wrote: >> >>> Hi there, >>> >>> >>> I've started using Spark recently and evaluating possible use cases in >>> our company. >>> >>> I'm trying to save RDD as compressed Sequence file. I'm able to save >>> non-compressed file be calling: >>> >>> counts.saveAsSequenceFile(output) >>> >>> where counts is my RDD (IntWritable, Text). However, I didn't manage to >>> compress output. I tried several configurations and always got exception: >>> >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>> <console>:21: error: type mismatch; >>> found : >>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>> required: Option[Class[_ <: >>> org.apache.hadoop.io.compress.CompressionCodec]] >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.hadoop.io.compress.SnappyCodec]) >>> >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>> <console>:21: error: type mismatch; >>> found : >>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) >>> required: Option[Class[_ <: >>> org.apache.hadoop.io.compress.CompressionCodec]] >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.spark.io.SnappyCompressionCodec]) >>> >>> and it doesn't work even for Gzip: >>> >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>> <console>:21: error: type mismatch; >>> found : >>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) >>> required: Option[Class[_ <: >>> org.apache.hadoop.io.compress.CompressionCodec]] >>> counts.saveAsSequenceFile(output, >>> classOf[org.apache.hadoop.io.compress.GzipCodec]) >>> >>> Could you please suggest solution? also, I didn't find how is it >>> possible to specify compression parameters (i.e. compression type for >>> Snappy). I wondered if you could share code snippets for writing/reading >>> RDD with compression? >>> >>> Thank you in advance, >>> Konstantin Kudryavtsev >>> >> >> >