Re: Test
Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia matei.zaha...@gmail.com wrote: This message has no content.
Re: Spark output compression on HDFS
There is no compress type for snappy. Sent from my iPhone5s On 2014年4月4日, at 23:06, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank you, Konstantin Kudryavtsev On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Thanks all, it works fine now and I managed to compress output. However, I am still in stuck... How is it possible to set compression type for Snappy? I mean to set up record or block level of compression for output On Apr 3, 2014 1:15 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for pointing that out. On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.com wrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is this a Scala-only feature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com wrote: http://www.scala-lang.org/api/2.10.3/index.html#scala.Option The signature is 'def saveAsSequenceFile(path: String, codec: Option[Class[_ : CompressionCodec]] = None)', but you are providing a Class, not an Option[Class]. Try counts.saveAsSequenceFile(output, Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: Hi there, I've started using Spark recently and evaluating possible use cases in our company. I'm trying to save RDD as compressed Sequence file. I'm able to save non-compressed file be calling: counts.saveAsSequenceFile(output) where counts is my RDD (IntWritable, Text). However, I didn't manage to compress output. I tried several configurations and always got exception: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) console:21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) console:21: error: type mismatch; found : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) and it doesn't work even for Gzip: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) console:21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) Could you please suggest solution? also, I didn't find how is it possible to specify compression parameters (i.e. compression type for Snappy). I wondered if you could share code snippets for writing/reading RDD with compression? Thank you in advance, Konstantin Kudryavtsev
Re: Relation between DStream and RDDs
Thanks for sharing here. Sent from my iPhone5s On 2014年3月21日, at 20:44, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi, I searched more articles and ran few examples and have clarified my doubts. This answer by TD in another thread ( https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped me a lot. Here is the summary of my finding: 1) A DStream can consist of 0 or 1 or more RDDs. 2) Even if you have multiple files to be read in a time interval, DStream will have only 1 RDD. 3) Functions like reduce count return as many no. of RDDs as there were in the input DStream. However the internal computation in every batch will have only 1 RDD, so these functions will return 1 RDD in the returned DStream. However if you are using window functions to get more RDDs, and run reduce/count on the windowed DStream, your returned DStream will have more than 1 RDD. Hope this helps someone. Thanks everyone for the answers. Regards, Sanjay On Thursday, 20 March 2014 9:30 PM, andy petrella andy.petre...@gmail.com wrote: Don't see an example, but conceptually it looks like you'll need an according structure like a Monoid. I mean, because if it's not tied to a window, it's an overall computation that has to be increased over time (otherwise it would land in the batch world see after) and that will be the purpose of Monoid, and specially probabilistic sets (avoid sucking the whole memory). If it falls in the batch job's world because you have enough information encapsulated in one conceptual RDD, it might be helpful to have DStream storing it in hdfs, then using the SparkContext within the StreaminContext to run a batch job on the data. But I'm only thinking out of loud, so I might be completely wrong. hth Andy Petrella Belgium (Liège) Data Engineer in NextLab sprl (owner) Engaged Citizen Coder for WAJUG (co-founder) Author of Learning Play! Framework 2 Bio: on visify Mobile: +32 495 99 11 04 Mails: andy.petre...@nextlab.be andy.petre...@gmail.com Socials: Twitter: https://twitter.com/#!/noootsab LinkedIn: http://be.linkedin.com/in/andypetrella Blogger: http://ska-la.blogspot.com/ GitHub: https://github.com/andypetrella Masterbranch: https://masterbranch.com/andy.petrella On Thu, Mar 20, 2014 at 12:18 PM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: On Thu, Mar 20, 2014 at 11:57 AM, andy petrella andy.petre...@gmail.com wrote: also consider creating pairs and use *byKey* operators, and then the key will be the structure that will be used to consolidate or deduplicate your data my2c One thing I wonder: imagine I want to sub-divide RDDs in a DStream into several RDDs but not according to time window, I don't see any trivial way to do it... On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: Actually it's quite simple... DStream[T] is a stream of RDD[T]. So applying count on DStream is just applying count on each RDD of this DStream. So at the end of count, you have a DStream[Int] containing the same number of RDDs as before but each RDD just contains one element being the count result for the corresponding original RDD. For reduce, it's the same using reduce operation... The only operations that are a bit more complex are reduceByWindow countByValueAndWindow which union RDD over the time window... On Thu, Mar 20, 2014 at 9:51 AM, Sanjay Awatramani sanjay_a...@yahoo.com wrote: @TD: I do not need multiple RDDs in a DStream in every batch. On the contrary my logic would work fine if there is only 1 RDD. But then the description for functions like reduce count (Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.) left me confused whether I should account for the fact that a DStream can have multiple RDDs. My streaming code processes a batch every hour. In the 2nd batch, i checked that the DStream contains only 1 RDD, i.e. the 2nd batch's RDD. I verified this using sysout in foreachRDD. Does that mean that the DStream will always contain only 1 RDD ? A DStream creates a RDD for each window corresponding to your batch duration (maybe if there are no data in the current time window, no RDD is created but I'm not sure about that) So no, there is not one single RDD in a DStream, it just depends on the batch duration and the collected data. Is there a way to access the RDD of the 1st batch in the 2nd batch ? The 1st batch may contain some records which were not relevant to the first batch and are to be processed in the 2nd batch. I know i can use the sliding window mechanism of streaming, but if i'm not using it and there is no way to access the previous batch's RDD, then it means that functions like count will always return a DStream containing only 1 RDD,