Re: Test
Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s > On 2014年5月10日, at 13:31, Matei Zaharia wrote: > > This message has no content.
Re: Ooyala Server - plans to merge it into Apache ?
Hi, Good to know. does that Ooyala spark job server can run on the Yarn? is there a job scheduler? On Fri, Apr 18, 2014 at 12:12 PM, All In A Days Work < allinadays...@gmail.com> wrote: > Hi, > > In 2013 spark summit, Ooyala had presented their spark Job server and > indicated that they wanted open source the work. > > Is there any plan to merge this functionality into the spark offering > itself - and not offer just as Ooyala open sourced version ? > > Thanks, >
Re: sbt assembly error
It is only network issue. you have some network limited access in China. On Thu, Apr 17, 2014 at 2:27 PM, Sean Owen wrote: > The error is "Connection timed out", not 404. The build references > many repos, and only one will contain any given artifact. You are > seeing it fail through trying many different repos, many of which > don't even have the artifact either, but that's not the underlying > cause. > > FWIW I can build the assembly just fine and I suspect there is no > noise on this since others can too. > > wget does not have the same proxy settings that SBT / Maven would use, > and the issue may be HTTPS connections. > > That said it could be a transient network problem. I was going to > suggest that maybe something in the SBT build changed recently -- it's > a little weird that it's trying snapshot repos? -- but you are looking > at a very recent 0.9.1. > > Try HEAD to confirm? > > > On Wed, Apr 16, 2014 at 11:36 PM, Yiou Li wrote: > > Hi Sean, > > > > It's true that the sbt is trying different links but ALL of them have > > connections issue (which is actually 404 File not found error) and the > build > > process takes forever connecting different links.. > > > > I don't think it's a proxy issue because my other projects (other than > > spark) builds well and I can wget things like wget www.yahoo.com. > > > > I am surprised that there is not much noise raised on this issue --- at > > least on the most recent spark release. > > > > Best, > > Leo > > >
Re: Spark output compression on HDFS
There is no compress type for snappy. Sent from my iPhone5s > On 2014年4月4日, at 23:06, Konstantin Kudryavtsev > wrote: > > Can anybody suggest how to change compression level (Record, Block) for > Snappy? > if it possible, of course > > thank you in advance > > Thank you, > Konstantin Kudryavtsev > > >> On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev >> wrote: >> Thanks all, it works fine now and I managed to compress output. However, I >> am still in stuck... How is it possible to set compression type for Snappy? >> I mean to set up record or block level of compression for output >> >>> On Apr 3, 2014 1:15 AM, "Nicholas Chammas" >>> wrote: >>> Thanks for pointing that out. >>> >>> On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra wrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. > On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas > wrote: > Is this a Scala-only feature? > > >> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell >> wrote: >> For textFile I believe we overload it and let you set a codec directly: >> >> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 >> >> For saveAsSequenceFile yep, I think Mark is right, you need an option. >> >> >>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra >>> wrote: >>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option >>> >>> The signature is 'def saveAsSequenceFile(path: String, codec: >>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a >>> Class, not an Option[Class]. >>> >>> Try counts.saveAsSequenceFile(output, >>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) >>> >>> >>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev wrote: Hi there, I've started using Spark recently and evaluating possible use cases in our company. I'm trying to save RDD as compressed Sequence file. I'm able to save non-compressed file be calling: counts.saveAsSequenceFile(output) where counts is my RDD (IntWritable, Text). However, I didn't manage to compress output. I tried several configurations and always got exception: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) :21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) :21: error: type mismatch; found : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) and it doesn't work even for Gzip: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) :21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) required: Option[Class[_ <: org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) Could you please suggest solution? also, I didn't find how is it possible to specify compression parameters (i.e. compression type for Snappy). I wondered if you could share code snippets for writing/reading RDD with compression? Thank you in advance, Konstantin Kudryavtsev >
Re: Relation between DStream and RDDs
Thanks for sharing here. Sent from my iPhone5s > On 2014年3月21日, at 20:44, Sanjay Awatramani wrote: > > Hi, > > I searched more articles and ran few examples and have clarified my doubts. > This answer by TD in another thread ( > https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped > me a lot. > > Here is the summary of my finding: > 1) A DStream can consist of 0 or 1 or more RDDs. > 2) Even if you have multiple files to be read in a time interval, DStream > will have only 1 RDD. > 3) Functions like reduce & count return as many no. of RDDs as there were in > the input DStream. However the internal computation in every batch will have > only 1 RDD, so these functions will return 1 RDD in the returned DStream. > However if you are using window functions to get more RDDs, and run > reduce/count on the windowed DStream, your returned DStream will have more > than 1 RDD. > > Hope this helps someone. > Thanks everyone for the answers. > > Regards, > Sanjay > > > On Thursday, 20 March 2014 9:30 PM, andy petrella > wrote: > Don't see an example, but conceptually it looks like you'll need an according > structure like a Monoid. I mean, because if it's not tied to a window, it's > an overall computation that has to be increased over time (otherwise it would > land in the batch world see after) and that will be the purpose of Monoid, > and specially probabilistic sets (avoid sucking the whole memory). > > If it falls in the batch job's world because you have enough information > encapsulated in one conceptual RDD, it might be helpful to have DStream > storing it in hdfs, then using the SparkContext within the StreaminContext to > run a batch job on the data. > > But I'm only thinking out of "loud", so I might be completely wrong. > > hth > > Andy Petrella > Belgium (Liège) > > Data Engineer in NextLab sprl (owner) > Engaged Citizen Coder for WAJUG (co-founder) > Author of Learning Play! Framework 2 > Bio: on visify > > Mobile: +32 495 99 11 04 > Mails: > andy.petre...@nextlab.be > andy.petre...@gmail.com > > Socials: > Twitter: https://twitter.com/#!/noootsab > LinkedIn: http://be.linkedin.com/in/andypetrella > Blogger: http://ska-la.blogspot.com/ > GitHub: https://github.com/andypetrella > Masterbranch: https://masterbranch.com/andy.petrella > > > On Thu, Mar 20, 2014 at 12:18 PM, Pascal Voitot Dev > wrote: > > > > On Thu, Mar 20, 2014 at 11:57 AM, andy petrella > wrote: > also consider creating pairs and use *byKey* operators, and then the key will > be the structure that will be used to consolidate or deduplicate your data > my2c > > > One thing I wonder: imagine I want to sub-divide RDDs in a DStream into > several RDDs but not according to time window, I don't see any trivial way to > do it... > > > > On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev > wrote: > Actually it's quite simple... > > DStream[T] is a stream of RDD[T]. > So applying count on DStream is just applying count on each RDD of this > DStream. > So at the end of count, you have a DStream[Int] containing the same number of > RDDs as before but each RDD just contains one element being the count result > for the corresponding original RDD. > > For reduce, it's the same using reduce operation... > > The only operations that are a bit more complex are reduceByWindow & > countByValueAndWindow which union RDD over the time window... > > On Thu, Mar 20, 2014 at 9:51 AM, Sanjay Awatramani > wrote: > @TD: I do not need multiple RDDs in a DStream in every batch. On the contrary > my logic would work fine if there is only 1 RDD. But then the description for > functions like reduce & count (Return a new DStream of single-element RDDs by > counting the number of elements in each RDD of the source DStream.) left me > confused whether I should account for the fact that a DStream can have > multiple RDDs. My streaming code processes a batch every hour. In the 2nd > batch, i checked that the DStream contains only 1 RDD, i.e. the 2nd batch's > RDD. I verified this using sysout in foreachRDD. Does that mean that the > DStream will always contain only 1 RDD ? > > A DStream creates a RDD for each window corresponding to your batch duration > (maybe if there are no data in the current time window, no RDD is created but > I'm not sure about that) > So no, there is not one single RDD in a DStream, it just depends on the batch > duration and the collected data. > > > Is there a way to access the RDD of the 1st batch in the 2nd batch ? The 1st > batch may contain some records which were not relevant to the first batch and > are to be processed in the 2nd batch. I know i can use the sliding window > mechanism of streaming, but if i'm not using it and there is no way to access > the previous batch's RDD, then it means that functions like count will always > return a DStream containing only 1 RDD, am i correct ?
Re: Spark in YARN HDP problem
can you check the container logs and paste here? On Feb 26, 2014 7:07 PM, "aecc" wrote: > The error happens after the application is accepted: > > 14/02/26 12:05:08 INFO YarnClientImpl: Submitted application > application_1390483691679_0157 to ResourceManager at X:8050 > 3483 [main] INFO > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application > report from ASM: > appMasterRpcPort: 0 > appStartTime: 1393412708680 > yarnAppState: ACCEPTED > > 4484 [main] INFO > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application > report from ASM: > appMasterRpcPort: 0 > appStartTime: 1393412708680 > yarnAppState: ACCEPTED > > 5485 [main] INFO > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application > report from ASM: > appMasterRpcPort: 0 > appStartTime: 1393412708680 > yarnAppState: ACCEPTED > > 6487 [main] INFO > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application > report from ASM: > appMasterRpcPort: 0 > appStartTime: 1393412708680 > yarnAppState: FAILED > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-YARN-HDP-problem-tp2041p2072.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >