Re: Test

2014-05-11 Thread Azuryy
Got.

But it doesn't indicate all can receive this test.

Mail list is unstable recently.


Sent from my iPhone5s

> On 2014年5月10日, at 13:31, Matei Zaharia  wrote:
> 
> This message has no content.


Re: Ooyala Server - plans to merge it into Apache ?

2014-04-17 Thread Azuryy Yu
Hi,
Good to know. does that Ooyala spark job server can run on the Yarn?  is
there a job scheduler?


On Fri, Apr 18, 2014 at 12:12 PM, All In A Days Work <
allinadays...@gmail.com> wrote:

> Hi,
>
> In 2013 spark summit, Ooyala had presented their spark Job server and
> indicated that they wanted open source the work.
>
> Is there any plan to merge this functionality into the spark offering
> itself - and not offer just as Ooyala open sourced version ?
>
> Thanks,
>


Re: sbt assembly error

2014-04-16 Thread Azuryy Yu
It is only network issue.  you have some network limited access in China.


On Thu, Apr 17, 2014 at 2:27 PM, Sean Owen  wrote:

> The error is "Connection timed out", not 404. The build references
> many repos, and only one will contain any given artifact. You are
> seeing it fail through trying many different repos, many of which
> don't even have the artifact either, but that's not the underlying
> cause.
>
> FWIW I can build the assembly just fine and I suspect there is no
> noise on this since others can too.
>
> wget does not have the same proxy settings that SBT / Maven would use,
> and the issue may be HTTPS connections.
>
> That said it could be a transient network problem. I was going to
> suggest that maybe something in the SBT build changed recently -- it's
> a little weird that it's trying snapshot repos? -- but you are looking
> at a very recent 0.9.1.
>
> Try HEAD to confirm?
>
>
> On Wed, Apr 16, 2014 at 11:36 PM, Yiou Li  wrote:
> > Hi Sean,
> >
> > It's true that the sbt is trying different links but ALL of them have
> > connections issue (which is actually 404 File not found error) and the
> build
> > process takes forever connecting different links..
> >
> > I don't think it's a proxy issue because my other projects (other than
> > spark) builds well and I can wget things like wget www.yahoo.com.
> >
> > I am surprised that there is not much noise raised on this issue --- at
> > least on the most recent spark release.
> >
> > Best,
> > Leo
> >
>


Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy.


Sent from my iPhone5s

> On 2014年4月4日, at 23:06, Konstantin Kudryavtsev 
>  wrote:
> 
> Can anybody suggest how to change compression level (Record, Block) for 
> Snappy? 
> if it possible, of course
> 
> thank you in advance
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev 
>>  wrote:
>> Thanks all, it works fine now and I managed to compress output. However, I 
>> am still in stuck... How is it possible to set compression type for Snappy? 
>> I mean to set up record or block level of compression for output
>> 
>>> On Apr 3, 2014 1:15 AM, "Nicholas Chammas"  
>>> wrote:
>>> Thanks for pointing that out.
>>> 
>>> 
 On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra  
 wrote:
 First, you shouldn't be using spark.incubator.apache.org anymore, just 
 spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in 
 the Python API at this point. 
 
 
> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas 
>  wrote:
> Is this a Scala-only feature?
> 
> 
>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell  
>> wrote:
>> For textFile I believe we overload it and let you set a codec directly:
>> 
>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>> 
>> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>> 
>> 
>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra  
>>> wrote:
>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>> 
>>> The signature is 'def saveAsSequenceFile(path: String, codec: 
>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a 
>>> Class, not an Option[Class].  
>>> 
>>> Try counts.saveAsSequenceFile(output, 
>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>> 
>>> 
>>> 
 On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev 
  wrote:
 Hi there,
 
 
 
 I've started using Spark recently and evaluating possible use cases in 
 our company. 
 
 I'm trying to save RDD as compressed Sequence file. I'm able to save 
 non-compressed file be calling:
 
 
 
 
 
 counts.saveAsSequenceFile(output)
 where counts is my RDD (IntWritable, Text). However, I didn't manage 
 to compress output. I tried several configurations and always got 
 exception:
 
 
 
 
 
 counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 and it doesn't work even for Gzip:
 
 
 
 
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 :21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
  required: Option[Class[_ <: 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 Could you please suggest solution? also, I didn't find how is it 
 possible to specify compression parameters (i.e. compression type for 
 Snappy). I wondered if you could share code snippets for 
 writing/reading RDD with compression? 
 
 Thank you in advance,
 
 Konstantin Kudryavtsev
> 


Re: Relation between DStream and RDDs

2014-03-21 Thread Azuryy
Thanks for sharing here.

Sent from my iPhone5s

> On 2014年3月21日, at 20:44, Sanjay Awatramani  wrote:
> 
> Hi,
> 
> I searched more articles and ran few examples and have clarified my doubts. 
> This answer by TD in another thread ( 
> https://groups.google.com/d/msg/spark-users/GQoxJHAAtX4/0kiRX0nm1xsJ ) helped 
> me a lot.
> 
> Here is the summary of my finding:
> 1) A DStream can consist of 0 or 1 or more RDDs.
> 2) Even if you have multiple files to be read in a time interval, DStream 
> will have only 1 RDD.
> 3) Functions like reduce & count return as many no. of RDDs as there were in 
> the input DStream. However the internal computation in every batch will have 
> only 1 RDD, so these functions will return 1 RDD in the returned DStream. 
> However if you are using window functions to get more RDDs, and run 
> reduce/count on the windowed DStream, your returned DStream will have more 
> than 1 RDD.
> 
> Hope this helps someone.
> Thanks everyone for the answers.
> 
> Regards,
> Sanjay
> 
> 
> On Thursday, 20 March 2014 9:30 PM, andy petrella  
> wrote:
> Don't see an example, but conceptually it looks like you'll need an according 
> structure like a Monoid. I mean, because if it's not tied to a window, it's 
> an overall computation that has to be increased over time (otherwise it would 
> land in the batch world see after) and that will be the purpose of Monoid, 
> and specially probabilistic sets (avoid sucking the whole memory).
> 
> If it falls in the batch job's world because you have enough information 
> encapsulated in one conceptual RDD, it might be helpful to have DStream 
> storing it in hdfs, then using the SparkContext within the StreaminContext to 
> run a batch job on the data.
> 
> But I'm only thinking out of "loud", so I might be completely wrong.
> 
> hth
> 
> Andy Petrella
> Belgium (Liège)
>
>  Data Engineer in NextLab sprl (owner)
>  Engaged Citizen Coder for WAJUG (co-founder)
>  Author of Learning Play! Framework 2
>  Bio: on visify
>
> Mobile: +32 495 99 11 04
> Mails:  
> andy.petre...@nextlab.be
> andy.petre...@gmail.com
>
> Socials:
> Twitter: https://twitter.com/#!/noootsab
> LinkedIn: http://be.linkedin.com/in/andypetrella
> Blogger: http://ska-la.blogspot.com/
> GitHub:  https://github.com/andypetrella
> Masterbranch: https://masterbranch.com/andy.petrella
> 
> 
> On Thu, Mar 20, 2014 at 12:18 PM, Pascal Voitot Dev 
>  wrote:
> 
> 
> 
> On Thu, Mar 20, 2014 at 11:57 AM, andy petrella  
> wrote:
> also consider creating pairs and use *byKey* operators, and then the key will 
> be the structure that will be used to consolidate or deduplicate your data
> my2c
> 
> 
> One thing I wonder: imagine I want to sub-divide RDDs in a DStream into 
> several RDDs but not according to time window, I don't see any trivial way to 
> do it...
>  
> 
> 
> On Thu, Mar 20, 2014 at 11:50 AM, Pascal Voitot Dev 
>  wrote:
> Actually it's quite simple...
> 
> DStream[T] is a stream of RDD[T].
> So applying count on DStream is just applying count on each RDD of this 
> DStream.
> So at the end of count, you have a DStream[Int] containing the same number of 
> RDDs as before but each RDD just contains one element being the count result 
> for the corresponding original RDD.
> 
> For reduce, it's the same using reduce operation...
> 
> The only operations that are a bit more complex are reduceByWindow & 
> countByValueAndWindow which union RDD over the time window...
> 
> On Thu, Mar 20, 2014 at 9:51 AM, Sanjay Awatramani  
> wrote:
> @TD: I do not need multiple RDDs in a DStream in every batch. On the contrary 
> my logic would work fine if there is only 1 RDD. But then the description for 
> functions like reduce & count (Return a new DStream of single-element RDDs by 
> counting the number of elements in each RDD of the source DStream.) left me 
> confused whether I should account for the fact that a DStream can have 
> multiple RDDs. My streaming code processes a batch every hour. In the 2nd 
> batch, i checked that the DStream contains only 1 RDD, i.e. the 2nd batch's 
> RDD. I verified this using sysout in foreachRDD. Does that mean that the 
> DStream will always contain only 1 RDD ?
> 
> A DStream creates a RDD for each window corresponding to your batch duration 
> (maybe if there are no data in the current time window, no RDD is created but 
> I'm not sure about that)
> So no, there is not one single RDD in a DStream, it just depends on the batch 
> duration and the collected data.
> 
>  
> Is there a way to access the RDD of the 1st batch in the 2nd batch ? The 1st 
> batch may contain some records which were not relevant to the first batch and 
> are to be processed in the 2nd batch. I know i can use the sliding window 
> mechanism of streaming, but if i'm not using it and there is no way to access 
> the previous batch's RDD, then it means that functions like count will always 
> return a DStream containing only 1 RDD, am i correct ?

Re: Spark in YARN HDP problem

2014-02-26 Thread Azuryy Yu
can you check the container logs and paste here?
 On Feb 26, 2014 7:07 PM, "aecc"  wrote:

> The error happens after the application is accepted:
>
> 14/02/26 12:05:08 INFO YarnClientImpl: Submitted application
> application_1390483691679_0157 to ResourceManager at X:8050
> 3483 [main] INFO
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application
> report from ASM:
>  appMasterRpcPort: 0
>  appStartTime: 1393412708680
>  yarnAppState: ACCEPTED
>
> 4484 [main] INFO
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application
> report from ASM:
>  appMasterRpcPort: 0
>  appStartTime: 1393412708680
>  yarnAppState: ACCEPTED
>
> 5485 [main] INFO
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application
> report from ASM:
>  appMasterRpcPort: 0
>  appStartTime: 1393412708680
>  yarnAppState: ACCEPTED
>
> 6487 [main] INFO
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Application
> report from ASM:
>  appMasterRpcPort: 0
>  appStartTime: 1393412708680
>  yarnAppState: FAILED
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-YARN-HDP-problem-tp2041p2072.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>