Re: Livy Failed error on Yarn with Spark

2018-05-24 Thread Jeff Zhang
Could you check the the spark app's yarn log and livy log ? Chetan Khatri 于2018年5月10日周四 上午4:18写道: > All, > > I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I > am running same spark job using spark-submit it is getting success with all >

Re: [Spark] Supporting python 3.5?

2018-05-24 Thread Jeff Zhang
It supports python 3.5, and IIRC, spark also support python 3.6 Irving Duran 于2018年5月10日周四 下午9:08写道: > Does spark now support python 3.5 or it is just 3.4.x? > > https://spark.apache.org/docs/latest/rdd-programming-guide.html > > Thank You, > > Irving Duran >

Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due to yarn not spark. The number of AM comparing to the number of executors should be small and acceptable. If you do want to save more resources, I would suggest you to use yarn cluster mode where driver and AM run in the

re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
Hi there, from my apache spark streaming website (see links below), - the batch-interval is set when a spark StreamingContext is constructed (see example (a) quoted below) - the StreamingContext is available in older and new Spark version (v1.6, v2.2 to v2.3.0) (see

Re: [Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-24 Thread karthikjay
My data looks like this: { "ts2" : "2018/05/01 00:02:50.041", "serviceGroupId" : "123", "userId" : "avv-0", "stream" : "", "lastUserActivity" : "00:02:50", "lastUserActivityCount" : "0" } { "ts2" : "2018/05/01 00:09:02.079", "serviceGroupId" : "123", "userId" : "avv-0",

Re: Time series data

2018-05-24 Thread Vadim Semenov
Yeah, it depends on what you want to do with that timeseries data. We at Datadog process trillions of points daily using Spark, I cannot really go about what exactly we do with the data, but just saying that Spark can handle the volume, scale well and be fault-tolerant, albeit everything I said

Streaming : WAL ignored

2018-05-24 Thread Walid Lezzar
Hi, I have a spark streaming application running on yarn that consumes from a jms source. I have the checkpointing and WAL enabled to ensure zero data loss. However, When I suddenly kill my application and restarts it, sometimes it recovers the data from the WAL but sometimes it doesn’t !! In

Positive log-likelihood with Gaussian mixture

2018-05-24 Thread Simon Dirmeier
Dear all, I am fitting a very trivial GMM with 2-10 components on 100 samples and 5 features in pyspark and observe some of the log-likelihoods being positive (see below). I don't undestand how this is possible. Is this a bug or an intended behaviour? Furthermore, for different seeds,

Re: Time series data

2018-05-24 Thread Jörn Franke
There is not one answer to this. It really depends what kind of time Series analysis you do with the data and what time series database you are using. Then it also depends what Etl you need to do. You seem to also need to join data - is it with existing data of the same type or do you join

Time series data

2018-05-24 Thread amin mohebbi
Could you please help me to understand  the performance that we get from using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10'sor 100's of TBs of data and I feel that NoSQL will be much quicker