Re: Livy Failed error on Yarn with Spark

2018-05-24 Thread Jeff Zhang
Could you check the the spark app's yarn log and livy log ?


Chetan Khatri 于2018年5月10日周四 上午4:18写道:

> All,
>
> I am running on Hortonworks HDP Hadoop with Livy and Spark 2.2.0, when I
> am running same spark job using spark-submit it is getting success with all
> transformations are done.
>
> When I am trying to do spark submit using Livy, at that time Spark Job is
> getting invoked and getting success but Yarn status says : FAILED and when
> you take a look on logs at attempt : Log says  SUCCESS and there is no
> error log.
>
> Any one has faced this weird exprience ?
>
> Thank you.
>


Re: [Spark] Supporting python 3.5?

2018-05-24 Thread Jeff Zhang
It supports python 3.5, and IIRC, spark also support python 3.6

Irving Duran 于2018年5月10日周四 下午9:08写道:

> Does spark now support python 3.5 or it is just 3.4.x?
>
> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>
> Thank You,
>
> Irving Duran
>


Re: Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-24 Thread Jeff Zhang
I don't think it is possible to have less than 1 core for AM, this is due
to yarn not spark.

The number of AM comparing to the number of executors should be small and
acceptable. If you do want to save more resources, I would suggest you to
use yarn cluster mode where driver and AM run in the same process.

You can either use livy or zeppelin which both support interactive work in
yarn cluster mode.

http://livy.incubator.apache.org/
https://zeppelin.apache.org/
https://medium.com/@zjffdu/zeppelin-0-8-0-new-features-ea53e8810235


Another approach to save resources is to share SparkContext across your
applications since your scenario is interactive work ( I guess it is some
kind of notebook).  Zeppelin support sharing SparkContext across users and
notes.



peay 于2018年5月18日周五 下午6:20写道:

> Hello,
>
> I run a Spark cluster on YARN, and we have a bunch of client-mode
> applications we use for interactive work. Whenever we start one of this, an
> application master container is started.
>
> My understanding is that this is mostly an empty shell, used to request
> further containers or get status from YARN. Is that correct?
>
> spark.yarn.am.cores is 1, and that AM gets one full vCore on the cluster.
> Because I am using DominantResourceCalculator to take vCores into account
> for scheduling, this results in a lot of unused CPU capacity overall
> because all those AMs each block one full vCore. With enough jobs, this
> adds up quickly.
>
> I am trying to understand if we can work around that -- ideally, by
> allocating fractional vCores (e.g., give 100 millicores to the AM), or by
> allocating no vCores at all for the AM (I am fine with a bit of
> oversubscription because of that).
>
> Any idea on how to avoid blocking so many YARN vCores just for the Spark
> AMs?
>
> Thanks!
>
>


re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
 Hi there,

from my apache spark streaming website (see links below),

   - the batch-interval is set when a spark StreamingContext is constructed
   (see example (a) quoted below)
   - the StreamingContext is available in older and new Spark version
   (v1.6, v2.2 to v2.3.0) (see
   https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html
   and https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html
   )
   - however, example (b) below  doesn't use StreamingContext, but
   StreamingSession object to setup a streaming flow;

What does the usage difference in (a) and (b) mean? I was wondering if this
would mean a different streaming approach ("traditional" streaming vs
structured streaming?

Basically I need to find a way to set the batch-interval in (b), similar as
in (a) below.

Would be great if someone can please share some insights here.

Thanks!

Peter

(a)
https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html )

import org.apache.spark._import org.apache.spark.streaming._
val conf = new SparkConf().setAppName(appName).setMaster(master)val
*ssc *= new StreamingContext(conf, Seconds(1))


(b)
( from databricks'
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
)

   val *spark *= SparkSession.builder()
.appName(appName)
  .getOrCreate()
...

jsonOptions = { "timestampFormat": nestTimestampFormat }
parsed = *spark *\
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "localhost:9092") \
  .option("subscribe", "nest-logs") \
  .load() \
  .select(from_json(col("value").cast("string"), schema,
jsonOptions).alias("parsed_value"))


Re: [Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-24 Thread karthikjay
My data looks like this:


{
  "ts2" : "2018/05/01 00:02:50.041",
  "serviceGroupId" : "123",
  "userId" : "avv-0",
  "stream" : "",
  "lastUserActivity" : "00:02:50",
  "lastUserActivityCount" : "0"
}
{
  "ts2" : "2018/05/01 00:09:02.079",
  "serviceGroupId" : "123",
  "userId" : "avv-0",
  "stream" : "",
  "lastUserActivity" : "00:09:02",
  "lastUserActivityCount" : "0"
}
{
  "ts2" : "2018/05/01 00:09:02.086",
  "serviceGroupId" : "123",
  "userId" : "avv-2",
  "stream" : "",
  "lastUserActivity" : "00:09:02",
  "lastUserActivityCount" : "0"
}
...

And my aggregation is :

val sdvTuneInsAgg1 = df
  .withWatermark("ts2", "10 seconds")
  .groupBy(window(col("ts2"),"10 seconds"))
  .agg(count("*") as "count")
  .as[CountMetric1]

But, the only anomaly is that the current date is 2018/05/24 but the record
ts2 has old dates. Will aggregation / count work in this scenario ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Time series data

2018-05-24 Thread Vadim Semenov
Yeah, it depends on what you want to do with that timeseries data. We at
Datadog process trillions of points daily using Spark, I cannot really go
about what exactly we do with the data, but just saying that Spark can
handle the volume, scale well and be fault-tolerant, albeit everything I
said comes with multiple asterisks.

On Thursday, May 24, 2018, amin mohebbi  wrote:

> Could you please help me to understand  the performance that we get from
> using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings
> = 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with
> 10's or 100's of TBs of data and I feel that NoSQL will be much quicker
> than Hadoop/Spark. This is time series data that are coming from many
> devices in form of flat files and it is currently extracted / transformed 
> /loaded
> into another database which is connected to BI tools. We might use azure
> data factory to collect the flat files and then use spark to do the ETL(not
> sure if it is correct way) and then use spark to join table or do the
> aggregations and save them into a db (preferably nosql not sure).
> Finally, connect deploy Power BI to get visualize the data from nosql db.
> My questions are :
>
> 1- Is the above mentioned correct architecture? using spark with nosql as
> I think combination of these two could help to have random access and run
> many queries by different users.
> 2- do we really need to use a time series db?
>
>
> Best Regards ... Amin
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia
> Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my
> amin_...@me.com
>


-- 
Sent from my iPhone


Streaming : WAL ignored

2018-05-24 Thread Walid Lezzar
Hi,

I have a spark streaming application running on yarn that consumes from a jms 
source. I have the checkpointing and WAL enabled to ensure zero data loss. 
However, When I suddenly kill my application and restarts it, sometimes it 
recovers the data from the WAL but sometimes it doesn’t !! In all the cases, I 
can see the WAL written correctly on HDFS. 

Can someone explains me why my WAL is sometimes ignored on restart ? What are 
the conditions for spark to decide to recover or not from the WAL ?

Thanks,
Walid.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Positive log-likelihood with Gaussian mixture

2018-05-24 Thread Simon Dirmeier

Dear all,

I am fitting a very trivial GMM with 2-10 components on 100 samples and 
5 features in pyspark and observe some of the log-likelihoods being 
positive (see below). I don't undestand how this is possible. Is this a 
bug or an intended behaviour? Furthermore, for different seeds, 
sometimes the likelihoods even change sign. Is this due to the EM only 
converging to a local maximum?


Cheers and thanks for your help,

Simon


```

for i in range(2, 10 + 1):

km = GaussianMixture(tol=0.1, maxIter=1000, k=i, seed=23)

model = km.fit(df)

print(i, model.summary.logLikelihood)

2 -197.37852947736653
3 -129.9873268616941
4 252.856072127079
5 58.104854133211305
6 102.05184634221902
7 -438.69872950609897
8 -521.9157414809579
9 684.7223627089136
10 -596.7165760632951

for i in range(2, 10 + 1):

km = GaussianMixture(tol=0.1, maxIter=1000, k=i, seed=5)

model = km.fit(df)

print(i, model.summary.logLikelihood)

2 -237.6569055995205
3 193.6716647064348
4 222.8175404052819
5 201.28821925102105
6 74.02720327261291
7 -540.8607659051879
8 144.837051544231
9 -507.48261722455305
10 -689.1844483249996
```



Re: Time series data

2018-05-24 Thread Jörn Franke
There is not one answer to this. 

It really depends what kind of time Series analysis you do with the data and 
what time series database you are using. Then it also depends what Etl you need 
to do.
You seem to also need to join data - is it with existing data of the same type 
or do you join completely different data. If so where does this data come from?

360 GB / day / uncompressed does not sound terrible much.

> On 24. May 2018, at 08:49, amin mohebbi  wrote:
> 
> Could you please help me to understand  the performance that we get from 
> using spark with any nosql or TSDB ? We receive 1 mil meters x 288 readings = 
> 288 mil rows (Approx. 360 GB per day) – Therefore, we will end up with 10's 
> or 100's of TBs of data and I feel that NoSQL will be much quicker than 
> Hadoop/Spark. This is time series data that are coming from many devices in 
> form of flat files and it is currently extracted / transformed /loaded  into 
> another database which is connected to BI tools. We might use azure data 
> factory to collect the flat files and then use spark to do the ETL(not sure 
> if it is correct way) and then use spark to join table or do the aggregations 
> and save them into a db (preferably nosql not sure). Finally, connect deploy 
> Power BI to get visualize the data from nosql db. My questions are :
> 
> 1- Is the above mentioned correct architecture? using spark with nosql as I 
> think combination of these two could help to have random access and run many 
> queries by different users. 
> 2- do we really need to use a time series db? 
> 
> 
> Best Regards ... Amin 
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia   
> Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my   
> amin_...@me.com