unable to connect to connect to cluster 2.2.0

2017-12-05 Thread Imran Rajjad
Hi,

Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have
broken. The submitted application is unable to connect to the cluster, when
all is running.

below is my stack trace
Spark Master:spark://192.168.10.207:7077
Job Arguments:
-appName orange_watch -directory /u01/watch/stream/
Spark Configuration:
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:6g
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:4g
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:orange_watch
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:2
Spark Arguments:
[--packages]:graphframes:graphframes:0.5.0-spark2.1-s_2.11
Using properties file:
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf
Adding default property:
spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
Parsed arguments:
  master  spark://192.168.10.207:7077
  deployMode  null
  executorMemory  6g
  executorCores   2
  totalExecutorCores  null
  propertiesFile
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf
  driverMemory4g
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   com.my_user.MainClassWatch
  primaryResource file:/home/my_user/cluster-testing/job.jar
  nameorange_watch
  childArgs   [-watchId 3199 -appName orange_watch -directory
/u01/watch/stream/]
  jarsnull
  packagesgraphframes:graphframes:0.5.0-spark2.1-s_2.11
  packagesExclusions  null
  repositoriesnull
  verbose true
Spark properties used, including those specified through
 --conf and those from the properties file
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf:
  (spark.driver.memory,4g)
  (spark.executor.memory,6g)
  (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11)
  (spark.app.name,orange_watch)
  (spark.executor.cores,2)

Ivy Default Cache set to: /home/my_user/.ivy2/cache
The jars for the packages stored in: /home/my_user/.ivy2/jars
:: loading settings :: url =
jar:file:/home/my_user/spark-2.2.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.5.0-spark2.1-s_2.11 in spark-list
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in
central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in
central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in spark-list
:: resolution report :: resolve 191ms :: artifacts dl 5ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from
central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from
central in [default]
graphframes#graphframes;0.5.0-spark2.1-s_2.11 from spark-list in
[default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from spark-list in [default]

-
|  |modules||   artifacts
|
|   conf   | number| search|dwnlded|evicted||
number|dwnlded|

-
|  default |   5   |   0   |   0   |   0   ||   5   |   0
|

-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/7ms)
Main class:
com.my_user.MainClassWatch
Arguments:
-watchId
3199
-appName
orange_watch
-directory
/u01/watch/stream/
System properties:
(spark.executor.memory,6g)
(spark.driver.memory,4g)
(SPARK_SUBMIT,true)
(spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11)
(spark.app.name,orange_watch)
(spark.jars,file:/home/my_user/.ivy2/jars/graphframes_graphframes-0.5.0-spark2.1-s_2.11.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/org.scala-lang_scala-reflect-2.11.0.jar,file:/home/my_user/.ivy2/jars/org.slf4j_slf4j-api-1.7.7.jar,file:/home/my_user/cluster-testing/job.jar)
(spark.submit.deployMode,client)
(spark.master,spark://192.168.10.207:7077)

Spark job only starts tasks on a single node

2017-12-05 Thread Ji Yan
Hi all,

I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
several nodes. I try to set the number of executors by the formula
(spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
will try to fill up on one mesos node as many executors as it can, then it
stops going to other mesos nodes despite that it has not done scheduling
all the executors I have asked it to yet! This is super weird!

Did anyone notice this behavior before? Any help appreciated!

Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


How to export the Spark SQL jobs from the HiveThriftServer2

2017-12-05 Thread wenxing zheng
Dear all,

I have a HiveThriftServer2 serer running and most of our spark SQLs will go
there for calculation. From the Yarn GUI, I can see the application id and
the attempt ID of the thrift server. But with the REST api described on the
page (https://spark.apache.org/docs/latest/monitoring.html#rest-api), I
still can't get the jobs for a given application with the endpoint:
*/applications/[app-id]/jobs*

Can anyone kindly advice how to dump the spark SQL jobs for audit? Just
like the one for the MapReduce jobs (
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html
).

Thanks again,
Wenxing


Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out
of date). Are you just looking to have graphing/alerting set up on the
Spark metrics?

On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> You can also get the metrics from the Spark application events log file.
>
>
>
> See https://www.slideshare.net/JayeshThakrar/apache-
> bigdata2017sparkprofiling
>
>
>
>
>
> *From: *"Qiao, Richard" 
> *Date: *Monday, December 4, 2017 at 6:09 PM
> *To: *Nick Dimiduk , "user@spark.apache.org" <
> user@spark.apache.org>
> *Subject: *Re: Access to Applications metrics
>
>
>
> It works to collect Job level, through Jolokia java agent.
>
>
>
> Best Regards
>
> Richard
>
>
>
>
>
> *From: *Nick Dimiduk 
> *Date: *Monday, December 4, 2017 at 6:53 PM
> *To: *"user@spark.apache.org" 
> *Subject: *Re: Access to Applications metrics
>
>
>
> Bump.
>
>
>
> On Wed, Nov 15, 2017 at 2:28 PM, Nick Dimiduk  wrote:
>
> Hello,
>
>
>
> I'm wondering if it's possible to get access to the detailed
> job/stage/task level metrics via the metrics system (JMX, Graphite, ).
> I've enabled the wildcard sink and I do not see them. It seems these values
> are only available over http/json and to SparkListener instances, is this
> the case? Has anyone worked on a SparkListener that would bridge data from
> one to the other?
>
>
>
> Thanks,
>
> Nick
>
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
@Richard I don't see any error in the executor log but let me run again to
make sure.

@Gerard Thanks much!  but would your answer on .collect() change depending
on running the spark app in client vs cluster mode?

Thanks!

On Tue, Dec 5, 2017 at 1:54 PM, Gerard Maas  wrote:

> The general answer to your initial question is that "it depends". If the
> operation in the rdd.foreach() closure can be parallelized, then you don't
> need to collect first. If it needs some local context (e.g. a socket
> connection), then you need to do rdd.collect first to bring the data
> locally, which has a perf penalty and also is restricted to the memory size
> to the driver process.
>
> Given the further clarification:
> >Reads from Kafka and outputs to Kafka. so I check the output from Kafka.
>
> If it's writing to Kafka, that operation can be done in a distributed
> form.
>
> You could use this lib: https://github.com/BenFradet/spark-kafka-writer
>
> Or, if you can upgrade to Spark 2.2 version, you can pave your way to
> migrate to structured streaming by already adopting the 'structured' APIs
> within Spark Streaming:
>
> case class KV(key: String, value: String)
>
> dstream.map().reduce().forEachRdd{rdd ->
> import spark.implicits._
> val kv = rdd.map{e => KV(extractKey(e), extractValue(e))} // needs to
> be in a (key,value) shape
> val dataFrame = rdd.toDF()
> dataFrame.write
>  .format("kafka")
>  .option("kafka.bootstrap.servers",
> "host1:port1,host2:port2")
>  .option("topic", "topic1")
>  .save()
> }
>
> -kr, Gerard.
>
>
>
> On Tue, Dec 5, 2017 at 10:38 PM, kant kodali  wrote:
>
>> Reads from Kafka and outputs to Kafka. so I check the output from Kafka.
>>
>> On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard <
>> richard.q...@capitalone.com> wrote:
>>
>>> Where do you check the output result for both case?
>>>
>>> Sent from my iPhone
>>>
>>>
>>> > On Dec 5, 2017, at 15:36, kant kodali  wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I have a simple stateless transformation using Dstreams (stuck with
>>> the old API for one of the Application). The pseudo code is rough like this
>>> >
>>> > dstream.map().reduce().forEachRdd(rdd -> {
>>> >  rdd.collect(),forEach(); // Is this necessary ? Does execute fine
>>> but a bit slow
>>> > })
>>> >
>>> > I understand collect collects the results back to the driver but is
>>> that necessary? can I just do something like below? I believe I tried both
>>> and somehow the below code didn't output any results (It can be issues with
>>> my env. I am not entirely sure) but I just would like some clarification on
>>> .collect() since it seems to slow things down for me.
>>> >
>>> > dstream.map().reduce().forEachRdd(rdd -> {
>>> >  rdd.forEach(() -> {} ); //
>>> > })
>>> >
>>> > Thanks!
>>> >
>>> >
>>> 
>>>
>>> The information contained in this e-mail is confidential and/or
>>> proprietary to Capital One and/or its affiliates and may only be used
>>> solely in performance of work or services for Capital One. The information
>>> transmitted herewith is intended only for use by the individual or entity
>>> to which it is addressed. If the reader of this message is not the intended
>>> recipient, you are hereby notified that any review, retransmission,
>>> dissemination, distribution, copying or other use of, or taking of any
>>> action in reliance upon this information is strictly prohibited. If you
>>> have received this communication in error, please contact the sender and
>>> delete the material from your computer.
>>>
>>>
>>
>


Re: learning Spark

2017-12-05 Thread makoto
This gitbook explains Spark compotents in detail.

'Mastering Apache Spark 2'

https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details




2017-12-04 12:48 GMT+09:00 Manuel Sopena Ballesteros <
manuel...@garvan.org.au>:

> Dear Spark community,
>
>
>
> Is there any resource (books, online course, etc.) available that you know
> of to learn about spark? I am interested in the sys admin side of it? like
> the different parts inside spark, how spark works internally, best ways to
> install/deploy/monitor and how to get best performance possible.
>
>
>
> Any suggestion?
>
>
>
> Thank you very much
>
>
>
> *Manuel Sopena Ballesteros *| Systems Engineer
> *Garvan Institute of Medical Research *
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> 
> *T:* + 61 (0)2 9355 5760 <+61%202%209355%205760> | *F:* +61 (0)2 9295 8507
> <+61%202%209295%208507> | *E:* manuel...@garvan.org.au
>
>
> NOTICE
> Please consider the environment before printing this email. This message
> and any attachments are intended for the addressee named and may contain
> legally privileged/confidential/copyright information. If you are not the
> intended recipient, you should not read, use, disclose, copy or distribute
> this communication. If you have received this message in error please
> notify us at once by return email and then delete both messages. We accept
> no liability for the distribution of viruses or similar in electronic
> communications. This notice should not be removed.
>


Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Gerard Maas
The general answer to your initial question is that "it depends". If the
operation in the rdd.foreach() closure can be parallelized, then you don't
need to collect first. If it needs some local context (e.g. a socket
connection), then you need to do rdd.collect first to bring the data
locally, which has a perf penalty and also is restricted to the memory size
to the driver process.

Given the further clarification:
>Reads from Kafka and outputs to Kafka. so I check the output from Kafka.

If it's writing to Kafka, that operation can be done in a distributed form.

You could use this lib: https://github.com/BenFradet/spark-kafka-writer

Or, if you can upgrade to Spark 2.2 version, you can pave your way to
migrate to structured streaming by already adopting the 'structured' APIs
within Spark Streaming:

case class KV(key: String, value: String)

dstream.map().reduce().forEachRdd{rdd ->
import spark.implicits._
val kv = rdd.map{e => KV(extractKey(e), extractValue(e))} // needs to
be in a (key,value) shape
val dataFrame = rdd.toDF()
dataFrame.write
 .format("kafka")
 .option("kafka.bootstrap.servers",
"host1:port1,host2:port2")
 .option("topic", "topic1")
 .save()
}

-kr, Gerard.



On Tue, Dec 5, 2017 at 10:38 PM, kant kodali  wrote:

> Reads from Kafka and outputs to Kafka. so I check the output from Kafka.
>
> On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard  > wrote:
>
>> Where do you check the output result for both case?
>>
>> Sent from my iPhone
>>
>>
>> > On Dec 5, 2017, at 15:36, kant kodali  wrote:
>> >
>> > Hi All,
>> >
>> > I have a simple stateless transformation using Dstreams (stuck with the
>> old API for one of the Application). The pseudo code is rough like this
>> >
>> > dstream.map().reduce().forEachRdd(rdd -> {
>> >  rdd.collect(),forEach(); // Is this necessary ? Does execute fine
>> but a bit slow
>> > })
>> >
>> > I understand collect collects the results back to the driver but is
>> that necessary? can I just do something like below? I believe I tried both
>> and somehow the below code didn't output any results (It can be issues with
>> my env. I am not entirely sure) but I just would like some clarification on
>> .collect() since it seems to slow things down for me.
>> >
>> > dstream.map().reduce().forEachRdd(rdd -> {
>> >  rdd.forEach(() -> {} ); //
>> > })
>> >
>> > Thanks!
>> >
>> >
>> 
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>


Re: Access to Applications metrics

2017-12-05 Thread Thakrar, Jayesh
You can also get the metrics from the Spark application events log file.

See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling


From: "Qiao, Richard" 
Date: Monday, December 4, 2017 at 6:09 PM
To: Nick Dimiduk , "user@spark.apache.org" 

Subject: Re: Access to Applications metrics

It works to collect Job level, through Jolokia java agent.

Best Regards
Richard


From: Nick Dimiduk 
Date: Monday, December 4, 2017 at 6:53 PM
To: "user@spark.apache.org" 
Subject: Re: Access to Applications metrics

Bump.

On Wed, Nov 15, 2017 at 2:28 PM, Nick Dimiduk 
> wrote:
Hello,

I'm wondering if it's possible to get access to the detailed job/stage/task 
level metrics via the metrics system (JMX, Graphite, ). I've enabled the 
wildcard sink and I do not see them. It seems these values are only available 
over http/json and to SparkListener instances, is this the case? Has anyone 
worked on a SparkListener that would bridge data from one to the other?

Thanks,
Nick




The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
In the 2nd case, is there any producer’s error thrown in executor’s log?

Best Regards
Richard


From: kant kodali 
Date: Tuesday, December 5, 2017 at 4:38 PM
To: "Qiao, Richard" 
Cc: "user @spark" 
Subject: Re: Do I need to do .collect inside forEachRDD

Reads from Kafka and outputs to Kafka. so I check the output from Kafka.

On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard 
> wrote:
Where do you check the output result for both case?

Sent from my iPhone

> On Dec 5, 2017, at 15:36, kant kodali 
> > wrote:
>
> Hi All,
>
> I have a simple stateless transformation using Dstreams (stuck with the old 
> API for one of the Application). The pseudo code is rough like this
>
> dstream.map().reduce().forEachRdd(rdd -> {
>  rdd.collect(),forEach(); // Is this necessary ? Does execute fine but a 
> bit slow
> })
>
> I understand collect collects the results back to the driver but is that 
> necessary? can I just do something like below? I believe I tried both and 
> somehow the below code didn't output any results (It can be issues with my 
> env. I am not entirely sure) but I just would like some clarification on 
> .collect() since it seems to slow things down for me.
>
> dstream.map().reduce().forEachRdd(rdd -> {
>  rdd.forEach(() -> {} ); //
> })
>
> Thanks!
>
>


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
On Tue, Dec 5, 2017 at 12:43 PM, bsikander  wrote:
> 2) If I use context.addSparkListener, I can customize the listener but then
> I miss the onApplicationStart event. Also, I don't know the Spark's logic to
> changing the state of application from WAITING -> RUNNING.

I'm not sure I follow you here. This is something that you are
defining, not Spark.

"SparkLauncher" has its own view of that those mean, and it doesn't match yours.

"SparkListener" has no notion of whether an app is running or not.

It's up to you to define what waiting and running mean in your code,
and map the events Spark provides you to those concepts.

e.g., a job is running after your listener gets an "onJobStart" event.
But the application might have been running already before that job
started.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Reads from Kafka and outputs to Kafka. so I check the output from Kafka.

On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard 
wrote:

> Where do you check the output result for both case?
>
> Sent from my iPhone
>
> > On Dec 5, 2017, at 15:36, kant kodali  wrote:
> >
> > Hi All,
> >
> > I have a simple stateless transformation using Dstreams (stuck with the
> old API for one of the Application). The pseudo code is rough like this
> >
> > dstream.map().reduce().forEachRdd(rdd -> {
> >  rdd.collect(),forEach(); // Is this necessary ? Does execute fine
> but a bit slow
> > })
> >
> > I understand collect collects the results back to the driver but is that
> necessary? can I just do something like below? I believe I tried both and
> somehow the below code didn't output any results (It can be issues with my
> env. I am not entirely sure) but I just would like some clarification on
> .collect() since it seems to slow things down for me.
> >
> > dstream.map().reduce().forEachRdd(rdd -> {
> >  rdd.forEach(() -> {} ); //
> > })
> >
> > Thanks!
> >
> >
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>


Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
Where do you check the output result for both case?

Sent from my iPhone

> On Dec 5, 2017, at 15:36, kant kodali  wrote:
> 
> Hi All,
> 
> I have a simple stateless transformation using Dstreams (stuck with the old 
> API for one of the Application). The pseudo code is rough like this
> 
> dstream.map().reduce().forEachRdd(rdd -> {
>  rdd.collect(),forEach(); // Is this necessary ? Does execute fine but a 
> bit slow
> })
> 
> I understand collect collects the results back to the driver but is that 
> necessary? can I just do something like below? I believe I tried both and 
> somehow the below code didn't output any results (It can be issues with my 
> env. I am not entirely sure) but I just would like some clarification on 
> .collect() since it seems to slow things down for me.
> 
> dstream.map().reduce().forEachRdd(rdd -> {
>  rdd.forEach(() -> {} ); // 
> })
> 
> Thanks!
> 
> 


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread bsikander
Thank you for the reply.

I am not a Spark expert but I was reading through the code and I thought
that the state was changed from SUBMITTED to RUNNING only after executors
(CoarseGrainedExecutorBackend) were registered.
https://github.com/apache/spark/commit/015f7ef503d5544f79512b626749a1f0c48b#diff-a755f3d892ff2506a7aa7db52022d77cR95

As you mentioned that Launcher has no idea about executors, probably my
understanding is not correct.



SparkListener is an option but it has its own pitfalls. 
1) If I use spark.extraListeners, I get all the events but I cannot
customize the Listener, since I have to pass the class as a string to
spark-submit/Launcher. 
2) If I use context.addSparkListener, I can customize the listener but then
I miss the onApplicationStart event. Also, I don't know the Spark's logic to
changing the state of application from WAITING -> RUNNING.

Maybe you can answer,
If I have a Spark job which needs 3 executors and cluster can only provide 1
executor, will the application be in WAITING or RUNNING ?
If I know the Spark's logic then I can program something with
SparkListener.onExecutorAdded event to correctly figure out the state.

One other alternate can be to use Spark Master Json (http://<>:8080/json),
but the problem with this is that it returns everything and I was not able
to find any way to filter ..



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
SparkLauncher operates at a different layer than Spark applications.
It doesn't know about executors or driver or anything, just whether
the Spark application was started or not. So it doesn't work for your
case.

The best option for your case is to install a SparkListener and
monitor events. But that will not tell you when things do not happen,
just when they do happen, so maybe even that is not enough for you.


On Mon, Dec 4, 2017 at 1:06 AM, bsikander  wrote:
> So, I tried to use SparkAppHandle.Listener with SparkLauncher as you
> suggested. The behavior of Launcher is not what I expected.
>
> 1- If I start the job (using SparkLauncher) and my Spark cluster has enough
> cores available, I receive events in my class extending
> SparkAppHandle.Listener and I see the status getting changed from
> UNKOWN->CONNECTED -> SUBMITTED -> RUNNING. All good here.
>
> 2- If my Spark cluster has cores only for my Driver process (running in
> cluster mode) but no cores for my executor, then I still receive the RUNNING
> event. I was expecting something else since my executor has no cores and
> Master UI shows WAITING state for executors, listener should respond with
> SUBMITTED state instead of RUNNING.
>
> 3- If my Spark cluster has no cores for even the driver process then
> SparkLauncher invokes no events at all. The state stays in UNKNOWN. I would
> have expected it to be in SUBMITTED state atleast.
>
> *Is there any way with which I can reliably get the WAITING state of job?*
> Driver=RUNNING, executor=RUNNING, overall state should be RUNNING
> Driver=RUNNING, executor=WAITING overall state should be SUBMITTED/WAITING
> Driver=WAITING, executor=WAITING overall state should be
> CONNECTED/SUBMITTED/WAITING
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Hi All,

I have a simple stateless transformation using Dstreams (stuck with the old
API for one of the Application). The pseudo code is rough like this

dstream.map().reduce().forEachRdd(rdd -> {
 rdd.collect(),forEach(); // Is this necessary ? Does execute fine but
a bit slow
})

I understand collect collects the results back to the driver but is that
necessary? can I just do something like below? I believe I tried both and
somehow the below code didn't output any results (It can be issues with my
env. I am not entirely sure) but I just would like some clarification on
.collect() since it seems to slow things down for me.

dstream.map().reduce().forEachRdd(rdd -> {
 rdd.forEach(() -> {} ); //
})

Thanks!


Apache Spark 2.3 and Apache ORC 1.4 finally

2017-12-05 Thread Dongjoon Hyun
Hi, All.

Today, Apache Spark starts to use Apache ORC 1.4 as a `native` ORC
implementation.

SPARK-20728 Make OrcFileFormat configurable between `sql/hive` and
`sql/core`.
-
https://github.com/apache/spark/commit/326f1d6728a7734c228d8bfaa69442a1c7b92e9b

Thank you so much for all your supports for this!

I'll proceed more ORC issues in order to make a synergy between both
communities.

Please see https://issues.apache.org/jira/browse/SPARK-20901 to see the
updates

Bests,
Dongjoon.


Re: How to persistent database/table created in sparkSession

2017-12-05 Thread Wenchen Fan
Try with `SparkSession.builder().enableHiveSupport` ?

On Tue, Dec 5, 2017 at 3:22 PM, 163  wrote:

> Hi,
> How can I persistent database/table created in spark application?
>
> object TestPersistentDB {
> def main(args:Array[String]): Unit = {
> val spark = SparkSession.builder()
> .appName("Create persistent table")
> .config("spark.master”,"local")
> .getOrCreate()
> import spark.implicits._
> spark.sql("create database testdb location \"
> hdfs://node1:8020/testdb\")
>
> }
> }
>
>   When I use spark.sql(“create database”) in sparkSession, and close this
> sparkSession.
>   The created database is not persisted to metadata, So I cannot find it
> in spark-sql: show databases.
>
>
>
> regards
> wendy
>


Support for storing date time fields as TIMESTAMP_MILLIS(INT64)

2017-12-05 Thread Rahul Raj
Hi,

I believe spark writes datetime fields as INT96. What are the implications
of https://issues.apache.org/jira/browse/SPARK-10364(Support Parquet
logical type TIMESTAMP_MILLIS) which is part of 2.2.0?

I am having issues while reading spark generated parquet dates using Apache
Drill (Drill supports INT64 default). Will the change allow me to store
datetime fields as INT64?

Regards,
Rahul

-- 
 This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom it is 
addressed. If you are not the named addressee then you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately and delete this e-mail from your system.


Re: learning Spark

2017-12-05 Thread Jean Georges Perrin
When you pick a book, make sure it covers the version of Spark you want to 
deploy. There are a lot of books out there that focus a lot on Spark 1.x. Spark 
2.x generalizes the dataframe API, introduces Tungsten, etc. All might not be 
relevant to a pure “sys admin” learning, but it is good to know.

jg

> On Dec 3, 2017, at 22:48, Manuel Sopena Ballesteros  
> wrote:
> 
> Dear Spark community,
>  
> Is there any resource (books, online course, etc.) available that you know of 
> to learn about spark? I am interested in the sys admin side of it? like the 
> different parts inside spark, how spark works internally, best ways to 
> install/deploy/monitor and how to get best performance possible.
>  
> Any suggestion?
>  
> Thank you very much
>  
> Manuel Sopena Ballesteros | Systems Engineer
> Garvan Institute of Medical Research 
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: manuel...@garvan.org.au 
> 
>  
> NOTICE
> Please consider the environment before printing this email. This message and 
> any attachments are intended for the addressee named and may contain legally 
> privileged/confidential/copyright information. If you are not the intended 
> recipient, you should not read, use, disclose, copy or distribute this 
> communication. If you have received this message in error please notify us at 
> once by return email and then delete both messages. We accept no liability 
> for the distribution of viruses or similar in electronic communications. This 
> notice should not be removed.