unable to connect to connect to cluster 2.2.0

2017-12-05 Thread Imran Rajjad
Hi, Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have broken. The submitted application is unable to connect to the cluster, when all is running. below is my stack trace Spark Master:spark://192.168.10.207:7077 Job Arguments: -appName orange_watch -directory

Spark job only starts tasks on a single node

2017-12-05 Thread Ji Yan
Hi all, I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto several nodes. I try to set the number of executors by the formula (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark will try to fill up on one mesos node as many executors as it can, then

How to export the Spark SQL jobs from the HiveThriftServer2

2017-12-05 Thread wenxing zheng
Dear all, I have a HiveThriftServer2 serer running and most of our spark SQLs will go there for calculation. From the Yarn GUI, I can see the application id and the attempt ID of the thrift server. But with the REST api described on the page

Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out of date). Are you just looking to have graphing/alerting set up on the Spark metrics? On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > You can also get the metrics from the Spark

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
@Richard I don't see any error in the executor log but let me run again to make sure. @Gerard Thanks much! but would your answer on .collect() change depending on running the spark app in client vs cluster mode? Thanks! On Tue, Dec 5, 2017 at 1:54 PM, Gerard Maas wrote:

Re: learning Spark

2017-12-05 Thread makoto
This gitbook explains Spark compotents in detail. 'Mastering Apache Spark 2' https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details 2017-12-04 12:48 GMT+09:00 Manuel Sopena Ballesteros < manuel...@garvan.org.au>: > Dear Spark community, > > > > Is there any resource

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Gerard Maas
The general answer to your initial question is that "it depends". If the operation in the rdd.foreach() closure can be parallelized, then you don't need to collect first. If it needs some local context (e.g. a socket connection), then you need to do rdd.collect first to bring the data locally,

Re: Access to Applications metrics

2017-12-05 Thread Thakrar, Jayesh
You can also get the metrics from the Spark application events log file. See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling From: "Qiao, Richard" Date: Monday, December 4, 2017 at 6:09 PM To: Nick Dimiduk ,

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
In the 2nd case, is there any producer’s error thrown in executor’s log? Best Regards Richard From: kant kodali Date: Tuesday, December 5, 2017 at 4:38 PM To: "Qiao, Richard" Cc: "user @spark" Subject: Re: Do I need to

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
On Tue, Dec 5, 2017 at 12:43 PM, bsikander wrote: > 2) If I use context.addSparkListener, I can customize the listener but then > I miss the onApplicationStart event. Also, I don't know the Spark's logic to > changing the state of application from WAITING -> RUNNING. I'm not

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Reads from Kafka and outputs to Kafka. so I check the output from Kafka. On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard wrote: > Where do you check the output result for both case? > > Sent from my iPhone > > > On Dec 5, 2017, at 15:36, kant kodali

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread Qiao, Richard
Where do you check the output result for both case? Sent from my iPhone > On Dec 5, 2017, at 15:36, kant kodali wrote: > > Hi All, > > I have a simple stateless transformation using Dstreams (stuck with the old > API for one of the Application). The pseudo code is rough

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread bsikander
Thank you for the reply. I am not a Spark expert but I was reading through the code and I thought that the state was changed from SUBMITTED to RUNNING only after executors (CoarseGrainedExecutorBackend) were registered.

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread Marcelo Vanzin
SparkLauncher operates at a different layer than Spark applications. It doesn't know about executors or driver or anything, just whether the Spark application was started or not. So it doesn't work for your case. The best option for your case is to install a SparkListener and monitor events. But

Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Hi All, I have a simple stateless transformation using Dstreams (stuck with the old API for one of the Application). The pseudo code is rough like this dstream.map().reduce().forEachRdd(rdd -> { rdd.collect(),forEach(); // Is this necessary ? Does execute fine but a bit slow }) I

Apache Spark 2.3 and Apache ORC 1.4 finally

2017-12-05 Thread Dongjoon Hyun
Hi, All. Today, Apache Spark starts to use Apache ORC 1.4 as a `native` ORC implementation. SPARK-20728 Make OrcFileFormat configurable between `sql/hive` and `sql/core`. - https://github.com/apache/spark/commit/326f1d6728a7734c228d8bfaa69442a1c7b92e9b Thank you so much for all your supports

Re: How to persistent database/table created in sparkSession

2017-12-05 Thread Wenchen Fan
Try with `SparkSession.builder().enableHiveSupport` ? On Tue, Dec 5, 2017 at 3:22 PM, 163 wrote: > Hi, > How can I persistent database/table created in spark application? > > object TestPersistentDB { > def main(args:Array[String]): Unit = { >

Support for storing date time fields as TIMESTAMP_MILLIS(INT64)

2017-12-05 Thread Rahul Raj
Hi, I believe spark writes datetime fields as INT96. What are the implications of https://issues.apache.org/jira/browse/SPARK-10364(Support Parquet logical type TIMESTAMP_MILLIS) which is part of 2.2.0? I am having issues while reading spark generated parquet dates using Apache Drill (Drill

Re: learning Spark

2017-12-05 Thread Jean Georges Perrin
When you pick a book, make sure it covers the version of Spark you want to deploy. There are a lot of books out there that focus a lot on Spark 1.x. Spark 2.x generalizes the dataframe API, introduces Tungsten, etc. All might not be relevant to a pure “sys admin” learning, but it is good to