Re: Do I need to do .collect inside forEachRDD

2017-12-06 Thread kant kodali
@Richard I had pasted the two versions of the code below and I still couldn't figure out why it wouldn't work without .collect ? Any help would be great *The code below doesn't work and sometime I also run into OutOfMemory error.* jsonMessagesDStream .window(new Duration(6), new

Spark ListenerBus

2017-12-06 Thread KhajaAsmath Mohammed
Hi, I am running spark sql job and it completes without any issues. I am getting errors as ERROR: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate after completion of job. could anyone share your suggestions on how to avoid it. Thanks, Asmath

Re: Json Parsing.

2017-12-06 Thread satyajit vegesna
Thank you for the info, is there a way to get all keys of JSON, so that i can create a dataframe with json keys, as below, fieldsDataframe.withColumn("data" , functions.get_json_object($"RecordString", "$.id")) this is for appending a single column in dataframe with id key. I would like to

Re: Json Parsing.

2017-12-06 Thread ayan guha
You can use get On Thu, 7 Dec 2017 at 10:39 am, satyajit vegesna wrote: > Does spark support automatic detection of schema from a json string in a > dataframe. > > I am trying to parse a json string and do some transofrmations on to it > (would like append new

Re: Json Parsing.

2017-12-06 Thread ayan guha
On Thu, 7 Dec 2017 at 11:37 am, ayan guha wrote: > You can use get_json function > > On Thu, 7 Dec 2017 at 10:39 am, satyajit vegesna < > satyajit.apas...@gmail.com> wrote: > >> Does spark support automatic detection of schema from a json string in a >> dataframe. >> >> I am

Json Parsing.

2017-12-06 Thread satyajit vegesna
Does spark support automatic detection of schema from a json string in a dataframe. I am trying to parse a json string and do some transofrmations on to it (would like append new columns to the dataframe) , from the data i stream from kafka. But i am not very sure, how i can parse the json in

Explode schema name question

2017-12-06 Thread tj5527
Searching online with key word such as auto explode schema doesn't come up what I am looking for, so ask here ... I want to explode Dataset schema where the dataset schema are nested structure from complicated XML and its structure changes a lot with high frequency. After checking api doc, I

Re: Spark job only starts tasks on a single node

2017-12-06 Thread Ji Yan
I am sure that the other agents have plentiful enough resources, but I don't know why Spark only scheduled executors on one single node, up to that node's capacity ( it is a different node everytime I run btw ). I checked the DEBUG log from Spark Driver, didn't see any mention of decline. But

sparkSession.sql("sql query") vs df.sqlContext().sql(this.query) ?

2017-12-06 Thread kant kodali
Hi All, I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2. Dataset df = sparkSession.readStream() .format("kafka") .load(); df.createOrReplaceTempView("table"); df.printSchema(); *Dataset

Buffer/cache exhaustion Spark standalone inside a Docker container

2017-12-06 Thread Stein Welberg
Hi All! I have a very weird memory issue (which is what a lot of people will most likely say ;-)) with Spark running in standalone mode inside a Docker container. Our setup is as follows: We have a Docker container in which we have a Spring boot application that runs Spark in standalone mode.

A possible bug? Must call persist to make code run

2017-12-06 Thread kwunlyou
I prepare a simple example (python) as follows to illustrate what I found: - The code works well by calling a persist beforehand under all Spark versions - Without calling persist, the code works well under Spark 2.2.0 but doesn't work under Spark 2.1.1 and Spark 2.1.2 - It really looks like a

Re: Spark job only starts tasks on a single node

2017-12-06 Thread Art Rand
Hello Ji, Spark will launch Executors round-robin on offers, so when the resources on an agent get broken into multiple resource offers it's possible that many Executrors get placed on a single agent. However, from your description, it's not clear why your other agents do not get Executors

[ML] LogisticRegression and dataset's standardization before training

2017-12-06 Thread Filipp Zhinkin
Hi, LogisticAggregator [1] scales every sample on every iteration. Without scaling binaryUpdateInPlace could be rewritten using BLAS.dot and that would significantly improve performance. However, there is a comment [2] saying that standardization and caching of the dataset before training will

Re: unable to connect to connect to cluster 2.2.0

2017-12-06 Thread Imran Rajjad
thanks the machine where spark job was being submitted had SPARK_HOME pointing old 2.1.1 directory. On Wed, Dec 6, 2017 at 1:35 PM, Qiao, Richard wrote: > Are you now building your app using spark 2.2 or 2.1? > > > > Best Regards > > Richard > > > > > > *From:

Re: How to export the Spark SQL jobs from the HiveThriftServer2

2017-12-06 Thread wenxing zheng
the words: [app-id] will actually be [base-app-id]/[attempt-id], where [base-app-id] is the YARN application ID. is not so correct. As after I changed the [app-id] to [base-app-id], it works. Maybe we need to fix the document? >From the information of the spark job or the spark stages, I can't

Re: Do I need to do .collect inside forEachRDD

2017-12-06 Thread Gerard Maas
Hi Kant, > but would your answer on .collect() change depending on running the spark app in client vs cluster mode? No, it should make no difference. -kr, Gerard. On Tue, Dec 5, 2017 at 11:34 PM, kant kodali wrote: > @Richard I don't see any error in the executor log but

Re: unable to connect to connect to cluster 2.2.0

2017-12-06 Thread Qiao, Richard
Are you now building your app using spark 2.2 or 2.1? Best Regards Richard From: Imran Rajjad Date: Wednesday, December 6, 2017 at 2:45 AM To: "user @spark" Subject: unable to connect to connect to cluster 2.2.0 Hi, Recently upgraded from 2.1.1 to