Re: spark streaming exception

2019-11-10 Thread Akshay Bhardwaj
Hi, Could you provide with the code snippet of how you are connecting and reading data from kafka? Akshay Bhardwaj +91-97111-33849 On Thu, Oct 17, 2019 at 8:39 PM Amit Sharma wrote: > Please update me if any one knows about it. > > > Thanks > Amit > > On Thu, Oct 10,

Re: driver crashesneed to find out why driver keeps crashing

2019-10-23 Thread Akshay Bhardwaj
? Standalone spark process(master is set to local[*]) ? Spark master-slave cluster? YARN or Mesos Cluster, etc? Akshay Bhardwaj +91-97111-33849 On Mon, Oct 21, 2019 at 11:20 AM Manuel Sopena Ballesteros < manuel...@garvan.org.au> wrote: > Dear Apache Spark community, > > >

Re: Why my spark job STATE--> Running FINALSTATE --> Undefined.

2019-06-11 Thread Akshay Bhardwaj
was then irrespective of the Cluster manager used. Akshay Bhardwaj +91-97111-33849 On Tue, Jun 11, 2019 at 7:41 PM Shyam P wrote: > Hi, > Any clue why spark job goes into UNDEFINED state ? > > More detail are in the url. > > https://stackoverflow.com/questions/56545644/why-my-spark-sql-

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
Additionally there is "uuid" function available as well if that helps your use case. Akshay Bhardwaj +91-97111-33849 On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Marcelo, > > If you are using spark 2.3+ and dataset

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
functions. Akshay Bhardwaj +91-97111-33849 On Thu, May 30, 2019 at 4:05 AM Marcelo Valle wrote: > Hi all, > > I am new to spark and I am trying to write an application using dataframes > that normalize data. > > So I have a dataframe `denormalized_cities` with 3 columns: CO

Re: Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-29 Thread Akshay Bhardwaj
object stores before they can be referenced in spark. As you mention you are using Azure blob files, this should explain the behaviour where everything seems to stop. You can reduce this time by ensuring you have small number of large files in your blob store to read from instead of vice-a-versa. Aksha

Re: double quota is automaticly added when sinking as csv

2019-05-21 Thread Akshay Bhardwaj
Hi, Add writeStream.option("quoteMode", "NONE") By default Spark dataset api assumes that all the values MUST BE enclosed in quote character (def: ") while writing to CSV files. Akshay Bhardwaj +91-97111-33849 On Tue, May 21, 2019 at 5:34 PM 杨浩 wrote: > We us

Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Akshay Bhardwaj
Hi Hari, Thanks for this information. Do you have any resources on/can explain, why YARN has this as default behaviour? What would be the advantages/scenarios to have multiple assignments in single heartbeat? Regards Akshay Bhardwaj +91-97111-33849 On Mon, May 20, 2019 at 1:29 PM Hariharan

Re: Spark-YARN | Scheduling of containers

2019-05-19 Thread Akshay Bhardwaj
Hi All, Just floating this email again. Grateful for any suggestions. Akshay Bhardwaj +91-97111-33849 On Mon, May 20, 2019 at 12:25 AM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi All, > > I am running Spark 2.3 on YARN using HDP 2.6 > > I am running sp

Spark-YARN | Scheduling of containers

2019-05-19 Thread Akshay Bhardwaj
YARN decide which nodes to launch containers? I have around 12 YARN nodes running in the cluster, but still i see repeated patterns of 3-4 containers launched on the same node for a particular job. What is the best way to start debugging this reason? Akshay Bhardwaj +91-97111-33849

Re: Spark job gets hung on cloudera cluster

2019-05-16 Thread Akshay Bhardwaj
can communicate with Name node service? Akshay Bhardwaj +91-97111-33849 On Thu, May 16, 2019 at 4:27 PM Rishi Shah wrote: > on yarn > > On Thu, May 16, 2019 at 1:36 AM Akshay Bhardwaj < > akshay.bhardwaj1...@gmail.com> wrote: > >> Hi Rishi, >> >> Are you

Re: Running spark with javaagent configuration

2019-05-15 Thread Akshay Bhardwaj
Hi Anton, Do you have the option of storing the JAR file on HDFS, which can be accessed via spark in your cluster? Akshay Bhardwaj +91-97111-33849 On Thu, May 16, 2019 at 12:04 AM Oleg Mazurov wrote: > You can see what Uber JVM does at > https://github.com/uber-common/jvm-pr

Re: Spark job gets hung on cloudera cluster

2019-05-15 Thread Akshay Bhardwaj
Hi Rishi, Are you running spark on YARN or spark's master-slave cluster? Akshay Bhardwaj +91-97111-33849 On Thu, May 16, 2019 at 7:15 AM Rishi Shah wrote: > Any one please? > > On Tue, May 14, 2019 at 11:51 PM Rishi Shah > wrote: > >> Hi All, >> >> At times

Spark Elasticsearch Connector | Index and Update

2019-05-10 Thread Akshay Bhardwaj
*Say if I have 2 documents in the partition, and based on a field I want to index the document, and based on another field I want to update the document with an inline script.* *Is there a possibility to do this in the same writeStream for Elastic Search in spark structured streaming?* Akshay Bh

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-08 Thread Akshay Bhardwaj
ress status displays a lot of metrics that shall be your first diagnosis to identify issues. The progress status with kafka stream displays the "startOffset" and "endOffset" values per batch. This is listed topic-partition wise the start to end offsets per trigger batch of stre

What is Spark context cleaner in structured streaming

2019-05-02 Thread Akshay Bhardwaj
have streaming interval of 500ms, reading data from Kafka topic with max batch size as 1000. Akshay Bhardwaj +91-97111-33849

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
proved to be unreliable, as I have encountered corrupted files which causes errors on job restarts. Akshay Bhardwaj +91-97111-33849 On Wed, May 1, 2019 at 3:20 PM Anastasios Zouzias wrote: > Hi, > > Have you checked the docs, i.e., > https://spark.apache.org/docs/latest/structur

Re: Spark Structured Streaming | Highly reliable de-duplication strategy

2019-05-01 Thread Akshay Bhardwaj
Hi All, Floating this again. Any suggestions? Akshay Bhardwaj +91-97111-33849 On Tue, Apr 30, 2019 at 7:30 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Experts, > > I am using spark structured streaming to read message from Kafka, with a > producer that w

Spark Structured Streaming | Highly reliable de-duplication strategy

2019-04-30 Thread Akshay Bhardwaj
if the checksum is not present in KV store. - My doubts with this approach is how to ensure safe write to both the 2nd topic and to KV store for storing checksum, in the case of unwanted failures. How does that guarantee exactly-once with restarts? Any suggestions are highly appreciated. Akshay

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Akshay Bhardwaj
Hi Austin, Are you using Spark Streaming or Structured Streaming? For better understanding, could you also provide sample code/config params for your spark-kafka connector for the said streaming job? Akshay Bhardwaj +91-97111-33849 On Mon, Apr 29, 2019 at 10:34 PM Austin Weaver wrote

Re: spark structured streaming crash due to decompressing gzip file failure

2019-03-07 Thread Akshay Bhardwaj
Hi, In your spark-submit command, try using the below config property and see if this solves the problem. --conf spark.sql.files.ignoreCorruptFiles=true For me this worked to ignore reading empty/partially uploaded gzip files in s3 bucket. Akshay Bhardwaj +91-97111-33849 On Thu, Mar 7, 2019

Re: Structured Streaming to Kafka Topic

2019-03-06 Thread Akshay Bhardwaj
Hi Pankaj, What version of Spark are you using? If you are using 2.4+ then there is an inbuilt function "to_json" which converts the columns of your dataset to JSON format. https://spark.apache.org/docs/2.4.0/api/sql/#to_json Akshay Bhardwaj +91-97111-33849 On Wed, Mar 6, 2019 a

Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-06 Thread Akshay Bhardwaj
Also, what is the average kafka record message size in bytes? Akshay Bhardwaj +91-97111-33849 On Wed, Mar 6, 2019 at 1:26 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi, > > To better debug the issue, please check the below co

Re: "java.lang.AssertionError: assertion failed: Failed to get records for **** after polling for 180000" error

2019-03-05 Thread Akshay Bhardwaj
is not set, then poll.ms is default to spark.network.timeout) - - Akshay Bhardwaj +91-97111-33849 On Wed, Mar 6, 2019 at 8:39 AM JF Chen wrote: > When my kafka executor reads data from kafka, sometimes it throws the > error "java.lang.AssertionError: assertion failed: Failed to

Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread Akshay Bhardwaj
as schema of tables/views used. If there is an issue with your SQL syntax then the method throws below exception that you can catch org.apache.spark.sql.catalyst.parser.ParseException Hope this helps! Akshay Bhardwaj +91-97111-33849 On Fri, Mar 1, 2019 at 10:23 PM kant kodali wrote:

Re: Spark 2.4.0 Master going down

2019-02-28 Thread Akshay Bhardwaj
are accessible? 3) Have you checked the memory consumption of the executors/driver running in the cluster? Akshay Bhardwaj +91-97111-33849 On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar wrote: > Hi All > > We are running Spark version 2.4.0 and we run few Spark streaming jobs > listening on

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Akshay Bhardwaj
Hi Gabor, I guess you are looking at Kafka 2.1 but Guillermo mentioned initially that they are working with Kafka 1.0 Akshay Bhardwaj +91-97111-33849 On Wed, Feb 27, 2019 at 5:41 PM Gabor Somogyi wrote: > Where exactly? In Kafka broker configuration section here it's 10080: >

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Akshay Bhardwaj
Hi Gabor, I am talking about offset.retention.minutes which is set default as 1440 (or 24 hours) Akshay Bhardwaj +91-97111-33849 On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi wrote: > Hi Akshay, > > The feature what you've mentioned has a default value of 7 days... > > BR, &

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-26 Thread Akshay Bhardwaj
Hi Guillermo, What was the interval in between restarting the spark job? As a feature in Kafka, a broker deleted offsets for a consumer group after inactivity of 24 hours. In such a case, the newly started spark streaming job will read offsets from beginning for the same groupId. Akshay Bhardwaj

Re: Spark 2.3 | Structured Streaming | Metric for numInputRows

2019-02-26 Thread Akshay Bhardwaj
t;, "startOffset" : { "kafka_events_topic" : { "2" : 32822078, "1" : 114248484, "0" : 114242134 } }, "endOffset" : { "kafka_events_topic" : { "2" : 32822496, &

Spark 2.3 | Structured Streaming | Metric for numInputRows

2019-02-26 Thread Akshay Bhardwaj
while also filtering events fetched from CSV file? I am also open to suggestions if there is a better way of filtering out the prohibited events in structured streaming. Thanks in advance. Akshay Bhardwaj +91-97111-33849