Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2

2019-02-12 Thread Allu Thomas
Hi There, My use case is to read a simple json message from Kafka queue using Spark Structured Streaming. But I’m getting the following error message when I run my Kafka consumer. I don’t get this error when using Spark direct stream. The issue is happening only with structured streaming. Any

Re: Dataset experimental interfaces

2019-02-12 Thread yeikel
If you tested it end to end with the current version and it works fine , I'd say go ahead unless there is another similar way. If they change the functionality you can always update it. Regarding "non-experimental" functions ,they could also be marked as deprecated and then removed on later

Spark with Kubernetes connecting to pod id, not address

2019-02-12 Thread Pat Ferrel
From: Pat Ferrel Reply: Pat Ferrel Date: February 12, 2019 at 5:40:41 PM To: user@spark.apache.org Subject:  Spark with Kubernetes connecting to pod id, not address We have a k8s deployment of several services including Apache Spark. All services seem to be operational. Our application

????????

2019-02-12 Thread ????????

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Vadim Semenov
Yeah, then the easiest would be to fork spark and run using the forked version, and in case of YARN it should be pretty easy to do. git clone https://github.com/apache/spark.git cd spark export MAVEN_OPTS="-Xmx4g -XX:ReservedCodeCacheSize=512m" ./build/mvn -DskipTests clean package

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
* It is not getPartitions() but getNumPartitions(). El mar., 12 de feb. de 2019 a la(s) 13:08, Pedro Tuero (tuerope...@gmail.com) escribió: > And this is happening in every job I run. It is not just one case. If I > add a forced repartitions it works fine, even better than before. But I run >

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
And this is happening in every job I run. It is not just one case. If I add a forced repartitions it works fine, even better than before. But I run the same code for different inputs so the number to make repartitions must be related to the input. El mar., 12 de feb. de 2019 a la(s) 11:22, Pedro

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
Hi Jacek. I 'm not using SparkSql, I'm using RDD API directly. I can confirm that the jobs and stages are the same on both executions. In the environment tab of the web UI, when using spark 2.4 spark.default.parallelism=128 is shown while in 2.3.1 is not. But in 2.3.1 should be the same, because

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Jacek Laskowski
Hi, Can you show the plans with explain(extended=true) for both versions? That's where I'd start to pinpoint the issue. Perhaps the underlying execution engine change to affect keyBy? Dunno and guessing... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-12 Thread Serega Sheypak
I tried a similar approach, it works well for user functions. but I need to crash tasks or executor when spark application runs "repartition". I didn't any away to inject "poison pill" into repartition call :( пн, 11 февр. 2019 г. в 21:19, Vadim Semenov : > something like this > > import

Re: Data growth vs Cluster Size planning

2019-02-12 Thread Phillip Henry
Too little information to give an answer, if indeed an answer a priori is possible. However, I would do the following on your test instances: - Run jstat -gc on all your nodes. It might be that the GC is taking a lot of time. - Poll with jstack semi frequently. I can give you a fairly good idea

Re: structured streaming handling validation and json flattening

2019-02-12 Thread Phillip Henry
Hi, I'm in a somewhat similar situation. Here's what I do (it seems to be working so far): 1. Stream in the JSON as a plain string. 2. Feed this string into a JSON library to validate it (I use Circe). 3. Using the same library, parse the JSON and extract fields X, Y and Z. 4. Create a dataset