Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-26 Thread Gabor Somogyi
Hi Karan, Plz have a look at the stackoverflow comment I've had 2 days ago G On Fri, 25 Feb 2022, 23:31 karan alang, wrote: > Hello All, > I'm running a StructuredStreaming program on GCP Dataproc, which reads > data from Kafka, does some processing and puts processed data back into > Kafka.

Re: JDBCConnectionProvider in Spark

2022-01-07 Thread Gabor Somogyi
We've expected that it would be hard to understand all the aspects at first so created an explanation for it. Please see the following readme which hopefully answers most of your questions: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md On

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
+1 on Sean's opinion On Wed, Apr 7, 2021 at 2:17 PM Sean Owen wrote: > You shouldn't be modifying your cluster install. You may at this point > have conflicting, excess JARs in there somewhere. I'd start it over if you > can. > > On Wed, Apr 7, 2021 at 7:15 AM Gabor S

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-07 Thread Gabor Somogyi
gt;>>> Now I deleted ~/.ivy2 directory and ran the job again >>>> >>>> Ivy Default Cache set to: /home/hduser/.ivy2/cache >>>> The jars for the packages stored in: /home/hduser/.ivy2/jars >>>> org.apache.spark#spark-sql-kafka-0-10_2.12 added as

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
gt; I gather from your second mail, there seems to be an issue with >>> spark-sql-kafka-0-10_2.12-3.*1*.1.jar ? >>> >>> HTH >>> >>> >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
and passed (just checked it) If you think there is an issue in the code and/or packaging please open a jira with more details. BR, G On Tue, Apr 6, 2021 at 12:54 PM Gabor Somogyi wrote: > Since you've not shared too much details I presume you've updated the > spark-sql-kafka >

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only. KafkaTokenUtil is in the token provider jar. As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently. BR, G On Tue, Apr 6, 2021 at 10:21 AM Mich

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Gabor Somogyi
;>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Sun, 4 Apr 2021 at 16:27, Ali Gouta wrote: >>>> >>>>> Thank you guys for your answers, I will dig more this new way of doing >>&

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Gabor Somogyi
There is no way to store offsets in Kafka and restart from the stored offset. Structured Streaming stores offset in checkpoint and it restart from there without any user code. Offsets can be stored with a listener but it can be only used for lag calculation. BR, G On Sat, 3 Apr 2021, 21:09 Ali

Re: Spark streaming giving error for version 2.4

2021-03-16 Thread Gabor Somogyi
Well, this is not much. Please provide driver and executor logs... G On Tue, Mar 16, 2021 at 6:03 AM Renu Yadav wrote: > Hi Team, > > > I have upgraded my spark streaming from 2.2 to 2.4 but getting below error: > > > spark-streaming-kafka_0-10.2.11_2.4.0 > > > scala 2.11 > > > Any Idea? > >

Re: How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Gabor Somogyi
egards, > Renu yadav > > On Fri, Mar 12, 2021 at 7:25 PM Renu Yadav wrote: > >> Thanks Gabor, >> This is very useful. >> >> Regards, >> Renu Yadav >> >> On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi >> wrote: >> >>>

Re: Issue while consuming message in kafka using structured streaming

2021-03-12 Thread Gabor Somogyi
> > On Fri, Mar 12, 2021 at 5:30 PM Gabor Somogyi > wrote: > >> Since you've not provided any version I guess you're using 2.x and you're >> hitting this issue: https://issues.apache.org/jira/browse/SPARK-28367 >> The executor side must be resolved out of the box in the l

Re: How to upgrade kafka client in spark_streaming_kafka 2.2

2021-03-12 Thread Gabor Somogyi
Kafka client upgrade is not a trivial change which may or may not work since new versions can contain incompatible API and/or behavior changes. I've collected how Spark evolved in terms of Kafka client and there I've gathered the breaking changes to make our life easier. Have a look and based on

Re: Issue while consuming message in kafka using structured streaming

2021-03-12 Thread Gabor Somogyi
Since you've not provided any version I guess you're using 2.x and you're hitting this issue: https://issues.apache.org/jira/browse/SPARK-28367 The executor side must be resolved out of the box in the latest Spark version however on driver side one must set "

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Gabor Somogyi
Good to hear and great work Hyukjin!  On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, wrote: > Thanks Hyukjin for driving the huge release, and thanks everyone for > contributing the release! > > On Wed, Mar 3, 2021 at 6:54 PM angers zhu wrote: > >> Great work, Hyukjin ! >> >> Bests, >> Angers >> >>

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
The most interesting part is that you've added this: kafka-clients-0.10.2.2.jar Spark 3.0.1 uses Kafka clients 2.4.1. Downgrading with such a big step doesn't help. Please remove that also togrther w/ Spark-Kafka dependency. G On Thu, 21 Jan 2021, 22:45 gshen, wrote: > Thanks for the tips! >

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
/GoogleHadoopOutputStream.java#L114 The linked code is the master but it just doesn't fit... G On Thu, Jan 21, 2021 at 9:18 AM Gabor Somogyi wrote: > I've doubled checked this and came to the same conclusion just like > Jungtaek. > I've added a comment to the stackoverflow post to reach mo

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Gabor Somogyi
I've doubled checked this and came to the same conclusion just like Jungtaek. I've added a comment to the stackoverflow post to reach more poeple with the answer. G On Thu, Jan 21, 2021 at 6:53 AM Jungtaek Lim wrote: > I quickly looked into the attached log in SO post, and the problem doesn't

Re: Data source v2 streaming sinks does not support Update mode

2021-01-19 Thread Gabor Somogyi
>>>>> the repo, I believe just commit it to your personal repo and that should >>>>> be >>>>> it. >>>>> >>>>> Regards >>>>> >>>>> On Mon, 18 Jan 2021 at 15:46, Eric Beabes >>>>> wrote: >>>&

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Gabor Somogyi
rce > generate-test-sources > > add-test-source > > > > src/test/scala > > > > > > > >

Re: Data source v2 streaming sinks does not support Update mode

2021-01-13 Thread Gabor Somogyi
Just reached this thread. +1 on to create a simple reproducer app and I suggest to create a jira attaching the full driver and executor logs. Ping me on the jira and I'll pick this up right away... Thanks! G On Wed, Jan 13, 2021 at 8:54 AM Jungtaek Lim wrote: > Would you mind if I ask for a

Re: Spark standalone - reading kerberos hdfs

2021-01-08 Thread Gabor Somogyi
TGT is not enough, you need HDFS token which can be obtained by Spark. Please check the logs... On Fri, 8 Jan 2021, 18:51 Sudhir Babu Pothineni, wrote: > I spin up a spark standalone cluster (spark.autheticate=false), submitted > a job which reads remote kerberized HDFS, > > val spark =

Re: Jdbc Hook in Spark Batch Application

2020-12-25 Thread Gabor Somogyi
roblem that the classes referenced in the code need to be > modified. I want to try not to change the existing code. > > Gabor Somogyi 于2020年12月25日周五 上午12:16写道: > >> One can wrap the JDBC driver and such a way eveything can be sniffed. >> >> On Thu, 24 Dec 2020, 03:51 lec

Re: Jdbc Hook in Spark Batch Application

2020-12-24 Thread Gabor Somogyi
One can wrap the JDBC driver and such a way eveything can be sniffed. On Thu, 24 Dec 2020, 03:51 lec ssmi, wrote: > Hi: >guys, I have some spark programs that have database connection > operations. I want to acquire the connection information, such as jdbc > connection properties , but

Re: Cannot perform operation after producer has been closed

2020-12-09 Thread Gabor Somogyi
or over 2 weeks! > > THANKS once again. Hopefully, at some point we can switch to Spark 3.0. > > > On Fri, Nov 20, 2020 at 7:30 AM Gabor Somogyi > wrote: > >> Happy that saved some time for you :) >> We've invested quite an effort in the latest releases into streaming a

Re: Is there a better way to read kerberized impala tables by spark jdbc?

2020-12-08 Thread Gabor Somogyi
At the moment I can't think of any better but we've added custom JdbcConnectionProvider API in Spark 3.1. Hope that will make life easier in the future... G On Tue, Dec 8, 2020 at 3:55 AM eab...@163.com wrote: > Hi: > > I want to use spark jdbc to read kerberized impala tables, like: > ``` >

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
e problem toook for in the cluster? > > Regards > Amit > > On Monday, December 7, 2020, Gabor Somogyi > wrote: > >> + Adding back user list. >> >> I've had a look at the Spark code and it's not >> modifying "partition.assignment.strategy" so the problem &

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
+ Adding back user list. I've had a look at the Spark code and it's not modifying "partition.assignment.strategy" so the problem must be either in your application or in your cluster setup. G On Mon, Dec 7, 2020 at 12:31 PM Gabor Somogyi wrote: > It's super interesting because t

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
+1 on the mentioned change, Spark uses the following kafka-clients library: 2.4.1 G On Mon, Dec 7, 2020 at 9:30 AM German Schiavon wrote: > Hi, > > I think the issue is that you are overriding the kafka-clients that comes > with spark-sql-kafka-0-10_2.12 > > > I'd try removing the

Re: Cannot perform operation after producer has been closed

2020-11-20 Thread Gabor Somogyi
2020 at 8:18 AM Gabor Somogyi > wrote: > >> "spark.kafka.producer.cache.timeout" is available since 2.2.1 which can >> be increased as a temporary workaround. >> This is not super elegant but works which gives enough time to migrate to >> Spark 3. >> &

Re: Cannot perform operation after producer has been closed

2020-11-19 Thread Gabor Somogyi
"spark.kafka.producer.cache.timeout" is available since 2.2.1 which can be increased as a temporary workaround. This is not super elegant but works which gives enough time to migrate to Spark 3. On Wed, Nov 18, 2020 at 11:12 PM Eric Beabes wrote: > I must say.. *Spark has let me down in this

Re: Custom JdbcConnectionProvider

2020-10-29 Thread Gabor Somogyi
I've compiled the latest master locally and then put it into my m2 repo. mvn clean install... On Wed, Oct 28, 2020 at 8:36 PM rafaelkyrdan wrote: > Thank you Gabor Somogyi for highlighting the issue. > > I cannot compile the example because 3.1.0-SNAPSHOT is not available in > m

Re: Custom JdbcConnectionProvider

2020-10-28 Thread Gabor Somogyi
Please use the latest snapshot as dependency + update force the snapshots. On Wed, Oct 28, 2020 at 5:42 PM rafaelkyrdan wrote: > I cloned your example and it is also cannot be compiled > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/t10977/Screenshot_2020-10-28_182213.png> > > >

Re: spark-submit parameters about two keytab files to yarn and kafka

2020-10-28 Thread Gabor Somogyi
Hi, Cross-realm trust must be configured. One can find several docs on how to do that. BR, G On Wed, Oct 28, 2020 at 8:21 AM big data wrote: > Hi, > > We want to submit spark streaming job to YARN and consume Kafka topic. > > YARN and Kafka are in two different clusters, and they have the >

Re: Custom JdbcConnectionProvider

2020-10-27 Thread Gabor Somogyi
Thanks Takeshi for sharing it, that can be used as an example. The user and developer guide will come soon... On Tue, Oct 27, 2020 at 2:31 PM Takeshi Yamamuro wrote: > Hi, > > Please see an example code in > https://github.com/gaborgsomogyi/spark-jdbc-connection-provider ( >

Re: Spark JDBC- OAUTH example

2020-10-01 Thread Gabor Somogyi
If I > understand this correctly, you use OAuth to gain user access at the web > portal level, but use DBMS authentication at the JDBC level. > > -- ND > On 9/30/20 2:11 PM, Gabor Somogyi wrote: > > Not sure there is already a way. I'm just implementing JDBC connection > provider

Re: Spark JDBC- OAUTH example

2020-09-30 Thread Gabor Somogyi
Not sure there is already a way. I'm just implementing JDBC connection provider which will make it available in 3.1. Just to be clear when the API is available custom connection provider must be implemented. With actual Spark one can try to write a driver wrapper which does the authentication. G

Re: Offset Management in Spark

2020-09-30 Thread Gabor Somogyi
Hi, Structured Streaming stores offsets only in HDFS compatible filesystems. Kafka and S3 are not such. Custom offset storage was only an option in DStreams. G On Wed, Sep 30, 2020 at 9:45 AM Siva Samraj wrote: > Hi all, > > I am using Spark Structured Streaming (Version 2.3.2). I need to

Re: Structured Streaming Checkpoint Error

2020-09-17 Thread Gabor Somogyi
Hi, Structured Streaming is simply not working when checkpoint location is on S3 due to it's read-after-write consistency. Please choose an HDFS compliant filesystem and it will work like a charm. BR, G On Wed, Sep 16, 2020 at 4:12 PM German Schiavon wrote: > Hi! > > I have an Structured

Re: Spark Streaming Checkpointing

2020-09-04 Thread Gabor Somogyi
Hi Andras, A general suggestion is to use Structured Streaming instead of DStreams because it provides several things out of the box (stateful streaming, etc...). Kafka 0.8 is super old and deprecated (no security...). Do you have a specific reason to use that? BR, G On Thu, Sep 3, 2020 at

Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-28 Thread Gabor Somogyi
Hi Amit, The answer is no. G On Fri, Aug 28, 2020 at 9:16 AM Jungtaek Lim wrote: > Hi Amit, > > if I remember correctly, you don't need to restart the query to reflect > the newly added topic and partition, if your subscription covers the topic > (like subscribe pattern). Please try it out.

Re: Kafka with Spark Streaming work on local but it doesn't work in Standalone mode

2020-07-24 Thread Gabor Somogyi
Hi Davide, Please see the doc: *Note: Kafka 0.8 support is deprecated as of Spark 2.3.0.* Have you tried the same with Structured Streaming and not with DStreams? If you insist somehow to DStreams you can use spark-streaming-kafka-0-10 connector instead. BR, G On Fri, Jul 24, 2020 at 12:08 PM

Re: How To Access Hive 2 Through JDBC Using Kerberos

2020-07-06 Thread Gabor Somogyi
Hi Daniel, I'm just working on the developer API where any custom JDBC connection provider(including Hive) can be added. Not sure what you mean by third-party connection but AFAIK there is no workaround at the moment. BR, G On Mon, Jul 6, 2020 at 12:09 PM Daniel de Oliveira Mantovani <

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Gabor Somogyi
In 3.0 the community just added it. On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed, wrote: > Hi, > > We are trying to move our existing code from spark dstreams to structured > streaming for one of the old application which we built few years ago. > > Structured streaming job doesn’t have

Re: Spark streaming with Confluent kafka

2020-07-03 Thread Gabor Somogyi
The error is clear: Caused by: java.lang.IllegalArgumentException: No serviceName defined in either JAAS or Kafka config On Fri, 3 Jul 2020, 15:40 dwgw, wrote: > Hi > > I am trying to stream confluent kafka topic in the spark shell. For that i > have invoked spark shell using following command.

Re: Spark interrupts S3 request backoff

2020-04-14 Thread Gabor Somogyi
+1 on the previous guess and additionally I suggest to reproduce it with vanilla Spark. Amazon Spark contains modifications which not available in vanilla Spark which makes problem hunting hard or impossible. Such case amazon can help... On Tue, Apr 14, 2020 at 11:20 AM ZHANG Wei wrote: > I

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-14 Thread Gabor Somogyi
The simplest way is to do thread dump which doesn't require any fancy tool (it's available on Spark UI). Without thread dump it's hard to say anything... On Tue, Apr 14, 2020 at 11:32 AM jane thorpe wrote: > Here a is another tool I use Logic Analyser 7:55 > https://youtu.be/LnzuMJLZRdU > >

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
a console consumer on the exact same host where the executor ran... If that works you can open a Spark jira with driver and executor logs, otherwise fix the connection issue. BR, G On Tue, Apr 14, 2020 at 1:32 PM Gabor Somogyi wrote: > The symptom is simple, the broker is not responding in

Re: Spark Streaming not working

2020-04-14 Thread Gabor Somogyi
The symptom is simple, the broker is not responding in 120 seconds. That's the reason why Debabrata asked the broker config. What I can suggest is to check the previous printout which logs the Kafka consumer settings. With On Tue, Apr 14, 2020 at 11:44 AM ZHANG Wei wrote: > Here is the

Re: Problem with Kafka group.id

2020-03-24 Thread Gabor Somogyi
Hi Sjoerd, We've added kafka.group.id config to Spark 3.0... kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each

Re: How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

2020-01-10 Thread Gabor Somogyi
Hi, Please open a jira + attach the spark application with configuration and logs you have and just pull me in. I'm going to check it... All in all this happens when "sasl.jaas.config" is set on a consumer. Presume somehow Spark obtained a token and set the mentioned property. BR, G On Wed,

Re: Structured Streaming Kafka change maxOffsetsPerTrigger won't apply

2019-11-20 Thread Gabor Somogyi
Hi Roland, Not much shared apart from it's not working. Latest partition offset is used when the size of a TopicPartition is negative. This can be found out by checking the following log entry in the logs: logDebug(s"rateLimit $tp size is $size") If you've double checked and still think it's an

Re: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread Gabor Somogyi
With the mentioned parameter the capacity can be increased but the main question is more like why is that happening. Even on a really beefy machine having more than 64 consumers is quite extreme. Maybe better horizontal scaling (more executors) would be a better option to reach maximum

Re: Access all of the custom streaming query listeners that were registered to spark session

2019-09-11 Thread Gabor Somogyi
Hi Natalie, In ideal situation it would be good to keep track of the registered listeners. If this is not possible in 3.0 new feature added here: https://github.com/apache/spark/pull/25518 BR, G On Tue, Sep 10, 2019 at 10:18 PM Natalie Ruiz wrote: > Hello, > > > > Is there a way to access

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-06 Thread Gabor Somogyi
-> Constants.TRUSTSTORE_PASSWORD, > > "ssl.keystore.location" -> Constants.KEYSTORE_PATH, > > "ssl.keystore.password" -> Constants.KEYSTORE_PASSWORD, > > "ssl.key.password" -> Constants.KEYSTORE_PASSWORD > > ) > > val stream = Kafk

Re: [Spark Streaming Kafka 0-10] - What was the reason for adding "spark-executor-" prefix to group id in executor configurations

2019-09-05 Thread Gabor Somogyi
Hi, Let me share Spark 3.0 documentation part (Structured Streaming and not DStreams what you've mentioned but still relevant): kafka.group.id string none streaming and batch The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
hen. Does that > mean DSV2 is not really for production use yet? > > Any idea what the best documentation would be? I'd probably start by > looking at existing code. > > Cheers, > Lars > > On Fri, Jun 28, 2019 at 1:06 PM Gabor Somogyi > wrote: > >> Hi Lars,

Re: Spark Structured Streaming Custom Sources confusion

2019-06-28 Thread Gabor Somogyi
Hi Lars, Since Structured Streaming doesn't support receivers at all so that source/sink can't be used. Data source v2 is under development and because of that it's a moving target so I suggest to implement it with v1 (unless special features are required from v2). Additionally since I've just

Re: Exposing JIRA issue types at GitHub PRs

2019-06-17 Thread Gabor Somogyi
Dongjoon, I think it's useful. Thanks for adding it! On Mon, Jun 17, 2019 at 8:05 AM Dongjoon Hyun wrote: > Thank you, Hyukjin ! > > On Sun, Jun 16, 2019 at 4:12 PM Hyukjin Kwon wrote: > >> Labels look good and useful. >> >> On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, >> wrote: >> >>> Now, you

Re: Structred Streaming Error

2019-05-22 Thread Gabor Somogyi
Have you tried what the exception suggests? If startingOffsets contains specific offsets, you must specify all TopicPartitions. BR, G On Tue, May 21, 2019 at 9:16 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am getting below errror when running sample strreaming app.

Re: how to get spark-sql lineage

2019-05-16 Thread Gabor Somogyi
Hi, spark.lineage.enabled is Cloudera specific and doesn't work with vanilla Spark. BR, G On Thu, May 16, 2019 at 4:44 AM lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage :

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-13 Thread Gabor Somogyi
> Where exactly would I see the start/end offset values per batch, is that in the spark logs? Yes, it's in the Spark logs. Please see this: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reading-metrics-interactively On Mon, May 13, 2019 at 10:53 AM Austin

Re: Streaming data out of spark to a Kafka topic

2019-03-27 Thread Gabor Somogyi
Hi Mich, Please take a look at how to write data into Kafka topic with DStreams: https://github.com/gaborgsomogyi/spark-dstream-secure-kafka-sink-app/blob/62d64ce368bc07b385261f85f44971b32fe41327/src/main/scala/com/cloudera/spark/examples/DirectKafkaSinkWordCount.scala#L77 (DStreams has no native

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Gabor Somogyi
Hi Matt, Maybe you could set maxFilesPerTrigger to 1. BR, G On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper wrote: > Hello, > > I am new to Spark and Structured Streaming and have the following File > Output Sink question: > > Wondering what (and how to modify) triggers a Spark Sturctured

Re: Writing the contents of spark dataframe to Kafka with Spark 2.2

2019-03-20 Thread Gabor Somogyi
fka. > In my deployment project I do use the provided scope. > > > On Tue, Mar 19, 2019 at 8:50 AM Gabor Somogyi > wrote: > >> Hi Anna, >> >> Looks like some sort of version mismatch. >> >> Presume scala version double checked... >> >> Not sure

Re: Writing the contents of spark dataframe to Kafka with Spark 2.2

2019-03-19 Thread Gabor Somogyi
ache.spark > spark-sql_2.11 > 2.2.2 > > > org.apache.spark > spark-sql-kafka-0-10_2.11 > 2.2.2 > > > and my kafka version is kafka_2.11-1.1.0 > > > On Tue, Mar 19, 2019 at 12:48 AM Gabor Somogyi > wrote: > >> Hi Anna, >> >&g

Re: Writing the contents of spark dataframe to Kafka with Spark 2.2

2019-03-19 Thread Gabor Somogyi
Hi Anna, Have you added spark-sql-kafka-0-10_2.11:2.2.0 package as well? Further info can be found here: https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html#deploying The same --packages option can be used with spark-shell as well... BR, G On Mon, Mar 18, 2019 at

Re: [Kubernets] [SPARK-27061] Need to expose 4040 port on driver service

2019-03-05 Thread Gabor Somogyi
Hi, It will be automatically assigned when one creates a PR. BR, G On Tue, Mar 5, 2019 at 4:51 PM Chandu Kavar wrote: > My Jira username is: *cckavar* > > On Tue, Mar 5, 2019 at 11:46 PM Chandu Kavar wrote: > >> Hi Team, >> I have created a JIRA ticket to expose 4040 port on driver service.

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-03-01 Thread Gabor Somogyi
g of writing out the dfKV > dataframe to disk and then use Avro apis to read the data. > > This smells like a bug somewhere. > > Cheers, > > Hien > > On Thu, Feb 28, 2019 at 4:02 AM Gabor Somogyi > wrote: > >> No, just take a look at the schema of dfStruct sinc

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-28 Thread Gabor Somogyi
ing is > needed :( > > On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi > wrote: > >> Hi, >> >> I was dealing with avro stuff lately and most of the time it has >> something to do with the schema. >> One thing I've pinpointed quickly (where I was struggling

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
t they are working with Kafka 1.0 > > Akshay Bhardwaj > +91-97111-33849 > > > On Wed, Feb 27, 2019 at 5:41 PM Gabor Somogyi > wrote: > >> Where exactly? In Kafka broker configuration section here it's 10080: >> https://kafka.apache.org/documentation/ >> >&

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
; Hi Gabor, >> >> I am talking about offset.retention.minutes which is set default as 1440 >> (or 24 hours) >> >> Akshay Bhardwaj >> +91-97111-33849 >> >> >> On Wed, Feb 27, 2019 at 4:47 PM Gabor Somogyi >> wrote: >> >>> Hi Akshay,

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Gabor Somogyi
, a broker deleted offsets for a consumer group after inactivity of 24 > hours. > In such a case, the newly started spark streaming job will read offsets > from beginning for the same groupId. > > Akshay Bhardwaj > +91-97111-33849 > > > On Thu, Feb 21, 2019 at 9:08 PM Gabo

Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Gabor Somogyi
Hi, I was dealing with avro stuff lately and most of the time it has something to do with the schema. One thing I've pinpointed quickly (where I was struggling also) is the name field should be nullable but the result is not yet correct so further digging needed... scala> val expectedSchema =

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-21 Thread Gabor Somogyi
>From the info you've provided not much to say. Maybe you could collect sample app, logs etc, open a jira and we can take a deeper look at it... BR, G On Thu, Feb 21, 2019 at 4:14 PM Guillermo Ortiz wrote: > I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct Stream > as

Re: Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2

2019-02-13 Thread Gabor Somogyi
Hi Thomas, The issue occurs when the user does not have the READ permission on the consumer groups. In DStreams group ID is configured in application, for example:

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-11 Thread Gabor Somogyi
What blocks you to put if conditions inside the mentioned map function? On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak wrote: > Yeah, but I don't need to crash entire app, I want to fail several tasks > or executors and then wait for completion. > > вс, 10 февр. 2019 г. в 21:49, G

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Gabor Somogyi
e issue and kill tasks at a > certain stage. Killing executor is also an option for me. > I'm curious how do core spark contributors test spark fault tolerance? > > > вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi : > >> Hi Serega, >> >> If I understand your problem c

Re: Spark on YARN, HowTo kill executor or individual task?

2019-02-10 Thread Gabor Somogyi
Hi Serega, If I understand your problem correctly you would like to kill one executor only and the rest of the app has to be untouched. If that's true yarn -kill is not what you want because it stops the whole application. I've done similar thing when tested/testing Spark's HA features. - jps

Re: java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Gabor Somogyi
Hi Hande, "Unsupported class file major version 55" means java incompatibility. This error means you're trying to load a Java "class" file that was compiled with a newer version of Java than you have installed. For example, your .class file could have been compiled for JDK 8, and you're trying to

Re: Structured streaming from Kafka by timestamp

2019-01-24 Thread Gabor Somogyi
Hi Tomas, As a general note don't fully understand your use-case. You've mentioned structured streaming but your query is more like a one-time SQL statement. Kafka doesn't support predicates how it's integrated with spark. What can be done from spark perspective is to look for an offset for a

Re: Reading compacted Kafka topic is slow

2019-01-24 Thread Gabor Somogyi
Hi Tomas, Presume the 60 sec window means trigger interval. Maybe a quick win could be to try structured streaming because there the trigger interval is optional. If it is not specified, the system will check for availability of new data as soon as the previous processing has completed. BR, G

Re: "failed to get records for spark-executor after polling for ***" error

2018-12-03 Thread Gabor Somogyi
Hi, There are not much details in the mail so hard to tell exactly. Maybe you're facing https://issues.apache.org/jira/browse/SPARK-19275 BR, G On Mon, Dec 3, 2018 at 10:32 AM JF Chen wrote: > Some kafka consumer tasks throw "failed to get records for spark-executor > after polling for ***"

Re: PySpark Direct Streaming : SASL Security Compatibility Issue

2018-11-28 Thread Gabor Somogyi
Hi Muthuraman, Previous to 0.9, Kafka had no built-in security features but DStreams support only kafka 0.8. Suggest to use structured streaming where 0.10+ (2.0.0 in spark 2.4) support is available. BR, G On Wed, Nov 28, 2018 at 4:15 AM Ramaswamy, Muthuraman < muthuraman.ramasw...@viasat.com>