StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-25 Thread karan alang
Hello All,
I'm running a StructuredStreaming program on GCP Dataproc, which reads data
from Kafka, does some processing and puts processed data back into Kafka.
The program was running fine, when I killed it (to make minor changes), and
then re-started it.

It is giving me the error - pyspark.sql.utils.StreamingQueryException:
batch 44 doesn't exist

Here is the error:

22/02/25 22:14:08 ERROR
org.apache.spark.sql.execution.streaming.MicroBatchExecution: Query
[id = 0b73937f-1140-40fe-b201-cf4b7443b599, runId =
43c9242d-993a-4b9a-be2b-04d8c9e05b06] terminated with error
java.lang.IllegalStateException: batch 44 doesn't exist
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$populateStartOffsets$1(MicroBatchExecution.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.populateStartOffsets(MicroBatchExecution.scala:286)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:197)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
Traceback (most recent call last):
  File 
"/tmp/0149aedd804c42f288718e013fb16f9c/StructuredStreaming_GCP_Versa_Sase_gcloud.py",
line 609, in 
query.awaitTermination()
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py",
line 101, in awaitTermination
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1304, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
line 117, in deco
pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist


Question - what is the cause of this error and how to debug/fix ? Also, I
notice that the checkpoint location gets corrupted occasionally, when I do
multiple restarts. After checkpoint corruption, it does not return any
records

For the above issue(as well as when the checkpoint was corrupted), when i
cleared the checkpoint location and re-started the program, it went trhough
fine.

Pls note: while doing readStream, i've enabled failOnDataLoss=false

Additional details are in stackoverflow :

https://stackoverflow.com/questions/71272328/structuredstreaming-error-pyspark-sql-utils-streamingqueryexception-batch-44

any input on this ?

tia!


RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Ahh, ok.  So, Kafka 3.1 is supported for Spark 3.2.1.  Thank you very much.

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:50 PM
To: Michael Williams (SSI) 
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

these are the old and news ones

For spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar  --> 
kafka-clients-3.1.0.jar
commons-pool2-2.9.0.jar  --> 
commons-pool2-2.11.1.jar
spark-streaming_2.12-3.1.1.jar  --> 
spark-streaming_2.12-3.2.1.jar
spark-sql-kafka-0-10_2.12-3.1.0.jar -> 
spark-sql-kafka-0-10_2.12-3.2.1.jar



HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 20:40, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
I believe it is 3.1, but if there is a different version that “works better” 
with spark, any advice would be appreciated.  Our entire team is totally new to 
spark and kafka (this is a poc trial).

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:30 PM
To: Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>>
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

and what version of kafka do you have 2.7?

for spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar
commons-pool2-2.9.0.jar
spark-streaming_2.12-3.1.1.jar
spark-sql-kafka-0-10_2.12-3.1.0.jar



HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
What is the use case? Is this for spark structured streaming?

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no 

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Thank you, that is good to know.

From: Sean Owen [mailto:sro...@gmail.com]
Sent: Friday, February 25, 2022 2:46 PM
To: Michael Williams (SSI) 
Cc: Mich Talebzadeh ; user@spark.apache.org
Subject: Re: Spark Kafka Integration

Spark 3.2.1 is compiled vs Kafka 2.8.0; the forthcoming Spark 3.3 against Kafka 
3.1.0.
It may well be mutually compatible though.

On Fri, Feb 25, 2022 at 2:40 PM Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
I believe it is 3.1, but if there is a different version that “works better” 
with spark, any advice would be appreciated.  Our entire team is totally new to 
spark and kafka (this is a poc trial).

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:30 PM
To: Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>>
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

and what version of kafka do you have 2.7?

for spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar
commons-pool2-2.9.0.jar
spark-streaming_2.12-3.1.1.jar
spark-sql-kafka-0-10_2.12-3.1.0.jar



HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
What is the use case? Is this for spark structured streaming?

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
After reviewing Spark's Kafka Integration guide, it indicates that 
spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 
3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the cleanest, 
most repeatable (reliable) way to acquire these jars for including in a Spark 
Docker image without introducing version compatibility issues?

Thank you,
Mike


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose 

Re: Spark Kafka Integration

2022-02-25 Thread Mich Talebzadeh
these are the old and news ones

For spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar  --> kafka-clients-3.1.0.jar

commons-pool2-2.9.0.jar  --> commons-pool2-2.11.1.jar

spark-streaming_2.12-3.1.1.jar  --> spark-streaming_2.12-3.2.1.jar

spark-sql-kafka-0-10_2.12-3.1.0.jar -> spark-sql-kafka-0-10_2.12-3.2.1.jar



HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 25 Feb 2022 at 20:40, Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> I believe it is 3.1, but if there is a different version that “works
> better” with spark, any advice would be appreciated.  Our entire team is
> totally new to spark and kafka (this is a poc trial).
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Friday, February 25, 2022 2:30 PM
> *To:* Michael Williams (SSI) 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Kafka Integration
>
>
>
> and what version of kafka do you have 2.7?
>
>
>
> for spark 3.1.1 I needed these jar files to make it work
>
>
>
> kafka-clients-2.7.0.jar
> commons-pool2-2.9.0.jar
> spark-streaming_2.12-3.1.1.jar
> spark-sql-kafka-0-10_2.12-3.1.0.jar
>
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
> wrote:
>
> What is the use case? Is this for spark structured streaming?
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) <
> michael.willi...@ssigroup.com> wrote:
>
> After reviewing Spark's Kafka Integration guide, it indicates that
> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
> cleanest, most repeatable (reliable) way to acquire these jars for
> including in a Spark Docker image without introducing version compatibility
> issues?
>
>
>
> Thank you,
>
> Mike
>
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this 

Re: Spark Kafka Integration

2022-02-25 Thread Sean Owen
Spark 3.2.1 is compiled vs Kafka 2.8.0; the forthcoming Spark 3.3 against
Kafka 3.1.0.
It may well be mutually compatible though.

On Fri, Feb 25, 2022 at 2:40 PM Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> I believe it is 3.1, but if there is a different version that “works
> better” with spark, any advice would be appreciated.  Our entire team is
> totally new to spark and kafka (this is a poc trial).
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Friday, February 25, 2022 2:30 PM
> *To:* Michael Williams (SSI) 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Kafka Integration
>
>
>
> and what version of kafka do you have 2.7?
>
>
>
> for spark 3.1.1 I needed these jar files to make it work
>
>
>
> kafka-clients-2.7.0.jar
> commons-pool2-2.9.0.jar
> spark-streaming_2.12-3.1.1.jar
> spark-sql-kafka-0-10_2.12-3.1.0.jar
>
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
> wrote:
>
> What is the use case? Is this for spark structured streaming?
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) <
> michael.willi...@ssigroup.com> wrote:
>
> After reviewing Spark's Kafka Integration guide, it indicates that
> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
> cleanest, most repeatable (reliable) way to acquire these jars for
> including in a Spark Docker image without introducing version compatibility
> issues?
>
>
>
> Thank you,
>
> Mike
>
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>


Re: Help With unstructured text file with spark scala

2022-02-25 Thread Danilo Sousa
Rafael Mendes,

Are you from ?

Thanks.
> On 21 Feb 2022, at 15:33, Danilo Sousa  wrote:
> 
> Yes, this a only single file.
> 
> Thanks Rafael Mendes.
> 
>> On 13 Feb 2022, at 07:13, Rafael Mendes > > wrote:
>> 
>> Hi, Danilo.
>> Do you have a single large file, only?
>> If so, I guess you can use tools like sed/awk to split it into more files 
>> based on layout, so you can read these files into Spark.
>> 
>> 
>> Em qua, 9 de fev de 2022 09:30, Bitfox > > escreveu:
>> Hi
>> 
>> I am not sure about the total situation.
>> But if you want a scala integration I think it could use regex to match and 
>> capture the keywords.
>> Here I wrote one you can modify by your end.
>> 
>> import scala.io.Source
>> import scala.collection.mutable.ArrayBuffer
>> 
>> val list1 = ArrayBuffer[(String,String,String)]()
>> val list2 = ArrayBuffer[(String,String)]()
>> 
>> 
>> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
>> val patt2 = """^(.*)#([^#]*)$""".r
>> 
>> val file = "1.txt"
>> val lines = Source.fromFile(file).getLines()
>> 
>> for ( x <- lines ) {
>>   x match {
>> case patt1(k,v,z) => list1 += ((k,v,z))
>> case patt2(k,v) => list2 += ((k,v))
>> case _ => println("no match")
>>   }
>> }
>> 
>> 
>> Now the list1 and list2 have the elements you wanted, you can convert them 
>> to a dataframe easily.
>> 
>> Thanks.
>> 
>> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa > > wrote:
>> Hello
>> 
>> 
>> Yes, for this block I can open as csv with # delimiter, but have the block 
>> that is no csv format. 
>> 
>> This is the likely key value. 
>> 
>> We have two different layouts in the same file. This is the “problem”.
>> 
>> Thanks for your time.
>> 
>> 
>> 
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>> 
>>> Contrato#123456 - Test
>>> Empresa#Test
>> 
>>> On 9 Feb 2022, at 00:58, Bitfox >> > wrote:
>>> 
>>> Hello
>>> 
>>> You can treat it as a csf file and load it from spark:
>>> 
>>> >>> df = spark.read.format("csv").option("inferSchema", 
>>> >>> "true").option("header", "true").option("sep","#").load(csv_file)
>>> >>> df.show()
>>> ++---+-+
>>> |   Plano|Código Beneficiário|Nome Beneficiário|
>>> ++---+-+
>>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>>> ++---+-+
>>> 
>>> 
>>> cat csv_file:
>>> 
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> 
>>> 
>>> Regards
>>> 
>>> 
>>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa >> > wrote:
>>> Hi
>>> I have to transform unstructured text to dataframe.
>>> Could anyone please help with Scala code ?
>>> 
>>> Dataframe need as:
>>> 
>>> operadora filial unidade contrato empresa plano codigo_beneficiario 
>>> nome_beneficiario
>>> 
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>> 
>>> Contrato#123456 - Test
>>> Empresa#Test
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>> 
>>> Contrato#898011000 - FUNDACAO GERDAU
>>> Empresa#FUNDACAO GERDAU
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>>> 
>>> 
>> 
> 



RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
Thank you

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:35 PM
To: Michael Williams (SSI) 
Cc: Sean Owen ; user@spark.apache.org
Subject: Re: Spark Kafka Integration

please see my earlier reply for 3.1.1 tested and worked in Google Dataproc 
environment

Also this article of mine may be useful

Processing Change Data Capture with Spark Structured 
Streaming

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 20:30, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
The use case is for spark structured streaming (a spark app will be launched by 
a worker service that monitors the kafka topic for new messages, once the 
messages are consumed, the spark app will terminate), but if there is a hitch 
here, it is that the Spark environment includes the MS dotnet for Spark 
wrapper, which means the each spark app will consume from one kafka topic and 
will be written in C#.  If possible, I’d really like to be able to manually 
download the necessary jars and do the kafka client installation as part of the 
docker image build, so that the dependencies already exist on disk.  If that 
makes any sense.

Thank you

From: Mich Talebzadeh 
[mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:16 PM
To: Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>>
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

What is the use case? Is this for spark structured streaming?

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
After reviewing Spark's Kafka Integration guide, it indicates that 
spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 
3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the cleanest, 
most repeatable (reliable) way to acquire these jars for including in a Spark 
Docker image without introducing version compatibility issues?

Thank you,
Mike


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.



This 

RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
I believe it is 3.1, but if there is a different version that “works better” 
with spark, any advice would be appreciated.  Our entire team is totally new to 
spark and kafka (this is a poc trial).

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:30 PM
To: Michael Williams (SSI) 
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

and what version of kafka do you have 2.7?

for spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar
commons-pool2-2.9.0.jar
spark-streaming_2.12-3.1.1.jar
spark-sql-kafka-0-10_2.12-3.1.0.jar



HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
What is the use case? Is this for spark structured streaming?

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
After reviewing Spark's Kafka Integration guide, it indicates that 
spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 
3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the cleanest, 
most repeatable (reliable) way to acquire these jars for including in a Spark 
Docker image without introducing version compatibility issues?

Thank you,
Mike


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


Re: Spark Kafka Integration

2022-02-25 Thread Mich Talebzadeh
please see my earlier reply for 3.1.1 tested and worked in Google Dataproc
environment

Also this article of mine may be useful

Processing Change Data Capture with Spark Structured Streaming


HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 25 Feb 2022 at 20:30, Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> The use case is for spark structured streaming (a spark app will be
> launched by a worker service that monitors the kafka topic for new
> messages, once the messages are consumed, the spark app will terminate),
> but if there is a hitch here, it is that the Spark environment includes the
> MS dotnet for Spark wrapper, which means the each spark app will consume
> from one kafka topic and will be written in C#.  If possible, I’d really
> like to be able to manually download the necessary jars and do the kafka
> client installation as part of the docker image build, so that the
> dependencies already exist on disk.  If that makes any sense.
>
>
>
> Thank you
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Friday, February 25, 2022 2:16 PM
> *To:* Michael Williams (SSI) 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark Kafka Integration
>
>
>
> What is the use case? Is this for spark structured streaming?
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) <
> michael.willi...@ssigroup.com> wrote:
>
> After reviewing Spark's Kafka Integration guide, it indicates that
> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
> cleanest, most repeatable (reliable) way to acquire these jars for
> including in a Spark Docker image without introducing version compatibility
> issues?
>
>
>
> Thank you,
>
> Mike
>
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>


Re: Spark Kafka Integration

2022-02-25 Thread Mich Talebzadeh
and what version of kafka do you have 2.7?

for spark 3.1.1 I needed these jar files to make it work

kafka-clients-2.7.0.jar
commons-pool2-2.9.0.jar
spark-streaming_2.12-3.1.1.jar
spark-sql-kafka-0-10_2.12-3.1.0.jar


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 25 Feb 2022 at 20:15, Mich Talebzadeh 
wrote:

> What is the use case? Is this for spark structured streaming?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) <
> michael.willi...@ssigroup.com> wrote:
>
>> After reviewing Spark's Kafka Integration guide, it indicates that
>> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
>> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
>> cleanest, most repeatable (reliable) way to acquire these jars for
>> including in a Spark Docker image without introducing version compatibility
>> issues?
>>
>>
>>
>> Thank you,
>>
>> Mike
>>
>>
>> This electronic message may contain information that is Proprietary,
>> Confidential, or legally privileged or protected. It is intended only for
>> the use of the individual(s) and entity named in the message. If you are
>> not an intended recipient of this message, please notify the sender
>> immediately and delete the material from your computer. Do not deliver,
>> distribute or copy this message and do not disclose its contents or take
>> any action in reliance on the information it contains. Thank You.
>>
>


RE: Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
The use case is for spark structured streaming (a spark app will be launched by 
a worker service that monitors the kafka topic for new messages, once the 
messages are consumed, the spark app will terminate), but if there is a hitch 
here, it is that the Spark environment includes the MS dotnet for Spark 
wrapper, which means the each spark app will consume from one kafka topic and 
will be written in C#.  If possible, I’d really like to be able to manually 
download the necessary jars and do the kafka client installation as part of the 
docker image build, so that the dependencies already exist on disk.  If that 
makes any sense.

Thank you

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, February 25, 2022 2:16 PM
To: Michael Williams (SSI) 
Cc: user@spark.apache.org
Subject: Re: Spark Kafka Integration

What is the use case? Is this for spark structured streaming?

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile

 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
After reviewing Spark's Kafka Integration guide, it indicates that 
spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 
3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the cleanest, 
most repeatable (reliable) way to acquire these jars for including in a Spark 
Docker image without introducing version compatibility issues?

Thank you,
Mike


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


Re: Spark Kafka Integration

2022-02-25 Thread Sean Owen
That .jar is available on Maven, though typically you depend on it in your
app, and compile an uber JAR which will contain it and all its dependencies.
You can I suppose manage to compile an uber JAR from that dependency itself
with tools if needed.

On Fri, Feb 25, 2022 at 1:37 PM Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> After reviewing Spark's Kafka Integration guide, it indicates that
> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
> cleanest, most repeatable (reliable) way to acquire these jars for
> including in a Spark Docker image without introducing version compatibility
> issues?
>
>
>
> Thank you,
>
> Mike
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>


Re: Spark Kafka Integration

2022-02-25 Thread Mich Talebzadeh
What is the use case? Is this for spark structured streaming?

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 25 Feb 2022 at 19:38, Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> After reviewing Spark's Kafka Integration guide, it indicates that
> spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for
> Spark 3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the
> cleanest, most repeatable (reliable) way to acquire these jars for
> including in a Spark Docker image without introducing version compatibility
> issues?
>
>
>
> Thank you,
>
> Mike
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>


Spark Kafka Integration

2022-02-25 Thread Michael Williams (SSI)
After reviewing Spark's Kafka Integration guide, it indicates that 
spark-sql-kafka-0-10_2.12_3.2.1.jar and its dependencies are needed for Spark 
3.2.1 (+ Scala 2.12) to work with Kafka.  Can anybody clarify the cleanest, 
most repeatable (reliable) way to acquire these jars for including in a Spark 
Docker image without introducing version compatibility issues?

Thank you,
Mike



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


Re: Non-Partition based Workload Distribution

2022-02-25 Thread Gourav Sengupta
Hi,

not quite sure here, but can you please share your code?

Regards,
Gourav Sengupta

On Thu, Feb 24, 2022 at 8:25 PM Artemis User  wrote:

> We got a Spark program that iterates through a while loop on the same
> input DataFrame and produces different results per iteration. I see
> through Spark UI that the workload is concentrated on a single core of
> the same worker.  Is there anyway to distribute the workload to
> different cores/workers, e.g. per iteration, since each iteration is not
> dependent from each other?
>
> Certainly this type of problem could be easily implemented using
> threads, e.g. spawn a child thread for each iteration, and wait at the
> end of the loop.  But threads apparently don't go beyond the worker
> boundary.  We also thought about using MapReduce, but it won't be
> straightforward since mapping only deals with rows, not at the dataframe
> level.  Any thoughts/suggestions are highly appreciated..
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Structured Streaming + UDF - logic based on checking if a column is present in the Dataframe

2022-02-25 Thread Gourav Sengupta
Hi,

can you please let us know the following:
1. the spark version
2. a few samples of input data
3. a few samples of what is the expected output that you want


Regards,
Gourav Sengupta

On Wed, Feb 23, 2022 at 8:43 PM karan alang  wrote:

> Hello All,
>
> I'm using StructuredStreaming, and am trying to use UDF to parse each row.
> Here is the requirement:
>
>- we can get alerts of a particular KPI with type 'major' OR 'critical'
>- for a KPI, if we get alerts of type 'major' eg _major, and we have a
>critical alert as well _critical, we need to ignore the _major alert, and
>consider _critical alert only
>
> There are ~25 alerts which are stored in the array (AlarmKeys.alarm_all)
>
> UDF Code (draft):
>
> @udf(returnType=StringType())def convertStructToStr(APP_CAUSE, tenantName, 
> window,,__major,__major, __critical, five__major, 
> __critical):
>
> res = "{window: "+ str(window) + "type: 10m, applianceName: "+ 
> str(APP_CAUSE)+","
> first = True
> for curr_alarm in AlarmKeys.alarms_all:
> alsplit = curr_alarm.split('__')
> if len(alsplit) == 2:
> # Only account for critical row if both major & critical are there
> if alsplit[1] == 'major':
> critical_alarm = alsplit[0] + "__critical"
> if int(col(critical_alarm)) > 0:
> continue
> if int(col(curr_alarm)) > 0:
> if first:
> mystring = "{} {}({})".format(mystring, 
> AlarmKeys.alarm_exp[curr_alarm], str(col(curr_alarm)))
> first = False
> else:
> mystring = "{}, {}({})".format(mystring, 
> AlarmKeys.alarm_exp[curr_alarm], str(col(curr_alarm)))
> res+="insight: "+mystring +"}"
>
> # structured streaming using udf, this is printing data on console# 
> eventually, i'll put data into Kafka instead
> df.select(convertStructToStr(*df.columns)) \
> .write \
> .format("console") \
> .option("numRows",100)\
> .option("checkpointLocation", 
> "/Users/karanalang/PycharmProjects/Kafka/checkpoint") \
> .option("outputMode", "complete")\
> .save("output")
>
> Additional Details in stackoverflow :
>
> https://stackoverflow.com/questions/71243726/structured-streaming-udf-logic-based-on-checking-if-a-column-is-present-in-t
>
>
> Question is -
>
> Can this be done using UDF ? Since I'm passing column values to the UDF, I
> have no way to check if a particular KPI of type 'critical' is available in
> the dataframe ?
>
> Any suggestions on the best way to solve this problem ?
> tia!
>
>