Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
Hi,

Have you come back with some ideas for implementing this? Specifically
integrating Spark Structured Streaming with REST API? FYI, I did some work
on it as it can have potential wider use cases, i.e. the seamless
integration of Spark Structured Streaming with Flask REST API for real-time
data ingestion and analytics. My use case revolves around a scenario where
data is generated through REST API requests in real time with Pyspark.. The
Flask REST API efficiently captures and processes this data, saving it to a
sync of your choice like a data warehouse or kafka.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
 wrote:

> Hello!
>
> Is it possible to write pyspark UDF, generated data to streaming dataframe?
>
> I want to get some data from REST API requests in real time and consider
> to save this data to dataframe.
>
> And then put it to Kafka.
>
> I can't realise how to create streaming dataframe from generated data.
>
>
>
> I am new in spark streaming.
>
> Could you give me some hints?
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>


Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
Hi,

Do you have more info on this Jira besides the github link as I don't seem
to find it!

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 28 Dec 2023 at 09:33, Hyukjin Kwon  wrote:

> Just fyi streaming python data source is in progress
> https://github.com/apache/spark/pull/44416 we will likely release this in
> spark 4.0
>
> On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
>  wrote:
>
>> Yes, it's actual data.
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>> *From:* Mich Talebzadeh 
>> *Sent:* Wednesday, December 27, 2023 9:43 PM
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Pyspark UDF as a data source for streaming
>>
>>
>>
>> Is this generated data actual data or you are testing the application?
>>
>>
>>
>> Sounds like a form of Lambda architecture here with some
>> decision/processing not far from the attached diagram
>>
>>
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>  [image: Рисунок удален отправителем.]  view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
>> s.poroti...@skbkontur.ru> wrote:
>>
>> Actually it's json with specific structure from API server.
>>
>> But the task is to check constantly if new data appears on API server and
>> load it to Kafka.
>>
>> Full pipeline can be presented like that:
>>
>> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>> *From:* Mich Talebzadeh 
>> *Sent:* Wednesday, December 27, 2023 6:17 PM
>> *To:* Поротиков Станислав Вячеславович 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Pyspark UDF as a data source for streaming
>>
>>
>>
>> Ok so you want to generate some random data and load it into Kafka on a
>> regular interval and the rest?
>>
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>  [image: Рисунок удален отправителем.]  view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
>>  wrote:
>>
>> Hello!
>>
>> Is it possible to write pyspark UDF, generated data to streaming
>> dataframe?
>>
>> I want to get some data from REST API requests in real time and consider
>> to save this data to dataframe.
>>
>> And then put it to Kafka.
>>
>> I can't realise how to create streaming dataframe from generated data.
>>
>>
>>
>> I am new in spark streaming.
>>
>> Could you give me some hints?
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>>


Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav ,

On Pyspark DF can you the following

df.printSchema()

and send the output please

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 28 Dec 2023 at 12:31, Поротиков Станислав Вячеславович <
s.poroti...@skbkontur.ru> wrote:

> Ok. Thank you very much!
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, December 28, 2023 5:14 PM
> *To:* Hyukjin Kwon 
> *Cc:* Поротиков Станислав Вячеславович ;
> user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> You can work around this issue by trying to write your DF to a flat file
> and use Kafka to pick it up from the flat file and stream it in.
>
>
>
> Bear in mind that Kafa will require a unique identifier as K/V pair. Check
> this link how to generate UUID for this purpose
>
>
> https://stackoverflow.com/questions/49785108/spark-streaming-with-python-how-to-add-a-uuid-column
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 28 Dec 2023 at 09:33, Hyukjin Kwon  wrote:
>
> Just fyi streaming python data source is in progress
>
> https://github.com/apache/spark/pull/44416 we will likely release this in
> spark 4.0
>
>
>
> On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
>  wrote:
>
> Yes, it's actual data.
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 9:43 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Is this generated data actual data or you are testing the application?
>
>
>
> Sounds like a form of Lambda architecture here with some
> decision/processing not far from the attached diagram
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
> s.poroti...@skbkontur.ru> wrote:
>
> Actually it's json with specific structure from API server.
>
> But the task is to check constantly if new data appears on API server and
> load it to Kafka.
>
> Full pipeline can be presented like that:
>
> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 6:17 PM
> *To:* Поротиков Станислав Вячеславович 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Ok so you want to generate some random data and load it into Kafka on a
> regular interval and the rest?
>
>
>
> HTH
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Р

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much!

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
Sent: Thursday, December 28, 2023 5:14 PM
To: Hyukjin Kwon 
Cc: Поротиков Станислав Вячеславович ; 
user@spark.apache.org
Subject: Re: Pyspark UDF as a data source for streaming

You can work around this issue by trying to write your DF to a flat file and 
use Kafka to pick it up from the flat file and stream it in.

Bear in mind that Kafa will require a unique identifier as K/V pair. Check this 
link how to generate UUID for this purpose

https://stackoverflow.com/questions/49785108/spark-streaming-with-python-how-to-add-a-uuid-column

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 28 Dec 2023 at 09:33, Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Just fyi streaming python data source is in progress
https://github.com/apache/spark/pull/44416 we will likely release this in spark 
4.0

On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович 
 wrote:
Yes, it's actual data.

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Wednesday, December 27, 2023 9:43 PM
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Pyspark UDF as a data source for streaming

Is this generated data actual data or you are testing the application?

Sounds like a form of Lambda architecture here with some decision/processing 
not far from the attached diagram

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович 
mailto:s.poroti...@skbkontur.ru>> wrote:
Actually it's json with specific structure from API server.
But the task is to check constantly if new data appears on API server and load 
it to Kafka.
Full pipeline can be presented like that:
REST API -> Kafka -> some processing -> Kafka/Mongo -> …

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Wednesday, December 27, 2023 6:17 PM
To: Поротиков Станислав Вячеславович 
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Pyspark UDF as a data source for streaming

Ok so you want to generate some random data and load it into Kafka on a regular 
interval and the rest?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович 
 wrote:
Hello!
Is it possible to write pyspark UDF, generated data to streaming dataframe?
I want to get some data from REST API requests in real time and consider to 
save this data to dataframe.
And then put it to Kafka.
I can't realise how to create streaming dataframe from generated data.

I am new in spark streaming.
Could you give me some hints?

Best regards,
Stanislav Porotikov



Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
You can work around this issue by trying to write your DF to a flat file
and use Kafka to pick it up from the flat file and stream it in.

Bear in mind that Kafa will require a unique identifier as K/V pair. Check
this link how to generate UUID for this purpose

https://stackoverflow.com/questions/49785108/spark-streaming-with-python-how-to-add-a-uuid-column

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 28 Dec 2023 at 09:33, Hyukjin Kwon  wrote:

> Just fyi streaming python data source is in progress
> https://github.com/apache/spark/pull/44416 we will likely release this in
> spark 4.0
>
> On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
>  wrote:
>
>> Yes, it's actual data.
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>> *From:* Mich Talebzadeh 
>> *Sent:* Wednesday, December 27, 2023 9:43 PM
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Pyspark UDF as a data source for streaming
>>
>>
>>
>> Is this generated data actual data or you are testing the application?
>>
>>
>>
>> Sounds like a form of Lambda architecture here with some
>> decision/processing not far from the attached diagram
>>
>>
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>  [image: Рисунок удален отправителем.]  view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
>> s.poroti...@skbkontur.ru> wrote:
>>
>> Actually it's json with specific structure from API server.
>>
>> But the task is to check constantly if new data appears on API server and
>> load it to Kafka.
>>
>> Full pipeline can be presented like that:
>>
>> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>> *From:* Mich Talebzadeh 
>> *Sent:* Wednesday, December 27, 2023 6:17 PM
>> *To:* Поротиков Станислав Вячеславович 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Pyspark UDF as a data source for streaming
>>
>>
>>
>> Ok so you want to generate some random data and load it into Kafka on a
>> regular interval and the rest?
>>
>>
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>  [image: Рисунок удален отправителем.]  view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
>>  wrote:
>>
>> Hello!
>>
>> Is it possible to write pyspark UDF, generated data to streaming
>> dataframe?
>>
>> I want to get some data from REST API requests in real time and consider
>> to save this data to dataframe.
>>
>> And then put it to Kafka.
>>
>> I can't realise how to create streaming dataframe from generated data.
>>
>>
>>
>> I am new in spark streaming.
>>
>> Could you give me some hints?
>>
>>
>>
>> Best regards,
>>
>> Stanislav Porotikov
>>
>>
>>
>>


Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress
https://github.com/apache/spark/pull/44416 we will likely release this in
spark 4.0

On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
 wrote:

> Yes, it's actual data.
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 9:43 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Is this generated data actual data or you are testing the application?
>
>
>
> Sounds like a form of Lambda architecture here with some
> decision/processing not far from the attached diagram
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
> s.poroti...@skbkontur.ru> wrote:
>
> Actually it's json with specific structure from API server.
>
> But the task is to check constantly if new data appears on API server and
> load it to Kafka.
>
> Full pipeline can be presented like that:
>
> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 6:17 PM
> *To:* Поротиков Станислав Вячеславович 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Ok so you want to generate some random data and load it into Kafka on a
> regular interval and the rest?
>
>
>
> HTH
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
>  wrote:
>
> Hello!
>
> Is it possible to write pyspark UDF, generated data to streaming dataframe?
>
> I want to get some data from REST API requests in real time and consider
> to save this data to dataframe.
>
> And then put it to Kafka.
>
> I can't realise how to create streaming dataframe from generated data.
>
>
>
> I am new in spark streaming.
>
> Could you give me some hints?
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
>


RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data.

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
Sent: Wednesday, December 27, 2023 9:43 PM
Cc: user@spark.apache.org
Subject: Re: Pyspark UDF as a data source for streaming

Is this generated data actual data or you are testing the application?

Sounds like a form of Lambda architecture here with some decision/processing 
not far from the attached diagram

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович 
mailto:s.poroti...@skbkontur.ru>> wrote:
Actually it's json with specific structure from API server.
But the task is to check constantly if new data appears on API server and load 
it to Kafka.
Full pipeline can be presented like that:
REST API -> Kafka -> some processing -> Kafka/Mongo -> …

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Wednesday, December 27, 2023 6:17 PM
To: Поротиков Станислав Вячеславович 
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Pyspark UDF as a data source for streaming

Ok so you want to generate some random data and load it into Kafka on a regular 
interval and the rest?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович 
 wrote:
Hello!
Is it possible to write pyspark UDF, generated data to streaming dataframe?
I want to get some data from REST API requests in real time and consider to 
save this data to dataframe.
And then put it to Kafka.
I can't realise how to create streaming dataframe from generated data.

I am new in spark streaming.
Could you give me some hints?

Best regards,
Stanislav Porotikov



RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server.
But the task is to check constantly if new data appears on API server and load 
it to Kafka.
Full pipeline can be presented like that:
REST API -> Kafka -> some processing -> Kafka/Mongo -> …

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
Sent: Wednesday, December 27, 2023 6:17 PM
To: Поротиков Станислав Вячеславович 
Cc: user@spark.apache.org
Subject: Re: Pyspark UDF as a data source for streaming

Ok so you want to generate some random data and load it into Kafka on a regular 
interval and the rest?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович 
 wrote:
Hello!
Is it possible to write pyspark UDF, generated data to streaming dataframe?
I want to get some data from REST API requests in real time and consider to 
save this data to dataframe.
And then put it to Kafka.
I can't realise how to create streaming dataframe from generated data.

I am new in spark streaming.
Could you give me some hints?

Best regards,
Stanislav Porotikov



RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server.
But the task is to check constantly if new data appears on API server and load 
it to Kafka.

Full pipeline can be presented like that:

REST API -> Kafka -> some processing -> Kafka/Mongo -> …

Best regards,
Stanislav Porotikov

From: Mich Talebzadeh 
Sent: Wednesday, December 27, 2023 6:17 PM
To: Поротиков Станислав Вячеславович 
Cc: user@spark.apache.org
Subject: Re: Pyspark UDF as a data source for streaming

Ok so you want to generate some random data and load it into Kafka on a regular 
interval and the rest?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


 [Рисунок удален отправителем.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович 
 wrote:
Hello!
Is it possible to write pyspark UDF, generated data to streaming dataframe?
I want to get some data from REST API requests in real time and consider to 
save this data to dataframe.
And then put it to Kafka.
I can't realise how to create streaming dataframe from generated data.

I am new in spark streaming.
Could you give me some hints?

Best regards,
Stanislav Porotikov



Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a
regular interval and the rest?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
 wrote:

> Hello!
>
> Is it possible to write pyspark UDF, generated data to streaming dataframe?
>
> I want to get some data from REST API requests in real time and consider
> to save this data to dataframe.
>
> And then put it to Kafka.
>
> I can't realise how to create streaming dataframe from generated data.
>
>
>
> I am new in spark streaming.
>
> Could you give me some hints?
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>