Re: What does Apache Spark do?

2022-05-18 Thread Pasha Finkelshtein
Hi Mr. Turritopsis Dohrnii Teo En Ming,

Spark can perform variety of different tasks, do the most important thing
you should know about it is that it's a distributed computation framework.

Usually it's used for ETL (extract-transform-load) Pipelines, but also
there is a plethora of extensions, for example for stream processing or for
ML.

You can read more on its website: https://spark.apache.org

Welcome to the beautiful world of days engineering!

Cheers,
Pasha

ср, 18 мая 2022 г., 16:09 Turritopsis Dohrnii Teo En Ming <
ceo.teo.en.m...@gmail.com>:

> Subject: What does Apache Spark do?
>
> Good day from Singapore,
>
> I notice that my company/organization uses Apache Spark. What does it do?
>
> Just being curious.
>
> Regards,
>
> Mr. Turritopsis Dohrnii Teo En Ming
> Targeted Individual in Singapore
> 18 May 2022 Wed
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Data ingestion

2022-08-17 Thread Pasha Finkelshtein
Hello

Spark does not have any built-in solution for this problem. Most probably
you will want to use Debezium+Kafka and read with Spark from Kafka


[image: facebook] 
[image: twitter] 
[image: linkedin] 
[image: instagram] 

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more 



ср, 17 авг. 2022 г. в 19:51, Akash Vellukai :

> Dear Sir,
>
>
> How we could do data ingestion from MySQL to Hive with the help of Spark
> streaming and not with Kafka
>
> Thanks and regards
> Akash
>


Re: Data ingestion

2022-08-17 Thread Pasha Finkelshtein
But not in streaming, right? It will be a usual batch approach, but initial
question was about streaming.


[image: facebook] 
[image: twitter] 
[image: linkedin] 
[image: instagram] 

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more 



чт, 18 авг. 2022 г. в 03:12, pengyh :

> from my experience, spark can read/write from/to both mysql and hive
> fluently.
>
> regards.
>
>
> Akash Vellukai wrote:
> > How we could do data ingestion from MySQL to Hive with the help of Spark
> > streaming and not with Kafka
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
To me it seems like it's the best possible use case for PF4J.


[image: facebook] 
[image: twitter] 
[image: linkedin] 
[image: instagram] 

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more 



On Tue, 28 Nov 2023 at 12:47, Holden Karau  wrote:

> So I don’t think we make any particular guarantees around class path
> isolation there, so even if it does work it’s something you’d need to pay
> attention to on upgrades. Class path isolation is tricky to get right.
>
> On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde  wrote:
>
>> Hello,
>>
>> We are using spark 3.5.0 and were wondering if the following is
>> achievable using spark-core
>>
>> Our use case involves spinning up a spark cluster where the driver
>> application loads user jars containing spark transformations at runtime. A
>> single spark application can load multiple user jars ( same cluster ) that
>> can have class path conflicts if care is not taken
>>
>> AFAIK, to get this right requires the Executor to be designed in a way
>> that allows for class path isolation ( UDF, lambda expressions ). Ideally
>> per Spark Session is what we want
>>
>> I know Spark connect has been designed this way but Spark connect is not
>> an option for us at the moment. I had some luck using a private method
>> inside spark called JobArtifactSet.withActiveJobArtifactState
>>
>> Is it sufficient for me to run the user code enclosed
>> within JobArtifactSet.withActiveJobArtifactState to achieve my requirement?
>>
>> Thank you
>>
>>
>> Faiz
>>
>


Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
I actually think it should be totally possible to use it on an executor
side. Maybe it will require a small extension/udf, but generally no issues
here. Pf4j is very lightweight, so you'll only have a small overhead for
classloaders.

There's still a small question of distribution of plugins/extensions, but
you probably already have a storage and can store them there.



[image: facebook] <https://fb.com/asm0dey>
[image: twitter] <https://twitter.com/asm0di0>
[image: linkedin] <https://linkedin.com/in/asm0dey>
[image: instagram] <https://instagram.com/asm0dey>

Pasha Finkelshteyn

Developer Advocate for Data Engineering

JetBrains



asm0...@jetbrains.com
https://linktr.ee/asm0dey

Find out more <https://jetbrains.com>



On Tue, 28 Nov 2023 at 17:04, Faiz Halde  wrote:

> Hey Pasha,
>
> Is your suggestion towards the spark team? I can make use of the plugin
> system on the driver side of spark but considering spark is distributed,
> the executor side of spark needs to adapt to the pf4j framework I believe
> too
>
> Thanks
> Faiz
>
> On Tue, Nov 28, 2023, 16:57 Pasha Finkelshtein <
> pavel.finkelsht...@gmail.com> wrote:
>
>> To me it seems like it's the best possible use case for PF4J.
>>
>>
>> [image: facebook] <https://fb.com/asm0dey>
>> [image: twitter] <https://twitter.com/asm0di0>
>> [image: linkedin] <https://linkedin.com/in/asm0dey>
>> [image: instagram] <https://instagram.com/asm0dey>
>>
>> Pasha Finkelshteyn
>>
>> Developer Advocate for Data Engineering
>>
>> JetBrains
>>
>>
>>
>> asm0...@jetbrains.com
>> https://linktr.ee/asm0dey
>>
>> Find out more <https://jetbrains.com>
>>
>>
>>
>> On Tue, 28 Nov 2023 at 12:47, Holden Karau 
>> wrote:
>>
>>> So I don’t think we make any particular guarantees around class path
>>> isolation there, so even if it does work it’s something you’d need to pay
>>> attention to on upgrades. Class path isolation is tricky to get right.
>>>
>>> On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde  wrote:
>>>
>>>> Hello,
>>>>
>>>> We are using spark 3.5.0 and were wondering if the following is
>>>> achievable using spark-core
>>>>
>>>> Our use case involves spinning up a spark cluster where the driver
>>>> application loads user jars containing spark transformations at runtime. A
>>>> single spark application can load multiple user jars ( same cluster ) that
>>>> can have class path conflicts if care is not taken
>>>>
>>>> AFAIK, to get this right requires the Executor to be designed in a way
>>>> that allows for class path isolation ( UDF, lambda expressions ). Ideally
>>>> per Spark Session is what we want
>>>>
>>>> I know Spark connect has been designed this way but Spark connect is
>>>> not an option for us at the moment. I had some luck using a private method
>>>> inside spark called JobArtifactSet.withActiveJobArtifactState
>>>>
>>>> Is it sufficient for me to run the user code enclosed
>>>> within JobArtifactSet.withActiveJobArtifactState to achieve my requirement?
>>>>
>>>> Thank you
>>>>
>>>>
>>>> Faiz
>>>>
>>>