Re: deciding Spark tasks & optimization resource

2022-08-29 Thread Gibson
Hello Rajat,

Look up the spark *Pipelining* concept; any sequence of operations that
feed data directly into each other without need for shuffling will packed
into a single stage, ie select -> filter -> select (SparkSQL) ; map ->
filter -> map (RDD), for any operation that requires shuffling (sort,
group, reduce), a new stage will be created after each shuffle, but before
a new stage is created, these shuffle files are persisted to the local disk
(referred to as *Shuffle Persistence*), and and accessed by the group /
reduce tasks, this offers high availability in the sense that a group task
can be relaunched upon failure, in case there's no enough executors etc

Regarding the number of partitions, look up there's a
*spark.sql.shuffle.partition* parameter, is used to set the default number
of partitions output/created by a shuffle operation, default it 200.


Reference:
Spark : The Definitive Guide ; Bill Chambers & Matei Zaharia




On Mon, Aug 29, 2022 at 3:36 PM rajat kumar 
wrote:

> Hello Members,
>
> I have a query for spark stages:-
>
> why every stage has a different number of tasks/partitions in spark. Or
> how is it determined?
>
> Moreover, where can i see the improvements done in spark3+
>
>
> Thanks in advance
> Rajat
>


Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
The idea behind spark-streaming is to process change events as they occur,
hence the suggestions above that require capturing change events using
Debezium.

But you can use jdbc drivers to connect Spark to relational databases


On Wed, Aug 17, 2022 at 6:21 PM Akash Vellukai 
wrote:

> I am beginner with spark may , also know how to connect MySQL database
> with spark streaming
>
> Thanks and regards
> Akash P
>
> On Wed, 17 Aug, 2022, 8:28 pm Saurabh Gulati, 
> wrote:
>
>> Another take:
>>
>>- Debezium
>><https://debezium.io/documentation/reference/stable/connectors/mysql.html>
>>to read Write Ahead logs(WAL) and send to Kafka
>>- Kafka connect to write to cloud storage -> Hive
>>   - OR
>>
>>
>>- Spark streaming to parse WAL -> Storage -> Hive
>>
>> Regards
>> --
>> *From:* Gibson 
>> *Sent:* 17 August 2022 16:53
>> *To:* Akash Vellukai 
>> *Cc:* user@spark.apache.org 
>> *Subject:* [EXTERNAL] Re: Spark streaming - Data Ingestion
>>
>> *Caution! This email originated outside of FedEx. Please do not open
>> attachments or click links from an unknown or suspicious origin*.
>> If you have space for a message log like, then you should try:
>>
>> MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS
>> -> Hive
>>
>> On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai <
>> akashvellukai...@gmail.com> wrote:
>>
>> Dear sir
>>
>> I have tried a lot on this could you help me with this?
>>
>> Data ingestion from MySql to Hive with spark- streaming?
>>
>> Could you give me an overview.
>>
>>
>> Thanks and regards
>> Akash P
>>
>>


Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
If you have space for a message log like, then you should try:

MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS ->
Hive

On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai 
wrote:

> Dear sir
>
> I have tried a lot on this could you help me with this?
>
> Data ingestion from MySql to Hive with spark- streaming?
>
> Could you give me an overview.
>
>
> Thanks and regards
> Akash P
>


Spark Convert Column to String

2022-07-16 Thread Gibson
Hi Folks,


Have created a UDF that queries a confluent schema registry for a schema,
which is then used within a Dataset Select with the from_avro function to
decode an avro encoded value (reading from a bunch of kafka topics)


Dataset recordDF = df.select(
callUDF("getjsonSchemaUDF",col("topic").as("schema")),
from_avro(col("avro_value"),
callUDF("getjsonSchemaUDF",col("topic"))).as("record")
);


from_avro expects a String type as the second argument, and not a column
type

I need to be able convert the jsonSchema returned from Col to String type

Any ideas