Re: implement a distribution without shuffle like RDD.coalesce for DataSource V2 write

2023-06-18 Thread Mich Talebzadeh
Is this the point you are trying to implement?

I have state data source which enables the state in SS --> Structured
Streaming to be rewritten, which enables repartitioning, schema
evolution, etc via batch query. The writer requires hash partitioning
against group key, with the "desired number of partitions", which is
same as what Spark does read and write against state.

This is now implemented as DSv1, and the requirement is *simply done
by calling repartition with the "desired number".*

```
val fullPathsForKeyColumns = keySchema.map(key => new
Column(s"key.${key.name}"))
data
  .repartition(*newPartitions*, fullPathsForKeyColumns: _*)
  .queryExecution
  .toRdd
  .foreachPartition(
writeFn(resolvedCpLocation, version, operatorId, storeName,
keySchema, valueSchema,
  storeConf, hadoopConfBroadcast, queryId))

Well Spark will not know the optimum value of newPartitions and you
will need to work out that from SS size.

Is that a correct understanding?

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 18 Jun 2023 at 10:12, Pengfei Li  wrote:

> Hi All,
>
> I'm developing a DataSource on Spark 3.2 to write data to our system,
> and using DataSource V2 API. I want to implement the interface
> RequiresDistributionAndOrdering
> <https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/RequiresDistributionAndOrdering.java>
>  to
> set the number of partitions used for write. But I don't know how to
> implement a distribution without shuffle as  RDD.coalesce does. Is there
> any example or advice?
>
> Thank You
> Best Regards
>


Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
Try sending it to d...@spark.apache.org (and join that group)

You need to raise a JIRA for this request plus related doc related


Example JIRA

https://issues.apache.org/jira/browse/SPARK-42485

and the related *Spark project improvement proposals (SPIP) *to be filled in

https://spark.apache.org/improvement-proposals.html


HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 4 Jun 2023 at 12:38, keen  wrote:

> Do Spark **devs** read this mailing list?
> Is there another/a better way to make feature requests?
> I tried in the past to write a mail to the dev mailing list but it did not
> show at all.
>
> Cheers
>
> keen  schrieb am Do., 1. Juni 2023, 07:11:
>
>> Hi all,
>> currently only *temporary* Spark Views can be created from a DataFrame
>> (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView).
>>
>> When I want a *permanent* Spark View I need to specify it via Spark SQL
>> (CREATE VIEW AS SELECT ...).
>>
>> Sometimes it is easier to specify the desired logic of the View through
>> Spark/PySpark DataFrame API.
>> Therefore, I'd like to suggest to implement a new PySpark method that
>> allows creating a *permanent* Spark View from a DataFrame
>> (df.createOrReplaceView).
>>
>> see also:
>>
>> https://community.databricks.com/s/question/0D53f1PANVgCAP/is-there-a-way-to-create-a-nontemporary-spark-view-with-pyspark
>>
>> Regards
>> Martin
>>
>


Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh
Great stuff Winston. I added a channel in Slack Community for Spark

https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT

cheers

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 1 Jun 2023 at 01:55, Winston Lai  wrote:

> Hi Mich,
>
> I have been using ChatGPT free version, Bing AI, Google Bard and other AI
> chatbots.
>
> My use cases so far include writing, debugging code, generating
> documentation and explanation on Spark key terminologies for beginners to
> quickly pick up new concepts, summarizing pros and cons or uses cases of
> different tools available in data engineering for my decision making (as we
> know, modern tech stack have too many tools available), rephrasing
> difficult data concepts into something a 7 years old could understand,
> generating pictures for illustrations ...
>
> With all those use cases mentioned, I feel free version ChatGPT, Bing AI
> and Google Bard do quite well already. I usually can find a satisfying
> answer from one of these three with one attempt. A better prompt
> engineering would help you get the better answer, too. So my take is no
> need to pay for premium version of ChatGPT unless you feel all these free
> tools can't serve your needs.
>
> I started using all these three almost since the first week of the beta
> test by signing up for the wait-list. My personal preference is Bing AI
> first for all generic questions and personal fun stuff. Google Bard second
> if I would like to send the answer as email or Google doc with the built-in
> export integration and ChatGPT last if both Bing and Bard failed me, which
> happened to me once or twice only even though I used all these three daily.
>
> Hope it helps! We can always chat more about it in the slack community if
> you are interested. Thank you a lot for your continuous help in this Spark
> community! I'd be glad if my reply is useful to you 
>
> Thank You & Best Regards
> Winston Lai
> --
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, June 1, 2023 4:51:43 AM
> *To:* user @spark 
> *Subject:* ChatGPT and prediction of Spark future
>
> I have started looking into  ChatGPT as a consumer. The one I have tried
> is the free not plus version.
>
> I asked a question entitled "what is the future for spark" and asked for
> a concise response
>
> This was the answer
>
> "Spark has a promising future due to its capabilities in data processing,
> scalability, performance, and integration with other technologies. It is
> well-suited for big data analytics and machine learning tasks. With a
> strong community and extensive ecosystem, Spark is positioned to continue
> evolving and meeting the growing demands of data-driven applications."
>
> I have a naive question from those that know far more than me about AI
> (which ChatGPT I believe to be based on). How reliable the responses are
> (subject to interpretation) and is it worthwhile paying $20 subscription
> for the plus version?
>
> With regard to the response on Spark's future, I too agree that Spark will
> be used for years to come, especially in empowering Data Fabric and Cloud
> Data Fusion.
>
> Thanks,
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


ChatGPT and prediction of Spark future

2023-05-31 Thread Mich Talebzadeh
I have started looking into  ChatGPT as a consumer. The one I have tried is
the free not plus version.

I asked a question entitled "what is the future for spark" and asked for a
concise response

This was the answer

"Spark has a promising future due to its capabilities in data processing,
scalability, performance, and integration with other technologies. It is
well-suited for big data analytics and machine learning tasks. With a
strong community and extensive ecosystem, Spark is positioned to continue
evolving and meeting the growing demands of data-driven applications."

I have a naive question from those that know far more than me about AI
(which ChatGPT I believe to be based on). How reliable the responses are
(subject to interpretation) and is it worthwhile paying $20 subscription
for the plus version?

With regard to the response on Spark's future, I too agree that Spark will
be used for years to come, especially in empowering Data Fabric and Cloud
Data Fusion.

Thanks,

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
_0.xsd;>
4.0.0
spark
3.0
ReduceByKey
${project.artifactId}


11.0.1
11.0.1
UTF-8
2.13.8 
2.15.2



  
org.scala-lang
scala-library
2.13.8 
  
  
org.apache.spark
spark-core_2.13 
3.4.0
provided
  
  
org.apache.spark
spark-sql_2.13
3.4.0
provided
  


Thanks


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 29 May 2023 at 13:44, Bjørn Jørgensen 
wrote:

> Change
>
> 
> 
> org.scala-lang
> scala-library
> 2.13.11-M2
> 
> 
>
> to
>
> 
> 
> org.scala-lang
> scala-library
> ${scala.version}
> 
>
> man. 29. mai 2023 kl. 13:20 skrev Lingzhe Sun :
>
>> Hi Mich,
>>
>> Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8
>> <https://github.com/apache/spark/blob/88f69d6f92860823b1a90bc162ebca2b7c8132fc/pom.xml#L170>.
>> Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to
>> the major(13) and the minor version(8). Not using any of these may cause
>> unexpected behaviour(though scala claims compatibility among minor version
>> changes, I've encountered problem using the scala package with the same
>> major version and different minor version. That may due to bug fixes and
>> upgrade of scala itself.).
>> And although I did not encountered such problem, this
>> <https://stackoverflow.com/a/26411339/19476830>can be a a pitfall for
>> you.
>>
>> --
>> Best Regards!
>>
>> ...
>> Lingzhe Sun
>> Hirain Technology
>>
>>
>> *From:* Mich Talebzadeh 
>> *Date:* 2023-05-29 17:55
>> *To:* Bjørn Jørgensen 
>> *CC:* user @spark 
>> *Subject:* Re: maven with Spark 3.4.0 fails compilation
>> Thanks for your helpful comments Bjorn.
>>
>> I managed to compile the code with maven but when it run it fails with
>>
>>   Application is ReduceByKey
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.package$.Seq()Lscala/collection/immutable/Seq$;
>> at ReduceByKey$.main(ReduceByKey.scala:23)
>> at ReduceByKey.main(ReduceByKey.scala)
>> at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>> at
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>> at org.apache.spark.deploy.SparkSubmit.org
>> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
>> at
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
>> at
>> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
>> at
>> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
>> at
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
>> at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> I attach the pom.xml and the sample scala code is self contained and
>> basic. Again it runs with SBT with no issues.
>>
>> FYI, my scala version on host is
>>
>>  scala -version
>> Scala code runner version 2.13.6 -- Copyright 2002-2021, LAMP/EPFL and
>> Lightbend, Inc.
>>
>> I think I have a scala  incompatible somewhere again
>>
>> Cheers
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>&

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks for your helpful comments Bjorn.

I managed to compile the code with maven but when it run it fails with

  Application is ReduceByKey

Exception in thread "main" java.lang.NoSuchMethodError:
scala.package$.Seq()Lscala/collection/immutable/Seq$;
at ReduceByKey$.main(ReduceByKey.scala:23)
at ReduceByKey.main(ReduceByKey.scala)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I attach the pom.xml and the sample scala code is self contained and basic.
Again it runs with SBT with no issues.

FYI, my scala version on host is

 scala -version
Scala code runner version 2.13.6 -- Copyright 2002-2021, LAMP/EPFL and
Lightbend, Inc.

I think I have a scala  incompatible somewhere again

Cheers


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 28 May 2023 at 20:29, Bjørn Jørgensen 
wrote:

> From chatgpt4
>
>
> The problem appears to be that there is a mismatch between the version of
> Scala used by the Scala Maven plugin and the version of the Scala library
> defined as a dependency in your POM. You've defined your Scala version in
> your properties as `2.12.17` but you're pulling in `scala-library` version
> `2.13.6` as a dependency.
>
> The Scala Maven plugin will be using the Scala version defined in the
> `scala.version` property for compilation, but then it tries to load classes
> from a different Scala version, hence the error.
>
> To resolve this issue, make sure the `scala.version` property matches the
> version of `scala-library` defined in your dependencies. In your case, you
> may want to change `scala.version` to `2.13.6`.
>
> Here's the corrected part of your POM:
>
> ```xml
> 
>   1.7
>   1.7
>   UTF-8
>   2.13.6 
>   2.15.2
> 
> ```
>
> Additionally, ensure that the Scala versions in the Spark dependencies
> match the `scala.version` property as well. If you've updated the Scala
> version to `2.13.6`, the artifactIds for Spark dependencies should be
> `spark-core_2.13` and `spark-sql_2.13`.
>
> Another thing to consider: your Java version defined in
> `maven.compiler.source` and `maven.compiler.target` is `1.7`, which is
> quite outdated and might not be compatible with the latest versions of
> these libraries. Consider updating to a more recent version of Java, such
> as Java 8 or above, depending on the requirements of the libraries you're
> using.
>
>
>
> The same problem persists in this updated POM file - there's a mismatch in
> the Scala version declared in the properties and the version used in your
> dependencies. Here's what you need to update:
>
> 1. Update the Scala version in your properties to match the Scala library
> and your Spark dependencies:
>
> ```xml
> 
> 1.7
> 1.7
> UTF-8
> 2.13.6 
> 2.15.2
> 
> ```
>
> 2. Make sure all your Spark dependencies use the same Scala version. In
> this case, I see `spark-streaming-kafka_2.11` which should be
> `spark-streaming-kafka_2.13` if you're using Scala `2.13.6`.
>
> ```xml
> 
> org.apache.spark
> spark-streaming-kafka_2.13 
> 1.6.3 
> provided
> 
> ```
>
> 3. As mentioned in the previous message, your Java version
> (`maven.compiler.source` and `maven.compiler.target`) is also quite
> outdated. Depending on the requir

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh
Hi,
Autoscaling is not compatible with Spark Structured Streaming
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
since
Spark Structured Streaming currently does not support dynamic allocation
(see SPARK-24815: Structured Streaming should support dynamic allocation
<https://issues.apache.org/jira/browse/SPARK-24815>).

That ticket is still open

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 25 May 2023 at 18:44, Aishwarya Panicker <
aishwaryapanicke...@gmail.com> wrote:

> Hi Team,
>
> I have been working on Spark Structured Streaming and trying to autoscale
> our application through dynamic allocation. But I couldn't find any
> documentation or configurations that supports dynamic scaling in Spark
> Structured Streaming, due to which I had been using Spark Batch mode
> dynamic scaling which is not so efficient with streaming use case.
>
> I also tried with Spark streaming dynamic allocation configurations which
> didn't work with structured streaming.
>
> Below are the configurations I tried for dynamic scaling of my Spark
> Structured Streaming Application:
>
> With Batch Spark configurations:
>
> spark.dynamicAllocation.enabled: true
> spark.dynamicAllocation.executorAllocationRatio: 0.5
> spark.dynamicAllocation.minExecutors: 1
> spark.dynamicAllocation.maxExecutors: 5
>
>
> With Streaming Spark configurations:
>
> spark.dynamicAllocation.enabled: false
> spark.streaming.dynamicAllocation.enabled: true
> spark.streaming.dynamicAllocation.scaleUpRatio: 0.7
> spark.streaming.dynamicAllocation.scaleDownRatio: 0.2
> spark.streaming.dynamicAllocation.minExecutors: 1
> spark.streaming.dynamicAllocation.maxExecutors: 5
>
> Kindly let me know if there is any configuration for the dynamic
> allocation of Spark Structured Streaming which I'm missing due to which
> autoscaling of my application is not working properly.
>
> Awaiting your response.
>
> Thanks and Regards,
> Aishwarya
>
>
>
>
>


Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Just to correct the last sentence, if we end up starting a new instance of
Spark, I don't think it will be able to read the shuffle data from storage
from another instance, I stand corrected.


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 May 2023 at 15:27, Mich Talebzadeh 
wrote:

> Hi Maksym.
>
> Let us understand the basics here first
> My thoughtsSpark replicates the partitions among multiple nodes. If one
> executor fails, it moves the processing over to the other executor.
> However, if the data is lost, it re-executes the processing that generated
> the data,
> and might have to go back to the source. In case of failure, there will
> be delay in getting the results. The amount of delay depends on how much
> reprocessing Spark needs to do.
> Spark, by itself, doesn't add executors when executors fail. It just moves
> the tasks to other executors. If you are installing plain vanilla Spark
> on your own cluster, you need to figure out how to bring back executors.
> Most of the popular platforms built on top of Spark (Glue, EMR, GKS) will
> replace failed nodes. However, I don't think that applies to your case.
>
> With regard to below point you raised
>
> "" One of the offerings from the service we use is EBS migration which
> basically means if a host is about to get evicted, a new host is created
> and the EBS volume is attached to it. When Spark assigns a new executor
> to the newly created instance, it basically can recover all the shuffle
> files that are already persisted in the migrated EBS volume Is this how
> it works? Do executors recover / re-register the shuffle files that they
> found?"""
>
> My understanding is that RDD lineage keeps track of records of what needs
> to be re-executed. It uses RDD lineage to figure out what needs to be
> re-executed in the same Spark instance. For example, if you have done a
> groupBy Key , you will have 2 stages. After the first stage, the data will
> be shuffled by hashing the groupBy key , so that data for the same value of
> key lands in same partition. Now, if one of those partitions is lost
> during execution of second stage, I am guessing Spark will have to go back
> and re-execute all the tasks in the first stage.
>
>
> So it your case stating a new instance will not have the same issue,
> re-executes the job.
>
>
> HTH
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 22 May 2023 at 13:19, Maksym M 
> wrote:
>
>> Hey vaquar,
>>
>> The link does't explain the crucial detail we're interested in - does
>> executor
>> re-use the data that exists on a node from previous executor and if not,
>> how
>> can we configure it to do so?
>>
>> We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't
>> very relevant.
>>
>> We are running spark standalone mode.
>>
>> Best regards,
>> maksym
>>
>> On 2023/05/17 12:28:35 vaquar khan wrote:
>> > Following link you will get all required details
>> >
>> >
>> https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/
>> >
>> > Let me know if you required further informations.
>> >
>> >
>> > Regards,
>> > Vaquar khan
>> >
>> >
>> >
>> >
>> > On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh 
>> > wrote:
>> >
>> > > Couple of points
>> > >
>> > > Why use spot or pre-empt intantes when your application as you stated
>> > &

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Hi Maksym.

Let us understand the basics here first
My thoughtsSpark replicates the partitions among multiple nodes. If one
executor fails, it moves the processing over to the other executor.
However, if the data is lost, it re-executes the processing that generated
the data,
and might have to go back to the source. In case of failure, there will be
delay in getting the results. The amount of delay depends on how much
reprocessing Spark needs to do.
Spark, by itself, doesn't add executors when executors fail. It just moves
the tasks to other executors. If you are installing plain vanilla Spark on
your own cluster, you need to figure out how to bring back executors.
Most of the popular platforms built on top of Spark (Glue, EMR, GKS) will
replace failed nodes. However, I don't think that applies to your case.

With regard to below point you raised

"" One of the offerings from the service we use is EBS migration which
basically means if a host is about to get evicted, a new host is created
and the EBS volume is attached to it. When Spark assigns a new executor to
the newly created instance, it basically can recover all the shuffle files
that are already persisted in the migrated EBS volume Is this how it works?
Do executors recover / re-register the shuffle files that they found?"""

My understanding is that RDD lineage keeps track of records of what needs
to be re-executed. It uses RDD lineage to figure out what needs to be
re-executed in the same Spark instance. For example, if you have done a
groupBy Key , you will have 2 stages. After the first stage, the data will
be shuffled by hashing the groupBy key , so that data for the same value of
key lands in same partition. Now, if one of those partitions is lost
during execution of second stage, I am guessing Spark will have to go back
and re-execute all the tasks in the first stage.


So it your case stating a new instance will not have the same issue,
re-executes the job.


HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 May 2023 at 13:19, Maksym M 
wrote:

> Hey vaquar,
>
> The link does't explain the crucial detail we're interested in - does
> executor
> re-use the data that exists on a node from previous executor and if not,
> how
> can we configure it to do so?
>
> We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't
> very relevant.
>
> We are running spark standalone mode.
>
> Best regards,
> maksym
>
> On 2023/05/17 12:28:35 vaquar khan wrote:
> > Following link you will get all required details
> >
> >
> https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/
> >
> > Let me know if you required further informations.
> >
> >
> > Regards,
> > Vaquar khan
> >
> >
> >
> >
> > On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh 
> > wrote:
> >
> > > Couple of points
> > >
> > > Why use spot or pre-empt intantes when your application as you stated
> > > shuffles heavily.
> > > Have you looked at why you are having these shuffles? What is the
> cause of
> > > these large transformations ending up in shuffle
> > >
> > > Also on your point:
> > > "..then ideally we should expect that when an executor is killed/OOM'd
> > > and a new executor is spawned on the same host, the new executor
> registers
> > > the shuffle files to itself. Is that so?"
> > >
> > > What guarantee is that the new executor with inherited shuffle files
> will
> > > succeed?
> > >
> > > Also OOM is often associated with some form of skewed data
> > >
> > > HTH
> > > .
> > > Mich Talebzadeh,
> > > Lead Solutions Architect/Engineering Lead
> > > Palantir Technologies Limited
> > > London
> > > United Kingdom
> > >
> > >
> > >view my Linkedin profile
> > > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> > >
> > >
> > >  https://en.everybodywiki.com/Mich_Talebzadeh
> > >
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
&

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh
Ok, I did a bit of a test that shows that the shuffle does spill to memory
then to disk if my assertion is valid.

The sample code I wrote is as follows:

import sys
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField,
StringType,IntegerType, FloatType, TimestampType
import time
def main():
appName = "skew"
spark = SparkSession.builder.appName(appName).getOrCreate()
spark_context = SparkContext.getOrCreate()
spark_context.setLogLevel("ERROR")
df_uniform = spark.createDataFrame([i for i in range(1000)],
IntegerType())
df_uniform = df_uniform.withColumn("partitionId", spark_partition_id())
print("Number of Partitions: "+str(df_uniform.rdd.getNumPartitions()))

df_uniform.groupby([df_uniform.partitionId]).count().sort(df_uniform.partitionId).show()
df_uniform.alias("left").join(df_uniform.alias("right"),"value",
"inner").count()

print(f"""Spark.sql.shuffle.partitions is
{spark.conf.get("spark.sql.shuffle.partitions")}""")
df0 = spark.createDataFrame([0] * 998, IntegerType()).repartition(1)
df1 = spark.createDataFrame([1], IntegerType()).repartition(1)
df2 = spark.createDataFrame([2], IntegerType()).repartition(1)
df_skew = df0.union(df1).union(df2)
df_skew = df_skew.withColumn("partitionId", spark_partition_id())
## If we apply the same function call again, we get what we want to see
for the one partition with much more data than the other two.

df_skew.groupby([df_skew.partitionId]).count().sort(df_skew.partitionId).show()
## simulate reading to first round robin distribute the key
#df_skew = df_skew.repartition(3)
df_skew.join(df_uniform.select("value"),"value", "inner").count()
# salt range is from 1 to spark.conf.get("spark.sql.shuffle.partitions")
df_left = df_skew.withColumn("salt", (rand() *
spark.conf.get("spark.sql.shuffle.partitions")).cast("int")).show()
df_right = df_uniform.withColumn("salt_temp", array([lit(i) for i in
range(int(spark.conf.get("spark.sql.shuffle.partitions")))])).show()
time.sleep(60) # Pause

if __name__ == "__main__":
  main()

The PySpark code file is attached

There is a 60 sec wait at the end to allow one to examine Spark UI

Run it with meager memory and cores

bin/spark-submit --master local[1] --driver-memory 550M skew.py

This is the run results

Number of Partitions: 1
+---++
|partitionId|   count|
+---++
|  0|1000|
+---++

Spark.sql.shuffle.partitions is 200




*+---+---+|partitionId|  count|+---+---+|
 0|998||  1|  1|*

*|2| 1|+---+---+*

+-+---++
|value|partitionId|salt|
+-+---++
|0|  0|  89|
|0|  0|  56|
|0|  0| 169|
|0|  0| 130|
|0|  0|  94|
+-+---++
only showing top 5 rows

+-+---++
|value|partitionId|   salt_temp|
+-+---++
|0|  0|[0, 1, 2, 3, 4, 5...|
|1|  0|[0, 1, 2, 3, 4, 5...|
|2|  0|[0, 1, 2, 3, 4, 5...|
|3|  0|[0, 1, 2, 3, 4, 5...|
|4|  0|[0, 1, 2, 3, 4, 5...|
+-+---++
only showing top 5 rows

I have attached the screenshot from spark UI


So we can see the skew as below
+---+---+
|partitionId|  count|
+---+---+
|  0|998|
|  1|  1|
|  2|  1|
+---+---+

Now I have attached the UI plot as well. The section on "Aggregated Metric
by Executor" shows columns Spill(Memory) and Spill(Disk) highlighted in
yellow circle.

My deduction is that Spark will try to use memory for shuffle if it can but
will revert to Disk if it has to. So it is not true that Spark shuffle will
end up using disk as a result of shuffle?

I stand corrected 樂

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 16 May 2023 at 18:07, Mich Talebzadeh 
wrote:

> Hi,
>

Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi,

On the issue of Spark shuffle it is accepted that shuffle *often involves*
the following if not all below:

   - Disk I/O
   - Data serialization and deserialization
   - Network I/O

Excluding external shuffle service and without relying on the configuration
options provided by spark for shuffle does the operation always involve
disk usage (any HCFS compatible file system) or will it use the existing
persistent memory if it can.?

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
Couple of points

Why use spot or pre-empt intantes when your application as you stated
shuffles heavily.
Have you looked at why you are having these shuffles? What is the cause of
these large transformations ending up in shuffle

Also on your point:
"..then ideally we should expect that when an executor is killed/OOM'd and
a new executor is spawned on the same host, the new executor registers the
shuffle files to itself. Is that so?"

What guarantee is that the new executor with inherited shuffle files will
succeed?

Also OOM is often associated with some form of skewed data

HTH
.
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 May 2023 at 13:11, Faiz Halde 
wrote:

> Hello,
>
> We've been in touch with a few spark specialists who suggested us a
> potential solution to improve the reliability of our jobs that are shuffle
> heavy
>
> Here is what our setup looks like
>
>- Spark version: 3.3.1
>- Java version: 1.8
>- We do not use external shuffle service
>- We use spot instances
>
> We run spark jobs on clusters that use Amazon EBS volumes. The
> spark.local.dir is mounted on this EBS volume. One of the offerings from
> the service we use is EBS migration which basically means if a host is
> about to get evicted, a new host is created and the EBS volume is attached
> to it
>
> When Spark assigns a new executor to the newly created instance, it
> basically can recover all the shuffle files that are already persisted in
> the migrated EBS volume
>
> Is this how it works? Do executors recover / re-register the shuffle files
> that they found?
>
> So far I have not come across any recovery mechanism. I can only see
>
> KubernetesLocalDiskShuffleDataIO
>
>  that has a pre-init step where it tries to register the available shuffle
> files to itself
>
> A natural follow-up on this,
>
> If what they claim is true, then ideally we should expect that when an
> executor is killed/OOM'd and a new executor is spawned on the same host,
> the new executor registers the shuffle files to itself. Is that so?
>
> Thanks
>
> --
> Confidentiality note: This e-mail may contain confidential information
> from Nu Holdings Ltd and/or its affiliates. If you have received it by
> mistake, please let us know by e-mail reply and delete it from your system;
> you may not copy this message or disclose its contents to anyone; for
> details about what personal information we collect and why, please refer to
> our privacy policy
> <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>
> .
>


Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When I run this job in local mode  spark-submit --master local[4]

with

spark = SparkSession.builder \
.appName("tests") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.adaptive.enabled", "true")
df3.explain(extended=True)

and no caching

I see this plan

== Parsed Logical Plan ==
'Join UsingJoin(Inner, [index])
:- Relation [index#0,0#1] csv
+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS avg(0)#7]
   +- Relation [index#11,0#12] csv

== Analyzed Logical Plan ==
index: string, 0: string, avg(0): double
Project [index#0, 0#1, avg(0)#7]
+- Join Inner, (index#0 = index#11)
   :- Relation [index#0,0#1] csv
   +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
avg(0)#7]
  +- Relation [index#11,0#12] csv

== Optimized Logical Plan ==
Project [index#0, 0#1, avg(0)#7]
+- Join Inner, (index#0 = index#11)
   :- Filter isnotnull(index#0)
   :  +- Relation [index#0,0#1] csv
   +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
avg(0)#7]
  +- Filter isnotnull(index#11)
 +- Relation [index#11,0#12] csv

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [index#0, 0#1, avg(0)#7]
   +- BroadcastHashJoin [index#0], [index#11], Inner, BuildRight, false
  :- Filter isnotnull(index#0)
  :  +- FileScan csv [index#0,0#1] Batched: false, DataFilters:
[isnotnull(index#0)], Format: CSV, Location: InMemoryFileIndex(1
paths)[hdfs://rhes75:9000/tmp/df1.csv], PartitionFilters: [],
PushedFilters: [IsNotNull(index)], ReadSchema: struct
  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
string, true]),false), [plan_id=174]
 +- HashAggregate(keys=[index#11], functions=[avg(cast(0#12 as
double))], output=[index#11, avg(0)#7])
+- Exchange hashpartitioning(index#11, 200),
ENSURE_REQUIREMENTS, [plan_id=171]
   +- HashAggregate(keys=[index#11],
functions=[partial_avg(cast(0#12 as double))], output=[index#11, sum#28,
count#29L])
  +- Filter isnotnull(index#11)
 +- FileScan csv [index#11,0#12] Batched: false,
DataFilters: [isnotnull(index#11)], Format: CSV, Location:
InMemoryFileIndex(1 paths)[hdfs://rhes75:9000/tmp/df1.csv],
PartitionFilters: [], PushedFilters: [IsNotNull(index)], ReadSchema:
struct


so two in memory file scans for the csv file. So it caches the data already
given the small result set. Do you see this?

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 May 2023 at 17:48, Nitin Siwach  wrote:

> Thank you for the help Mich :)
>
> I have not started with a pandas DF. I have used pandas to create a dummy
> .csv which I dump on the disk that I intend to use to showcase my pain
> point. Providing pandas code was to ensure an end-to-end runnable example
> is provided and the effort on anyone trying to help me out is minimized
>
> I don't think Spark validating the file existence qualifies as an action
> according to Spark parlance. Sure there would be an analysis exception in
> case the file is not found as per the location provided, however, if you
> provided a schema and a valid path then no job would show up on the spark
> UI validating (IMO) that no action has been taken. (1 Action necessarily
> equals at least one job). If you don't provide the schema then a job is
> triggered (an action) to infer the schema for subsequent logical planning.
>
> Since I am just demonstrating my lack of understanding I have chosen local
> mode. Otherwise, I do use google buckets to host all the data
>
> This being said I think my question is something entirely different. It is
> that calling one action  (df3.count()) is reading the same csv twice. I do
> not understand that. So far, I always thought that data should be persisted
> only in case a DAG subset is to be reused by several actions.
>
>
> On Sun, May 7, 2023 at 9:47 PM Mich Talebzadeh 
> wrote:
>
>> You have started with panda DF which won't scale outside of the driver
>> itself.
>>
>> Let us put that aside.
>> df1.to_csv("./df1.csv",index_label = "index")  ## write the dataframe to
>> the underlying file system
>>
>> starting with spark
>>
>> df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read
>> the da

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When you run this in yarn mode, it uses  Broadcast Hash Join  for join
operation as shown in the following output. The datasets here are the same
size, so it broadcasts one dataset to all of the executors and then reads
the same dataset and does a hash join.

It is typical of joins . No surprises here. It has to read it twice to
perform this operation.  HJ was not invented by Spark. It has been around
in databases for years plus NLJ and MJ.

[image: image.png]

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 May 2023 at 09:38, Nitin Siwach  wrote:

> I do not think InMemoryFileIndex means it is caching the data. The caches
> get shown as InMemoryTableScan. InMemoryFileIndex is just for partition
> discovery and partition pruning.
> Any read will always show up as a scan from InMemoryFileIndex. It is not
> cached data. It is a cached file index. Please correct my understanding if
> I am wrong
>
> Even the following code shows a scan from an InMemoryFileIndex
> ```
> df1 = spark.read.csv("./df1.csv", header=True, schema = schema)
> df1.explain(mode = "extended")
> ```
>
> output:
> ```
>
> == Parsed Logical Plan ==
> Relation [index#50,0#51] csv
>
> == Analyzed Logical Plan ==
> index: string, 0: string
> Relation [index#50,0#51] csv
>
> == Optimized Logical Plan ==
> Relation [index#50,0#51] csv
>
> == Physical Plan ==
> FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format: CSV, 
> Location: InMemoryFileIndex(1 paths)[file:/home/nitin/work/df1.csv], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>
> ```
>
> On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh 
> wrote:
>
>> When I run this job in local mode  spark-submit --master local[4]
>>
>> with
>>
>> spark = SparkSession.builder \
>> .appName("tests") \
>> .enableHiveSupport() \
>> .getOrCreate()
>> spark.conf.set("spark.sql.adaptive.enabled", "true")
>> df3.explain(extended=True)
>>
>> and no caching
>>
>> I see this plan
>>
>> == Parsed Logical Plan ==
>> 'Join UsingJoin(Inner, [index])
>> :- Relation [index#0,0#1] csv
>> +- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS avg(0)#7]
>>+- Relation [index#11,0#12] csv
>>
>> == Analyzed Logical Plan ==
>> index: string, 0: string, avg(0): double
>> Project [index#0, 0#1, avg(0)#7]
>> +- Join Inner, (index#0 = index#11)
>>:- Relation [index#0,0#1] csv
>>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
>> avg(0)#7]
>>   +- Relation [index#11,0#12] csv
>>
>> == Optimized Logical Plan ==
>> Project [index#0, 0#1, avg(0)#7]
>> +- Join Inner, (index#0 = index#11)
>>:- Filter isnotnull(index#0)
>>:  +- Relation [index#0,0#1] csv
>>+- Aggregate [index#11], [index#11, avg(cast(0#12 as double)) AS
>> avg(0)#7]
>>   +- Filter isnotnull(index#11)
>>  +- Relation [index#11,0#12] csv
>>
>> == Physical Plan ==
>> AdaptiveSparkPlan isFinalPlan=false
>> +- Project [index#0, 0#1, avg(0)#7]
>>+- BroadcastHashJoin [index#0], [index#11], Inner, BuildRight, false
>>   :- Filter isnotnull(index#0)
>>   :  +- FileScan csv [index#0,0#1] Batched: false, DataFilters:
>> [isnotnull(index#0)], Format: CSV, Location: InMemoryFileIndex(1
>> paths)[hdfs://rhes75:9000/tmp/df1.csv], PartitionFilters: [],
>> PushedFilters: [IsNotNull(index)], ReadSchema:
>> struct
>>   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
>> string, true]),false), [plan_id=174]
>>  +- HashAggregate(keys=[index#11], functions=[avg(cast(0#12 as
>> double))], output=[index#11, avg(0)#7])
>> +- Exchange hashpartitioning(index#11, 200),
>> ENSURE_REQUIREMENTS, [plan_id=171]
>>+- HashAggregate(keys=[index#11],
>> functions=[partial_avg(cast(0#12 as double))], output=[index#11, sum#28,
>> count#29L])
>>   +- Filter isnotnull(index#11)
>>  +- FileScan csv [index#11,0#12] Batched: fa

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
You have started with panda DF which won't scale outside of the driver
itself.

Let us put that aside.
df1.to_csv("./df1.csv",index_label = "index")  ## write the dataframe to
the underlying file system

starting with spark

df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read the
dataframe from the underlying file system

That is your first action because spark needs to validate that file (exiss)
and the schema. What will happen if that file does not exist

csvlocation="/tmp/df1.csv"
csvlocation2="/tmp/df5.csv"
df1=  pd.DataFrame(np.arange(1_000).reshape(-1,1))
df1.index = np.random.choice(range(10),size=1000)
df1.to_csv(csvlocation,index_label = "index")
Schema = StructType([StructField('index', StringType(), True),
 StructField('0', StringType(), True)])

df1 = spark.read.csv(*csvlocation2*, header=True, schema = Schema).cache()
## incorrect location

df2 = df1.groupby("index").agg(F.mean("0"))
df3 = df1.join(df2,on='index')
df3.show()
#df3.explain()
df3.count()


error

pyspark.errors.exceptions.captured.AnalysisException: [PATH_NOT_FOUND] Path
does not exist: hdfs://rhes75:9000/tmp/df5.csv.

In a distributed env, that csv file has to be available to all spark
workers. Either you copy that file to all worker nodes or you put in HDFS
or S3 or  gs:// locations to be available to all.

It does not even get to df3.count()

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 May 2023 at 15:53, Nitin Siwach  wrote:

> Thank you for your response, Sir.
>
> My understanding is that the final ```df3.count()``` is the only action in
> the code I have attached. In fact, I tried running the rest of the code
> (commenting out just the final df3.count()) and, as I expected, no
> computations were triggered
>
> On Sun, 7 May, 2023, 20:16 Mich Talebzadeh, 
> wrote:
>
>>
>> ...However, In my case here I am calling just one action. ..
>>
>> ok, which line  in your code is called one action?
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 7 May 2023 at 14:13, Nitin Siwach  wrote:
>>
>>> @Vikas Kumar 
>>> I am sorry but I thought that you had answered the other question that I
>>> had raised to the same email address yesterday. It was around the SQL tab
>>> in web UI and the output of .explain showing different plans.
>>>
>>> I get how using .cache I can ensure that the data from a particular
>>> checkpoint is reused and the computations do not happen again.
>>>
>>> However, In my case here I am calling just one action. Within the
>>> purview of one action Spark should not rerun the overlapping parts of the
>>> DAG. I do not understand why the file scan is happening several times. I
>>> can easily mitigate the issue by using window functions and creating all
>>> the columns in one go without having to use several joins later on. That
>>> being said this particular behavior is what I am trying ot understand. The
>>> golden rule "The DAG overlaps wont run several times for one action" seems
>>> not to be apocryphal. If you can shed some light on this matter I would
>>> appreciate it
>>>
>>> @weiruanl...@gmail.com  My datasets are very
>>> small as you can see in the sample examples that I am creating as the first
>>> part of the code
>>>
>>> Really appreciate you guys helping me out with this :)
>>>
>>> O

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
...However, In my case here I am calling just one action. ..

ok, which line  in your code is called one action?


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 May 2023 at 14:13, Nitin Siwach  wrote:

> @Vikas Kumar 
> I am sorry but I thought that you had answered the other question that I
> had raised to the same email address yesterday. It was around the SQL tab
> in web UI and the output of .explain showing different plans.
>
> I get how using .cache I can ensure that the data from a particular
> checkpoint is reused and the computations do not happen again.
>
> However, In my case here I am calling just one action. Within the purview
> of one action Spark should not rerun the overlapping parts of the DAG. I do
> not understand why the file scan is happening several times. I can easily
> mitigate the issue by using window functions and creating all the columns
> in one go without having to use several joins later on. That being said
> this particular behavior is what I am trying ot understand. The golden rule
> "The DAG overlaps wont run several times for one action" seems not to be
> apocryphal. If you can shed some light on this matter I would appreciate it
>
> @weiruanl...@gmail.com  My datasets are very small
> as you can see in the sample examples that I am creating as the first part
> of the code
>
> Really appreciate you guys helping me out with this :)
>
> On Sun, May 7, 2023 at 12:23 PM Winston Lai  wrote:
>
>> When your memory is not sufficient to keep the cached data for your jobs
>> in two different stages, it might be read twice because Spark might have to
>> clear the previous cache for other jobs. In those cases, a spill may
>> triggered when Spark write your data from memory to disk.
>>
>> One way to to check is to read Spark UI. When Spark cache the data, you
>> will see a little green dot connected to the blue rectangle in the Spark
>> UI. If you see this green dot twice on your two stages, likely Spark spill
>> the data after your first job and read it again in the second run. You can
>> also confirm it in other metrics from Spark UI.
>>
>> That is my personal understanding based on what I have read and seen on
>> my job runs. If there is any mistake, be free to correct me.
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* Nitin Siwach 
>> *Sent:* Sunday, May 7, 2023 12:22:32 PM
>> *To:* Vikas Kumar 
>> *Cc:* User 
>> *Subject:* Re: Does spark read the same file twice, if two stages are
>> using the same DataFrame?
>>
>> Thank you tons, Vikas :). That makes so much sense now
>>
>> I'm in learning phase and was just browsing through various concepts of
>> spark with self made small examples.
>>
>> It didn't make sense to me that the two physical plans should be
>> different. But, now I understand what you're saying.
>>
>> Again, thank you for helping me out
>>
>> On Sun, 7 May, 2023, 07:48 Vikas Kumar,  wrote:
>>
>>
>> Spark came up with a plan but that may or may not be optimal plan given
>> the system settings.
>> If you do df1.cache() , i am guessing spark will not read df1 twice.
>>
>> Btw, Why do you have adaptive enabled to be false?
>>
>> On Sat, May 6, 2023, 1:46 PM Nitin Siwach  wrote:
>>
>> I hope this email finds you well :)
>>
>> The following code reads the same csv twice even though only one action
>> is called
>>
>> End to end runnable example:
>> ```
>> import pandas as pd
>> import numpy as np
>>
>> df1=  pd.DataFrame(np.arange(1_000).reshape(-1,1))
>> df1.index = np.random.choice(range(10),size=1000)
>> df1.to_csv("./df1.csv",index_label = "index")
>>
>> 
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql import functions as F
>> from pyspark.sql.types import StructType, StringType, StructField
>>
>> spark =
>> SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold",&qu

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you
want

## you don't need all these
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col, current_timestamp, lit
from pyspark.sql.types import *
sqltext = """
SELECT aggregate(array(1, 2, 3, 4),
   named_struct('sum', 0, 'cnt', 0),
   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt',
acc.cnt + 1),
   acc -> acc.sum / acc.cnt) AS avg
"""
df = spark.sql(sqltext)
df.printSchema()

root
 |-- avg: double (nullable = true)


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 5 May 2023 at 20:33, Yong Zhang  wrote:

> Hi, This is on Spark 3.1 environment.
>
> For some reason, I can ONLY do this in Spark SQL, instead of either Scala
> or PySpark environment.
>
> I want to aggregate an array into a Map of element count, within that
> array, but in Spark SQL.
> I know that there is an aggregate function available like
>
> aggregate(expr, start, merge [, finish])
>
>
> But I want to know if this can be done in the Spark SQL only, and:
>
>- How to represent an empty Map as "start" element above
>- How to merge each element (as String type) into Map (as adding count
>if exist in the Map, or add as (element -> 1) as new entry in the Map if
>not exist)
>
> Like the following example ->
> https://docs.databricks.com/sql/language-manual/functions/aggregate.html
>
> SELECT aggregate(array(1, 2, 3, 4),   named_struct('sum', 0, 
> 'cnt', 0),   (acc, x) -> named_struct('sum', acc.sum + x, 
> 'cnt', acc.cnt + 1),   acc -> acc.sum / acc.cnt) AS avg
>
>
> I wonder:
> select
> aggregate(
> array('a','b','a')),
> map('', 0),
> (acc, x) -> ???
> acc -> acc) as output
>
> How to do the logic after "(acc, x) -> ", so I can output a map of count
> of each element in the array?
> I know I can "explode", then groupby + count, but since I have multi array
> columns need to transform, so I want to do more a high order function way,
> and in pure Spark SQL.
>
> Thanks
>
>


Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
So what are you intending to do with the resultset produced?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 5 May 2023 at 15:06, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Hi Mich,
>
> Thank you. Ah, I want to avoid bringing all data to the driver node. That
> is my understanding of what will happen in that case. Perhaps, I'll trigger
> a Lambda to rename/combine the files after PySpark writes them.
>
> Cheers,
> Marco.
>
> On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh 
> wrote:
>
>> you can try
>>
>> df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json")
>>
>> hdfs dfs -ls /tmp/pairs.json
>> Found 2 items
>> -rw-r--r--   3 hduser supergroup  0 2023-05-04 22:21
>> /tmp/pairs.json/_SUCCESS
>> -rw-r--r--   3 hduser supergroup 96 2023-05-04 22:21
>> /tmp/pairs.json/part-0-21f12540-c1c6-441d-a9b2-a82ce2113853-c000.json
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 4 May 2023 at 22:14, Marco Costantini <
>> marco.costant...@rocketfncl.com> wrote:
>>
>>> Hi Mich,
>>> Thank you.
>>> Are you saying this satisfies my requirement?
>>>
>>> On the other hand, I am smelling something going on. Perhaps the Spark
>>> 'part' files should not be thought of as files, but rather pieces of a
>>> conceptual file. If that is true, then your approach (of which I'm well
>>> aware) makes sense. Question: what are some good methods, tools, for
>>> combining the parts into a single, well-named file? I imagine that is
>>> outside of the scope of PySpark, but any advice is welcome.
>>>
>>> Thank you,
>>> Marco.
>>>
>>> On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so
>>>> they do sharding to improve read performance when writing to HCFS file
>>>> systems.
>>>>
>>>> Let us take your code for a drive
>>>>
>>>> import findspark
>>>> findspark.init()
>>>> from pyspark.sql import SparkSession
>>>> from pyspark.sql.functions import struct
>>>> from pyspark.sql.types import *
>>>> spark = SparkSession.builder \
>>>> .getOrCreate()
>>>> pairs = [(1, "a1"), (2, "a2"), (3, "a3")]
>>>> Schema = StructType([ StructField("ID", IntegerType(), False),
>>>>   StructField("datA" , StringType(), True)])
>>>> df = spark.createDataFrame(data=pairs,schema=Schema)
>>>> df.printSchema()
>>>> df.show()
>>>> df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct"))
>>>> df2.printSchema()
>>>> df2.show()
>>>> df2.write.mode("overwrite").json("/tmp/pairs.json")
>>>>
>>>> root
>>>>  |-- ID: integer (nullable = false)
>>>>  |-- datA: string (nullable = true)
>>>>
>>>> +---++
>>>> | ID|datA|
>>>> +---++
>>>> |  1|  a1|
>>>> |  2|  a2|
>>>> |  3|  a3|
>>>> +---++
>>>>
>>>> root
>>>>  |-- ID: integer (nullable = false)
>

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
you can try

df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json")

hdfs dfs -ls /tmp/pairs.json
Found 2 items
-rw-r--r--   3 hduser supergroup  0 2023-05-04 22:21
/tmp/pairs.json/_SUCCESS
-rw-r--r--   3 hduser supergroup 96 2023-05-04 22:21
/tmp/pairs.json/part-0-21f12540-c1c6-441d-a9b2-a82ce2113853-c000.json

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 4 May 2023 at 22:14, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Hi Mich,
> Thank you.
> Are you saying this satisfies my requirement?
>
> On the other hand, I am smelling something going on. Perhaps the Spark
> 'part' files should not be thought of as files, but rather pieces of a
> conceptual file. If that is true, then your approach (of which I'm well
> aware) makes sense. Question: what are some good methods, tools, for
> combining the parts into a single, well-named file? I imagine that is
> outside of the scope of PySpark, but any advice is welcome.
>
> Thank you,
> Marco.
>
> On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh 
> wrote:
>
>> AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they
>> do sharding to improve read performance when writing to HCFS file systems.
>>
>> Let us take your code for a drive
>>
>> import findspark
>> findspark.init()
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import struct
>> from pyspark.sql.types import *
>> spark = SparkSession.builder \
>> .getOrCreate()
>> pairs = [(1, "a1"), (2, "a2"), (3, "a3")]
>> Schema = StructType([ StructField("ID", IntegerType(), False),
>>   StructField("datA" , StringType(), True)])
>> df = spark.createDataFrame(data=pairs,schema=Schema)
>> df.printSchema()
>> df.show()
>> df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct"))
>> df2.printSchema()
>> df2.show()
>> df2.write.mode("overwrite").json("/tmp/pairs.json")
>>
>> root
>>  |-- ID: integer (nullable = false)
>>  |-- datA: string (nullable = true)
>>
>> +---++
>> | ID|datA|
>> +---++
>> |  1|  a1|
>> |  2|  a2|
>> |  3|  a3|
>> +---++
>>
>> root
>>  |-- ID: integer (nullable = false)
>>  |-- Struct: struct (nullable = false)
>>  ||-- datA: string (nullable = true)
>>
>> +---+--+
>> | ID|Struct|
>> +---+--+
>> |  1|  {a1}|
>> |  2|  {a2}|
>> |  3|  {a3}|
>> +---+--+
>>
>> Look at the last line where json format is written
>> df2.write.mode("overwrite").json("/tmp/pairs.json")
>> Under the bonnet this happens
>>
>> hdfs dfs -ls /tmp/pairs.json
>> Found 5 items
>> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
>> /tmp/pairs.json/_SUCCESS
>> -rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
>> /tmp/pairs.json/part-0-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
>> /tmp/pairs.json/part-1-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
>> /tmp/pairs.json/part-2-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>> -rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
>> /tmp/pairs.json/part-3-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaim

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do
sharding to improve read performance when writing to HCFS file systems.

Let us take your code for a drive

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct
from pyspark.sql.types import *
spark = SparkSession.builder \
.getOrCreate()
pairs = [(1, "a1"), (2, "a2"), (3, "a3")]
Schema = StructType([ StructField("ID", IntegerType(), False),
  StructField("datA" , StringType(), True)])
df = spark.createDataFrame(data=pairs,schema=Schema)
df.printSchema()
df.show()
df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct"))
df2.printSchema()
df2.show()
df2.write.mode("overwrite").json("/tmp/pairs.json")

root
 |-- ID: integer (nullable = false)
 |-- datA: string (nullable = true)

+---++
| ID|datA|
+---++
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---++

root
 |-- ID: integer (nullable = false)
 |-- Struct: struct (nullable = false)
 ||-- datA: string (nullable = true)

+---+--+
| ID|Struct|
+---+--+
|  1|  {a1}|
|  2|  {a2}|
|  3|  {a3}|
+---+--+

Look at the last line where json format is written
df2.write.mode("overwrite").json("/tmp/pairs.json")
Under the bonnet this happens

hdfs dfs -ls /tmp/pairs.json
Found 5 items
-rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
/tmp/pairs.json/_SUCCESS
-rw-r--r--   3 hduser supergroup  0 2023-05-04 21:53
/tmp/pairs.json/part-0-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
-rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
/tmp/pairs.json/part-1-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
-rw-r--r--   3 hduser supergroup 32 2023-05-04 21:53
/tmp/pairs.json/part-2-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json
-rw-r--r--   3 hduser supergroup     32 2023-05-04 21:53
/tmp/pairs.json/part-3-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 4 May 2023 at 21:38, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Hello,
>
> I am testing writing my DataFrame to S3 using the DataFrame `write`
> method. It mostly does a great job. However, it fails one of my
> requirements. Here are my requirements.
>
> - Write to S3
> - use `partitionBy` to automatically make folders based on my chosen
> partition columns
> - control the resultant filename (whole or in part)
>
> I can get the first two requirements met but not the third.
>
> Here's an example. When I use the commands...
>
> df.write.partitionBy("year","month").mode("append")\
> .json('s3a://bucket_name/test_folder/')
>
> ... I get the partitions I need. However, the filenames are something
> like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json
>
>
> Now, I understand Spark's need to include the partition number in the
> filename. However, it sure would be nice to control the rest of the file
> name.
>
>
> Any advice? Please and thank you.
>
> Marco.
>


Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Mich Talebzadeh
Is this all of your writeStream?

df.writeStream()
.foreachBatch(new KafkaS3PipelineImplementation(applicationId, appConfig))
.start()
.awaitTermination();

What happened to the checkpoint location?

option('checkpointLocation', checkpoint_path).

example

 checkpoint_path = "file:///ssd/hduser/MDBatchBQ/chkpt"


ls -l  /ssd/hduser/MDBatchBQ/chkpt
total 24
-rw-r--r--. 1 hduser hadoop   45 Mar  1 09:27 metadata
drwxr-xr-x. 5 hduser hadoop 4096 Mar  1 09:27 .
drwxr-xr-x. 4 hduser hadoop 4096 Mar  1 10:31 ..
drwxr-xr-x. 3 hduser hadoop 4096 Apr 22 11:27 sources
drwxr-xr-x. 2 hduser hadoop 4096 Apr 24 11:09 offsets
drwxr-xr-x. 2 hduser hadoop 4096 Apr 24 11:09 commits

so you can see what is going on

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 27 Apr 2023 at 15:46, Abhishek Singla 
wrote:

> Hi Team,
>
> I am using Spark Streaming to read from Kafka and write to S3.
>
> Version: 3.1.2
> Scala Version: 2.12
> Spark Kafka connector: spark-sql-kafka-0-10_2.12
>
> Dataset df =
> spark
> .readStream()
> .format("kafka")
> .options(appConfig.getKafka().getConf())
> .load()
> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
>
> df.writeStream()
> .foreachBatch(new KafkaS3PipelineImplementation(applicationId, appConfig))
> .start()
> .awaitTermination();
>
> kafka.conf = {
>"kafka.bootstrap.servers": "localhost:9092",
>"subscribe": "test-topic",
>"minOffsetsPerTrigger": 1000,
>"maxOffsetsPerTrigger": 1100,
>"maxTriggerDelay": "15m",
>"groupIdPrefix": "test",
>"startingOffsets": "latest",
>"includeHeaders": true,
>"failOnDataLoss": false
>   }
>
> spark.conf = {
>"spark.master": "spark://localhost:7077",
>"spark.app.name": "app",
>"spark.sql.streaming.kafka.useDeprecatedOffsetFetching": false,
>"spark.sql.streaming.metricsEnabled": true
>  }
>
>
> But these configs do not seem to be working as I can see Spark processing
> batches of 3k-15k immediately one after another. Is there something I am
> missing?
>
> Ref:
> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
>
> Regards,
> Abhishek Singla
>
>
>
>
>
>
>
>
>


Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Again one try is worth many opinions. Try it and gather matrix from spark
UI and see how it performs.

On Wed, 26 Apr 2023 at 14:57, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Thanks team,
> Email was just an example. The point was to illustrate that some actions
> could be chained using Spark's foreach. In reality, this is an S3 write and
> a Kafka message production, which I think is quite reasonable for spark to
> do.
>
> To answer Ayan's first question. Yes, all a users orders, prepared for
> each and every user.
>
> Other than the remarks that email transmission is unwise (which I've now
> reminded is irrelevant) I am not seeing an alternative to using Spark's
> foreach. Unless, your proposal is for the Spark job to target 1 user, and
> just run the job 1000's of times taking the user_id as input. That doesn't
> sound attractive.
>
> Also, while we say that foreach is not optimal, I cannot find any evidence
> of it; neither here nor online. If there are any docs about the inner
> workings of this functionality, please pass them to me. I continue to
> search for them. Even late last night!
>
> Thanks for your help team,
> Marco.
>
> On Wed, Apr 26, 2023 at 6:21 AM Mich Talebzadeh 
> wrote:
>
>> Indeed very valid points by Ayan. How email is going to handle 1000s of
>> records. As a solution architect I tend to replace. Users by customers and
>> for each order there must be products sort of many to many relationship. If
>> I was a customer I would also be interested in product details as
>> well.sending via email sounds like a Jurassic park solution 
>>
>> On Wed, 26 Apr 2023 at 10:24, ayan guha  wrote:
>>
>>> Adding to what Mitch said,
>>>
>>> 1. Are you trying to send statements of all orders to all users? Or the
>>> latest order only?
>>>
>>> 2. Sending email is not a good use of spark. instead, I suggest to use a
>>> notification service or function. Spark should write to a queue (kafka,
>>> sqs...pick your choice here).
>>>
>>> Best regards
>>> Ayan
>>>
>>> On Wed, 26 Apr 2023 at 7:01 pm, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well OK in a nutshell you want the result set for every user prepared
>>>> and email to that user right.
>>>>
>>>> This is a form of ETL where those result sets need to be posted
>>>> somewhere. Say you create a table based on the result set prepared for each
>>>> user. You may have many raw target tables at the end of the first ETL. How
>>>> does this differ from using forEach? Performance wise forEach may not be
>>>> optimal.
>>>>
>>>> Can you take the sample tables and try your method?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 26 Apr 2023 at 04:10, Marco Costantini <
>>>> marco.costant...@rocketfncl.com> wrote:
>>>>
>>>>> Hi Mich,
>>>>> First, thank you for that. Great effort put into helping.
>>>>>
>>>>> Second, I don't think this tackles the technical challenge here. I
>>>>> understand the windowing as it serves those ranks you created, but I don't
>>>>> see how the ranks contribute to the solution.
>>>>> Third, the core of the challenge is about performing this kind of
>>>>> 'statement' but for all users. In this example we target Mich, but that
>>>>> reduces the complexity by a lot! In fact, a simple join and filter would
>>>>> solve that one.
>>>>>
>>>>> Any thoughts on that? For me, the foreach is desirable because 

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Indeed very valid points by Ayan. How email is going to handle 1000s of
records. As a solution architect I tend to replace. Users by customers and
for each order there must be products sort of many to many relationship. If
I was a customer I would also be interested in product details as
well.sending via email sounds like a Jurassic park solution 

On Wed, 26 Apr 2023 at 10:24, ayan guha  wrote:

> Adding to what Mitch said,
>
> 1. Are you trying to send statements of all orders to all users? Or the
> latest order only?
>
> 2. Sending email is not a good use of spark. instead, I suggest to use a
> notification service or function. Spark should write to a queue (kafka,
> sqs...pick your choice here).
>
> Best regards
> Ayan
>
> On Wed, 26 Apr 2023 at 7:01 pm, Mich Talebzadeh 
> wrote:
>
>> Well OK in a nutshell you want the result set for every user prepared and
>> email to that user right.
>>
>> This is a form of ETL where those result sets need to be posted
>> somewhere. Say you create a table based on the result set prepared for each
>> user. You may have many raw target tables at the end of the first ETL. How
>> does this differ from using forEach? Performance wise forEach may not be
>> optimal.
>>
>> Can you take the sample tables and try your method?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 26 Apr 2023 at 04:10, Marco Costantini <
>> marco.costant...@rocketfncl.com> wrote:
>>
>>> Hi Mich,
>>> First, thank you for that. Great effort put into helping.
>>>
>>> Second, I don't think this tackles the technical challenge here. I
>>> understand the windowing as it serves those ranks you created, but I don't
>>> see how the ranks contribute to the solution.
>>> Third, the core of the challenge is about performing this kind of
>>> 'statement' but for all users. In this example we target Mich, but that
>>> reduces the complexity by a lot! In fact, a simple join and filter would
>>> solve that one.
>>>
>>> Any thoughts on that? For me, the foreach is desirable because I can
>>> have the workers chain other actions to each iteration (send email, send
>>> HTTP request, etc).
>>>
>>> Thanks Mich,
>>> Marco.
>>>
>>> On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi Marco,
>>>>
>>>> First thoughts.
>>>>
>>>> foreach() is an action operation that is to iterate/loop over each
>>>> element in the dataset, meaning cursor based. That is different from
>>>> operating over the dataset as a set which is far more efficient.
>>>>
>>>> So in your case as I understand it correctly, you want to get order for
>>>> each user (say Mich), convert the result set to json and send it to Mich
>>>> via email
>>>>
>>>> Let us try this based on sample data
>>>>
>>>> Put your csv files into HDFS directory
>>>>
>>>> hdfs dfs -put users.csv /data/stg/test
>>>> hdfs dfs -put orders.csv /data/stg/test
>>>>
>>>> Then create dataframes from csv files, create temp views and do a join
>>>> on result sets with some slicing and dicing on orders table
>>>>
>>>> #! /usr/bin/env python3
>>>> from __future__ import print_function
>>>> import sys
>>>> import findspark
>>>> findspark.init()
>>>> from pyspark.sql import SparkSession
>>>> from pyspark import SparkContext
>>>> from pyspark.sql import SQLContext, HiveContext
>>>> from pyspark.sql.window import Window
>>>>
>>>> def spark_session(appName):
>>>>   return SparkSession.builder \
>>>> .appName(appName) \
>>>>   

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Well OK in a nutshell you want the result set for every user prepared and
email to that user right.

This is a form of ETL where those result sets need to be posted somewhere.
Say you create a table based on the result set prepared for each user. You
may have many raw target tables at the end of the first ETL. How does this
differ from using forEach? Performance wise forEach may not be optimal.

Can you take the sample tables and try your method?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 26 Apr 2023 at 04:10, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Hi Mich,
> First, thank you for that. Great effort put into helping.
>
> Second, I don't think this tackles the technical challenge here. I
> understand the windowing as it serves those ranks you created, but I don't
> see how the ranks contribute to the solution.
> Third, the core of the challenge is about performing this kind of
> 'statement' but for all users. In this example we target Mich, but that
> reduces the complexity by a lot! In fact, a simple join and filter would
> solve that one.
>
> Any thoughts on that? For me, the foreach is desirable because I can have
> the workers chain other actions to each iteration (send email, send HTTP
> request, etc).
>
> Thanks Mich,
> Marco.
>
> On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh 
> wrote:
>
>> Hi Marco,
>>
>> First thoughts.
>>
>> foreach() is an action operation that is to iterate/loop over each
>> element in the dataset, meaning cursor based. That is different from
>> operating over the dataset as a set which is far more efficient.
>>
>> So in your case as I understand it correctly, you want to get order for
>> each user (say Mich), convert the result set to json and send it to Mich
>> via email
>>
>> Let us try this based on sample data
>>
>> Put your csv files into HDFS directory
>>
>> hdfs dfs -put users.csv /data/stg/test
>> hdfs dfs -put orders.csv /data/stg/test
>>
>> Then create dataframes from csv files, create temp views and do a join on
>> result sets with some slicing and dicing on orders table
>>
>> #! /usr/bin/env python3
>> from __future__ import print_function
>> import sys
>> import findspark
>> findspark.init()
>> from pyspark.sql import SparkSession
>> from pyspark import SparkContext
>> from pyspark.sql import SQLContext, HiveContext
>> from pyspark.sql.window import Window
>>
>> def spark_session(appName):
>>   return SparkSession.builder \
>> .appName(appName) \
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> def main():
>> appName = "ORDERS"
>> spark =spark_session(appName)
>> # get the sample
>> users_file="hdfs://rhes75:9000/data/stg/test/users.csv"
>> orders_file="hdfs://rhes75:9000/data/stg/test/orders.csv"
>> users_df =
>> spark.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load(users_file)
>> users_df.printSchema()
>> """
>> root
>> |-- id: integer (nullable = true)
>> |-- name: string (nullable = true)
>> """
>>
>> print(f"""\n Reading from  {users_file}\n""")
>> users_df.show(5,False)
>> orders_df =
>> spark.read.format("com.databricks.spark.csv").option("inferSchema",
>> "true").option("header", "true").load(orders_file)
>> orders_df.printSchema()
>> """
>> root
>> |-- id: integer (nullable = true)
>> |-- description: string (nullable = true)
>> |-- amount: double (nullable = true)
>> |-- user_id: integer (nullable = true)
>>  """
>> print(f"""\n Reading from  {orders_file}\n""")
>> orders_df.show(50,False)
>> users_df.createOrReplaceTempView("users")
>> orders_df.

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco,

First thoughts.

foreach() is an action operation that is to iterate/loop over each element
in the dataset, meaning cursor based. That is different from operating over
the dataset as a set which is far more efficient.

So in your case as I understand it correctly, you want to get order for
each user (say Mich), convert the result set to json and send it to Mich
via email

Let us try this based on sample data

Put your csv files into HDFS directory

hdfs dfs -put users.csv /data/stg/test
hdfs dfs -put orders.csv /data/stg/test

Then create dataframes from csv files, create temp views and do a join on
result sets with some slicing and dicing on orders table

#! /usr/bin/env python3
from __future__ import print_function
import sys
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql.window import Window

def spark_session(appName):
  return SparkSession.builder \
.appName(appName) \
.enableHiveSupport() \
.getOrCreate()

def main():
appName = "ORDERS"
spark =spark_session(appName)
# get the sample
users_file="hdfs://rhes75:9000/data/stg/test/users.csv"
orders_file="hdfs://rhes75:9000/data/stg/test/orders.csv"
users_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(users_file)
users_df.printSchema()
"""
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
"""

print(f"""\n Reading from  {users_file}\n""")
users_df.show(5,False)
orders_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(orders_file)
orders_df.printSchema()
"""
root
|-- id: integer (nullable = true)
|-- description: string (nullable = true)
|-- amount: double (nullable = true)
|-- user_id: integer (nullable = true)
 """
print(f"""\n Reading from  {orders_file}\n""")
orders_df.show(50,False)
users_df.createOrReplaceTempView("users")
orders_df.createOrReplaceTempView("orders")
# Create a list of orders for each user
print(f"""\n Doing a join on two temp views\n""")

sqltext = """
SELECT u.name, t.order_id, t.description, t.amount, t.maxorders
FROM
(
SELECT
user_id AS user_id
,   id as order_id
,   description as description
,   amount AS amount
,  DENSE_RANK() OVER (PARTITION by user_id ORDER BY amount) AS RANK
,  MAX(amount) OVER (PARTITION by user_id ORDER BY id) AS maxorders
FROM orders
) t
INNER JOIN users u ON t.user_id = u.id
AND  u.name = 'Mich'
ORDER BY t.order_id
"""
spark.sql(sqltext).show(50)
if __name__ == '__main__':
main()

Final outcome displaying orders for user Mich

Doing a join on two temp views

 Doing a join on two temp views

+++-+--+-+
|name|order_id|  description|amount|maxorders|
+++-+--+-+
|Mich|   50001| Mich's 1st order|101.11|   101.11|
|Mich|   50002| Mich's 2nd order|102.11|   102.11|
|Mich|   50003| Mich's 3rd order|103.11|   103.11|
|Mich|   50004| Mich's 4th order|104.11|   104.11|
|Mich|   50005| Mich's 5th order|105.11|   105.11|
|Mich|   50006| Mich's 6th order|106.11|   106.11|
|Mich|   50007| Mich's 7th order|107.11|   107.11|
|Mich|   50008| Mich's 8th order|108.11|   108.11|
|Mich|   50009| Mich's 9th order|109.11|   109.11|
|Mich|   50010|Mich's 10th order|210.11|   210.11|
+++-+--+-+

You can start on this.  Happy coding

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 25 Apr 2023 at 18:50, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Thanks Mich,
>
> Great idea. I have done it. Those files are attached. I'm interested to
> know your thoughts. Let's imagine this same structure, but with huge
> amounts of data as well.
>
> Please and thank you,
> Marco.
>
> On Tue, 

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco,

Let us start simple,

Provide a csv file of 5 rows for the users table. Each row has a unique
user_id and one or two other columns like fictitious email etc.

Also for each user_id, provide 10 rows of orders table, meaning that orders
table has 5 x 10 rows for each user_id.

both as comma separated csv file

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 25 Apr 2023 at 14:07, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> Thanks Mich,
> I have not but I will certainly read up on this today.
>
> To your point that all of the essential data is in the 'orders' table; I
> agree! That distills the problem nicely. Yet, I still have some questions
> on which someone may be able to shed some light.
>
> 1) If my 'orders' table is very large, and will need to be aggregated by
> 'user_id', how will Spark intelligently optimize on that constraint (only
> read data for relevent 'user_id's). Is that something I have to instruct
> Spark to do?
>
> 2) Without #1, even with windowing, am I asking each partition to search
> too much?
>
> Please, if you have any links to documentation I can read on *how* Spark
> works under the hood for these operations, I would appreciate it if you
> give them. Spark has become a pillar on my team and knowing it in more
> detail is warranted.
>
> Slightly pivoting the subject here; I have tried something. It was a
> suggestion by an AI chat bot and it seemed reasonable. In my main Spark
> script I now have the line:
>
> ```
> grouped_orders_df =
> orders_df.groupBy('user_id').agg(collect_list(to_json(struct('user_id',
> 'timestamp', 'total', 'description'))).alias('orders'))
> ```
> (json is ultimately needed)
>
> This actually achieves my goal by putting all of the 'orders' in a single
> Array column. Now my worry is, will this column become too large if there
> are a great many orders. Is there a limit? I have search for documentation
> on such a limit but could not find any.
>
> I truly appreciate your help Mich and team,
> Marco.
>
>
> On Tue, Apr 25, 2023 at 5:40 AM Mich Talebzadeh 
> wrote:
>
>> Have you thought of using  windowing function
>> <https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
>> achieve this?
>>
>> Effectively all your information is in the orders table.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
>> marco.costant...@rocketfncl.com> wrote:
>>
>>> I have two tables: {users, orders}. In this example, let's say that for
>>> each 1 User in the users table, there are 10 Orders in the orders table.
>>>
>>> I have to use pyspark to generate a statement of Orders for each User.
>>> So, a single user will need his/her own list of Orders. Additionally, I
>>> need to send this statement to the real-world user via email (for example).
>>>
>>> My first intuition was to apply a DataFrame.foreach() on the users
>>> DataFrame. This way, I can rely on the spark workers to handle the email
>>> sending individually. However, I now do not know the best way to get each
>>> User's Orders.
>>>
>>> I will soon try the following (pseudo-code):
>>>
>>> ```
>>> users_df = 
>>> orders_df = 
>>>
>>> #this is poorly named for max understandability in this context
>>> def foreach_function(row):
>>>   u

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Have you thought of using  windowing function
<https://sparkbyexamples.com/spark/spark-sql-window-functions/>s to
achieve this?

Effectively all your information is in the orders table.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 25 Apr 2023 at 00:15, Marco Costantini <
marco.costant...@rocketfncl.com> wrote:

> I have two tables: {users, orders}. In this example, let's say that for
> each 1 User in the users table, there are 10 Orders in the orders table.
>
> I have to use pyspark to generate a statement of Orders for each User. So,
> a single user will need his/her own list of Orders. Additionally, I need to
> send this statement to the real-world user via email (for example).
>
> My first intuition was to apply a DataFrame.foreach() on the users
> DataFrame. This way, I can rely on the spark workers to handle the email
> sending individually. However, I now do not know the best way to get each
> User's Orders.
>
> I will soon try the following (pseudo-code):
>
> ```
> users_df = 
> orders_df = 
>
> #this is poorly named for max understandability in this context
> def foreach_function(row):
>   user_id = row.user_id
>   user_orders_df = orders_df.select(f'user_id = {user_id}')
>
>   #here, I'd get any User info from 'row'
>   #then, I'd convert all 'user_orders' to JSON
>   #then, I'd prepare the email and send it
>
> users_df.foreach(foreach_function)
> ```
>
> It is my understanding that if I do my user-specific work in the foreach
> function, I will capitalize on Spark's scalability when doing that work.
> However, I am worried of two things:
>
> If I take all Orders up front...
>
> Will that work?
> Will I be taking too much? Will I be taking Orders on partitions who won't
> handle them (different User).
>
> If I create the orders_df (filtered) within the foreach function...
>
> Will it work?
> Will that be too much IO to DB?
>
> The question ultimately is: How can I achieve this goal efficiently?
>
> I have not yet tried anything here. I am doing so as we speak, but am
> suffering from choice-paralysis.
>
> Please and thank you.
>


Re: Spark Kubernetes Operator

2023-04-14 Thread Mich Talebzadeh
Hi,

What exactly are you trying to achieve? Spark on GKE works fine and you can
run Datapoc now on GKE
https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D

Unless I misunderstood your point.

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 14 Apr 2023 at 17:42, Yuval Itzchakov  wrote:

> Hi,
>
> ATM I see the most used option for a Spark operator is the one provided by
> Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
>
> Unfortunately, it doesn't seem actively maintained. Are there any plans to
> support an official Apache Spark community driven operator?
>


Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for
AWS EKS to work

It is a bit of nightmare compared to the same on Google SDK which is simpler

Anyhow you will require additional jar files to be added to
$SPARK_HOME/jars. These two files will be picked up after you build the
docker image and will be available to pods.


   1. hadoop-aws-3.2.0.jar
   2. aws-java-sdk-bundle-1.11.375.jar

Then build your docker image and push the image to ecr registry on AWS.

This will allow you to refer to both the zipped package and your source
file as

 spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --py-files s3a://spark-on-k8s/codes/spark_on_eks.zip \
   s3a://1spark-on-k8s/codes/

Note that you refer to the bucket as* s3a rather than s3*

Output from driver log

kubectl logs   -n spark

Started at
14/04/2023 15:08:11.11
starting at ID =  1 ,ending on =  100
root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)
 |-- op_type: integer (nullable = false)
 |-- op_time: timestamp (nullable = false)

+---+-+-+--+--+--+--+---+---+
|ID |CLUSTERED|SCATTERED|RANDOMISED|RANDOM_STRING
  |SMALL_VC  |PADDING   |op_type|op_time|
+---+-+-+--+--+--+--+---+---+
|1  |0.0  |0.0  |17.0
 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ|
1|xx|1  |2023-04-14 15:08:15.534|
|2  |0.01 |1.0  |7.0
|ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR| 2|xx|1
 |2023-04-14 15:08:15.534|
|3  |0.02 |2.0  |30.0
 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz|
3|xx|1  |2023-04-14 15:08:15.534|
|4  |0.03 |3.0  |30.0
 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf|
4|xx|1  |2023-04-14 15:08:15.534|
|5  |0.04 |4.0  |79.0
 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks|
5|xx|1  |2023-04-14 15:08:15.534|
|6  |0.05 |5.0  |73.0
 |fFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSgW|
6|xx|1  |2023-04-14 15:08:15.534|
|7  |0.06 |6.0  |41.0
 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK|
7|xx|1  |2023-04-14 15:08:15.534|
|8  |0.07 |7.0  |29.0
 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf|
8|xx|1  |2023-04-14 15:08:15.534|
|9  |0.08 |8.0  |59.0
 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn|
9|xx|1  |2023-04-14 15:08:15.534|
|10 |0.09 |9.0  |73.0
 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt|
 10|xx|1  |2023-04-14 15:08:15.534|
+---+-+-+--+--+--+--+---+---+
only showing top 10 rows

Finished at
14/04/2023 15:08:16.16

I will provide the details under section *spark-on-aws *in
http://sparkcommunitytalk.slack.com/

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 19:04, Mich Talebzadeh 
wrote:

> Thanks! I will have a look.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen 
> wrote:
>
>> Yes, it looks inside the docker containers folder. It will work if you
>> are using s3 og gs.
>>
>> ons. 12. apr. 2023, 18:02 skrev Mich Talebzad

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen 
wrote:

> Yes, it looks inside the docker containers folder. It will work if you are
> using s3 og gs.
>
> ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh  >:
>
>> Hi,
>>
>> In my spark-submit to eks cluster, I use the standard code to submit to
>> the cluster as below:
>>
>> spark-submit --verbose \
>>--master k8s://$KUBERNETES_MASTER_IP:443 \
>>--deploy-mode cluster \
>>--name sparkOnEks \
>>--py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
>>   local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>>
>> In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
>> bucket.and it works fine.
>>
>> I am getting the following error in driver pod
>>
>>  + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
>> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
>> "$@")
>> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
>> spark.driver.bindAddress=192.168.39.251 --deploy-mode client 
>> --properties-file /opt/spark/conf/spark.properties --class 
>> org.apache.spark.deploy.PythonRunner 
>> local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> 23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> /usr/bin/python3: can't open file 
>> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py': [Errno 
>> 2] No such file or directory
>> log4j:WARN No appenders could be found for logger 
>> (org.apache.spark.util.ShutdownHookManager).
>> It says  can't open file 
>> '/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
>>
>>
>> [Errno 2] No such file or directory but it is there!
>>
>> ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> -rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16 
>> /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
>> So not sure what is going on. I have suspicion that it is looking inside the 
>> docker itself for this file?
>>
>>
>> Is that a correct assumption?
>>
>>
>> Thanks
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Hi,

In my spark-submit to eks cluster, I use the standard code to submit to the
cluster as below:

spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name sparkOnEks \
   --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \
  local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py

In Google Kubernetes Engine (GKE) I simply load them from gs:// storage
bucket.and it works fine.

I am getting the following error in driver pod

 + CMD=("$SPARK_HOME/bin/spark-submit" --conf
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode
client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf
spark.driver.bindAddress=192.168.39.251 --deploy-mode client
--properties-file /opt/spark/conf/spark.properties --class
org.apache.spark.deploy.PythonRunner
local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
23/04/11 23:07:23 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
/usr/bin/python3: can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':
[Errno 2] No such file or directory
log4j:WARN No appenders could be found for logger
(org.apache.spark.util.ShutdownHookManager).
It says  can't open file
'/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py':


[Errno 2] No such file or directory but it is there!

ls -l /home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
-rw-rw-rw- 1 hduser hadoop 5060 Mar 18 14:16
/home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py
So not sure what is going on. I have suspicion that it is looking
inside the docker itself for this file?


Is that a correct assumption?


Thanks


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
Hi Lingzhe Sun,

Thanks for your comments. I am afraid I won't be able to take part in this
project and contribute.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 02:55, Lingzhe Sun  wrote:

> Hi Mich,
>
> FYI we're using spark operator(
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build
> stateful structured streaming on k8s for a year. Haven't test it using
> non-operator way.
>
> Besides that, the main contributor of the spark operator, Yinan Li, has
> been inactive for quite long time. Kind of worried that this project might
> finally become outdated as k8s is evolving. So if anyone is interested,
> please support the project.
>
> --
> Lingzhe Sun
> Hirain Technologies
>
>
> *From:* Mich Talebzadeh 
> *Date:* 2023-04-11 02:06
> *To:* Rajesh Katkar 
> *CC:* user 
> *Subject:* Re: spark streaming and kinesis integration
> What I said was this
> "In so far as I know k8s does not support spark structured streaming?"
>
> So it is an open question. I just recalled it. I have not tested myself. I
> know structured streaming works on Google Dataproc cluster but I have not
> seen any official link that says Spark Structured Streaming is supported on
> k8s.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
> wrote:
>
>> Do you have any link or ticket which justifies that k8s does not support
>> spark streaming ?
>>
>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Do you have a high level diagram of the proposed solution?
>>>
>>> In so far as I know k8s does not support spark structured streaming?
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>> wrote:
>>>
>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>> Officially I could not find the connector or reader for kinesis from
>>>> spark like it has for kafka.
>>>>
>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>
>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Rajesh,
>>>>>
>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>> Which use case it concerns
>>>>>
>>>>> https://aws.amazon.com/kinesis/
>>>>>
>>>>> Can you use something else instead?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Archi

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark
applications in the form of containers in an automated fashion so that one
can easily deploy as many instances of the application as required
(autoscaling). From below:

https://price2meet.com/gcp/docs/dataproc_docs_concepts_configuring-clusters_autoscaling.pdf

Autoscaling does not support Spark Structured Streaming (
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
(see Autoscaling and Spark Structured Streaming
(#autoscaling_and_spark_structured_streaming)) .

On the same token k8s is more suitable (as of now)  for batch jobs than
Spark Structured Streaming.
https://issues.apache.org/jira/browse/SPARK-12133

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 10 Apr 2023 at 19:06, Mich Talebzadeh 
wrote:

> What I said was this
> "In so far as I know k8s does not support spark structured streaming?"
>
> So it is an open question. I just recalled it. I have not tested myself. I
> know structured streaming works on Google Dataproc cluster but I have not
> seen any official link that says Spark Structured Streaming is supported on
> k8s.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar 
> wrote:
>
>> Do you have any link or ticket which justifies that k8s does not support
>> spark streaming ?
>>
>> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
>> wrote:
>>
>>> Do you have a high level diagram of the proposed solution?
>>>
>>> In so far as I know k8s does not support spark structured streaming?
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>>> wrote:
>>>
>>>> Use case is , we want to read/write to kinesis streams using k8s
>>>> Officially I could not find the connector or reader for kinesis from
>>>> spark like it has for kafka.
>>>>
>>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>>
>>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Rajesh,
>>>>>
>>>>> What is the use case for Kinesis here? I have not used it personally,
>>>>> Which use case it concerns
>>>>>
>>>>> https://aws.amazon.com/kinesis/
>>>>>
>>>>> Can you use something else instead?
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this
"In so far as I know k8s does not support spark structured streaming?"

So it is an open question. I just recalled it. I have not tested myself. I
know structured streaming works on Google Dataproc cluster but I have not
seen any official link that says Spark Structured Streaming is supported on
k8s.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 10 Apr 2023 at 06:31, Rajesh Katkar  wrote:

> Do you have any link or ticket which justifies that k8s does not support
> spark streaming ?
>
> On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, 
> wrote:
>
>> Do you have a high level diagram of the proposed solution?
>>
>> In so far as I know k8s does not support spark structured streaming?
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar 
>> wrote:
>>
>>> Use case is , we want to read/write to kinesis streams using k8s
>>> Officially I could not find the connector or reader for kinesis from
>>> spark like it has for kafka.
>>>
>>> Checking here if anyone used kinesis and spark streaming combination ?
>>>
>>> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
>>> wrote:
>>>
>>>> Hi Rajesh,
>>>>
>>>> What is the use case for Kinesis here? I have not used it personally,
>>>> Which use case it concerns
>>>>
>>>> https://aws.amazon.com/kinesis/
>>>>
>>>> Can you use something else instead?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
>>>> wrote:
>>>>
>>>>> Hi Spark Team,
>>>>>
>>>>> We need to read/write the kinesis streams using spark streaming.
>>>>>
>>>>>  We checked the official documentation -
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>>>>
>>>>> It does not mention kinesis connector. Alternative is -
>>>>> https://github.com/qubole/kinesis-sql which is not active now.  This
>>>>> is now handed over here -
>>>>> https://github.com/roncemer/spark-sql-kinesis
>>>>>
>>>>> Also according to SPARK-18165
>>>>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark
>>>>> officially do not have any kinesis connector
>>>>>
>>>>> We have few below questions , It would be great if you can answer
>>>>>
>>>>>1. Does Spark provides officially any kinesis connector which have
>>>>>readstream/writestream and endorse any connector for production use 
>>>>> cases ?
>>>>>
>>>>>2.
>>>>>
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>>> This
>>>>>documentation does not mention how to write to kinesis. This method has
>>>>>default dynamodb as checkpoint, can we override it ?
>>>>>3. We have rocksdb as a state store but when we ran an application
>>>>>using official
>>>>>
>>>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>>>> rocksdb
>>>>>configurations were not effective. Can you please confirm if rocksdb 
>>>>> is not
>>>>>applicable in these cases?
>>>>>4. rocksdb however works with qubole connector , do you have any
>>>>>plan to release kinesis connector?
>>>>>5. Please help/recommend us for any good stable kinesis connector
>>>>>or some pointers around it
>>>>>
>>>>>


Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution?

In so far as I know k8s does not support spark structured streaming?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Apr 2023 at 16:40, Rajesh Katkar  wrote:

> Use case is , we want to read/write to kinesis streams using k8s
> Officially I could not find the connector or reader for kinesis from spark
> like it has for kafka.
>
> Checking here if anyone used kinesis and spark streaming combination ?
>
> On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, 
> wrote:
>
>> Hi Rajesh,
>>
>> What is the use case for Kinesis here? I have not used it personally,
>> Which use case it concerns
>>
>> https://aws.amazon.com/kinesis/
>>
>> Can you use something else instead?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar 
>> wrote:
>>
>>> Hi Spark Team,
>>>
>>> We need to read/write the kinesis streams using spark streaming.
>>>
>>>  We checked the official documentation -
>>> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>>>
>>> It does not mention kinesis connector. Alternative is -
>>> https://github.com/qubole/kinesis-sql which is not active now.  This is
>>> now handed over here - https://github.com/roncemer/spark-sql-kinesis
>>>
>>> Also according to SPARK-18165
>>> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially
>>> do not have any kinesis connector
>>>
>>> We have few below questions , It would be great if you can answer
>>>
>>>1. Does Spark provides officially any kinesis connector which have
>>>readstream/writestream and endorse any connector for production use 
>>> cases ?
>>>
>>>2.
>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>> This
>>>documentation does not mention how to write to kinesis. This method has
>>>default dynamodb as checkpoint, can we override it ?
>>>3. We have rocksdb as a state store but when we ran an application
>>>using official
>>>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
>>> rocksdb
>>>configurations were not effective. Can you please confirm if rocksdb is 
>>> not
>>>applicable in these cases?
>>>4. rocksdb however works with qubole connector , do you have any
>>>plan to release kinesis connector?
>>>5. Please help/recommend us for any good stable kinesis connector or
>>>some pointers around it
>>>
>>>


Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh,

What is the use case for Kinesis here? I have not used it personally, Which
use case it concerns

https://aws.amazon.com/kinesis/

Can you use something else instead?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar  wrote:

> Hi Spark Team,
>
> We need to read/write the kinesis streams using spark streaming.
>
>  We checked the official documentation -
> https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
>
> It does not mention kinesis connector. Alternative is -
> https://github.com/qubole/kinesis-sql which is not active now.  This is
> now handed over here - https://github.com/roncemer/spark-sql-kinesis
>
> Also according to SPARK-18165
> <https://issues.apache.org/jira/browse/SPARK-18165> , Spark officially do
> not have any kinesis connector
>
> We have few below questions , It would be great if you can answer
>
>1. Does Spark provides officially any kinesis connector which have
>readstream/writestream and endorse any connector for production use cases ?
>
>2.
>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
> This
>documentation does not mention how to write to kinesis. This method has
>default dynamodb as checkpoint, can we override it ?
>3. We have rocksdb as a state store but when we ran an application
>using official
>https://spark.apache.org/docs/latest/streaming-kinesis-integration.html 
> rocksdb
>configurations were not effective. Can you please confirm if rocksdb is not
>applicable in these cases?
>4. rocksdb however works with qubole connector , do you have any plan
>to release kinesis connector?
>5. Please help/recommend us for any good stable kinesis connector or
>some pointers around it
>
>


Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Mich Talebzadeh
The whole idea of creating a docker container is to have a reployable self
contained utility. A Docker container image is a lightweight, standalone,
executable package of software that includes everything needed to run an
application: code, runtime, system tools, system libraries and settings. The
concepts are explained in the http://sparkcommunitytalk.slack.com/ slack
under section https://sparkcommunitytalk.slack.com/archives/C051KFWK9TJ

Back to AWS, GCP use case, we are currently creating an Istio mesh for GCP
to AWS k8s fail-over using the same docker image in both gc
<https://cloud.google.com/container-registry>r and ecr
<https://docs.aws.amazon.com/AmazonECR/latest/userguide/Registries.html>
(container registries)

 HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 5 Apr 2023 at 10:59, Ken Peng  wrote:

>
>
> ashok34...@yahoo.com.INVALID wrote:
> > Is it possible to use Spark docker built on GCP on AWS without
> > rebuilding from new on AWS?
>
> I am using the spark image from bitnami for running on k8s.
> And yes, it's deployed by helm.
>
>
> --
> https://kenpeng.pages.dev/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread Mich Talebzadeh
OK Spark Structured Streaming.

How are you getting messages into Spark?  Is it Kafka?

This to me index that the message is incomplete or having another value in
Json

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 5 Apr 2023 at 12:58, me  wrote:

> Dear Apache Spark users,
> I have a long running Spark application that is encountering an
> ArrayIndexOutOfBoundsException once every two weeks. The exception does not
> disrupt the operation of my app, but I'm still concerned about it and would
> like to find a solution.
>
> Here's some additional information about my setup:
>
> Spark is running in standalone mode
> Spark version is 3.3.1
> Scala version is 2.12.15
> I'm using Spark in Structured Streaming
>
> Here's the relevant error message:
> java.lang.ArrayIndexOutOfBoundsException Index 59 out of bounds for length
> 16
> I've reviewed the code and searched online, but I'm still unable to find a
> solution. The full stacktrace can be found at this link:
> https://gist.github.com/rsi2m/ae54eccac93ae602d04d383e56c1a737
> I would appreciate any insights or suggestions on how to resolve this
> issue. Thank you in advance for your help.
>
> Best regards,
> rsi2m
>
>
>


Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
That 3 months retention is just a soft setting. For low volume traffic, it
can be negotiated to a year’s retention. Let me see what we can do about it.

HTH

On Tue, 4 Apr 2023 at 09:31, Bjørn Jørgensen 
wrote:

> One of the things that I don't like about this slack solution is that
> questions and answers disappear after 90 days. Today's maillist solution is
> indexed by search engines and when one day you wonder about something, you
> can find solutions with the help of just searching the web. Another
> question that I have is why has apache superset
> <https://superset.apache.org> taken away their slack channel. They have
> it linked to on the website but the link is giving errors
> <https://apache-superset.slack.com/join/shared_invite/zt-1pj34ugpe-23fJZ7DIH~F~ffIkHsRv1g#/shared-invite/error>.
> Have they had any experience that this was not the best solution or is it
> just that the link does not work.
>
> tir. 4. apr. 2023 kl. 09:06 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Hi Shani,
>>
>> I believe I am an admin so that is fine by me.
>>
>> Hi Dongioon,
>>
>> With regard to summarising the discussion etc, no need, It is like
>> flogging the dead horse, we have already discussed it enough. I don't see
>> the point of it.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 4 Apr 2023 at 07:06,  wrote:
>>
>>> Hey Dongjoon, Denny and all,
>>>
>>> I’ve created the current slack.
>>> All users have the option to create channels for different topics.
>>>
>>> I don’t see a reason for creating a new one.
>>>
>>> If anyone want to be admin on the current slack channel you all are
>>> welcome to send me a msg and I’ll grand permission.
>>>
>>> Have a great week,
>>> Shani Alisar
>>>
>>>
>>>
>>>
>>>
>>> On 4 Apr 2023, at 3:51, Dongjoon Hyun  wrote:
>>>
>>> 
>>> Thank you, Denny.
>>>
>>> May I interpret your comment as a request to support multiple channels
>>> in ASF too?
>>>
>>> > because it would allow us to create multiple channels for different
>>> topics
>>>
>>> Any other reasons?
>>>
>>> Dongjoon.
>>>
>>>
>>> On Mon, Apr 3, 2023 at 5:31 PM Denny Lee  wrote:
>>>
>>>> I do think creating a new Slack channel would be helpful because it
>>>> would allow us to create multiple channels for different topics -
>>>> streaming, graph, ML, etc.
>>>>
>>>> We would need a volunteer core to maintain it so we can keep the spirit
>>>> and letter of ASF / code of conduct.  I’d be glad to volunteer to keep this
>>>> active.
>>>>
>>>>
>>>>
>>>> On Mon, Apr 3, 2023 at 16:46 Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Shall we summarize the discussion so far?
>>>>>
>>>>> To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to
>>>>> initiate this thread instead of "Slack" vs "Mailing list"?
>>>>>
>>>>> If ASF Slack provides what you need, is it better than creating a
>>>>> new Slack channel?
>>>>>
>>>>> Or, is there another reason for us to create a new Slack channel?
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Mon, Apr 3, 2023 at 3:27 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> I agree, whatever individual sentiments are.
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Lead Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>>
>>&

Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
Hi Shani,

I believe I am an admin so that is fine by me.

Hi Dongioon,

With regard to summarising the discussion etc, no need, It is like flogging
the dead horse, we have already discussed it enough. I don't see the point
of it.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 4 Apr 2023 at 07:06,  wrote:

> Hey Dongjoon, Denny and all,
>
> I’ve created the current slack.
> All users have the option to create channels for different topics.
>
> I don’t see a reason for creating a new one.
>
> If anyone want to be admin on the current slack channel you all are
> welcome to send me a msg and I’ll grand permission.
>
> Have a great week,
> Shani Alisar
>
>
>
>
>
> On 4 Apr 2023, at 3:51, Dongjoon Hyun  wrote:
>
> 
> Thank you, Denny.
>
> May I interpret your comment as a request to support multiple channels in
> ASF too?
>
> > because it would allow us to create multiple channels for different
> topics
>
> Any other reasons?
>
> Dongjoon.
>
>
> On Mon, Apr 3, 2023 at 5:31 PM Denny Lee  wrote:
>
>> I do think creating a new Slack channel would be helpful because it would
>> allow us to create multiple channels for different topics - streaming,
>> graph, ML, etc.
>>
>> We would need a volunteer core to maintain it so we can keep the spirit
>> and letter of ASF / code of conduct.  I’d be glad to volunteer to keep this
>> active.
>>
>>
>>
>> On Mon, Apr 3, 2023 at 16:46 Dongjoon Hyun 
>> wrote:
>>
>>> Shall we summarize the discussion so far?
>>>
>>> To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to
>>> initiate this thread instead of "Slack" vs "Mailing list"?
>>>
>>> If ASF Slack provides what you need, is it better than creating a
>>> new Slack channel?
>>>
>>> Or, is there another reason for us to create a new Slack channel?
>>>
>>> Dongjoon.
>>>
>>>
>>> On Mon, Apr 3, 2023 at 3:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> I agree, whatever individual sentiments are.
>>>>
>>>> Mich Talebzadeh,
>>>> Lead Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 3 Apr 2023 at 23:21, Jungtaek Lim 
>>>> wrote:
>>>>
>>>>> Just to be clear, if there is no strong volunteer to make the new
>>>>> community channel stay active, I'd probably be OK to not fork the channel.
>>>>> You can see a strong counter example from #spark channel in ASF. It is the
>>>>> place where there are only questions and promos but zero answers. I see
>>>>> volunteers here demanding for another channel, so I want to see us go with
>>>>> the most preferred way for these volunteers.
>>>>>
>>>>> User mailing list does not go in a good shape. I hope we give another
>>>>> try with recent technology to see whether we can gain traction - if we
>>>>> fail, the user mailing list will still be there.
>>>>>
>>>>> On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> The number of subscribers doesn't give any meaningful value. Please
>

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I agree, whatever individual sentiments are.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 3 Apr 2023 at 23:21, Jungtaek Lim 
wrote:

> Just to be clear, if there is no strong volunteer to make the new
> community channel stay active, I'd probably be OK to not fork the channel.
> You can see a strong counter example from #spark channel in ASF. It is the
> place where there are only questions and promos but zero answers. I see
> volunteers here demanding for another channel, so I want to see us go with
> the most preferred way for these volunteers.
>
> User mailing list does not go in a good shape. I hope we give another try
> with recent technology to see whether we can gain traction - if we fail,
> the user mailing list will still be there.
>
> On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim 
> wrote:
>
>> The number of subscribers doesn't give any meaningful value. Please look
>> into the number of mails being sent to the list.
>>
>> https://lists.apache.org/list.html?user@spark.apache.org
>> The latest month there were more than 200 emails being sent was Feb 2022,
>> more than a year ago. It was more than 1k in 2016, and more than 2k in 2015
>> and earlier.
>> Let's face the fact. User mailing list is dying, even before we start
>> discussion about alternative communication methods.
>>
>> Users never go with the way if it's just because PMC members (or ASF)
>> have preference. They are going with the way they are convenient.
>>
>> Same applies here - if ASF Slack requires a restricted invitation
>> mechanism then it won't work. Looks like there is a link for an invitation,
>> but we are also talking about the cost as well.
>> https://cwiki.apache.org/confluence/display/INFRA/Slack+Guest+Invites
>> As long as we are being serious about the cost, I don't think we are
>> going to land in the way "users" are convenient.
>>
>> On Tue, Apr 4, 2023 at 4:59 AM Dongjoon Hyun 
>> wrote:
>>
>>> As Mich Talebzadeh pointed out, Apache Spark has an official Slack
>>> channel.
>>>
>>> > It's unavoidable if "users" prefer to use an alternative communication
>>> mechanism rather than the user mailing list.
>>>
>>> The following is the number of people in the official channels.
>>>
>>> - user@spark.apache.org has 4519 subscribers.
>>> - d...@spark.apache.org has 3149 subscribers.
>>> - ASF Official Slack channel has 602 subscribers.
>>>
>>> May I ask if the users prefer to use the ASF Official Slack channel
>>> than the user mailing list?
>>>
>>> Dongjoon.
>>>
>>>
>>>
>>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>>> section of "Open Communications", restriction of communication inside ASF
>>>> INFRA (mailing list) is more about code and decision-making.
>>>>
>>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>>
>>>> It's unavoidable if "users" prefer to use an alternative communication
>>>> mechanism rather than the user mailing list. Before Stack Overflow days,
>>>> there had been a meaningful number of questions around user@. It's
>>>> just impossible to let them go back and post to the user mailing list.
>>>>
>>>> We just need to make sure it is not the purpose of employing Slack to
>>>> move all discussions about developments, direction of the project, etc
>>>> which must happen in dev@/private@. The purpose of Slack thread here
>>>> does not seem to aim to serve the purpose.
>>>>
>>>>
>>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Good discussions and proposals.all around.
>>>>>
>>>>> I have used slack in anger on a customer site before. For small and
>>>

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I for myself prefer to use the newly formed slack.

sparkcommunitytalk.slack.com

In summary, it may be a good idea to take a tour of it and see for
yourself. Topics are sectioned as per user requests.

I trust this answers your question.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 3 Apr 2023 at 20:59, Dongjoon Hyun  wrote:

> As Mich Talebzadeh pointed out, Apache Spark has an official Slack channel.
>
> > It's unavoidable if "users" prefer to use an alternative communication
> mechanism rather than the user mailing list.
>
> The following is the number of people in the official channels.
>
> - user@spark.apache.org has 4519 subscribers.
> - d...@spark.apache.org has 3149 subscribers.
> - ASF Official Slack channel has 602 subscribers.
>
> May I ask if the users prefer to use the ASF Official Slack channel
> than the user mailing list?
>
> Dongjoon.
>
>
>
> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim 
> wrote:
>
>> I'm reading through the page "Briefing: The Apache Way", and in the
>> section of "Open Communications", restriction of communication inside ASF
>> INFRA (mailing list) is more about code and decision-making.
>>
>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>
>> It's unavoidable if "users" prefer to use an alternative communication
>> mechanism rather than the user mailing list. Before Stack Overflow days,
>> there had been a meaningful number of questions around user@. It's just
>> impossible to let them go back and post to the user mailing list.
>>
>> We just need to make sure it is not the purpose of employing Slack to
>> move all discussions about developments, direction of the project, etc
>> which must happen in dev@/private@. The purpose of Slack thread here
>> does not seem to aim to serve the purpose.
>>
>>
>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Good discussions and proposals.all around.
>>>
>>> I have used slack in anger on a customer site before. For small and
>>> medium size groups it is good and affordable. Alternatives have been
>>> suggested as well so those who like investigative search can agree and come
>>> up with a freebie one.
>>> I am inclined to agree with Bjorn that this slack has more social
>>> dimensions than the mailing list. It is akin to a sports club using
>>> WhatsApp groups for communication. Remember we were originally looking for
>>> space for webinars, including Spark on Linkedin that Denney Lee suggested.
>>> I think Slack and mailing groups can coexist happily. On a more serious
>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>> traffic. Currently we hardly get many mails daily <> less than 5. So having
>>> a slack type medium may improve members participation.
>>>
>>> so +1 for me as well.
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee  wrote:
>>>
>>>> +1.
>>>>
>>>> To Shani’s point, there are multiple OSS projects that use the free
>>>> Slack version - top of mind include Delta, Presto, Flink, Trino, Datahub,
>>>> MLflow, etc.
>>>>
>>>> On Thu, Mar 30, 2023 at 14:15  wrote:
>>>

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Philippe,

Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner. Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms to reduce communication cost.

If you have enough memory, the smaller table is cached in the driver and
distributed to every node of the cluster, reduning shift and lift of data
check this link

https://sparkbyexamples.com/spark/broadcast-join-in-spark/#:~:text=Broadcast%20join%20is%20an%20optimization,always%20collected%20at%20the%20driver
.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 2 Apr 2023 at 20:05, Philippe de Rochambeau  wrote:

> Hi Mich,
> what exactly do you mean by « if you prefer to broadcast the reference
> data »?
> Philippe
>
> Le 2 avr. 2023 à 18:16, Mich Talebzadeh  a
> écrit :
>
> Hi Phillipe,
>
> These are my thoughts besides comments from Sean
>
> Just to clarify, you receive a CSV file periodically and you already have
> a file that contains valid patterns for phone numbers (reference)
>
> In a pseudo language you can probe your csv DF against the reference DF
>
> // load your reference dataframeval 
> reference_DF=sqlContext.parquetFile("path")
> // mark this smaller dataframe to be stored in memoryreference_DF.cache()
>
> //Create a temp table
>
> reference_DF.createOrReplaceTempView("reference")
>
> // Do the same on the CSV, change the line below
>
> val csvDF = 
> spark.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", "false").load("path")
>
> csvDF.cache()  // This may or not work if CSV is large, however it is worth 
> trying
>
> csvDF.createOrReplaceTempView("csv")
>
> sqlContext.sql("JOIN Query").show
>
> If you prefer to broadcast the reference data, you must first collect it on 
> the driver before you broadcast it. This requires that your RDD fits in 
> memory on your driver (and executors).
>
> You can then play around with that join.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau 
> wrote:
>
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as
>> the below «  telephonedirectory » data set?
>>
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
>> tel3 » …)) // the telephone sequence
>>
>> // was read for a CSV file
>>
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>
>>   rdd .foreach(tel => {
>> longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>>
>>
>>
>>
>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh  a
>> écrit :
>>
>> This may help
>>
>> Spark rlike() Working with Regex Matching Example
>> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical c

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Phillipe,

These are my thoughts besides comments from Sean

Just to clarify, you receive a CSV file periodically and you already have a
file that contains valid patterns for phone numbers (reference)

In a pseudo language you can probe your csv DF against the reference DF

// load your reference dataframeval reference_DF=sqlContext.parquetFile("path")
// mark this smaller dataframe to be stored in memoryreference_DF.cache()

//Create a temp table

reference_DF.createOrReplaceTempView("reference")

// Do the same on the CSV, change the line below

val csvDF = spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "false").load("path")

csvDF.cache()  // This may or not work if CSV is large, however it is
worth trying

csvDF.createOrReplaceTempView("csv")

sqlContext.sql("JOIN Query").show

If you prefer to broadcast the reference data, you must first collect
it on the driver before you broadcast it. This requires that your RDD
fits in memory on your driver (and executors).

You can then play around with that join.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 2 Apr 2023 at 09:17, Philippe de Rochambeau  wrote:

> Many thanks, Mich.
> Is « foreach »  the best construct to  lookup items is a dataset  such as
> the below «  telephonedirectory » data set?
>
> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
> tel3 » …)) // the telephone sequence
>
> // was read for a CSV file
>
> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>
>   rdd .foreach(tel => {
> longAcc.select(«  * » ).rlike(«  + »  + tel)
>   })
>
>
>
>
> Le 1 avr. 2023 à 22:36, Mich Talebzadeh  a
> écrit :
>
> This may help
>
> Spark rlike() Working with Regex Matching Example
> <https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau 
> wrote:
>
>> Hello,
>> I’m looking for an efficient way in Spark to search for a series of
>> telephone numbers, contained in a CSV file, in a data set column.
>>
>> In pseudo code,
>>
>> for tel in [tel1, tel2, …. tel40,000]
>> search for tel in dataset using .like(« %tel% »)
>> end for
>>
>> I’m using the like function because the telephone numbers in the data set
>> main contain prefixes, such as « + « ; e.g., « +331222 ».
>>
>> Any suggestions would be welcome.
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Looping through a series of telephone numbers

2023-04-01 Thread Mich Talebzadeh
This may help

Spark rlike() Working with Regex Matching Example
<https://sparkbyexamples.com/spark/spark-rlike-regex-matching-examples/>s
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau  wrote:

> Hello,
> I’m looking for an efficient way in Spark to search for a series of
> telephone numbers, contained in a CSV file, in a data set column.
>
> In pseudo code,
>
> for tel in [tel1, tel2, …. tel40,000]
> search for tel in dataset using .like(« %tel% »)
> end for
>
> I’m using the like function because the telephone numbers in the data set
> main contain prefixes, such as « + « ; e.g., « +331222 ».
>
> Any suggestions would be welcome.
>
> Many thanks.
>
> Philippe
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid.

I have created a section in Apache Spark Community Stack called spark
foundation.  spark-foundation - Apache Spark Community - Slack
<https://app.slack.com/client/T04URTRBZ1R/C051CL5T1KL/thread/C0501NBTNQG-1680132989.091199>

I invite you to add your weblink to that section.

HTH
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 1 Apr 2023 at 13:12, Khalid Mammadov 
wrote:

> Hey AN-TRUONG
>
> I have got some articles about this subject that should help.
> E.g.
> https://khalidmammadov.github.io/spark/spark_internals_rdd.html
>
> Also check other Spark Internals on web.
>
> Regards
> Khalid
>
> On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, 
> wrote:
>
>> Thank you for your information,
>>
>> I have tracked the spark history server on port 18080 and the spark UI on
>> port 4040. I see the result of these two tools as similar right?
>>
>> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
>> the images does, is it possible?
>> https://i.stack.imgur.com/Azva4.png
>>
>> Best regards,
>>
>> An - Truong
>>
>>
>> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are you familiar with spark GUI default on port 4040?
>>>
>>> have a look.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>>> tr.phan.tru...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am learning about Apache Spark and want to know the meaning of each
>>>> Task created on the Jobs recorded on Spark history.
>>>>
>>>> For example, the application I write creates 17 jobs, in which job 0
>>>> runs for 10 minutes, there are 2384 small tasks and I want to learn about
>>>> the meaning of these 2384, is it possible?
>>>>
>>>> I found a picture of DAG in the Jobs and want to know the relationship
>>>> between DAG and Task, is it possible (Specifically from the attached file
>>>> DAG and 2384 tasks below)?
>>>>
>>>> Thank you very much, have a nice day everyone.
>>>>
>>>> Best regards,
>>>>
>>>> An-Trường.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Trân Trọng,
>>
>> An Trường.
>>
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
yes history refers to completed jobs. 4040 is the running jobs

you should have screen shots for executors and stages as well.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 31 Mar 2023 at 16:17, AN-TRUONG Tran Phan 
wrote:

> Thank you for your information,
>
> I have tracked the spark history server on port 18080 and the spark UI on
> port 4040. I see the result of these two tools as similar right?
>
> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
> the images does, is it possible?
> https://i.stack.imgur.com/Azva4.png
>
> Best regards,
>
> An - Truong
>
>
> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh 
> wrote:
>
>> Are you familiar with spark GUI default on port 4040?
>>
>> have a look.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>> tr.phan.tru...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am learning about Apache Spark and want to know the meaning of each
>>> Task created on the Jobs recorded on Spark history.
>>>
>>> For example, the application I write creates 17 jobs, in which job 0
>>> runs for 10 minutes, there are 2384 small tasks and I want to learn about
>>> the meaning of these 2384, is it possible?
>>>
>>> I found a picture of DAG in the Jobs and want to know the relationship
>>> between DAG and Task, is it possible (Specifically from the attached file
>>> DAG and 2384 tasks below)?
>>>
>>> Thank you very much, have a nice day everyone.
>>>
>>> Best regards,
>>>
>>> An-Trường.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Trân Trọng,
>
> An Trường.
>


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040?

have a look.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan 
wrote:

> Hi,
>
> I am learning about Apache Spark and want to know the meaning of each Task
> created on the Jobs recorded on Spark history.
>
> For example, the application I write creates 17 jobs, in which job 0 runs
> for 10 minutes, there are 2384 small tasks and I want to learn about the
> meaning of these 2384, is it possible?
>
> I found a picture of DAG in the Jobs and want to know the relationship
> between DAG and Task, is it possible (Specifically from the attached file
> DAG and 2384 tasks below)?
>
> Thank you very much, have a nice day everyone.
>
> Best regards,
>
> An-Trường.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Creating InMemory relations with data in ColumnarBatches

2023-03-30 Thread Mich Talebzadeh
Is this purely for performance consideration?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 19:56, praveen sinha  wrote:

> Hi,
>
> I have been trying to implement InMemoryRelation based on spark
> ColumnarBatches, so far I have not been able to store the vectorised
> columnarbatch into the relation. Is there a way to achieve this without
> going with an intermediary representation like Arrow, so as to enable spark
> to do fast columnar aggregations in memory. The code so far, using just the
> high level APIs is as follows -
>
> ```
>   //Load csv into Datafram
>   val csvDF: DataFrame = context.sqlctx.read
> .format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .load(csvFile)
>
>   //Create in memory relation using schema from csv dataframe
>   val relation = InMemoryRelation(
> useCompression = true,
> batchSize = 100,
> storageLevel = StorageLevel.MEMORY_ONLY,
> child = csvDF.queryExecution.sparkPlan, //Do I need to alter this
> to suggest columnar plans?
> tableName = Some("nyc_taxi"),
> optimizedPlan = csvDF.queryExecution.optimizedPlan
>   )
>
>   //create vectorized columnar batches
>   val rows = csvDF.collect()
>   import scala.collection.JavaConverters._
>   val vectorizedRows: ColumnarBatch =
> ColumnVectorUtils.toBatch(csvDF.schema, MemoryMode.ON_HEAP,
> rows.iterator.asJava)
>
>   //store the vectorized rows in the relation
>   //relation.store(vectorizedRows)
> ```
>
> Obviously the last line is the one which is not an API. Need help to
> understand if this approach can work and if it does, need help and pointers
> in trying to come up with how to implement this API using low level spark
> constructs.
>
> Thanks and Regards,
> Praveen
>


Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Good discussions and proposals.all around.

I have used slack in anger on a customer site before. For small and medium
size groups it is good and affordable. Alternatives have been suggested as
well so those who like investigative search can agree and come up with a
freebie one.
I am inclined to agree with Bjorn that this slack has more social
dimensions than the mailing list. It is akin to a sports club using
WhatsApp groups for communication. Remember we were originally looking for
space for webinars, including Spark on Linkedin that Denney Lee suggested.
I think Slack and mailing groups can coexist happily. On a more serious
note, when I joined the user group back in 2015-2016, there was a lot of
traffic. Currently we hardly get many mails daily <> less than 5. So having
a slack type medium may improve members participation.

so +1 for me as well.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 22:19, Denny Lee  wrote:

> +1.
>
> To Shani’s point, there are multiple OSS projects that use the free Slack
> version - top of mind include Delta, Presto, Flink, Trino, Datahub, MLflow,
> etc.
>
> On Thu, Mar 30, 2023 at 14:15  wrote:
>
>> Hey everyone,
>>
>> I think we should remain on a free program in slack.
>>
>> In my option the free program is more then enough, the only down side is
>> we could only see the last 90 days messages.
>>
>> From what I know the Airflow community (which has strong active community
>> in slack) also use the free program (You can tell by the 90 days limit
>> notice in their workspace).
>>
>> You can find the pricing and features comparison between the slack
>> programs here <https://slack.com/intl/en-gb/pricing> .
>>
>> Have a great day,
>> Shani
>>
>> On 30 Mar 2023, at 23:38, Mridul Muralidharan  wrote:
>>
>> 
>>
>>
>> Thanks for flagging the concern Dongjoon, I was not aware of the
>> discussion - but I can understand the concern.
>> Would be great if you or Matei could update the thread on the result of
>> deliberations, once it reaches a logical consensus: before we set up
>> official policy around it.
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 30, 2023 at 4:23 PM Bjørn Jørgensen 
>> wrote:
>>
>>> I like the idea of having a talk channel. It can make it easier for
>>> everyone to say hello. Or to dare to ask about small or big matters that
>>> you would not have dared to ask about before on mailing lists.
>>> But then there is the price and what is the best for an open source
>>> project.
>>>
>>> The price for using slack is expensive.
>>> Right now for those that have join spark slack
>>> $8.75 USD
>>> 72 members
>>> 1 month
>>> $630 USD
>>>
>>> https://app.slack.com/plans/T04URTRBZ1R/checkout/form?entry_point=hero_banner_upgrade_cta=2
>>>
>>> And they - slack does not have an option for open source projects.
>>>
>>> There seems to be some alternatives for open source software. I have not
>>> tried it.
>>> Like https://www.rocket.chat/blog/slack-open-source-alternatives
>>>
>>> 
>>>
>>>
>>> rocket chat is open source https://github.com/RocketChat/Rocket.Chat
>>>
>>> tor. 30. mar. 2023 kl. 18:54 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
>>>> Hi Dongjoon
>>>>
>>>> to your points if I may
>>>>
>>>> - Do you have any reference from other official ASF-related Slack
>>>> channels?
>>>>No, I don't have any reference from other official ASF-related Slack
>>>> channels because I don't think that matters. However, I stand corrected
>>>> - To be clear, I intentionally didn't refer to any specific mailing
>>>> list because we didn't set up any rule here yet.
>>>>fair enough
>>>>
>>>> going back to your original point
>>>>
>>>> ..There is a concern expressed by ASF board because recent Slack
>>>> activities created an isol

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon

to your points if I may

- Do you have any reference from other official ASF-related Slack channels?
   No, I don't have any reference from other official ASF-related Slack
channels because I don't think that matters. However, I stand corrected
- To be clear, I intentionally didn't refer to any specific mailing list
because we didn't set up any rule here yet.
   fair enough

going back to your original point

..There is a concern expressed by ASF board because recent Slack activities
created an isolated silo outside of ASF mailing list archive...
Well, there are activities on Spark and indeed other open source software
everywhere. One way or other they do help getting community (inside the
user groups and other) to get interested and involved. Slack happens to be
one of them.
I am of the opinion that creating such silos is already a reality and we
ought to be pragmatic. Unless there is an overriding reason, we should
embrace it as slack can co-exist with the other mailing lists and channels
like linkedin etc.

Hope this clarifies my position

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 17:28, Dongjoon Hyun  wrote:

> To Mich.
> - Do you have any reference from other official ASF-related Slack channels?
> - To be clear, I intentionally didn't refer to any specific mailing list
> because we didn't set up any rule here yet.
>
> To Xiao. I understand what you mean. That's the reason why I added Matei
> from your side.
> > I did not see an objection from the ASF board.
>
> There is on-going discussion about the communication channels outside ASF
> email which is specifically concerning Slack.
> Please hold on any official action for this topic. We will know how to
> support it seamlessly.
>
> Dongjoon.
>
>
> On Thu, Mar 30, 2023 at 9:21 AM Xiao Li  wrote:
>
>> Hi, Dongjoon,
>>
>> The other communities (e.g., Pinot, Druid, Flink) created their own Slack
>> workspaces last year. I did not see an objection from the ASF board. At the
>> same time, Slack workspaces are very popular and useful in most non-ASF
>> open source communities. TBH, we are kind of late. I think we can do the
>> same in our community?
>>
>> We can follow the guide when the ASF has an official process for ASF
>> archiving. Since our PMC are the owner of the slack workspace, we can make
>> a change based on the policy. WDYT?
>>
>> Xiao
>>
>>
>> Dongjoon Hyun  于2023年3月30日周四 09:03写道:
>>
>>> Hi, Xiao and all.
>>>
>>> (cc Matei)
>>>
>>> Please hold on the vote.
>>>
>>> There is a concern expressed by ASF board because recent Slack
>>> activities created an isolated silo outside of ASF mailing list archive.
>>>
>>> We need to establish a way to embrace it back to ASF archive before
>>> starting anything official.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>>>
>>>> +1
>>>>
>>>> + @d...@spark.apache.org 
>>>>
>>>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>>>> Flink) have created their own dedicated Slack workspaces for faster
>>>> communication. We can do the same in Apache Spark. The Slack workspace will
>>>> be maintained by the Apache Spark PMC. I propose to initiate a vote for the
>>>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>>>
>>>> Cheers,
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>>>
>>>>> I created one at slack called pyspark
>>>>>
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Lead Solutions Architect/Engineering Lead
>>>>> Palantir Technologies Limited
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>&g

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon,

Thanks for your point.

I gather you are referring to archive as below

https://lists.apache.org/list.html?user@spark.apache.org

Otherwise, correct me.

Thanks


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 17:03, Dongjoon Hyun  wrote:

> Hi, Xiao and all.
>
> (cc Matei)
>
> Please hold on the vote.
>
> There is a concern expressed by ASF board because recent Slack activities
> created an isolated silo outside of ASF mailing list archive.
>
> We need to establish a way to embrace it back to ASF archive before
> starting anything official.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 29, 2023 at 11:32 PM Xiao Li  wrote:
>
>> +1
>>
>> + @d...@spark.apache.org 
>>
>> This is a good idea. The other Apache projects (e.g., Pinot, Druid,
>> Flink) have created their own dedicated Slack workspaces for faster
>> communication. We can do the same in Apache Spark. The Slack workspace will
>> be maintained by the Apache Spark PMC. I propose to initiate a vote for the
>> creation of a new Apache Spark Slack workspace. Does that sound good?
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>>
>>
>>
>> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>>
>>> I created one at slack called pyspark
>>>
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>>
>>>> +1 good idea, I d like to join as well.
>>>>
>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>>> écrit :
>>>>
>>>>> Please let us know when the channel is created. I'd like to join :)
>>>>>
>>>>> Thank You & Best Regards
>>>>> Winston Lai
>>>>> --
>>>>> *From:* Denny Lee 
>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>>> *To:* Hyukjin Kwon 
>>>>> *Cc:* keen ; user@spark.apache.org <
>>>>> user@spark.apache.org>
>>>>> *Subject:* Re: Slack for PySpark users
>>>>>
>>>>> +1 I think this is a great idea!
>>>>>
>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>> Yeah, actually I think we should better have a slack channel so we can
>>>>> easily discuss with users and developers.
>>>>>
>>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>>
>>>>> Hi all,
>>>>> I really like *Slack *as communication channel for a tech community.
>>>>> There is a Slack workspace for *delta lake users* (
>>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>>> I was wondering if there is something similar for PySpark users.
>>>>>
>>>>> If not, would there be anything wrong with creating a new
>>>>> Slack workspace for PySpark users? (when explicitly mentioning that this 
>>>>> is
>>>>> *not* officially part of Apache Spark)?
>>>>>
>>>>> Cheers
>>>>> Martin
>>>>>
>>>>>
>>>>
>>>> --
>>>> Asma ZGOLLI
>>>>
>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>
>>>>


Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
The ownership of slack belongs to spark community

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 08:05,  wrote:

> Hey there,
>
> I agree, If Apache Spark PMC can maintain the spark community workspace,
> that would be great!
> Instead of creating a new one, they can also become the owner of the
> current one
> <https://join.slack.com/t/sparkcommunitytalk/shared_invite/zt-1rk11diac-hzGbOEdBHgjXf02IZ1mvUA>
>  .
>
> Best regards,
> Shani
>
> On 30 Mar 2023, at 9:32, Xiao Li  wrote:
>
> 
> +1
>
> + @d...@spark.apache.org 
>
> This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink)
> have created their own dedicated Slack workspaces for faster communication.
> We can do the same in Apache Spark. The Slack workspace will be maintained
> by the Apache Spark PMC. I propose to initiate a vote for the creation of a
> new Apache Spark Slack workspace. Does that sound good?
>
> Cheers,
>
> Xiao
>
>
>
>
>
>
>
> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>
>> I created one at slack called pyspark
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>
>>> +1 good idea, I d like to join as well.
>>>
>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>> écrit :
>>>
>>>> Please let us know when the channel is created. I'd like to join :)
>>>>
>>>> Thank You & Best Regards
>>>> Winston Lai
>>>> --
>>>> *From:* Denny Lee 
>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>> *To:* Hyukjin Kwon 
>>>> *Cc:* keen ; user@spark.apache.org <
>>>> user@spark.apache.org>
>>>> *Subject:* Re: Slack for PySpark users
>>>>
>>>> +1 I think this is a great idea!
>>>>
>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>> Yeah, actually I think we should better have a slack channel so we can
>>>> easily discuss with users and developers.
>>>>
>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>
>>>> Hi all,
>>>> I really like *Slack *as communication channel for a tech community.
>>>> There is a Slack workspace for *delta lake users* (
>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>> I was wondering if there is something similar for PySpark users.
>>>>
>>>> If not, would there be anything wrong with creating a new
>>>> Slack workspace for PySpark users? (when explicitly mentioning that this is
>>>> *not* officially part of Apache Spark)?
>>>>
>>>> Cheers
>>>> Martin
>>>>
>>>>
>>>
>>> --
>>> Asma ZGOLLI
>>>
>>> Ph.D. in Big Data - Applied Machine Learning
>>>
>>>


Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
We already have it

general - Apache Spark Community - Slack
<https://app.slack.com/client/T04URTRBZ1R/C0501NBTNQG/thread/C050F0J5YNA-1680070839.296179>

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 30 Mar 2023 at 07:31, Xiao Li  wrote:

> +1
>
> + @d...@spark.apache.org 
>
> This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink)
> have created their own dedicated Slack workspaces for faster communication.
> We can do the same in Apache Spark. The Slack workspace will be maintained
> by the Apache Spark PMC. I propose to initiate a vote for the creation of a
> new Apache Spark Slack workspace. Does that sound good?
>
> Cheers,
>
> Xiao
>
>
>
>
>
>
>
> Mich Talebzadeh  于2023年3月28日周二 07:07写道:
>
>> I created one at slack called pyspark
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:
>>
>>> +1 good idea, I d like to join as well.
>>>
>>> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
>>> écrit :
>>>
>>>> Please let us know when the channel is created. I'd like to join :)
>>>>
>>>> Thank You & Best Regards
>>>> Winston Lai
>>>> --
>>>> *From:* Denny Lee 
>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>>>> *To:* Hyukjin Kwon 
>>>> *Cc:* keen ; user@spark.apache.org <
>>>> user@spark.apache.org>
>>>> *Subject:* Re: Slack for PySpark users
>>>>
>>>> +1 I think this is a great idea!
>>>>
>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>> Yeah, actually I think we should better have a slack channel so we can
>>>> easily discuss with users and developers.
>>>>
>>>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>>>
>>>> Hi all,
>>>> I really like *Slack *as communication channel for a tech community.
>>>> There is a Slack workspace for *delta lake users* (
>>>> https://go.delta.io/slack) that I enjoy a lot.
>>>> I was wondering if there is something similar for PySpark users.
>>>>
>>>> If not, would there be anything wrong with creating a new
>>>> Slack workspace for PySpark users? (when explicitly mentioning that this is
>>>> *not* officially part of Apache Spark)?
>>>>
>>>> Cheers
>>>> Martin
>>>>
>>>>
>>>
>>> --
>>> Asma ZGOLLI
>>>
>>> Ph.D. in Big Data - Applied Machine Learning
>>>
>>>


Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
https://join.slack.com/t/sparkcommunitytalk/shared_invite/zt-1rk11diac-hzGbOEdBHgjXf02IZ1mvUA

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 28 Mar 2023 at 19:38, Mich Talebzadeh 
wrote:

> Hi Bjorn,
>
> you just need to create an account on slack and join any topic I believe
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 28 Mar 2023 at 18:57, Bjørn Jørgensen 
> wrote:
>
>> Do I need to get an invite before joining?
>>
>>
>> tir. 28. mar. 2023 kl. 18:51 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi all,
>>>
>>> There is a section in slack called webinars
>>>
>>>
>>> https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG
>>>
>>> Asma Zgolli, agreed to prepare materials for  Spark internals and/or
>>> comparing spark 3 and 2.
>>>
>>> I like to contribute to "Spark Streaming & Spark Structured Streaming"
>>> plus "Spark on k8s for both GCP and EKS concepts and contrasts"
>>>
>>> Other topics are mentioned below.
>>>
>>> -- Spark UI
>>> -- Dynamic allocation
>>> -- Tuning of jobs
>>> -- Collecting spark metrics for monitoring and alerting
>>> -- For those who prefer to use Pandas API on Spark since the release of
>>> Spark 3.2, What are some important notes for those users? For example, what
>>> are the additional factors affecting the Spark -- Performance using Pandas
>>> API on Spark? How to tune them in addition to the conventional Spark tuning
>>> methods applied to Spark SQL users.
>>> -- Spark internals and/or comparing spark 3 and 2
>>> -- Spark Streaming & Spark Structured Streaming
>>> -- Spark on notebooks
>>> -- Spark on serverless (for example Spark on Google Cloud)
>>> -- Spark on k8s
>>>
>>>
>>> If you are willing to contribute to presentation materials, please
>>> register your interest in slack/webinars.
>>>
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>>
>>> Spark on k8s for both GCP and EKS concepts and contrasts
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 28 Mar 2023 at 13:55, asma zgolli  wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I suggest using the slack for the spark community created recently to
&g

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi Bjorn,

you just need to create an account on slack and join any topic I believe

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 28 Mar 2023 at 18:57, Bjørn Jørgensen 
wrote:

> Do I need to get an invite before joining?
>
>
> tir. 28. mar. 2023 kl. 18:51 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Hi all,
>>
>> There is a section in slack called webinars
>>
>>
>> https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG
>>
>> Asma Zgolli, agreed to prepare materials for  Spark internals and/or
>> comparing spark 3 and 2.
>>
>> I like to contribute to "Spark Streaming & Spark Structured Streaming"
>> plus "Spark on k8s for both GCP and EKS concepts and contrasts"
>>
>> Other topics are mentioned below.
>>
>> -- Spark UI
>> -- Dynamic allocation
>> -- Tuning of jobs
>> -- Collecting spark metrics for monitoring and alerting
>> -- For those who prefer to use Pandas API on Spark since the release of
>> Spark 3.2, What are some important notes for those users? For example, what
>> are the additional factors affecting the Spark -- Performance using Pandas
>> API on Spark? How to tune them in addition to the conventional Spark tuning
>> methods applied to Spark SQL users.
>> -- Spark internals and/or comparing spark 3 and 2
>> -- Spark Streaming & Spark Structured Streaming
>> -- Spark on notebooks
>> -- Spark on serverless (for example Spark on Google Cloud)
>> -- Spark on k8s
>>
>>
>> If you are willing to contribute to presentation materials, please
>> register your interest in slack/webinars.
>>
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>>
>> Spark on k8s for both GCP and EKS concepts and contrasts
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 28 Mar 2023 at 13:55, asma zgolli  wrote:
>>
>>> Hello everyone,
>>>
>>> I suggest using the slack for the spark community created recently to
>>> collaborate and work together on these topics and use the LinkedIn page to
>>> publish the events and the webinars.
>>>
>>> Cheers,
>>> Asma
>>>
>>> Le jeu. 16 mars 2023 à 01:39, Denny Lee  a
>>> écrit :
>>>
>>>> What we can do is get into the habit of compiling the list on LinkedIn
>>>> but making sure this list is shared and broadcast here, eh?!
>>>>
>>>> As well, when we broadcast the videos, we can do this using zoom/jitsi/
>>>> riverside.fm as well as simulcasting this on LinkedIn. This way you
>>>> can view directly on the former without ever logging in with a user ID.
>>>>
>>>> HTH!!
>>>>
>>>> On Wed, Mar 15, 2023 at 4:30 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Understood Nitin It would be wrong to act against one's conviction. I
>>>>> am sure we can find a way around providing the contents
>>>>>
>>>>> Regards
>>>>>
>>>>> Mich

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi all,

There is a section in slack called webinars

https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG

Asma Zgolli, agreed to prepare materials for  Spark internals and/or
comparing spark 3 and 2.

I like to contribute to "Spark Streaming & Spark Structured Streaming" plus
"Spark on k8s for both GCP and EKS concepts and contrasts"

Other topics are mentioned below.

-- Spark UI
-- Dynamic allocation
-- Tuning of jobs
-- Collecting spark metrics for monitoring and alerting
-- For those who prefer to use Pandas API on Spark since the release of
Spark 3.2, What are some important notes for those users? For example, what
are the additional factors affecting the Spark -- Performance using Pandas
API on Spark? How to tune them in addition to the conventional Spark tuning
methods applied to Spark SQL users.
-- Spark internals and/or comparing spark 3 and 2
-- Spark Streaming & Spark Structured Streaming
-- Spark on notebooks
-- Spark on serverless (for example Spark on Google Cloud)
-- Spark on k8s


If you are willing to contribute to presentation materials, please register
your interest in slack/webinars.


Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh




Spark on k8s for both GCP and EKS concepts and contrasts

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 28 Mar 2023 at 13:55, asma zgolli  wrote:

> Hello everyone,
>
> I suggest using the slack for the spark community created recently to
> collaborate and work together on these topics and use the LinkedIn page to
> publish the events and the webinars.
>
> Cheers,
> Asma
>
> Le jeu. 16 mars 2023 à 01:39, Denny Lee  a écrit :
>
>> What we can do is get into the habit of compiling the list on LinkedIn
>> but making sure this list is shared and broadcast here, eh?!
>>
>> As well, when we broadcast the videos, we can do this using zoom/jitsi/
>> riverside.fm as well as simulcasting this on LinkedIn. This way you can
>> view directly on the former without ever logging in with a user ID.
>>
>> HTH!!
>>
>> On Wed, Mar 15, 2023 at 4:30 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Understood Nitin It would be wrong to act against one's conviction. I am
>>> sure we can find a way around providing the contents
>>>
>>> Regards
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 15 Mar 2023 at 22:34, Nitin Bhansali 
>>> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> Thanks for your prompt response ... much appreciated. I know how to and
>>>> can create login IDs on such sites but I had taken conscious decision some
>>>> 20 years ago ( and i will be going against my principles) not to be on such
>>>> sites. Hence I had asked for is there any other way I can join/view
>>>> recording of webinar.
>>>>
>>>> Anyways not to worry.
>>>>
>>>> Thanks & Regards
>>>>
>>>> Nitin.
>>>>
>>>>
>>>> On Wednesday, 15 March 2023 at 20:37:55 GMT, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>> Hi Nitin,
>>>>
>>

Re: Slack for PySpark users

2023-03-28 Thread Mich Talebzadeh
I created one at slack called pyspark


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 28 Mar 2023 at 03:52, asma zgolli  wrote:

> +1 good idea, I d like to join as well.
>
> Le mar. 28 mars 2023 à 04:09, Winston Lai  a
> écrit :
>
>> Please let us know when the channel is created. I'd like to join :)
>>
>> Thank You & Best Regards
>> Winston Lai
>> --
>> *From:* Denny Lee 
>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM
>> *To:* Hyukjin Kwon 
>> *Cc:* keen ; user@spark.apache.org > >
>> *Subject:* Re: Slack for PySpark users
>>
>> +1 I think this is a great idea!
>>
>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon  wrote:
>>
>> Yeah, actually I think we should better have a slack channel so we can
>> easily discuss with users and developers.
>>
>> On Tue, 28 Mar 2023 at 03:08, keen  wrote:
>>
>> Hi all,
>> I really like *Slack *as communication channel for a tech community.
>> There is a Slack workspace for *delta lake users* (
>> https://go.delta.io/slack) that I enjoy a lot.
>> I was wondering if there is something similar for PySpark users.
>>
>> If not, would there be anything wrong with creating a new Slack workspace
>> for PySpark users? (when explicitly mentioning that this is *not*
>> officially part of Apache Spark)?
>>
>> Cheers
>> Martin
>>
>>
>
> --
> Asma ZGOLLI
>
> Ph.D. in Big Data - Applied Machine Learning
>
>


Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Mich Talebzadeh
 Agreed.  How does asynchronous communication relate to Spark Structured
streaming?

In the previous post of yours,  you made your Spark to run on the driver in
a single JVM. You attempted to increase the number of executors to 3 after
submission of the job that (as Sean alluded to) would not work. So if you
want to improve performance of spark job  you will need to submit your
spark job similar to below (illustration only), specifying your
configuration(parameters number of executors etc) at time of submission:

 spark-submit --verbose \
   --deploy-mode client \
 .
   --conf "spark.driver.memory"=4G \
   --conf "spark.executor.memory"=4G \
   --conf "spark.num.executors"=4 \
   --conf "spark.executor.cores"=2 \

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 26 Mar 2023 at 16:47, Sean Owen  wrote:

> What do you mean by asynchronously here?
>
> On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis <
> kritharakismano...@gmail.com> wrote:
>
>> Hello again,
>>
>> Do we have any news for the above question?
>> I would really appreciate it.
>>
>> Thank you,
>>
>> --
>>
>> Emmanouil (Manos) Kritharakis
>>
>> Ph.D. candidate in the Department of Computer Science
>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>
>> Boston University
>>
>>
>> On Tue, Mar 14, 2023 at 12:04 PM Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I hope this email finds you well!
>>>
>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>> map transformation and then I write the result to another topic. Based on
>>> your documentation here
>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>> I need to work with Dataset data structures. Even though my solution works,
>>> I need to utilize map transformation asynchronously. So my question is how
>>> can I asynchronously call map transformation with Dataset data structures
>>> in a java structured streaming environment? Can you please share a working
>>> example?
>>>
>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>> advance!
>>>
>>> Kind regards
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>
>>> Boston University
>>>
>>


Re: Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Mich Talebzadeh
Hi,

Are you talking about intelligent index scan here?

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 Mar 2023 at 07:03, Anirudha Jadhav  wrote:

> Hello community, wanted your opinion on this implementation demo.
>
> / support for Materialized views, skipping indices and covered indices
> with bloom filter optimizations with opensearch via SparkSQL
>
> https://github.com/opensearch-project/sql/discussions/1465
> ( see video with voice over )
>
> Ani
> --
> Anirudha P. Jadhav
>


Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Mich Talebzadeh
or download it from here

https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 21 Mar 2023 at 15:38, Mich Talebzadeh 
wrote:

> Hi Emmanouil and anyone else interested
>
> Sounds like you may benefit from this booklet.Not the latest but good
> enough.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 21 Mar 2023 at 12:21, Sean Owen  wrote:
>
>> Yes more specifically, you can't ask for executors once the app starts,
>> in SparkConf like that. You set this when you launch it against a Spark
>> cluster in spark-submit or otherwise.
>>
>> On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Emmanouil,
>>>
>>> This means that your job is running on the driver as a single JVM, hence
>>> active(1)
>>>
>>>


Topics for Spark online classes & webinars, next steps

2023-03-21 Thread Mich Talebzadeh
Hi all,



As you may be aware we are proposing to set-up community classes and
webinars for Spark interest group or simply for those who could benefit
from them.



@Denny Lee   and myself had a discussion on how to
put this framework forward. The idea is first and foremost getting support
from peers (whether presenters or participants) and with the notion of
contributions from everyone.



We propose that these presentations could be at the starter, intermediate
and advanced levels to appeal to everyone. We thus welcome all candidates
who want to contribute and present, regardless of their depth of Spark
knowledge. You have already seen the list of topics of interest.



https://www.linkedin.com/posts/apachespark_topics-for-spark-online-classes-webinars-activity-7041754636860411904-wkfj?utm_source=share_medium=member_desktop



 If you would like to contribute, you can send me or @Denny Lee
  an email stating which topic and at what level you
would like to take part. We propose to do a peer review of the draft
presentation so no worries.



Looking forward to hearing from you.



HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread Mich Talebzadeh
Hi Karan,

The version tested was 3.1.1. Are you running on Dataproc serverless  3.1.3?


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 16 Mar 2023 at 23:49, karan alang  wrote:

> Fyi .. apache spark version is 3.1.3
>
> On Wed, Mar 15, 2023 at 4:34 PM karan alang  wrote:
>
>> Hi Mich, this doesn't seem to be working for me .. the watermark seems to
>> be getting ignored !
>>
>> Here is the data put into Kafka :
>>
>> ```
>>
>>
>> +---++
>>
>> |value
>> |key |
>>
>>
>> +---++
>>
>>
>> |{"temparature":14,"insert_ts":"2023-03-15T16:04:33.003-07:00","ts":"2023-03-15T15:12:00.000-07:00"}|null|
>>
>>
>> |{"temparature":10,"insert_ts":"2023-03-15T16:05:58.816-07:00","ts":"2023-03-15T16:12:00.000-07:00"}|null|
>>
>>
>> |{"temparature":17,"insert_ts":"2023-03-15T16:07:55.222-07:00","ts":"2023-03-15T16:12:00.000-07:00"}|null|
>>
>> |{"temparature":6,"insert_ts":"2023-03-15T16:11:41.759-07:00","ts":"2023-03-13T10:12:00.000-07:00"}
>> |null|
>>
>>
>> +---++
>> ```
>> Note :
>> insert_ts - specifies when the data was inserted
>>
>> Here is the output of the Structured Stream:
>>
>> ---
>>
>> Batch: 2
>>
>> ---
>>
>> +---+---+---+
>>
>> |startOfWindowFrame |endOfWindowFrame   |Sum_Temperature|
>>
>> +---+---+---+
>>
>> |2023-03-15 16:10:00|2023-03-15 16:15:00|27 |
>>
>> |2023-03-15 15:10:00|2023-03-15 15:15:00|14 |
>>
>> |2023-03-13 10:10:00|2023-03-13 10:15:00|6  |
>>
>> +---+---+---+
>>
>> Note: I'm summing up the temperatures (for easy verification)
>>
>> As per the above - all the 3 'ts' are included in the DataFrame, even
>> when I added   "ts":"2023-03-13T10:12:00.000-07:00", as the last record.
>> Since the wattermark is set to "5 minutes" and the max(ts) ==
>> 2023-03-15T16:12:00.000-07:00
>> record with ts = "2023-03-13T10:12:00.000-07:00" should have got
>> dropped, it is more than 2 days old (i.e. dated - 2023-03-13)!
>>
>> Any ideas what needs to be changed to make this work ?
>>
>> Here is the code (modified for my requirement, but essentially the same)
>> ```
>>
>> schema = StructType([
>> StructField("temparature", LongType(), False),
>> StructField("ts", TimestampType(), False),
>> StructField("insert_ts", TimestampType(), False)
>> ])
>>
>> streamingDataFrame = spark \
>> .readStream \
>> .format("kafka") \
>> .option("kafka.bootstrap.servers", kafkaBrokers) \
>> .option("group.id", 'watermark-grp') \
>> .option("subscribe", topic) \
>> .option("failOnDataLoss", "false") \
>> .option("includeHeaders", "true") \
>> .option("startingOffsets", "latest") \
>> .load() \
>> .select(from_json(col("value").cast("string"), 
>> schema=schema).alias("parsed_value"))
>>
>> resultC = streamingDataFrame.select( 
>> col("parsed_value.ts").alias("timesta

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Understood Nitin It would be wrong to act against one's conviction. I am
sure we can find a way around providing the contents

Regards

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Mar 2023 at 22:34, Nitin Bhansali 
wrote:

> Hi Mich,
>
> Thanks for your prompt response ... much appreciated. I know how to and
> can create login IDs on such sites but I had taken conscious decision some
> 20 years ago ( and i will be going against my principles) not to be on such
> sites. Hence I had asked for is there any other way I can join/view
> recording of webinar.
>
> Anyways not to worry.
>
> Thanks & Regards
>
> Nitin.
>
>
> On Wednesday, 15 March 2023 at 20:37:55 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi Nitin,
>
> Linkedin is more of a professional media.  FYI, I am only a member of
> Linkedin, no facebook, etc.There is no reason for you NOT to create a
> profile for yourself  in linkedin :)
>
>
> https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en
>
> see you there as well.
>
> Best of luck.
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali 
> wrote:
>
> Hello Mich,
>
> My apologies  ...  but I am not on any of such social/professional sites?
> Any other way to access such webinars/classes?
>
> Thanks & Regards
> Nitin.
>
> On Wednesday, 15 March 2023 at 18:26:51 GMT, Denny Lee <
> denny.g@gmail.com> wrote:
>
>
> Thanks Mich for tackling this!  I encourage everyone to add to the list so
> we can have a comprehensive list of topics, eh?!
>
> On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
> wrote:
>
> Hi all,
>
> Thanks to @Denny Lee   to give access to
>
> https://www.linkedin.com/company/apachespark/
>
> and contribution from @asma zgolli 
>
> You will see my post at the bottom. Please add anything else on topics to
> the list as a comment.
>
> We will then put them together in an article perhaps. Comments and
> contributions are welcome.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
> wrote:
>
> Hi Denny,
>
> That Apache Spark Linkedin page
> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
> a wider audience to benefit from it.
>
> +1 for me
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>
> In the past, we've been using the Apache 

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Hi Nitin,

Linkedin is more of a professional media.  FYI, I am only a member of
Linkedin, no facebook, etc.There is no reason for you NOT to create a
profile for yourself  in linkedin :)

https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en

see you there as well.

Best of luck.


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead,
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Mar 2023 at 18:31, Nitin Bhansali 
wrote:

> Hello Mich,
>
> My apologies  ...  but I am not on any of such social/professional sites?
> Any other way to access such webinars/classes?
>
> Thanks & Regards
> Nitin.
>
> On Wednesday, 15 March 2023 at 18:26:51 GMT, Denny Lee <
> denny.g@gmail.com> wrote:
>
>
> Thanks Mich for tackling this!  I encourage everyone to add to the list so
> we can have a comprehensive list of topics, eh?!
>
> On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh 
> wrote:
>
> Hi all,
>
> Thanks to @Denny Lee   to give access to
>
> https://www.linkedin.com/company/apachespark/
>
> and contribution from @asma zgolli 
>
> You will see my post at the bottom. Please add anything else on topics to
> the list as a comment.
>
> We will then put them together in an article perhaps. Comments and
> contributions are welcome.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead,
> Palantir Technologies Limited
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
> wrote:
>
> Hi Denny,
>
> That Apache Spark Linkedin page
> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
> a wider audience to benefit from it.
>
> +1 for me
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>
> In the past, we've been using the Apache Spark LinkedIn page
> <https://www.linkedin.com/company/apachespark/> and group to broadcast
> these type of events - if you're cool with this?  Or we could go through
> the process of submitting and updating the current
> https://spark.apache.org or request to leverage the original Spark
> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>  WDYT?
>
> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
> wrote:
>
> Well that needs to be created first for this purpose. The appropriate name
> etc. to be decided. Maybe @Denny Lee   can
> facilitate this as he offered his help.
>
>
> cheers
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>
> Hello Mich,
>
> Can you please provide the link for the confluence page?
>
> Many thanks
> Asma
> Ph.D. in Big Data - Applied Machine Learning
>
> Le lun. 13 mars 2023 à 17:21, Mich T

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Hi all,

Thanks to @Denny Lee   to give access to

https://www.linkedin.com/company/apachespark/

and contribution from @asma zgolli 

You will see my post at the bottom. Please add anything else on topics to
the list as a comment.

We will then put them together in an article perhaps. Comments and
contributions are welcome.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead,
Palantir Technologies Limited



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh 
wrote:

> Hi Denny,
>
> That Apache Spark Linkedin page
> https://www.linkedin.com/company/apachespark/ looks fine. It also allows
> a wider audience to benefit from it.
>
> +1 for me
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:
>
>> In the past, we've been using the Apache Spark LinkedIn page
>> <https://www.linkedin.com/company/apachespark/> and group to broadcast
>> these type of events - if you're cool with this?  Or we could go through
>> the process of submitting and updating the current
>> https://spark.apache.org or request to leverage the original Spark
>> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>>  WDYT?
>>
>> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well that needs to be created first for this purpose. The appropriate
>>> name etc. to be decided. Maybe @Denny Lee   can
>>> facilitate this as he offered his help.
>>>
>>>
>>> cheers
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>>
>>>> Hello Mich,
>>>>
>>>> Can you please provide the link for the confluence page?
>>>>
>>>> Many thanks
>>>> Asma
>>>> Ph.D. in Big Data - Applied Machine Learning
>>>>
>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> a écrit :
>>>>
>>>>> Apologies I missed the list.
>>>>>
>>>>> To move forward I selected these topics from the thread "Online
>>>>> classes for spark topics".
>>>>>
>>>>> To take this further I propose a confluence page to be seup.
>>>>>
>>>>>
>>>>>1. Spark UI
>>>>>2. Dynamic allocation
>>>>>3. Tuning of jobs
>>>>>4. Collecting spark metrics for monitoring and alerting
>>>>>5.  For those who prefer to use Pandas API on Spark since the
>>>>>release of Spark 3.2, What are some important notes for those users? 
>>>>> For
>>>>>example, what are the additional factors affecting the Spark 
>>>>> performance
>>>>>using Pandas API on Spark? How to tune them in addition to the 
>>>>> conventional
>>>>>Spark tuning methods applied to Spark SQL users.
>>>>>    6. Spark internals and/or comparing spar

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
In spark structured streaming we cannot perform repartition() without
stopping the streaming process unless otherwise.

Admittedly, It is not a parameter that I have played around with. I
still think Spark GUI should provide some insight.








   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:42, Sean Owen  wrote:

> That's incorrect, it's spark.default.parallelism, but as the name
> suggests, that is merely a default. You control partitioning directly with
> .repartition()
>
> On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Check this link
>>
>>
>> https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/
>>
>> You can set it
>>
>> spark.conf.set("sparkDefaultParallelism", value])
>>
>>
>> Have a look at Streaming statistics in Spark GUI, especially *Processing
>> Tim*e, defined by Spark GUI as Time taken to process all jobs of a
>> batch.  *The **Scheduling Dela*y and *the **Total Dela*y are additional
>> indicators of health.
>>
>>
>> then decide how to set the value.
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Yes I need to check the performance of my streaming job in terms of
>>> latency and throughput. Is there any working example of how to increase the
>>> parallelism with spark structured streaming  using Dataset data structures?
>>> Thanks in advance.
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>
>>> Boston University
>>>
>>>
>>> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> What benefits are you going with increasing parallelism? Better
>>>> througput
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>>>> kritharakismano...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I hope this email finds you well!
>>>>>
>>>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>>>> map transformation and then I write the result to another topic. Based on
>>>>> your documentation here
>>>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>>>> I need to work with Dataset data structures. Even though my solution 
>>>>> works,
>>>>> I need

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
Check this link

https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/

You can set it

spark.conf.set("sparkDefaultParallelism", value])


Have a look at Streaming statistics in Spark GUI, especially *Processing
Tim*e, defined by Spark GUI as Time taken to process all jobs of a batch.
*The **Scheduling Dela*y and *the **Total Dela*y are additional indicators
of health.


then decide how to set the value.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 16:04, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Yes I need to check the performance of my streaming job in terms of
> latency and throughput. Is there any working example of how to increase the
> parallelism with spark structured streaming  using Dataset data structures?
> Thanks in advance.
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> <https://sites.bu.edu/casp/people/ekritharakis/>
>
> Boston University
>
>
> On Tue, Mar 14, 2023 at 12:01 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> What benefits are you going with increasing parallelism? Better througput
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
>> kritharakismano...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I hope this email finds you well!
>>>
>>> I have a simple dataflow in which I read from a kafka topic, perform a
>>> map transformation and then I write the result to another topic. Based on
>>> your documentation here
>>> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
>>> I need to work with Dataset data structures. Even though my solution works,
>>> I need to increase the parallelism. The spark documentation includes a lot
>>> of parameters that I can change based on specific data structures like
>>> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The
>>> former is the default number of partitions in RDDs returned by
>>> transformations like join, reduceByKey while the later is not recommended
>>> for structured streaming as it is described in documentation: "Note: For
>>> structured streaming, this configuration cannot be changed between query
>>> restarts from the same checkpoint location".
>>>
>>> So my question is how can I increase the parallelism for a simple
>>> dataflow based on datasets with a map transformation only?
>>>
>>> I am looking forward to hearing from you as soon as possible. Thanks in
>>> advance!
>>>
>>> Kind regards,
>>>
>>> --
>>>
>>> Emmanouil (Manos) Kritharakis
>>>
>>> Ph.D. candidate in the Department of Computer Science
>>> <https://sites.bu.edu/casp/people/ekritharakis/>
>>>
>>> Boston University
>>>
>>


Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
What benefits are you going with increasing parallelism? Better througput



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 15:58, Emmanouil Kritharakis <
kritharakismano...@gmail.com> wrote:

> Hello,
>
> I hope this email finds you well!
>
> I have a simple dataflow in which I read from a kafka topic, perform a map
> transformation and then I write the result to another topic. Based on your
> documentation here
> <https://spark.apache.org/docs/3.3.2/structured-streaming-kafka-integration.html#content>,
> I need to work with Dataset data structures. Even though my solution works,
> I need to increase the parallelism. The spark documentation includes a lot
> of parameters that I can change based on specific data structures like
> *spark.default.parallelism* or *spark.sql.shuffle.partitions*. The former
> is the default number of partitions in RDDs returned by transformations
> like join, reduceByKey while the later is not recommended for structured
> streaming as it is described in documentation: "Note: For structured
> streaming, this configuration cannot be changed between query restarts from
> the same checkpoint location".
>
> So my question is how can I increase the parallelism for a simple dataflow
> based on datasets with a map transformation only?
>
> I am looking forward to hearing from you as soon as possible. Thanks in
> advance!
>
> Kind regards,
>
> --
>
> Emmanouil (Manos) Kritharakis
>
> Ph.D. candidate in the Department of Computer Science
> <https://sites.bu.edu/casp/people/ekritharakis/>
>
> Boston University
>


Re: Topics for Spark online classes & webinars

2023-03-14 Thread Mich Talebzadeh
Hi Denny,

That Apache Spark Linkedin page
https://www.linkedin.com/company/apachespark/ looks fine. It also allows a
wider audience to benefit from it.

+1 for me



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 14 Mar 2023 at 14:23, Denny Lee  wrote:

> In the past, we've been using the Apache Spark LinkedIn page
> <https://www.linkedin.com/company/apachespark/> and group to broadcast
> these type of events - if you're cool with this?  Or we could go through
> the process of submitting and updating the current
> https://spark.apache.org or request to leverage the original Spark
> confluence page <https://cwiki.apache.org/confluence/display/SPARK>.
>  WDYT?
>
> On Mon, Mar 13, 2023 at 9:34 AM Mich Talebzadeh 
> wrote:
>
>> Well that needs to be created first for this purpose. The appropriate
>> name etc. to be decided. Maybe @Denny Lee   can
>> facilitate this as he offered his help.
>>
>>
>> cheers
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:
>>
>>> Hello Mich,
>>>
>>> Can you please provide the link for the confluence page?
>>>
>>> Many thanks
>>> Asma
>>> Ph.D. in Big Data - Applied Machine Learning
>>>
>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
>>> a écrit :
>>>
>>>> Apologies I missed the list.
>>>>
>>>> To move forward I selected these topics from the thread "Online classes
>>>> for spark topics".
>>>>
>>>> To take this further I propose a confluence page to be seup.
>>>>
>>>>
>>>>1. Spark UI
>>>>2. Dynamic allocation
>>>>3. Tuning of jobs
>>>>4. Collecting spark metrics for monitoring and alerting
>>>>5.  For those who prefer to use Pandas API on Spark since the
>>>>release of Spark 3.2, What are some important notes for those users? For
>>>>example, what are the additional factors affecting the Spark performance
>>>>using Pandas API on Spark? How to tune them in addition to the 
>>>> conventional
>>>>Spark tuning methods applied to Spark SQL users.
>>>>6. Spark internals and/or comparing spark 3 and 2
>>>>7. Spark Streaming & Spark Structured Streaming
>>>>8. Spark on notebooks
>>>>9. Spark on serverless (for example Spark on Google Cloud)
>>>>10. Spark on k8s
>>>>
>>>> Opinions and how to is welcome
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi guys
>>>>>
>>>>> To move forward I selected these topics from the thread "Online
>>>>> classes for spark topics".
>>>>>
>>>>> To take this further I propose a confluence page to be seup.
>>>>>
>>>>> Opinions and how to is welcome
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>


Re: Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Well that needs to be created first for this purpose. The appropriate name
etc. to be decided. Maybe @Denny Lee   can
facilitate this as he offered his help.


cheers



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Mar 2023 at 16:29, asma zgolli  wrote:

> Hello Mich,
>
> Can you please provide the link for the confluence page?
>
> Many thanks
> Asma
> Ph.D. in Big Data - Applied Machine Learning
>
> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh 
> a écrit :
>
>> Apologies I missed the list.
>>
>> To move forward I selected these topics from the thread "Online classes
>> for spark topics".
>>
>> To take this further I propose a confluence page to be seup.
>>
>>
>>1. Spark UI
>>2. Dynamic allocation
>>3. Tuning of jobs
>>4. Collecting spark metrics for monitoring and alerting
>>5.  For those who prefer to use Pandas API on Spark since the release
>>of Spark 3.2, What are some important notes for those users? For example,
>>what are the additional factors affecting the Spark performance using
>>Pandas API on Spark? How to tune them in addition to the conventional 
>> Spark
>>tuning methods applied to Spark SQL users.
>>6. Spark internals and/or comparing spark 3 and 2
>>7. Spark Streaming & Spark Structured Streaming
>>8. Spark on notebooks
>>9. Spark on serverless (for example Spark on Google Cloud)
>>10. Spark on k8s
>>
>> Opinions and how to is welcome
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
>> wrote:
>>
>>> Hi guys
>>>
>>> To move forward I selected these topics from the thread "Online classes
>>> for spark topics".
>>>
>>> To take this further I propose a confluence page to be seup.
>>>
>>> Opinions and how to is welcome
>>>
>>> Cheers
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>
>
>


Re: Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Apologies I missed the list.

To move forward I selected these topics from the thread "Online classes for
spark topics".

To take this further I propose a confluence page to be seup.


   1. Spark UI
   2. Dynamic allocation
   3. Tuning of jobs
   4. Collecting spark metrics for monitoring and alerting
   5.  For those who prefer to use Pandas API on Spark since the release of
   Spark 3.2, What are some important notes for those users? For example, what
   are the additional factors affecting the Spark performance using Pandas API
   on Spark? How to tune them in addition to the conventional Spark tuning
   methods applied to Spark SQL users.
   6. Spark internals and/or comparing spark 3 and 2
   7. Spark Streaming & Spark Structured Streaming
   8. Spark on notebooks
   9. Spark on serverless (for example Spark on Google Cloud)
   10. Spark on k8s

Opinions and how to is welcome


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh 
wrote:

> Hi guys
>
> To move forward I selected these topics from the thread "Online classes
> for spark topics".
>
> To take this further I propose a confluence page to be seup.
>
> Opinions and how to is welcome
>
> Cheers
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Hi guys

To move forward I selected these topics from the thread "Online classes for
spark topics".

To take this further I propose a confluence page to be seup.

Opinions and how to is welcome

Cheers



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-13 Thread Mich Talebzadeh
Hi Gary

Thanks for the update. So  this serverless dataproc. on 3.3.1. Maybe an
autoscaling policy could be an option. What is y-axis? Is that the capacity?

Can you break down the join into multiple parts and save the intermediate
result set?


HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Mar 2023 at 14:56, Gary Liu  wrote:

> Hi Mich,
> I used the serverless spark session, not the local mode in the notebook.
> So machine type does not matter in this case. Below is the chart for
> serverless spark session execution. I also tried to increase executor
> memory and core, but the issue did got get resolved. I will try shutting
> down autoscaling, and see what will happen.
> [image: Serverless Session Executors-4core.png]
>
>
> On Fri, Mar 10, 2023 at 11:55 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> for your dataproc what type of machines are you using for example
>> n2-standard-4 with 4vCPU and 16GB or something else? how many nodes and if
>> autoscaling turned on.
>>
>> most likely executor memory limit?
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 10 Mar 2023 at 15:35, Gary Liu  wrote:
>>
>>> Hi ,
>>>
>>> I have a job in GCP dataproc server spark session (spark 3.3.2), it is a
>>> job involving multiple joinings, as well as a complex UDF. I always got the
>>> below FetchFailedException, but the job can be done and the results look
>>> right. Neither of 2 input data is very big (one is 6.5M rows*11 columns,
>>> ~150M in orc format and 17.7M rows*11 columns, ~400M in orc format). It ran
>>> very smoothly on and on-premise spark environment though.
>>>
>>> According to Google's document (
>>> https://cloud.google.com/dataproc/docs/support/spark-job-tuning#shuffle_fetch_failures),
>>> it has 3 solutions:
>>> 1. Using EFM mode
>>> 2. Increase executor memory
>>> 3, decrease the number of job partitions.
>>>
>>> 1. I started the session from a vertex notebook, so I don't know how to
>>> use EFM mode.
>>> 2. I increased executor memory from the default 12GB to 25GB, and the
>>> number of cores from 4 to 8, but it did not solve the problem.
>>> 3. Wonder how to do this? repartition the input dataset to have less
>>> partitions? I used df.rdd.getNumPartitions() to check the input data
>>> partitions, they have 9 and 17 partitions respectively, should I decrease
>>> them further? I also read a post on StackOverflow (
>>> https://stackoverflow.com/questions/34941410/fetchfailedexception-or-metadatafetchfailedexception-when-processing-big-data-se),
>>> saying increasing partitions may help.Which one makes more sense? I
>>> repartitioned the input data to 20 and 30 partitions, but still no luck.
>>>
>>> Any suggestions?
>>>
>>> 23/03/10 14:32:19 WARN TaskSetManager: Lost task 58.1 in stage 27.0 (TID 
>>> 3783) (10.1.0.116 executor 33): FetchFailed(BlockManagerId(72, 10.1.15.199, 
>>> 36791, None), shuffleId=24, mapIndex=77, mapId=3457, reduceId=58, message=
>>> org.apache.spark.shuffle.FetchFailedException
>>> at 
>>> org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:312)
>>> at 
>>> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
>>> at 
>>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:918)
>>> at 
>>> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(Shuf

Re: Online classes for spark topics

2023-03-12 Thread Mich Talebzadeh
Hi Denny,

Thanks for the offer. How do you envisage that structure to be?


Also it would be good to have a webinar (for a given topic)  for different
target audiences as we have a mixture of members in Spark forums. For
example, beginners, intermediate and advanced.


do we have a confluence page for Spark so we can use it. I guess that would
be part of the structure you mentioned.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 12 Mar 2023 at 22:59, Denny Lee  wrote:

> Looks like we have some good topics here - I'm glad to help with setting
> up the infrastructure to broadcast if it helps?
>
> On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani 
> wrote:
>
>> I am happy to be a part of this discussion as well.
>>
>> Regards,
>> Neeraj
>>
>> On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:
>>
>>> +1, any webinar on Spark related topic is appreciated 
>>>
>>> Thank You & Best Regards
>>> Winston Lai
>>> --
>>> *From:* asma zgolli 
>>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>>> *To:* karan alang 
>>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com
>>> ; User 
>>> *Subject:* Re: Online classes for spark topics
>>>
>>> +1
>>>
>>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>>> écrit :
>>>
>>> +1 .. I'm happy to be part of these discussions as well !
>>>
>>>
>>>
>>>
>>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I guess I can schedule this work over a course of time. I for myself can
>>> contribute plus learn from others.
>>>
>>> So +1 for me.
>>>
>>> Let us see if anyone else is interested.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>>> wrote:
>>>
>>>
>>> Hello Mich.
>>>
>>> Greetings. Would you be able to arrange for Spark Structured Streaming
>>> learning webinar.?
>>>
>>> This is something I haven been struggling with recently. it will be very
>>> helpful.
>>>
>>> Thanks and Regard
>>>
>>> AK
>>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> This might  be a worthwhile exercise on the assumption that the
>>> contributors will find the time and bandwidth to chip in so to speak.
>>>
>>> I am sure there are many but on top of my head I can think of Holden
>>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>>> experienced.
>>>
>>> Anyone else 樂
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>>  wrote:
>>>
>>> Hello gurus,
>>>
>>> Does Spark arranges online webinars for special topics like Spark on
>>> K8s, data science and Spark Structured Streaming?
>>>
>>> I would be most grateful if experts can share their experience with
>>> learners with intermediate knowledge like myself. Hopefully we will find
>>> the practical experiences told valuable.
>>>
>>> Respectively,
>>>
>>> AK
>>>
>>>
>>>
>>>
>>


Re: Spark StructuredStreaming - watermark not working as expected

2023-03-12 Thread Mich Talebzadeh
OK

ts is the timestamp right?

This is a similar code that works out the average temperature with time
frame of 5 minutes.

Note the comments and catch error with try:

try:

# construct a streaming dataframe streamingDataFrame that
subscribes to topic temperature
streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("schema.registry.url",
config['MDVariables']['schemaRegistryURL']) \
.option("group.id", config['common']['appName']) \
.option("zookeeper.connection.timeout.ms",
config['MDVariables']['zookeeperConnectionTimeoutMs']) \
.option("rebalance.backoff.ms",
config['MDVariables']['rebalanceBackoffMS']) \
.option("zookeeper.session.timeout.ms",
config['MDVariables']['zookeeperSessionTimeOutMs']) \
.option("auto.commit.interval.ms",
config['MDVariables']['autoCommitIntervalMS']) \
.option("subscribe", "temperature") \
.option("failOnDataLoss", "false") \
.option("includeHeaders", "true") \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value"))


resultC = streamingDataFrame.select( \
 col("parsed_value.rowkey").alias("rowkey") \
   , col("parsed_value.timestamp").alias("timestamp") \
   , col("parsed_value.temperature").alias("temperature"))

"""
We work out the window and the AVG(temperature) in the window's
timeframe below
This should return back the following Dataframe as struct

 root
 |-- window: struct (nullable = false)
 ||-- start: timestamp (nullable = true)
 ||-- end: timestamp (nullable = true)
 |-- avg(temperature): double (nullable = true)

"""
resultM = resultC. \
 withWatermark("timestamp", "5 minutes"). \
 groupBy(window(resultC.timestamp, "5 minutes", "5
minutes")). \
 avg('temperature')

# We take the above Dataframe and flatten it to get the columns
aliased as "startOfWindowFrame", "endOfWindowFrame" and "AVGTemperature"
resultMF = resultM. \
   select( \

F.col("window.start").alias("startOfWindowFrame") \
  , F.col("window.end").alias("endOfWindowFrame") \
  ,
F.col("avg(temperature)").alias("AVGTemperature"))

resultMF.printSchema()

result = resultMF. \
 writeStream. \
 outputMode('complete'). \
 option("numRows", 1000). \
     option("truncate", "false"). \
 format('console'). \
 option('checkpointLocation', checkpoint_path). \
 queryName("temperature"). \
 start()

except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)



HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 11 Mar 2023 at 04:33, karan alang  wrote:

> Hi Mich -
> Here is the output of the ldf.printSchema() & ldf.show() commands.
>
> ldf.printSchema()
>
> root
>  |-- applianceName: string (nullable = true)
>  |-- timeslot: long (nullable = true)
>  |-- customer: string (nullable = true)
>  |-- window: struct (nullable = false)
>  ||-- start: timestamp (nullable = true)
>  ||-- end: timestamp (nullable = true)
>  |-- sentOctets: long (nullable = true)
>  |-- recvdOctets: long (nullable = true)
>
>
>  ldf.show() :
>
>
>  
> +--+---++--++--+--+-

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread Mich Talebzadeh
collectAsList brings all the data into the driver which is a single JVM on
a single node. In this case your program may work because effectively you
are not using the spark in yarn on the hadoop cluster. The benefit of Spark
is that you can process a large amount of data using the memory and
processors across multiple executors on multiple nodes.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 11 Mar 2023 at 19:01, sam smith  wrote:

> not sure what you mean by your question, but it is not helping in any case
>
>
> Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh 
> a écrit :
>
>>
>>
>> ... To note that if I execute collectAsList on the dataset at the
>> beginning of the program
>>
>> What do you think  collectAsList does?
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 11 Mar 2023 at 18:29, sam smith 
>> wrote:
>>
>>> Hello guys,
>>>
>>> I am launching through code (client mode) a Spark program to run in
>>> Hadoop. If I execute on the dataset methods of the likes of show() and
>>> count() or collectAsList() (that are displayed in the Spark UI) after
>>> performing heavy transformations on the columns then the mentioned methods
>>> will cause the execution to freeze on Hadoop and that independently of the
>>> dataset size (intriguing issue for small size datasets!).
>>> Any idea what could be causing this type of issue?
>>> To note that if I execute collectAsList on the dataset at the beginning
>>> of the program (before performing the transformations on the columns) then
>>> the method yields results correctly.
>>>
>>> Thanks.
>>> Regards
>>>
>>>


Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread Mich Talebzadeh
... To note that if I execute collectAsList on the dataset at the beginning
of the program

What do you think  collectAsList does?



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 11 Mar 2023 at 18:29, sam smith  wrote:

> Hello guys,
>
> I am launching through code (client mode) a Spark program to run in
> Hadoop. If I execute on the dataset methods of the likes of show() and
> count() or collectAsList() (that are displayed in the Spark UI) after
> performing heavy transformations on the columns then the mentioned methods
> will cause the execution to freeze on Hadoop and that independently of the
> dataset size (intriguing issue for small size datasets!).
> Any idea what could be causing this type of issue?
> To note that if I execute collectAsList on the dataset at the beginning of
> the program (before performing the transformations on the columns) then the
> method yields results correctly.
>
> Thanks.
> Regards
>
>


Re: Spark StructuredStreaming - watermark not working as expected

2023-03-10 Thread Mich Talebzadeh
Just looking at the code


in here


ldf = ldf.groupBy("applianceName", "timeslot", "customer",

window(col("ts"), "15 minutes")) \
.agg({'sentOctets':"sum", 'recvdOctets':"sum"}) \
.withColumnRenamed('sum(sentOctets)', 'sentOctets') \
.withColumnRenamed('sum(recvdOctets)', 'recvdOctets') \
.fillna(0)

What does ldf.printSchema() returns


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 Mar 2023 at 07:16, karan alang  wrote:

>
> Hello All -
>
> I've a structured Streaming job which has a trigger of 10 minutes, and I'm
> using watermark to account for late data coming in. However, the watermark
> is not working - and instead of a single record with total aggregated
> value, I see 2 records.
>
> Here is the code :
>
> ```
>
> 1) StructuredStreaming - Reading from Kafka every 10 mins
>
>
> df_stream = self.spark.readStream.format('kafka') \
> .option("kafka.security.protocol", "SSL") \
> .option("kafka.ssl.truststore.location", 
> self.ssl_truststore_location) \
> .option("kafka.ssl.truststore.password", 
> self.ssl_truststore_password) \
> .option("kafka.ssl.keystore.location", 
> self.ssl_keystore_location_bandwidth_intermediate) \
> .option("kafka.ssl.keystore.password", 
> self.ssl_keystore_password_bandwidth_intermediate) \
> .option("kafka.bootstrap.servers", self.kafkaBrokers) \
> .option("subscribe", topic) \
> .option("startingOffsets", "latest") \
> .option("failOnDataLoss", "false") \
> .option("kafka.metadata.max.age.ms", "1000") \
> .option("kafka.ssl.keystore.type", "PKCS12") \
> .option("kafka.ssl.truststore.type", "PKCS12") \
> .load()
>
> 2. calling foreachBatch(self.process)
> # note - outputMode is set to "update" (tried setting outputMode = 
> append as well)
>
> # 03/09 ::: outputMode - update instead of append
> query = df_stream.selectExpr("CAST(value AS STRING)", "timestamp", 
> "topic").writeStream \
> .outputMode("update") \
> .trigger(processingTime='10 minutes') \
> .option("truncate", "false") \
> .option("checkpointLocation", self.checkpoint) \
> .foreachBatch(self.process) \
> .start()
>
>
> self.process - where i do the bulk of the  processing, which calls the 
> function  'aggIntfLogs'
>
> In function aggIntfLogs - i'm using watermark of 15 mins, and doing  groupBy 
> to calculate the sum of sentOctets & recvdOctets
>
>
> def aggIntfLogs(ldf):
> if ldf and ldf.count() > 0:
>
> ldf = ldf.select('applianceName', 'timeslot', 'sentOctets', 
> 'recvdOctets','ts', 'customer') \
> .withColumn('sentOctets', 
> ldf["sentOctets"].cast(LongType())) \
> .withColumn('recvdOctets', 
> ldf["recvdOctets"].cast(LongType())) \
> .withWatermark("ts", "15 minutes")
>
> ldf = ldf.groupBy("applianceName", "timeslot", "customer",
>  
> window(col("ts"), "15 minutes")) \
> .agg({'sentOctets':"sum", 'recvdOctets':"sum"}) \
> .withColumnRenamed('sum(sentOctets)', 'sentOctets') \
> .withColumnRenamed('sum(recvdOctets)', 'recvdOctets') \
> .fillna(0)
> return ldf
> return ldf
>
>
> Dataframe 'ldf' returned from the function aggIntfLogs - is written 
> to Kafka topic
>
> ```
>
> I was expecting that using the watermark will account for late coming data
> .. i.e. the sentOctets & recvdOctets are calculated for the consolidated
> data
> (including late-coming data, since the late coming data comes within 15
> mins), however, I'm seeing 2 records for some of the data (i.e. key -
> applianceName/timeslot/customer) i.e. the aggregated data is calculated
> individually for the records and I see 2 records instead of single record
> accounting for late coming data within watermark.
>
> What needs to be done to fix this & make this work as desired?
>
> tia!
>
>
> Here is the Stackoverflow link as well -
>
>
> https://stackoverflow.com/questions/75693171/spark-structuredstreaming-watermark-not-working-as-expected
>
>
>
>


Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-10 Thread Mich Talebzadeh
for your dataproc what type of machines are you using for example
n2-standard-4 with 4vCPU and 16GB or something else? how many nodes and if
autoscaling turned on.

most likely executor memory limit?


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 10 Mar 2023 at 15:35, Gary Liu  wrote:

> Hi ,
>
> I have a job in GCP dataproc server spark session (spark 3.3.2), it is a
> job involving multiple joinings, as well as a complex UDF. I always got the
> below FetchFailedException, but the job can be done and the results look
> right. Neither of 2 input data is very big (one is 6.5M rows*11 columns,
> ~150M in orc format and 17.7M rows*11 columns, ~400M in orc format). It ran
> very smoothly on and on-premise spark environment though.
>
> According to Google's document (
> https://cloud.google.com/dataproc/docs/support/spark-job-tuning#shuffle_fetch_failures),
> it has 3 solutions:
> 1. Using EFM mode
> 2. Increase executor memory
> 3, decrease the number of job partitions.
>
> 1. I started the session from a vertex notebook, so I don't know how to
> use EFM mode.
> 2. I increased executor memory from the default 12GB to 25GB, and the
> number of cores from 4 to 8, but it did not solve the problem.
> 3. Wonder how to do this? repartition the input dataset to have less
> partitions? I used df.rdd.getNumPartitions() to check the input data
> partitions, they have 9 and 17 partitions respectively, should I decrease
> them further? I also read a post on StackOverflow (
> https://stackoverflow.com/questions/34941410/fetchfailedexception-or-metadatafetchfailedexception-when-processing-big-data-se),
> saying increasing partitions may help.Which one makes more sense? I
> repartitioned the input data to 20 and 30 partitions, but still no luck.
>
> Any suggestions?
>
> 23/03/10 14:32:19 WARN TaskSetManager: Lost task 58.1 in stage 27.0 (TID 
> 3783) (10.1.0.116 executor 33): FetchFailed(BlockManagerId(72, 10.1.15.199, 
> 36791, None), shuffleId=24, mapIndex=77, mapId=3457, reduceId=58, message=
> org.apache.spark.shuffle.FetchFailedException
>   at 
> org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:312)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1180)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:918)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:85)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
>   at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
>   at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
>   at 
> org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.

Re: How to share a dataset file across nodes

2023-03-09 Thread Mich Talebzadeh
Try something like below

1) Put your csv say cities.csv in HDFS as below
hdfs dfs -put cities.csv /data/stg/test
2) Read it into dataframe in PySpark as below
csv_file="hdfs://:PORT/data/stg/test/cities.csv"
# read it in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 21:07, Sean Owen  wrote:

> Put the file on HDFS, if you have a Hadoop cluster?
>
> On Thu, Mar 9, 2023 at 3:02 PM sam smith 
> wrote:
>
>> Hello,
>>
>> I use Yarn client mode to submit my driver program to Hadoop, the dataset
>> I load is from the local file system, when i invoke load("file://path")
>> Spark complains about the csv file being not found, which i totally
>> understand, since the dataset is not in any of the workers or the
>> applicationMaster but only where the driver program resides.
>> I tried to share the file using the configurations:
>>
>>> *spark.yarn.dist.files* OR *spark.files *
>>
>> but both ain't working.
>> My question is how to share the csv dataset across the nodes at the
>> specified path?
>>
>> Thanks.
>>
>


Re: read a binary file and save in another location

2023-03-09 Thread Mich Talebzadeh
Does this need any action in PySpark?


How about importing using the shutil package?


https://sparkbyexamples.com/python/how-to-copy-files-in-python/



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 17:46, Russell Jurney 
wrote:

> https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html
>
> This says "Binary file data source does not support writing a DataFrame
> back to the original files." which I take to mean this isn't possible...
>
> I haven't done this, but going from the docs, it would be:
>
> spark.read.format("binaryFile").option("pathGlobFilter", 
> "*.png").load("/path/to/data").write.format("binaryFile").save("/new/path/to/data")
>
> Looking at the DataFrameWriter code on master branch
> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala>
> for DataFrameWriter, let's see if there is a binaryFile format option...
>
> At this point I get lost. I can't figure out how this works either, but
> hopefully I have helped define the problem. The format() method of
> DataFrameWriter isn't documented
> <https://spark.apache.org/docs/3.1.3/api/java/org/apache/spark/sql/DataFrameWriter.html#format-java.lang.String->
> .
>
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
> <https://calendly.com/rjurney_personal/30min>
>
>
> On Thu, Mar 9, 2023 at 12:52 AM second_co...@yahoo.com.INVALID
>  wrote:
>
>> any example on how to read a binary file using pySpark and save it in
>> another location . copy feature
>>
>>
>> Thank you,
>> Teoh
>>
>


Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread Mich Talebzadeh
most probably we will require an  additional method pause()

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html

to allow us to pause (as opposed to stop()) the streaming process and
resume after changing the parameters. The state of streaming needs to be
preserved.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 7 Mar 2023 at 17:25, Mich Talebzadeh 
wrote:

> hm interesting proposition. I guess you mean altering one of following
> parameters in flight
>
>
>   streamingDataFrame = self.spark \
> .readStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
> .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
> .option("group.id", config['common']['appName']) \
> .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
> .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
> .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
> .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
> .option("subscribe", config['MDVariables']['topic']) \
> .option("failOnDataLoss", "false") \
> .option("includeHeaders", "true") \
> .option("startingOffsets", "latest") \
> .load() \
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
>
> Ok, one secure way of doing it though shutting down the streaming process
> gracefully without loss of data that impacts consumers. The other method
> implies inflight changes as suggested by the topic with zeio interruptions.
> Interestingly one of our clients requested a similar solution. As solutions
> architect /engineering manager I should come back with few options. I am on
> the case so to speak. There is a considerable interest in Spark Structured
> Streaming across the board, especially in trading systems.
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Feb 2023 at 04:12, hueiyuan su  wrote:
>
>> *Component*: Spark Structured Streaming
>> *Level*: Advanced
>> *Scenario*: How-to
>>
>> -
>> *Problems Description*
>> I would like to confirm could we directly apply new options of
>> readStream/writeStream without stopping current running spark structured
>> streaming applications? For example, if we just want to adjust throughput
>> properties of readStream with kafka. Do we have method can just adjust it
>> without stopping application? If you have any ideas, please let me know. I
>> will be appreciate it and your answer.
>>
>>
>> --
>> Best Regards,
>>
>> Mars Su
>> *Phone*: 0988-661-013
>> *Email*: hueiyua...@gmail.com
>>
>


Re: Online classes for spark topics

2023-03-09 Thread Mich Talebzadeh
Hi Deepak,

The priority list of topics is a very good point. The theard owner
mentioned Spark on k8s, Data Science and Spark Structured Streaming. What
other topics need to be included I guess it depends on demand.. I suggest
we wait a couple of days to see the demand .

We just need to create a draft list of topics of interest and share them in
the forum to get the priority order.

Well that is my thoughts.

Cheers





   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 06:13, Deepak Sharma  wrote:

> I can prepare some topics and present as well , if we have a prioritised
> list of topics already .
>
> On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee  wrote:
>
>> We used to run Spark webinars on the Apache Spark LinkedIn group
>> <https://www.linkedin.com/company/apachespark/?viewAsMember=true> but
>> honestly the turnout was pretty low.  We had dove into various features.
>> If there are particular topics that. you would like to discuss during a
>> live session, please let me know and we can try to restart them.  HTH!
>>
>> On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:
>>
>>> +1
>>>
>>> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
>>> wrote:
>>>
>>>> +1, any webinar on Spark related topic is appreciated 
>>>>
>>>> Thank You & Best Regards
>>>> Winston Lai
>>>> --
>>>> *From:* asma zgolli 
>>>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>>>> *To:* karan alang 
>>>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com
>>>> ; User 
>>>> *Subject:* Re: Online classes for spark topics
>>>>
>>>> +1
>>>>
>>>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>>>> écrit :
>>>>
>>>> +1 .. I'm happy to be part of these discussions as well !
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I guess I can schedule this work over a course of time. I for myself
>>>> can contribute plus learn from others.
>>>>
>>>> So +1 for me.
>>>>
>>>> Let us see if anyone else is interested.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>>>> wrote:
>>>>
>>>>
>>>> Hello Mich.
>>>>
>>>> Greetings. Would you be able to arrange for Spark Structured Streaming
>>>> learning webinar.?
>>>>
>>>> This is something I haven been struggling with recently. it will be
>>>> very helpful.
>>>>
>>>> Thanks and Regard
>>>>
>>>> AK
>>>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> This might  be a worthwhile exercise on the assumption that the
>>>> contributors will find the time and bandwidth to chip in so to speak.
>>>>
>>>> I am sure there are many but on top of my head I can think of Holden
>>>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>>>> experienced.
>>>>
>>>> Anyone else 樂
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>>>  wrote:
>>>>
>>>> Hello gurus,
>>>>
>>>> Does Spark arranges online webinars for special topics like Spark on
>>>> K8s, data science and Spark Structured Streaming?
>>>>
>>>> I would be most grateful if experts can share their experience with
>>>> learners with intermediate knowledge like myself. Hopefully we will find
>>>> the practical experiences told valuable.
>>>>
>>>> Respectively,
>>>>
>>>> AK
>>>>
>>>>
>>>>
>>>>
>>>


Re: Online classes for spark topics

2023-03-08 Thread Mich Talebzadeh
Hi,

I guess I can schedule this work over a course of time. I for myself can
contribute plus learn from others.

So +1 for me.

Let us see if anyone else is interested.

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
wrote:

>
> Hello Mich.
>
> Greetings. Would you be able to arrange for Spark Structured Streaming
> learning webinar.?
>
> This is something I haven been struggling with recently. it will be very
> helpful.
>
> Thanks and Regard
>
> AK
> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi,
>
> This might  be a worthwhile exercise on the assumption that the
> contributors will find the time and bandwidth to chip in so to speak.
>
> I am sure there are many but on top of my head I can think of Holden Karau
> for k8s, and Sean Owen for data science stuff. They are both very
> experienced.
>
> Anyone else 樂
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>  wrote:
>
> Hello gurus,
>
> Does Spark arranges online webinars for special topics like Spark on K8s,
> data science and Spark Structured Streaming?
>
> I would be most grateful if experts can share their experience with
> learners with intermediate knowledge like myself. Hopefully we will find
> the practical experiences told valuable.
>
> Respectively,
>
> AK
>
>


Re: Online classes for spark topics

2023-03-07 Thread Mich Talebzadeh
Hi,

This might  be a worthwhile exercise on the assumption that the
contributors will find the time and bandwidth to chip in so to speak.

I am sure there are many but on top of my head I can think of Holden Karau
for k8s, and Sean Owen for data science stuff. They are both very
experienced.

Anyone else 樂

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
 wrote:

> Hello gurus,
>
> Does Spark arranges online webinars for special topics like Spark on K8s,
> data science and Spark Structured Streaming?
>
> I would be most grateful if experts can share their experience with
> learners with intermediate knowledge like myself. Hopefully we will find
> the practical experiences told valuable.
>
> Respectively,
>
> AK
>


Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-07 Thread Mich Talebzadeh
hm interesting proposition. I guess you mean altering one of following
parameters in flight


  streamingDataFrame = self.spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers",
config['MDVariables']['bootstrapServers'],) \
.option("schema.registry.url",
config['MDVariables']['schemaRegistryURL']) \
.option("group.id", config['common']['appName']) \
.option("zookeeper.connection.timeout.ms",
config['MDVariables']['zookeeperConnectionTimeoutMs']) \
.option("rebalance.backoff.ms",
config['MDVariables']['rebalanceBackoffMS']) \
.option("zookeeper.session.timeout.ms",
config['MDVariables']['zookeeperSessionTimeOutMs']) \
.option("auto.commit.interval.ms",
config['MDVariables']['autoCommitIntervalMS']) \
.option("subscribe", config['MDVariables']['topic']) \
.option("failOnDataLoss", "false") \
.option("includeHeaders", "true") \
.option("startingOffsets", "latest") \
.load() \
.select(from_json(col("value").cast("string"),
schema).alias("parsed_value"))

Ok, one secure way of doing it though shutting down the streaming process
gracefully without loss of data that impacts consumers. The other method
implies inflight changes as suggested by the topic with zeio interruptions.
Interestingly one of our clients requested a similar solution. As solutions
architect /engineering manager I should come back with few options. I am on
the case so to speak. There is a considerable interest in Spark Structured
Streaming across the board, especially in trading systems.

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 16 Feb 2023 at 04:12, hueiyuan su  wrote:

> *Component*: Spark Structured Streaming
> *Level*: Advanced
> *Scenario*: How-to
>
> -
> *Problems Description*
> I would like to confirm could we directly apply new options of
> readStream/writeStream without stopping current running spark structured
> streaming applications? For example, if we just want to adjust throughput
> properties of readStream with kafka. Do we have method can just adjust it
> without stopping application? If you have any ideas, please let me know. I
> will be appreciate it and your answer.
>
>
> --
> Best Regards,
>
> Mars Su
> *Phone*: 0988-661-013
> *Email*: hueiyua...@gmail.com
>


Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-06 Thread Mich Talebzadeh
Spark Structured Streaming can write to anything as long as an appropriate
API or JDBC connection exists.

I have not tried Kinesis but have you thought about how you want to write
it as a Sync?

Those quota limitations, much like quotas set by the vendors (say Google on
BigQuery writes etc) are default but can be negotiated with the vendor.to
increase it.

What facts have you established so far?

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 6 Mar 2023 at 04:20, hueiyuan su  wrote:

> *Component*: Spark Structured Streaming
> *Level*: Advanced
> *Scenario*: How-to
>
> 
> *Problems Description*
> 1. I currently would like to use pyspark structured streaming to
> write data to kinesis. But it seems like does not have corresponding
> connector can use. I would confirm whether have another method in addition
> to this solution
> <https://repost.aws/questions/QUP_OJomilTO6oIgvK00VHEA/writing-data-to-kinesis-stream-from-py-spark>
> 2. Because aws kinesis have quota limitation (like 1MB/s and 1000
> records/s), if spark structured streaming micro batch size too large, how
> can we handle this?
>
> --
> Best Regards,
>
> Mars Su
> *Phone*: 0988-661-013
> *Email*: hueiyua...@gmail.com
>


Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-05 Thread Mich Talebzadeh
OK I found a workaround.

Basically each stream state is not kept and I have two streams. One is a
business topic and the other one created to shut down spark structured
streaming gracefully.

I was interested to print the value for the most recent batch Id for the
business topic called "md" here ust before terminating it gracefully. I
made some effort to use a single function to get batchId for both topics
but this did not work.

However under the checkpoint directory in spark structured streaming,
<https://docs.databricks.com/structured-streaming/async-checkpointing.html>
there is a subdirectory called offsets that is maintained by Spark. This
directory contains a list of batchIds for recovery/start from where you
left purpose.

So I read from that directory visible to spark and the last value for the
business batchId..

def sendToControl(dfnewtopic, batchId2):
if(len(dfnewtopic.take(1))) > 0:
print(f"""From sendToControl, newtopic batchId is {batchId2}""")
dfnewtopic.show(100,False)
queue = dfnewtopic.first()[2]
status = dfnewtopic.first()[3]
print(f"""testing queue is {queue}, and status is {status}""")
if((queue == config['MDVariables']['topic']) & (status == 'false')):
## shutdown issued
  # get the last batchId for the main topic from
{checkpoint_path}/offsets sub-directory
  dir_path = "///ssd/hduser/MDBatchBQ/chkpt/offsets"
  dir_list = os.listdir(dir_path)
  batchIdMD = max(dir_list)
  spark_session = s.spark_session(config['common']['appName'])
  active = spark_session.streams.active
  for e in active:
 name = e.name
 if(name == config['MDVariables']['topic']):
print(f"""\n==> Request terminating streaming process for
topic {name} with batchId = {batchIdMD} at {datetime.now()}\n """)
e.stop()
else:
print("DataFrame newtopic is empty")



>From sendToControl, newtopic batchId is 191
>From sendToSink, md, batchId is 676, at 2023-03-05 19:57:00.095001
++---+-+--+
|uuid|timeissued |queue|status|
++---+-+--+
|f26544ab-95f5-47f3-ad8b-03beb25239c2|2023-03-05 19:56:49|md   |true  |
++---+-+--+

testing queue is md, and status is true
>From sendToSink, md, batchId is 677, at 2023-03-05 19:57:30.072476
>From sendToSink, md, batchId is 678, at 2023-03-05 19:58:00.081617
>From sendToControl, newtopic batchId is 192
++---+-+--+
|uuid|timeissued |queue|status|
++---+-+--+
|92f0b6fc-d683-42de-af17-f8a021048196|2023-03-05 19:57:29|md   |false |
++---+-+--+


*==> Request terminating streaming process for topic md with batchId = 678
at 2023-03-05 19:58:01.589334*

2023-03-05 19:58:01,649 ERROR streaming.MicroBatchExecution: Query newtopic
[id = 19f4c6ad-11b8-451f-acf1-8bfbea7c370b, runId =
dd26db7d-f4bf-4176-ae75-116eb67eb237] terminated with error


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 4 Mar 2023 at 22:13, Mich Talebzadeh 
wrote:

> This might help
>
> https://docs.databricks.com/structured-streaming/foreach.html
>
> streamingDF.writeStream.foreachBatch(...) allows you to specify a
> function that is executed on the output data of every micro-batch of the
> streaming query. It takes two parameters: a DataFrame or Dataset that has
> the output data of a micro-batch and the unique ID of the micro-batch
>
>
> So there are two different function calls in my case. I cannot put them
> together in one function.
>
>
>newtopicResult = streamingNewtopic.select( \
>
>  col("newtopic_value.uuid").alias("uuid") \
>
>, col("newtopic_value.timeissued").alias("timeissued") \
>
>, col("newtopic_value.queue").alias("queue") \
>
>, col("newtopic_value.status").alias("status"

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
This might help

https://docs.databricks.com/structured-streaming/foreach.html

streamingDF.writeStream.foreachBatch(...) allows you to specify a function
that is executed on the output data of every micro-batch of the streaming
query. It takes two parameters: a DataFrame or Dataset that has the output
data of a micro-batch and the unique ID of the micro-batch


So there are two different function calls in my case. I cannot put them
together in one function.


   newtopicResult = streamingNewtopic.select( \

 col("newtopic_value.uuid").alias("uuid") \

   , col("newtopic_value.timeissued").alias("timeissued") \

   , col("newtopic_value.queue").alias("queue") \

   , col("newtopic_value.status").alias("status")). \

 writeStream. \

 outputMode('append'). \

 option("truncate", "false"). \

 *foreachBatch(sendToControl). \*

* trigger(processingTime='30 seconds'). \*

* option('checkpointLocation',
checkpoint_path_newtopic). \*

* queryName(config['MDVariables']['newtopic']). \*

* start()*

#print(newtopicResult)


result = streamingDataFrame.select( \

 col("parsed_value.rowkey").alias("rowkey") \

   , col("parsed_value.ticker").alias("ticker") \

   , col("parsed_value.timeissued").alias("timeissued") \

   , col("parsed_value.price").alias("price")). \

 writeStream. \

 outputMode('append'). \

 option("truncate", "false"). \

  *   foreachBatch(sendToSink). \*

* trigger(processingTime='30 seconds'). \*

* option('checkpointLocation', checkpoint_path). \*

* queryName(config['MDVariables']['topic']). \*

 start()




   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 4 Mar 2023 at 21:51, Mich Talebzadeh 
wrote:

> I am aware of your point that global  don't work in a distributed
> environment.
> With regard to your other point, these are two different topics with their
> own streams. The point of second stream is to set the status to false, so
> it can gracefully shutdown the main stream (the one called "md") here
>
> For example, the second stream has this row
>
>
> ++---+-+--+
>
> |uuid|timeissued |queue|status|
>
> ++---+-+--+
>
> |ac74d419-58aa-4879-945d-a2a41bb64873|2023-03-04 21:29:18|md   |true  |
>
> ++---+-+--+
>
> so every 30 seconds, it checks the status and if staus = false, it shuts
> down the main stream gracefully. It works ok
>
> def sendToControl(dfnewtopic, batchId2):
> if(len(dfnewtopic.take(1))) > 0:
> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
> dfnewtopic.show(100,False)
> queue = dfnewtopic.first()[2]
> status = dfnewtopic.first()[3]
> print(f"""testing queue is {queue}, and status is {status}""")
> if((queue == config['MDVariables']['topic']) & (status == 'false')
> ):
>   spark_session = s.spark_session(config['common']['appName'])
>   active = spark_session.streams.active
>   for e in active:
>  name = e.name
>  if(name == config['MDVariables']['topic']):
> print(f"""\n==> Request terminating streaming process for
> topic {name} at {datetime.now()}\n """)
> e.stop()
> else:
> print("DataFrame newtopic is empty")
>
> and so when status set to false in the second it does as below
>
> From sendToControl, newtopic batchId is 93
> ++---+-+--+
> |uuid|

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
I am aware of your point that global  don't work in a distributed
environment.
With regard to your other point, these are two different topics with their
own streams. The point of second stream is to set the status to false, so
it can gracefully shutdown the main stream (the one called "md") here

For example, the second stream has this row


++---+-+--+

|uuid|timeissued |queue|status|

++---+-+--+

|ac74d419-58aa-4879-945d-a2a41bb64873|2023-03-04 21:29:18|md   |true  |

++---+-+--+

so every 30 seconds, it checks the status and if staus = false, it shuts
down the main stream gracefully. It works ok

def sendToControl(dfnewtopic, batchId2):
if(len(dfnewtopic.take(1))) > 0:
print(f"""From sendToControl, newtopic batchId is {batchId2}""")
dfnewtopic.show(100,False)
queue = dfnewtopic.first()[2]
status = dfnewtopic.first()[3]
print(f"""testing queue is {queue}, and status is {status}""")
if((queue == config['MDVariables']['topic']) & (status == 'false')):
  spark_session = s.spark_session(config['common']['appName'])
  active = spark_session.streams.active
  for e in active:
 name = e.name
 if(name == config['MDVariables']['topic']):
print(f"""\n==> Request terminating streaming process for
topic {name} at {datetime.now()}\n """)
e.stop()
else:
print("DataFrame newtopic is empty")

and so when status set to false in the second it does as below

>From sendToControl, newtopic batchId is 93
++---+-+--+
|uuid|timeissued |queue|status|
++---+-+--+
|c4736bc7-bee7-4dce-b67a-3b1d674b243a|2023-03-04 21:36:52|md   |false |
++---+-+--+

*testing queue is md, and status is false*

==> Request terminating streaming process for topic md at 2023-03-04
21:36:55.590162

and shuts down

I want to state this

  print(f"""\n==> Request terminating streaming process for topic {name}
and batch {BatchId for md} at {datetime.now()}\n """)

That {BatchId for md} should come from this one

def sendToSink(df, batchId):
if(len(df.take(1))) > 0:
print(f"""From sendToSink, md, batchId is {batchId}, at
{datetime.now()} """)
#df.show(100,False)
df. persist()
# write to BigQuery batch table
#s.writeTableToBQ(df, "append",
config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
df.unpersist()
#print(f"""wrote to DB""")
batchidMD = batchId
print(batchidMD)
else:
print("DataFrame md is empty")

I trust I explained it adequately

cheers


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 4 Mar 2023 at 21:22, Sean Owen  wrote:

> I don't quite get it - aren't you applying to the same stream, and
> batches? worst case why not apply these as one function?
> Otherwise, how do you mean to associate one call to another?
> globals don't help here. They aren't global beyond the driver, and, which
> one would be which batch?
>
> On Sat, Mar 4, 2023 at 3:02 PM Mich Talebzadeh 
> wrote:
>
>> Thanks. they are different batchIds
>>
>> From sendToControl, newtopic batchId is 76
>> From sendToSink, md, batchId is 563
>>
>> As a matter of interest, why does a global variable not work?
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from 

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
Thanks. they are different batchIds

>From sendToControl, newtopic batchId is 76
>From sendToSink, md, batchId is 563

As a matter of interest, why does a global variable not work?



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 4 Mar 2023 at 20:13, Sean Owen  wrote:

> It's the same batch ID already, no?
> Or why not simply put the logic of both in one function? or write one
> function that calls both?
>
> On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh 
> wrote:
>
>>
>> This is probably pretty  straight forward but somehow is does not look
>> that way
>>
>>
>>
>> On Spark Structured Streaming,  "foreachBatch" performs custom write
>> logic on each micro-batch through a call function. Example,
>>
>> foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
>> DataFrame or Dataset and second: unique id for each batch
>>
>>
>>
>> In my case I simultaneously read two topics through two separate functions
>>
>>
>>
>>1. foreachBatch(sendToSink). \
>>2. foreachBatch(sendToControl). \
>>
>> This is  the code
>>
>> def sendToSink(df, batchId):
>> if(len(df.take(1))) > 0:
>> print(f"""From sendToSink, md, batchId is {batchId}, at
>> {datetime.now()} """)
>> #df.show(100,False)
>> df. persist()
>> # write to BigQuery batch table
>> #s.writeTableToBQ(df, "append",
>> config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
>> df.unpersist()
>> #print(f"""wrote to DB""")
>>else:
>> print("DataFrame md is empty")
>>
>> def sendToControl(dfnewtopic, batchId2):
>> if(len(dfnewtopic.take(1))) > 0:
>> print(f"""From sendToControl, newtopic batchId is {batchId2}""")
>> dfnewtopic.show(100,False)
>> queue = dfnewtopic.first()[2]
>> status = dfnewtopic.first()[3]
>> print(f"""testing queue is {queue}, and status is {status}""")
>> if((queue == config['MDVariables']['topic']) & (status ==
>> 'false')):
>>   spark_session = s.spark_session(config['common']['appName'])
>>   active = spark_session.streams.active
>>   for e in active:
>>  name = e.name
>>  if(name == config['MDVariables']['topic']):
>> print(f"""\n==> Request terminating streaming process for
>> topic {name} at {datetime.now()}\n """)
>> e.stop()
>> else:
>> print("DataFrame newtopic is empty")
>>
>>
>> The problem I have is to share batchID from the first function in the
>> second function sendToControl(dfnewtopic, batchId2) so I can print it
>> out.
>>
>>
>> Defining a global did not work.. So it sounds like I am missing something
>> rudimentary here!
>>
>>
>> Thanks
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
This is probably pretty  straight forward but somehow is does not look that
way



On Spark Structured Streaming,  "foreachBatch" performs custom write logic
on each micro-batch through a call function. Example,

foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as
DataFrame or Dataset and second: unique id for each batch



In my case I simultaneously read two topics through two separate functions



   1. foreachBatch(sendToSink). \
   2. foreachBatch(sendToControl). \

This is  the code

def sendToSink(df, batchId):
if(len(df.take(1))) > 0:
print(f"""From sendToSink, md, batchId is {batchId}, at
{datetime.now()} """)
#df.show(100,False)
df. persist()
# write to BigQuery batch table
#s.writeTableToBQ(df, "append",
config['MDVariables']['targetDataset'],config['MDVariables']['targetTable'])
df.unpersist()
#print(f"""wrote to DB""")
   else:
print("DataFrame md is empty")

def sendToControl(dfnewtopic, batchId2):
if(len(dfnewtopic.take(1))) > 0:
print(f"""From sendToControl, newtopic batchId is {batchId2}""")
dfnewtopic.show(100,False)
queue = dfnewtopic.first()[2]
status = dfnewtopic.first()[3]
print(f"""testing queue is {queue}, and status is {status}""")
if((queue == config['MDVariables']['topic']) & (status == 'false')):
  spark_session = s.spark_session(config['common']['appName'])
  active = spark_session.streams.active
  for e in active:
 name = e.name
 if(name == config['MDVariables']['topic']):
print(f"""\n==> Request terminating streaming process for
topic {name} at {datetime.now()}\n """)
e.stop()
else:
print("DataFrame newtopic is empty")


The problem I have is to share batchID from the first function in the
second function sendToControl(dfnewtopic, batchId2) so I can print it out.


Defining a global did not work.. So it sounds like I am missing something
rudimentary here!


Thanks


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: SPIP architecture diagrams

2023-03-04 Thread Mich Talebzadeh
ok I decided to bite the bullet and use a Visio diagram for my SPIP "Shutting
down spark structured streaming when the streaming process completed the
current process". Details from here
https://issues.apache.org/jira/browse/SPARK-42485


This is not meant to be complete. In this an indication. I have tried to
make it generic. However, trademarks are acknowledged . I have tried not to
use color but I guess pointers are fair.


Let me know your thoughts.


Regards



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 Feb 2023 at 20:12, Mich Talebzadeh 
wrote:

>
> Sounds like I have to decide for myself what to use. A correction  Vision
> should read* Visio *
>
>
> ideally the SPIP guide https://spark.apache.org/improvement-proposals.html
>  should include this topic. Additionally there should be a repository for
> the original diagrams as well. From the said guide:
>
>
> *Appendix B. Optional Design Sketch: How are the goals going to be
> accomplished? Give sufficient technical detail to allow a contributor to
> judge whether it’s likely to be feasible. Note that this is not a full
> design document.*
>
> *Appendix C. Optional Rejected Designs: What alternatives were considered?
> Why were they rejected? If no alternatives have been considered, the
> problem needs more thought.*
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 20 Feb 2023 at 15:11, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Can someone advise me what architecture tools I can use to create
>> diagrams for SPIP document purposes?
>>
>>
>> For example, Vision, Excalidraw, Draw IO etc or does it matter as I just
>> need to create a PNG file from whatever?
>>
>>
>> Thanks
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread Mich Talebzadeh
Hi Murat,

I have dealt with EMR but have used Spark cluster on Google Dataproc with
3.1.1 with autoscaling policy.

My understanding is that autoscaling policy will decide on how to scale if
needed without manual intervention. Is this the case with yours?


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 27 Feb 2023 at 14:16, murat migdisoglu 
wrote:

> Hey Mich,
> This cluster is running spark 2.4.6 on EMR
>
> On Mon, Feb 27, 2023 at 12:20 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> What is the spark version and what type of cluster is it, spark on
>> dataproc or other?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 27 Feb 2023 at 09:06, murat migdisoglu <
>> murat.migdiso...@gmail.com> wrote:
>>
>>> On an auto-scaling cluster using YARN as resource manager, we observed
>>> that when we decrease the number of worker nodes after upscaling instance
>>> types, the number of tasks for the same spark job spikes. (the total
>>> cpu/memory capacity of the cluster remains identical)
>>>
>>> the same spark job, with the same spark settings (dynamic allocation is
>>> on), spins up 4-5 times more tasks. Related to that, we see 4-5 times more
>>> executors being allocated.
>>>
>>> As far as I understand, dynamic allocation decides to start a new
>>> executor if it sees tasks pending being queued up. But I don't know why the
>>> same spark application with identical input files runs 4-5 times higher
>>> number of tasks.
>>>
>>> Any clues would be much appreciated, thank you.
>>>
>>> Murat
>>>
>>>
>
> --
> "Talkers aren’t good doers. Rest assured that we’re going there to use
> our hands, not our tongues."
> W. Shakespeare
>


Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread Mich Talebzadeh
Hi,

What is the spark version and what type of cluster is it, spark on dataproc
or other?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 27 Feb 2023 at 09:06, murat migdisoglu 
wrote:

> On an auto-scaling cluster using YARN as resource manager, we observed
> that when we decrease the number of worker nodes after upscaling instance
> types, the number of tasks for the same spark job spikes. (the total
> cpu/memory capacity of the cluster remains identical)
>
> the same spark job, with the same spark settings (dynamic allocation is
> on), spins up 4-5 times more tasks. Related to that, we see 4-5 times more
> executors being allocated.
>
> As far as I understand, dynamic allocation decides to start a new executor
> if it sees tasks pending being queued up. But I don't know why the same
> spark application with identical input files runs 4-5 times higher number
> of tasks.
>
> Any clues would be much appreciated, thank you.
>
> Murat
>
>


Fwd: 自动回复: Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Mich Talebzadeh
Hi,

Can someone disable the below login from spark forums please?

Sounds like someone left this email and we are receiving a spam type
message anytime we respond.

thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




-- Forwarded message -
From: xyqiao 
Date: Sun, 26 Feb 2023 at 22:42
Subject: 自动回复: Re: [DISCUSS] Show Python code examples first in Spark
documentation
To: Mich Talebzadeh 


这是来自QQ邮箱的假期自动回复邮件。

您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。


Re: Unable to handle bignumeric datatype in spark/pyspark

2023-02-25 Thread Mich Talebzadeh
sounds like it is cosmetric. The important point is that if the data stored
in GBQ is valid?


THT


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 25 Feb 2023 at 18:12, Rajnil Guha  wrote:

> Hi All,
>
> I had created an issue on Stackoverflow(linked below) a few months back
> about issues while handling bignumeric type values of BigQuery in Spark.
>
> link
> <https://stackoverflow.com/questions/74719503/getting-error-while-reading-bignumeric-data-type-from-a-bigquery-table-using-apa>
>
> On Fri, Feb 24, 2023 at 3:54 PM Mich Talebzadeh 
> wrote:
>
>> Hi Nidhi,
>>
>> can you create a BigQuery table with a  bignumeric and numeric column
>> types, add a few lines and try to read into spark. through DF
>>
>> and do
>>
>>
>> df.printSchema()
>>
>> df.show(5,False)
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 24 Feb 2023 at 02:47, nidhi kher  wrote:
>>
>>> Hello,
>>>
>>>
>>> I am facing below issue in pyspark code:
>>>
>>> We are running spark code using dataproc serverless batch in google
>>> cloud platform. Spark code is causing issue while writing the data to
>>> bigquery table. In bigquery table , few of the columns have datatype as
>>> bignumeric and spark code is changing the datatype from bignumeric to
>>> numeric while writing the data. We need datatype to be kept as bignumeric
>>> only as we need data of 38,20 precision.
>>>
>>>
>>> Can we cast a column to bignumeric in spark sql dataframe like below
>>> code for decimal:
>>>
>>>
>>> df= spark.sql("""SELECT cast(col1 as decimal(38,20)) as col1 from
>>> table1""")
>>>
>>> Spark version :3.3
>>>
>>> Pyspark version : 1.1
>>>
>>>
>>> Regards,
>>>
>>> Nidhi
>>>
>>


Re: SPIP architecture diagrams

2023-02-24 Thread Mich Talebzadeh
Sounds like I have to decide for myself what to use. A correction  Vision
should read* Visio *


ideally the SPIP guide
https://spark.apache.org/improvement-proposals.html should
include this topic. Additionally there should be a repository for the
original diagrams as well. From the said guide:


*Appendix B. Optional Design Sketch: How are the goals going to be
accomplished? Give sufficient technical detail to allow a contributor to
judge whether it’s likely to be feasible. Note that this is not a full
design document.*

*Appendix C. Optional Rejected Designs: What alternatives were considered?
Why were they rejected? If no alternatives have been considered, the
problem needs more thought.*


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 20 Feb 2023 at 15:11, Mich Talebzadeh 
wrote:

> Hi,
>
> Can someone advise me what architecture tools I can use to create diagrams
> for SPIP document purposes?
>
>
> For example, Vision, Excalidraw, Draw IO etc or does it matter as I just
> need to create a PNG file from whatever?
>
>
> Thanks
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Unable to handle bignumeric datatype in spark/pyspark

2023-02-24 Thread Mich Talebzadeh
Hi Nidhi,

can you create a BigQuery table with a  bignumeric and numeric column
types, add a few lines and try to read into spark. through DF

and do


df.printSchema()

df.show(5,False)


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 24 Feb 2023 at 02:47, nidhi kher  wrote:

> Hello,
>
>
> I am facing below issue in pyspark code:
>
> We are running spark code using dataproc serverless batch in google cloud
> platform. Spark code is causing issue while writing the data to bigquery
> table. In bigquery table , few of the columns have datatype as bignumeric
> and spark code is changing the datatype from bignumeric to numeric while
> writing the data. We need datatype to be kept as bignumeric only as we need
> data of 38,20 precision.
>
>
> Can we cast a column to bignumeric in spark sql dataframe like below code
> for decimal:
>
>
> df= spark.sql("""SELECT cast(col1 as decimal(38,20)) as col1 from
> table1""")
>
> Spark version :3.3
>
> Pyspark version : 1.1
>
>
> Regards,
>
> Nidhi
>


<    1   2   3   4   5   6   7   8   9   10   >