Re: Looping through a series of telephone numbers

2023-04-02 Thread Bjørn Jørgensen
dataset.csv id,tel_in_dataset 1,+33 2,+331222 3,+331333 4,+331222 5,+331222 6,+331444 7,+331222 8,+331555 telephone_numbers.csv tel +331222 +331222 +331222 +331222 start spark with all of yous cpu and ram import os import multiprocessing

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Phillipe, These are my thoughts besides comments from Sean Just to clarify, you receive a CSV file periodically and you already have a file that contains valid patterns for phone numbers (reference) In a pseudo language you can probe your csv DF against the reference DF // load your

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen
That won't work, you can't use Spark within Spark like that. If it were exact matches, the best solution would be to load both datasets and join on telephone number. For this case, I think your best bet is a UDF that contains the telephone numbers as a list and decides whether a given number

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Many thanks, Mich. Is « foreach » the best construct to lookup items is a dataset such as the below « telephonedirectory » data set? val telrdd = spark.sparkContext.parallelize(Seq(« tel1 » , « tel2 » , « tel3 » …)) // the telephone sequence // was read for a CSV file val ds =

Re: Looping through a series of telephone numbers

2023-04-01 Thread Mich Talebzadeh
This may help Spark rlike() Working with Regex Matching Example s Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Looping through a series of telephone numbers

2023-04-01 Thread Philippe de Rochambeau
Hello, I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column. In pseudo code, for tel in [tel1, tel2, …. tel40,000] search for tel in dataset using .like(« %tel% ») end for I’m using the like function

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid. I have created a section in Apache Spark Community Stack called spark foundation. spark-foundation - Apache Spark Community - Slack I invite you to add your weblink to that section.

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG I have got some articles about this subject that should help. E.g. https://khalidmammadov.github.io/spark/spark_internals_rdd.html Also check other Spark Internals on web. Regards Khalid On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, wrote: > Thank you for your information, >

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
yes history refers to completed jobs. 4040 is the running jobs you should have screen shots for executors and stages as well. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
Thank you for your information, I have tracked the spark history server on port 18080 and the spark UI on port 4040. I see the result of these two tools as similar right? I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in the images does, is it possible?

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040? have a look. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Creating InMemory relations with data in ColumnarBatches

2023-03-31 Thread praveen sinha
Yes, purely for performance. On Thu, Mar 30, 2023, 3:01 PM Mich Talebzadeh wrote: > Is this purely for performance consideration? > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > > >view my Linkedin profile >

Re: Slack for PySpark users

2023-03-30 Thread Jungtaek Lim
I'm reading through the page "Briefing: The Apache Way", and in the section of "Open Communications", restriction of communication inside ASF INFRA (mailing list) is more about code and decision-making. https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define It's

Re: Creating InMemory relations with data in ColumnarBatches

2023-03-30 Thread Mich Talebzadeh
Is this purely for performance consideration? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Good discussions and proposals.all around. I have used slack in anger on a customer site before. For small and medium size groups it is good and affordable. Alternatives have been suggested as well so those who like investigative search can agree and come up with a freebie one. I am inclined to

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
+1. To Shani’s point, there are multiple OSS projects that use the free Slack version - top of mind include Delta, Presto, Flink, Trino, Datahub, MLflow, etc. On Thu, Mar 30, 2023 at 14:15 wrote: > Hey everyone, > > I think we should remain on a free program in slack. > > In my option the free

Re: Slack for PySpark users

2023-03-30 Thread shani . alishar
Hey everyone,I think we should remain on a free program in slack.In my option the free program is more then enough, the only down side is we could only see the last 90 days messages.From what I know the Airflow community (which has strong active community in slack) also use the free program (You

Re: Slack for PySpark users

2023-03-30 Thread Mridul Muralidharan
Thanks for flagging the concern Dongjoon, I was not aware of the discussion - but I can understand the concern. Would be great if you or Matei could update the thread on the result of deliberations, once it reaches a logical consensus: before we set up official policy around it. Regards, Mridul

unsubscribe

2023-03-30 Thread Daniel Tavares de Santana
unsubscribe

Re: Slack for PySpark users

2023-03-30 Thread Bjørn Jørgensen
I like the idea of having a talk channel. It can make it easier for everyone to say hello. Or to dare to ask about small or big matters that you would not have dared to ask about before on mailing lists. But then there is the price and what is the best for an open source project. The price for

Creating InMemory relations with data in ColumnarBatches

2023-03-30 Thread praveen sinha
Hi, I have been trying to implement InMemoryRelation based on spark ColumnarBatches, so far I have not been able to store the vectorised columnarbatch into the relation. Is there a way to achieve this without going with an intermediary representation like Arrow, so as to enable spark to do fast

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon to your points if I may - Do you have any reference from other official ASF-related Slack channels? No, I don't have any reference from other official ASF-related Slack channels because I don't think that matters. However, I stand corrected - To be clear, I intentionally didn't

Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
To Mich. - Do you have any reference from other official ASF-related Slack channels? - To be clear, I intentionally didn't refer to any specific mailing list because we didn't set up any rule here yet. To Xiao. I understand what you mean. That's the reason why I added Matei from your side. > I

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
Hi, Dongjoon, The other communities (e.g., Pinot, Druid, Flink) created their own Slack workspaces last year. I did not see an objection from the ASF board. At the same time, Slack workspaces are very popular and useful in most non-ASF open source communities. TBH, we are kind of late. I think we

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon, Thanks for your point. I gather you are referring to archive as below https://lists.apache.org/list.html?user@spark.apache.org Otherwise, correct me. Thanks Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
Hi, Xiao and all. (cc Matei) Please hold on the vote. There is a concern expressed by ASF board because recent Slack activities created an isolated silo outside of ASF mailing list archive. We need to establish a way to embrace it back to ASF archive before starting anything official. Bests,

[ANNOUNCE] Apache Celeborn(incubating) 0.2.1 available

2023-03-30 Thread rexxiong
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.2.1 Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
The ownership of slack belongs to spark community Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
We already have it general - Apache Spark Community - Slack Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Slack for PySpark users

2023-03-30 Thread shani . alishar
Hey there,I agree, If Apache Spark PMC can maintain the spark community workspace, that would be great!Instead of creating a new one, they can also become the owner of the current one .Best regards,ShaniOn 30 Mar 2023, at 9:32, Xiao Li wrote:+1 + @d...@spark.apache.org This is a good idea. The

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
+1 + @d...@spark.apache.org This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink) have created their own dedicated Slack workspaces for faster communication. We can do the same in Apache Spark. The Slack workspace will be maintained by the Apache Spark PMC. I propose to

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
Hi Jacek, Thanks for the hints, I would rather have the information statically rather than build a logical plan. I'm using Apache Calcite to build SQL expressions and then I feed them to spark to run, so the pipeline goes like this: initial query in SQL (from the user) + schema definition (from

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Jacek Laskowski
Hi, Interesting question indeed! The closest I could get would be to use lookupFunctionBuilder(name: FunctionIdentifier): Option[FunctionBuilder] [1] followed by extracting the dataType from T in `type FunctionBuilder = Seq[Expression] => T` which can be Expression (regular functions) or

spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
Hi, I'm using Apache Calcite to run some SQL transformations on Apache sparks SQL statements. I would like to extract the type signature out of spark.catalog.listFunctions to be able to register them in Calcite with their proper signature. >From the API, I can get the fully qualified class name

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
https://join.slack.com/t/sparkcommunitytalk/shared_invite/zt-1rk11diac-hzGbOEdBHgjXf02IZ1mvUA Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi Bjorn, you just need to create an account on slack and join any topic I believe HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
Do I need to get an invite before joining? tir. 28. mar. 2023 kl. 18:51 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Hi all, > > There is a section in slack called webinars > > > https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG > >

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi all, There is a section in slack called webinars https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG Asma Zgolli, agreed to prepare materials for Spark internals and/or comparing spark 3 and 2. I like to contribute to "Spark Streaming &

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread lee
That is, every pagerank value has no relationship to 1, right? As long as we focus on the size of each pagerank value in Graphx, we don't need to focus on the range, is that right? | | 李杰 | | leedd1...@163.com | Replied Message | From | Sean Owen | | Date | 3/28/2023 22:33 | | To |

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen
>From the docs: * Note that this is not the "normalized" PageRank and as a consequence pages that have no * inlinks will have a PageRank of alpha. In particular, the pageranks may have some values * greater than 1. On Tue, Mar 28, 2023 at 9:11 AM lee wrote: > When I calculate pagerank using

What is the range of the PageRank value of graphx

2023-03-28 Thread lee
When I calculate pagerank using HugeGraph, each pagerank value is less than 1, and the total of pageranks is 1. However, the PageRank value of graphx is often greater than 1, so what is the range of the PageRank value of graphx? || 李杰 | | leedd1...@163.com |

Re: Slack for PySpark users

2023-03-28 Thread Mich Talebzadeh
I created one at slack called pyspark Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

Re: Topics for Spark online classes & webinars

2023-03-28 Thread asma zgolli
Hello everyone, I suggest using the slack for the spark community created recently to collaborate and work together on these topics and use the LinkedIn page to publish the events and the webinars. Cheers, Asma Le jeu. 16 mars 2023 à 01:39, Denny Lee a écrit : > What we can do is get into the

Re: Slack for PySpark users

2023-03-28 Thread asma zgolli
Hello @Mich Talebzadeh , I suggest we use this slack to plan and organize "Online classes for spark topics". Best, Asma Le mar. 28 mars 2023 à 14:37, Shani Alisar a écrit : > Hi all, > > We recently opened an unofficial spark community slack workspace > Please join so we can increase the

Re: Slack for PySpark users

2023-03-28 Thread Shani Alisar
Hi all, We recently opened an unofficial spark community slack workspace Please join so we can increase the community and knowledge - Link Cheers, Shani >> From: שוהם יהודה >> Date: 28 March 2023

Re: Slack for PySpark users

2023-03-27 Thread asma zgolli
+1 good idea, I d like to join as well. Le mar. 28 mars 2023 à 04:09, Winston Lai a écrit : > Please let us know when the channel is created. I'd like to join :) > > Thank You & Best Regards > Winston Lai > -- > *From:* Denny Lee > *Sent:* Tuesday, March 28, 2023

Re: Slack for PySpark users

2023-03-27 Thread Winston Lai
Please let us know when the channel is created. I'd like to join :) Thank You & Best Regards Winston Lai From: Denny Lee Sent: Tuesday, March 28, 2023 9:43:08 AM To: Hyukjin Kwon Cc: keen ; user@spark.apache.org Subject: Re: Slack for PySpark users +1 I think

Re: Slack for PySpark users

2023-03-27 Thread Denny Lee
+1 I think this is a great idea! On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon wrote: > Yeah, actually I think we should better have a slack channel so we can > easily discuss with users and developers. > > On Tue, 28 Mar 2023 at 03:08, keen wrote: > >> Hi all, >> I really like *Slack *as

Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can easily discuss with users and developers. On Tue, 28 Mar 2023 at 03:08, keen wrote: > Hi all, > I really like *Slack *as communication channel for a tech community. > There is a Slack workspace for *delta lake users* ( >

Slack for PySpark users

2023-03-27 Thread keen
Hi all, I really like *Slack *as communication channel for a tech community. There is a Slack workspace for *delta lake users* (https://go.delta.io/slack) that I enjoy a lot. I was wondering if there is something similar for PySpark users. If not, would there be anything wrong with creating a new

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Mich Talebzadeh
Agreed. How does asynchronous communication relate to Spark Structured streaming? In the previous post of yours, you made your Spark to run on the driver in a single JVM. You attempted to increase the number of executors to 3 after submission of the job that (as Sean alluded to) would not

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen
What do you mean by asynchronously here? On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello again, > > Do we have any news for the above question? > I would really appreciate it. > > Thank you, > >

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Emmanouil Kritharakis
Hello again, Do we have any news for the above question? I would really appreciate it. Thank you, -- Emmanouil (Manos) Kritharakis Ph.D. candidate in the Department of Computer Science

How to calculate the spark.kryoserializer.buffer.max?

2023-03-26 Thread Arthur Li
Hello all, The data is generated by the vendors, while some days, the data size will be very huge, and it will overflow the default value of spark.kryoserializer.buffer.max, So how to calculate the spark.kryoserializer.buffer.max when the data size is changed ahead of raising the exception

Re: Kind help request

2023-03-25 Thread Sean Owen
It is telling you that the UI can't bind to any port. I presume that's because of container restrictions? If you don't want the UI at all, just set spark.ui.enabled to false On Sat, Mar 25, 2023 at 8:28 AM Lorenzo Ferrando < lorenzo.ferra...@edu.unige.it> wrote: > Dear Spark team, > > I am

Kind help request

2023-03-25 Thread Lorenzo Ferrando
Dear Spark team, I am Lorenzo from University of Genoa. I am currently using (ubuntu 18.04) the nextflow/sarek pipeline to analyse genomic data through a singularity container. One of the step of the pipeline uses GATK4 and it implements Spark. However, after some time I get the following error:

Re: Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Mich Talebzadeh
Hi, Are you talking about intelligent index scan here? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Adding OpenSearch as a secondary index provider to SparkSQL

2023-03-24 Thread Anirudha Jadhav
Hello community, wanted your opinion on this implementation demo. / support for Materialized views, skipping indices and covered indices with bloom filter optimizations with opensearch via SparkSQL https://github.com/opensearch-project/sql/discussions/1465 ( see video with voice over ) Ani --

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Mich Talebzadeh
or download it from here https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Question related to parallelism using structed streaming parallelism

2023-03-21 Thread Sean Owen
Yes more specifically, you can't ask for executors once the app starts, in SparkConf like that. You set this when you launch it against a Spark cluster in spark-submit or otherwise. On Tue, Mar 21, 2023 at 4:23 AM Mich Talebzadeh wrote: > Hi Emmanouil, > > This means that your job is running on

Topics for Spark online classes & webinars, next steps

2023-03-21 Thread Mich Talebzadeh
Hi all, As you may be aware we are proposing to set-up community classes and webinars for Spark interest group or simply for those who could benefit from them. @Denny Lee and myself had a discussion on how to put this framework forward. The idea is first and foremost getting support from

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread karan alang
Hi Mich, I'm currently testing this on my mac .. are you able to reproduce this issue ? Note - the code is similar .. except outputMode is set to update. wrt outputMode - when using aggregation + watermark, the outputMode should be either append Or update, in your code - you have used 'complete'

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread Mich Talebzadeh
Hi Karan, The version tested was 3.1.1. Are you running on Dataproc serverless 3.1.3? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Single node spark issue in Sparkly/RStudio

2023-03-16 Thread elango vaidyanathan
Hi team, In a single Linux node, I would like to set up Rstudio with Sparkly. Three to four people make up the dev team. I am aware of the single-node spark cluster's constraints. When there is a resource problem with Spark, I want to know when more users join in to use Sparkly in Rstudio. It

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-16 Thread karan alang
Fyi .. apache spark version is 3.1.3 On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote: > Hi Mich, this doesn't seem to be working for me .. the watermark seems to > be getting ignored ! > > Here is the data put into Kafka : > > ``` > > >

Re: Understanding executor memory behavior

2023-03-16 Thread Sean Owen
All else equal it is better to have the same resources in fewer executors. More tasks are local to other tasks which helps perf. There is more possibility of 'borrowing' extra mem and CPU in a task. On Thu, Mar 16, 2023, 2:14 PM Nikhil Goyal wrote: > Hi folks, > I am trying to understand what

Understanding executor memory behavior

2023-03-16 Thread Nikhil Goyal
Hi folks, I am trying to understand what would be the difference in running 8G 1 core executor vs 40G 5 core executors. I see that on yarn it can cause bin fitting issues but other than that are there any pros and cons on using either? Thanks Nikhil

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
What we can do is get into the habit of compiling the list on LinkedIn but making sure this list is shared and broadcast here, eh?! As well, when we broadcast the videos, we can do this using zoom/jitsi/ riverside.fm as well as simulcasting this on LinkedIn. This way you can view directly on the

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-15 Thread karan alang
Hi Mich, this doesn't seem to be working for me .. the watermark seems to be getting ignored ! Here is the data put into Kafka : ``` +---++ |value |key |

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Understood Nitin It would be wrong to act against one's conviction. I am sure we can find a way around providing the contents Regards Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Hi Nitin, Linkedin is more of a professional media. FYI, I am only a member of Linkedin, no facebook, etc.There is no reason for you NOT to create a profile for yourself in linkedin :) https://www.linkedin.com/help/linkedin/answer/a1338223/sign-up-to-join-linkedin?lang=en see you there as

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Bjørn Jørgensen
Great. A case that I hope can be better documented, especially now that we have Pandas API on Spark and many potential new users coming from Pandas. Is how to start Spark with full available memory and CPU. I use this function to do this in a notebook. import multiprocessing import os import sys

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
Thanks Mich for tackling this! I encourage everyone to add to the list so we can have a comprehensive list of topics, eh?! On Wed, Mar 15, 2023 at 10:27 Mich Talebzadeh wrote: > Hi all, > > Thanks to @Denny Lee to give access to > > https://www.linkedin.com/company/apachespark/ > > and

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
Hi all, Thanks to @Denny Lee to give access to https://www.linkedin.com/company/apachespark/ and contribution from @asma zgolli You will see my post at the bottom. Please add anything else on topics to the list as a comment. We will then put them together in an article perhaps. Comments

Re: logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Sean Owen
Pickle won't work. But the others should. I think you are specifying an invalid path in both cases but hard to say without more detail On Wed, Mar 15, 2023, 9:13 AM Mnisi, Caleb wrote: > Good Day > > > > I am having trouble saving a spark.ml Pipeline model to a pickle file, > when running

logging pickle files on local run of spark.ml Pipeline model

2023-03-15 Thread Mnisi, Caleb
Good Day I am having trouble saving a spark.ml Pipeline model to a pickle file, when running locally on my PC. I've tried a few ways to save the model: 1. mlflow.spark.log_model(artifact_path=experiment.artifact_location, spark_model= model, registered_model_name="myModel") * with

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
In spark structured streaming we cannot perform repartition() without stopping the streaming process unless otherwise. Admittedly, It is not a parameter that I have played around with. I still think Spark GUI should provide some insight. view my Linkedin profile

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
That's incorrect, it's spark.default.parallelism, but as the name suggests, that is merely a default. You control partitioning directly with .repartition() On Tue, Mar 14, 2023 at 11:37 AM Mich Talebzadeh wrote: > Check this link > > >

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
Check this link https://sparkbyexamples.com/spark/difference-between-spark-sql-shuffle-partitions-and-spark-default-parallelism/ You can set it spark.conf.set("sparkDefaultParallelism", value]) Have a look at Streaming statistics in Spark GUI, especially *Processing Tim*e, defined by

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Sean Owen
Are you just looking for DataFrame.repartition()? On Tue, Mar 14, 2023 at 10:57 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello, > > I hope this email finds you well! > > I have a simple dataflow in which I read from a kafka topic, perform a map > transformation and then

Question related to asynchronously map transformation using java spark structured streaming

2023-03-14 Thread Emmanouil Kritharakis
Hello, I hope this email finds you well! I have a simple dataflow in which I read from a kafka topic, perform a map transformation and then I write the result to another topic. Based on your documentation here

Re: Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Mich Talebzadeh
What benefits are you going with increasing parallelism? Better througput view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss,

Question related to parallelism using structed streaming parallelism

2023-03-14 Thread Emmanouil Kritharakis
Hello, I hope this email finds you well! I have a simple dataflow in which I read from a kafka topic, perform a map transformation and then I write the result to another topic. Based on your documentation here

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Mich Talebzadeh
Hi Denny, That Apache Spark Linkedin page https://www.linkedin.com/company/apachespark/ looks fine. It also allows a wider audience to benefit from it. +1 for me view my Linkedin profile

Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-14 Thread Gary Liu
Hi Mich, The y-axis is the number of executors. The code ran on dataproc serverless spark on 3.3.2. I tried closing autoscaling by setting the following: spark.dynamicAllocation.enabled=false spark.executor.instances=60 And still got the FetchFailedException error. I Wonder why it can run

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
In the past, we've been using the Apache Spark LinkedIn page and group to broadcast these type of events - if you're cool with this? Or we could go through the process of submitting and updating the current https://spark.apache.org or request to

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Joris Billen
This is a very good idea-would love to read such a confluence page. Adding a section “common mistakes/misconceptions” might be useful for many of these sections. It would describe undesired behaviour/errors one would get in case of not following some best practices. On 13 Mar 2023, at 17:20,

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread yangjie01
From the release notes of antl4 , there are two key changes in antl4 4.10: 1. 4.10-generated parsers incompatible with previous runtimes 2. Increasing minimum java version to Java 11 So I personally think it is temporarily impossible for Spark to upgrade to the antl4 version above

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen
You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade. You can shade your app's dependencies maybe. On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna wrote: > Hi Team > > > > We are upgrading a legacy application using Spring boot , Spark and > Hibernate. While upgrading

Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sahu, Karuna
Hi Team We are upgrading a legacy application using Spring boot , Spark and Hibernate. While upgrading Hibernate to 6.1.6.Final version there is a mismatch for antlr4 runtime jar with Hibernate and latest Spark version. Details for the issue are posted on StackOverflow as well: Issue in

Re: spark on k8s daemonset collect log

2023-03-14 Thread Cheng Pan
The filebeat supports multiline matching, here is an example[1] BTW, I’m working on External Log Service integration[2], it may be useful in your case, feel free to review/left comments [1] https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html#multiline [2]

spark on k8s daemonset collect log

2023-03-14 Thread 404
hi, all Spark runs on k8s, uses daemonset filebeat to collect logs, and writes them to elasticsearch. The docker logs are in json format, and each line is a json string. How to merge multi-line exceptions?

Extending spark connectors versus providing utility libraries

2023-03-13 Thread Jarus Local
Hi team, Had a design question around wether it’s a good idea to write wrappers over all existing spark connectors for adding some functionality/improving usability in terms of options passed to the connector. In contrast to providing utility libraries that takes parameters and calls the

Extending Spark Connector versus Providing utility library

2023-03-13 Thread Jarus Local
Hi team, Had a design question around wether it’s a good idea to write wrappers over all existing spark connectors for adding some functionality/improving usability in terms of options passed to the connector. In contrast to providing utility libraries that takes parameters and calls the

Re: Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Well that needs to be created first for this purpose. The appropriate name etc. to be decided. Maybe @Denny Lee can facilitate this as he offered his help. cheers view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-13 Thread asma zgolli
Hello Mich, Can you please provide the link for the confluence page? Many thanks Asma Ph.D. in Big Data - Applied Machine Learning Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh a écrit : > Apologies I missed the list. > > To move forward I selected these topics from the thread "Online classes

Re: Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Apologies I missed the list. To move forward I selected these topics from the thread "Online classes for spark topics". To take this further I propose a confluence page to be seup. 1. Spark UI 2. Dynamic allocation 3. Tuning of jobs 4. Collecting spark metrics for monitoring and

Topics for Spark online classes & webinars

2023-03-13 Thread Mich Talebzadeh
Hi guys To move forward I selected these topics from the thread "Online classes for spark topics". To take this further I propose a confluence page to be seup. Opinions and how to is welcome Cheers view my Linkedin profile

Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-13 Thread Mich Talebzadeh
Hi Gary Thanks for the update. So this serverless dataproc. on 3.3.1. Maybe an autoscaling policy could be an option. What is y-axis? Is that the capacity? Can you break down the join into multiple parts and save the intermediate result set? HTH view my Linkedin profile

Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-13 Thread Gary Liu
Hi Mich, I used the serverless spark session, not the local mode in the notebook. So machine type does not matter in this case. Below is the chart for serverless spark session execution. I also tried to increase executor memory and core, but the issue did got get resolved. I will try shutting down

<    9   10   11   12   13   14   15   16   17   18   >