Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
directories are explicitly specified then a default directory is created and configured appropriately. emptyDir volumes use the ephemeral storage feature of Kubernetes and do not persist beyond the life of the pod. tor. 11. apr. 2024 kl. 10:29 skrev Bjørn Jørgensen : > " In the end for my usecase

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
gt;>>> >>>>>>>> You can make a PVC on K8S call it 300GB >>>>>>>> >>>>>>>> make a folder in yours dockerfile >>>>>>>> WORKDIR /opt/spark/work-dir >>>>>>>> RUN chmod g+w /opt/spark/work-dir >>>>>>>> >>>>>>>> start spark with adding this >>>>>>>> >>>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", >>>>>>>> "300gb") \ >>>>>>>> >>>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path", >>>>>>>> "/opt/spark/work-dir") \ >>>>>>>> >>>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>>>>>> "False") \ >>>>>>>> >>>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName", >>>>>>>> "300gb") \ >>>>>>>> >>>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path", >>>>>>>> "/opt/spark/work-dir") \ >>>>>>>> >>>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>>>>>> "False") \ >>>>>>>> .config("spark.local.dir", "/opt/spark/work-dir") >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com>: >>>>>>>> >>>>>>>>> I have seen some older references for shuffle service for k8s, >>>>>>>>> although it is not clear they are talking about a generic shuffle >>>>>>>>> service for k8s. >>>>>>>>> >>>>>>>>> Anyhow with the advent of genai and the need to allow for a larger >>>>>>>>> volume of data, I was wondering if there has been any more work on >>>>>>>>> this matter. Specifically larger and scalable file systems like >>>>>>>>> HDFS, >>>>>>>>> GCS , S3 etc, offer significantly larger storage capacity than >>>>>>>>> local >>>>>>>>> disks on individual worker nodes in a k8s cluster, thus allowing >>>>>>>>> handling much larger datasets more efficiently. Also the degree of >>>>>>>>> parallelism and fault tolerance with these files systems come into >>>>>>>>> it. I will be interested in hearing more about any progress on >>>>>>>>> this. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> . >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> >>>>>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>>>>> >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>>view my Linkedin profile >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Disclaimer: The information provided is correct to the best of my >>>>>>>>> knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth >>>>>>>>> one-thousand >>>>>>>>> expert opinions (Werner Von Braun)". >>>>>>>>> >>>>>>>>> >>>>>>>>> - >>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Bjørn Jørgensen >>>>>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>>>>> Norge >>>>>>>> >>>>>>>> +47 480 94 297 >>>>>>>> >>>>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner Von Braun)". > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
` on April 1st, 2024. 3. Apache Spark 4.0.0 RC1 on May 1st, 2024. 4. Apache Spark 4.0.0 Release in June, 2024. tir. 2. apr. 2024 kl. 12:06 skrev Chawla, Parul : > ++ Ashima > > -- > *From:* Chawla, Parul > *Sent:* Tuesday, April 2, 2024 9:56 AM >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$> > > > > > > > > Disclaimer: The information provided is correct to the best of my > > knowledge but of course cannot be guaranteed . It is essential to note > > that, as with any advice, quote "one test result is worth one-thousand > > expert opinions (Werner Von Braun)". > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Bjørn Jørgensen
[image: image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul : > > Hi , > Can we get spark version on whuich this is resolved. > -- > *From:* Bjørn Jørgensen > *Sent:* Tuesday, February 27, 2024 7:05:36 PM > *To:* Sahni, Ashima > *C

Re: Issue of spark with antlr version

2024-02-27 Thread Bjørn Jørgensen
ure policy. > Your privacy is important to us. Accenture uses your personal data only in > compliance with data protection laws. For further information on how > Accenture processes your personal data, please see our privacy statement at > https://www.accenture.com/us-en/privacy-policy. > > __ > > www.accenture.com > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Bjørn Jørgensen
che.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:286) >> at >> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:209) >> >> >> Dataset df = >> spark >> .readStream() >> .format("org.apache.spark.sql.kafka010.KafkaSourceProvider") >> .options(appConfig.getKafka().getConf()) >> .load() >> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)"); >> >> df.writeStream() >> .foreachBatch(new KafkaS3PipelineImplementation(applicationId, >> appConfig)) >> .option("checkpointLocation", appConfig.getChk().getPath()) >> .start() >> .awaitTermination(); >> >> >> Regards, >> Abhishek Singla >> >> >> >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Okio Vulnerability in Spark 3.4.1

2024-01-11 Thread Bjørn Jørgensen
[SPARK-46662][K8S][BUILD] Upgrade kubernetes-client to 6.10.0 <https://github.com/apache/spark/pull/44672> a new version of kubernets-client with okio version 1.17.6 is now merged to master and will be in the spark 4.0 version. tir. 14. nov. 2023 kl. 15:21 skrev Bjørn Jørgensen : > FYI

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
framework and a small dataset. >>>- For complex queries, using a linter or code analysis tool can help >>> identify potential issues. >>> >>> HTH >>> >>> >>> Mich Talebzadeh, >>> Dad | Technologist | Solutions Architect | Engineer >>> London >>> United Kingdom >>> >>>view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Sun, 24 Dec 2023 at 07:57, ram manickam wrote: >>> >>>> Hello, >>>> Is there a way to validate pyspark sql to validate only syntax errors?. >>>> I cannot connect do actual data set to perform this validation. Any >>>> help would be appreciated. >>>> >>>> >>>> Thanks >>>> Ram >>>> >>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
>> >> Mich Talebzadeh, >> Dad | Technologist | Solutions Architect | Engineer >> London >> United Kingdom >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybo

Re: Okio Vulnerability in Spark 3.4.1

2023-11-14 Thread Bjørn Jørgensen
p5oNKuYFqDzxXghsRJxzOj7i5noa1bj3-uSj0f0tT8xZ3L42uUTNgHczw65Kt1WnUK2-A_yhTmEhg07yFdwKQha6bQyn2KoicHjcdlQzAWsRmbBgzVjhDKMGdPn9Mrm5V7lw1QgeoFmddSJsreHy6TcNY2dXtqEzhw-OX2ibRtOrCX4M_n1ONE73yhGXAhqarKsd1tl5IgDfp_MlsFe9bkMa9G2AK5pcO0GeI8r7yDXA/https%3A%2F%2Fnvd.nist.gov%2Fvuln%2Fdetail%2FCVE-2023-3635> >> >> >> >> Any guidance here would be of great help. >> >> >> >> Thanks, >> >> Sanket A. >> >> This message (including any attachments) contains confidential >> information intended for a specific individual and purpose, and is >> protected by law. If you are not the intended recipient, you should delete >> this message and any disclosure, copying, or distribution of this message, >> or the taking of any action based on it, by you is strictly prohibited. >> >> Deloitte refers to a Deloitte member firm, one of its related entities, >> or Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is >> a separate legal entity and a member of DTTL. DTTL does not provide >> services to clients. Please see www.deloitte.com/about to learn more. >> >> v.E.1 >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution <https://github.com/apache/spark/pull/40893> tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen : > In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those > are

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
requires Jackson Databind version >= 2.10.0 and < 2.11.0* > > > > According to spark 3.3.0 release notes: "Upgrade Jackson to 2.13.3" but > spark package of 3.4.1 contains Jackson of 2.10.5 > > (https://spark.apache.org/releases/spark-release-3-3-0.html) > > What am I missing? > > > > -- > > *Moshik Vitas* > > Senior Software Developer, Crossix > > Veeva Systems > > m: +972-54-5326-400 > > moshik.vi...@veeva.com > > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
prin >> rev.scode= p.bcode >> >> >> The item has two rows which have common attributes and the* final join >> should result in 2 rows. But I am seeing 4 rows instead.* >> >> left join item I >> on rev.sys = i.sys >> rev.custumer_id = I.custumer_id >> rev. scode = I.scode >> >> >> >> Regards, >> Meena >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
disclosure, copying, distribution or use of any of the information > contained in or attached to this message is STRICTLY PROHIBITED. If you > have received this transmission in error, please immediately notify the > sender and delete the e-mail and attached documents. Thank you. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > <https://github.com/apache/spark/pull/14868> The memory consumption

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
e author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sat, 16 Sept 2023 at 11:46, Mich Talebzadeh > wrote: > >> Happy Saturday coding  >> >> >> Mich Talebzadeh, >> Distinguis

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages aris

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
gt;> >> As this is local mode, we are facing performance issue(as only one >> executor) when it comes dealing with large datasets. >> >> Can I convert this 4 nodes into spark standalone cluster. We dont have >> hadoop so yarn mode is out of scope. >> >> S

Re: Filter out 20% of rows

2023-09-15 Thread Bjørn Jørgensen
15. sep. 2023 kl. 20:14 skrev ashok34...@yahoo.com.INVALID : > Hi team, > > I am using PySpark 3.4 > > I have a table of million rows that has few columns. among them incoming > ips and what is known as gbps (Gigabytes per second) and date and time > of incoming ip. > > I want to filter out 20% of low active ips and work on the remainder of > data. How can I do thiis in PySpark? > > Thanks > > > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Bjørn Jørgensen
onfidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > > Deloitte refers to a Deloitte member firm, one of its related entities, or > Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a > separate legal entity and a member of DTTL. DTTL does not provide services > to clients. Please see www.deloitte.com/about to learn more. > > v.E.1 > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Bjørn Jørgensen
eleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034 > 2023-08-20T19:45:19,691 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64 > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark Vulnerabilities

2023-08-14 Thread Bjørn Jørgensen
s > against viruses. TEMENOS accepts no liability for any damage caused by any > malicious code or virus transmitted by this e-mail. > > --------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- Bjørn J

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
)], "constant", > constant_values=0.0).tolist() > > But this works: > @udf("array>") > def pad(arr, n): > padded_arr = [] > for i in range(n): > padded_arr.append([0.0] * len(arr[0])) > padded_arr.extend(arr) > return padded_

Re: Rename columns without manually setting them all

2023-06-21 Thread Bjørn Jørgensen
tus`' for col in date_columns])}) as (`Date`, `Status`)”]) > > result = df.groupby("Date", "Status").count() > > > > > On 21 Jun 2023, at 11:45, John Paul Jayme > wrote: > > Hi, > > This is currently my column definition : > Employee ID Name Client Project Team 01/01/2022 02/01/2022 03/01/2022 > 04/01/2022 05/01/2022 > 12345 Dummy x Dummy a abc team a OFF WO WH WH WH > As you can see, the outer columns are just daily attendance dates. My goal > is to count the employees who were OFF / WO / WH on said dates. I need to > transpose them so it would look like this : > > > > I am still new to pandas. Can you guide me on how to produce this? I am > reading about melt() and set_index() but I am not sure if they are the > correct functions to use. > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
our own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary dama

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
ent is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen > wrote: > >> This is pandas API on spark >> >> from pyspark im

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
d or legally protected. > You are not authorized to copy or disclose the contents of this email. If > you are not the intended addressee, please inform the sender and delete > this email. > > > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
rg.scala-lang > scala-library > 2.13.8 > > > org.apache.spark > spark-core_2.13 > 3.4.0 > provided > > > org.apache.spark > spark-sql_2.13 > 3.4.0 > provided > > > > Thanks > > >

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
/26411339/19476830>can be a a pitfall for you. > > -- > Best Regards! > ... > Lingzhe Sun > Hirain Technology > > > *From:* Mich Talebzadeh > *Date:* 2023-05-29 17:55 > *To:* Bjørn Jørgense

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
imer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Change column values using several when conditions

2023-05-01 Thread Bjørn Jørgensen
t, the call to withColumn() gets ignored. > How to do exactly that in a more efficient way using Spark in Java? > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Non string type partitions

2023-04-15 Thread Bjørn Jørgensen
gt;>> We are running into the below error when we are trying to run a simple >>>> query a partitioned table in Spark. >>>> >>>> *MetaException(message:Filtering is supported only on partition keys of >>>> type string) >>>> * >>>&

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master

Re: Slack for PySpark users

2023-04-04 Thread Bjørn Jørgensen
t;>>>>>>>>> been >>>>>>>>>> suggested as well so those who like investigative search can agree >>>>>>>>>> and come >>>>>>>>>> up with a freebie one. >>>>>>>>

Re: Looping through a series of telephone numbers

2023-04-02 Thread Bjørn Jørgensen
y for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau >> wrote: >> >>> Hello, >>> I’m looking for an efficient way in Spark to search for a series of >>> telephone numbers, contained in a CSV file, in a data set column. >>> >>> In pseudo code, >>> >>> for tel in [tel1, tel2, …. tel40,000] >>> search for tel in dataset using .like(« %tel% ») >>> end for >>> >>> I’m using the like function because the telephone numbers in the data >>> set main contain prefixes, such as « + « ; e.g., « +331222 ». >>> >>> Any suggestions would be welcome. >>> >>> Many thanks. >>> >>> Philippe >>> >>> >>> >>> >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Slack for PySpark users

2023-03-30 Thread Bjørn Jørgensen
>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 28 Mar 2023 at 03:52, asma zgolli >>>>>> wrote: >>>>>> >>>>>>> +1 good idea, I d like to join as well. >>>>>>> >>>>>>> Le mar. 28 mars 2023 à 04:09, Winston Lai a >>>>>>> écrit : >>>>>>> >>>>>>>> Please let us know when the channel is created. I'd like to join :) >>>>>>>> >>>>>>>> Thank You & Best Regards >>>>>>>> Winston Lai >>>>>>>> -- >>>>>>>> *From:* Denny Lee >>>>>>>> *Sent:* Tuesday, March 28, 2023 9:43:08 AM >>>>>>>> *To:* Hyukjin Kwon >>>>>>>> *Cc:* keen ; user@spark.apache.org < >>>>>>>> user@spark.apache.org> >>>>>>>> *Subject:* Re: Slack for PySpark users >>>>>>>> >>>>>>>> +1 I think this is a great idea! >>>>>>>> >>>>>>>> On Mon, Mar 27, 2023 at 6:24 PM Hyukjin Kwon >>>>>>>> wrote: >>>>>>>> >>>>>>>> Yeah, actually I think we should better have a slack channel so we >>>>>>>> can easily discuss with users and developers. >>>>>>>> >>>>>>>> On Tue, 28 Mar 2023 at 03:08, keen wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> I really like *Slack *as communication channel for a tech >>>>>>>> community. >>>>>>>> There is a Slack workspace for *delta lake users* ( >>>>>>>> https://go.delta.io/slack) that I enjoy a lot. >>>>>>>> I was wondering if there is something similar for PySpark users. >>>>>>>> >>>>>>>> If not, would there be anything wrong with creating a new >>>>>>>> Slack workspace for PySpark users? (when explicitly mentioning that >>>>>>>> this is >>>>>>>> *not* officially part of Apache Spark)? >>>>>>>> >>>>>>>> Cheers >>>>>>>> Martin >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Asma ZGOLLI >>>>>>> >>>>>>> Ph.D. in Big Data - Applied Machine Learning >>>>>>> >>>>>>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
t;>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 13 Mar 2023 at 16:29, asma zgolli >>>>> wrote: >>>>> >>>>> Hello Mich, >>>>> >>>>> Can you please provide the link for the confluence page? >>>>> >>>>> Many thanks >>>>> Asma >>>>> Ph.D. in Big Data - Applied Machine Learning >>>>> >>>>> Le lun. 13 mars 2023 à 17:21, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> a écrit : >>>>> >>>>> Apologies I missed the list. >>>>> >>>>> To move forward I selected these topics from the thread "Online >>>>> classes for spark topics". >>>>> >>>>> To take this further I propose a confluence page to be seup. >>>>> >>>>> >>>>>1. Spark UI >>>>>2. Dynamic allocation >>>>>3. Tuning of jobs >>>>>4. Collecting spark metrics for monitoring and alerting >>>>>5. For those who prefer to use Pandas API on Spark since the >>>>>release of Spark 3.2, What are some important notes for those users? >>>>> For >>>>>example, what are the additional factors affecting the Spark >>>>> performance >>>>>using Pandas API on Spark? How to tune them in addition to the >>>>> conventional >>>>>Spark tuning methods applied to Spark SQL users. >>>>>6. Spark internals and/or comparing spark 3 and 2 >>>>>7. Spark Streaming & Spark Structured Streaming >>>>>8. Spark on notebooks >>>>>9. Spark on serverless (for example Spark on Google Cloud) >>>>>10. Spark on k8s >>>>> >>>>> Opinions and how to is welcome >>>>> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, 13 Mar 2023 at 16:16, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> Hi guys >>>>> >>>>> To move forward I selected these topics from the thread "Online >>>>> classes for spark topics". >>>>> >>>>> To take this further I propose a confluence page to be seup. >>>>> >>>>> Opinions and how to is welcome >>>>> >>>>> Cheers >>>>> >>>>> >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >> >> -- >> Asma ZGOLLI >> >> PhD in Big Data - Applied Machine Learning >> Email : zgollia...@gmail.com >> Tel : (+49) 015777685768 >> Skype : asma_zgolli >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Bjørn Jørgensen
gt;>>>using Pandas API on Spark? How to tune them in addition to the >>>>>>> conventional >>>>>>>Spark tuning methods applied to Spark SQL users. >>>>>>>6. Spark internals and/or comparing spark 3 and 2 >>

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
end"), 0)) >>>> ``` >>>> >>>> Basically, the distance is the maximum of three terms. >>>> >>>> This line causes an obscure error: >>>> >>>> ``` >>>> ValueError: Cannot convert column into bool: please use '&

Re: Graceful shutdown SPARK Structured Streaming

2023-02-19 Thread Bjørn Jørgensen
"") >>>>> e.stop() >>>>> else: >>>>> print("DataFrame newtopic is empty") >>>>> >>>>> This seems to work as I checked it to ensure that in this case data >>>>> was written and saved to the target sink (BigQuery table). It will wait >>>>> until data is written completely meaning the current streaming message is >>>>> processed and there is a latency there (meaning waiting for graceful >>>>> completion) >>>>> >>>>> This is the output >>>>> >>>>> Terminating streaming process md >>>>> wrote to DB ## this is the flag I added to ensure the current >>>>> micro-bath was completed >>>>> 2021-04-23 09:59:18,029 ERROR streaming.MicroBatchExecution: Query md >>>>> [id = 6bbccbfe-e770-4fb0-b83d-0dedd0ee571b, runId = >>>>> 2ae55673-6bc2-4dbe-af60-9fdc0447bff5] terminated with error >>>>> >>>>> The various termination processes are described in >>>>> >>>>> Structured Streaming Programming Guide - Spark 3.1.1 Documentation >>>>> (apache.org) >>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries> >>>>> >>>>> This is the idea I came up with which allows ending the streaming >>>>> process with least cost. >>>>> >>>>> HTH >>>>> >>>>>view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, 5 May 2021 at 17:30, Gourav Sengupta < >>>>> gourav.sengupta.develo...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> just thought of reaching out once again and seeking out your kind >>>>>> help to find out what is the best way to stop SPARK streaming gracefully. >>>>>> Do we still use the methods of creating a file as in SPARK 2.4.x which is >>>>>> several years old method or do we have a better approach in SPARK 3.1? >>>>>> >>>>>> Regards, >>>>>> Gourav Sengupta >>>>>> >>>>>> -- Forwarded message - >>>>>> From: Gourav Sengupta >>>>>> Date: Wed, Apr 21, 2021 at 10:06 AM >>>>>> Subject: Graceful shutdown SPARK Structured Streaming >>>>>> To: >>>>>> >>>>>> >>>>>> Dear friends, >>>>>> >>>>>> is there any documentation available for gracefully stopping SPARK >>>>>> Structured Streaming in 3.1.x? >>>>>> >>>>>> I am referring to articles which are 4 to 5 years old and was >>>>>> wondering whether there is a better way available today to gracefully >>>>>> shutdown a SPARK streaming job. >>>>>> >>>>>> Thanks a ton in advance for all your kind help. >>>>>> >>>>>> Regards, >>>>>> Gourav Sengupta >>>>>> >>>>> -- >>> Best Regards, >>> Ayan Guha >>> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How to explode array columns of a dataframe having the same length

2023-02-16 Thread Bjørn Jørgensen
> Hello guys, > > > > I have the following dataframe: > > > > col1 > > > > col2 > > > > col3 > > > > ["A","B","null"] > > > > ["C","D","null"] > > > > ["E","null","null"] > > > > > > > > I want to explode it to the following dataframe: > > > > col1 > > > > col2 > > > > col3 > > > > "A" > > > > "C" > > > > "E" > > > > "B" > > > > "D" > > > > "null" > > > > "null" > > > > "null" > > > > "null" > > > > > > How to do that (preferably in Java) using the explode() method ? knowing > that something like the following won't yield correct output: > > > > for (String colName: dataset.columns()) > > dataset=dataset.withColumn(colName,explode(dataset.col(colName))); > > > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
as *data,* I thought the SQL has to look >> like this >> >> select 1 as *`data.group`* from tbl group by `*data.group`* >> >> >> But that gives and error (cannot resolve '`data.group`') ... I'm no >> expert in SQL, but feel like it's a strange behavior... does anybody have a >> good explanation for it ? >> >> Thanks >> >> -- >> Kohki Nishio >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
;, "credits" or "license" for more information. >>> from scipy.stats import norm >>> fre. 6. jan. 2023 kl. 18:12 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the link. I already tried most of what was suggested there, > b

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > > Hello, > > I'm trying to install SciPy using a bootstrap script and then use

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20,

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
name) tuple. > > On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen > wrote: > >> We have pandas API on spark >> <https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html> >> which is very good. >> >> from pyspark import pandas as ps

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
t;>>>> >>>>>> Hello, >>>>>> >>>>>> How can I retain from each group only the row for which one value >>>>>> is the maximum of the group? For example, imagine a DataFrame containing >>>>>> all major cities in the world, with three columns: (1) City name (2) >>>>>> Country (3) population. How would I get a DataFrame that only contains >>>>>> the >>>>>> largest city in each country? Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Bjørn Jørgensen
t;> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1409) >>>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>>>>>> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1436) >>>>>>> at scala.reflect.internal.Trees.itransform$(Trees.scala:1400) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.internal.SymbolTable.itransform(SymbolTable.scala:28) >>>>>>> at >>>>>>> scala.reflect.api.Trees$Transformer.transform(Trees.scala:2563) >>>>>>> at >>>>>>> scala.tools.nsc.transform.TypingTransformers$TypingTransformer.transform(TypingTransformers.scala:57) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:275) >>>>>>> at >>>>>>> scala.tools.nsc.transform.ExtensionMethods$Extender.transform(ExtensionMethods.scala:133) >>>>>>> at scala.reflect.internal.Trees.itransform(Trees.scala:1411) >>>>>>> >>>>>>> >>> >>> -- >>> Thanks >>> Gnana >>> >> > > -- > Thanks > Gnana > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: How can I use backticks in column names?

2022-12-05 Thread Bjørn Jørgensen
ts / Periods in PySpark Column Names <https://mungingdata.com/pyspark/avoid-dots-periods-column-names/> man. 5. des. 2022 kl. 06:56 skrev 한승후 : > Spark throws an exception if there are backticks in the column name. > > Please help me. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Creating a Spark 3 Connector

2022-11-23 Thread Bjørn Jørgensen
rvices. Any such terms are not binding on MarkLogic > unless and until they are included in a definitive agreement executed by > MarkLogic. > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: spark - local question

2022-11-05 Thread Bjørn Jørgensen
.withColumnRenamed("category", > "cateid") \ > .withColumnRenamed('weight', 'score').withColumnRenamed('tag', > 'item_tags') \ > .withColumnRenamed('modify_time', > 'item_modify_time').withColumnRenamed('start_time', 'dg_start_time') \ > .withCo

Re: spark - local question

2022-11-04 Thread Bjørn Jørgensen
follows: >> >> Our team wants to develop an etl component based on python language. Data >> can be transferred between various data sources. >> >> If there is no yarn environment, can we read data from Database A and write >> it to Database B in local mode.Will this function be guaranteed to be stable >> and available? >> >> >> >> Thanks, >> Look forward to your reply >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [Spark Core][Release]Can we consider add SPARK-39725 into 3.3.1 or 3.3.2 release?

2022-10-04 Thread Bjørn Jørgensen
048 (High), which was >> set to 3.4.0 release but that will happen Feb 2023. Is it possible to have >> it in any earlier release such as 3.3.1 or 3.3.2? >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Issue with SparkContext

2022-09-20 Thread Bjørn Jørgensen
JavaError while running SparkContext. > Can you please help me to resolve this issue. > > > > Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Re: [how to]RDD using JDBC data source in PySpark

2022-09-20 Thread Bjørn Jørgensen
t;jdbc") is good way to resolved it. > But in some reasons, i can't using DataFrame API, only can use RDD API in > PySpark. > ...T_T... > > thanks all you guys help. but still need new idea to resolve it. XD > > > > > > ------ > java

Re: 答复: [how to]RDD using JDBC data source in PySpark

2022-09-19 Thread Bjørn Jørgensen
gt; > Does have some way to let rdd can using jdbc data source in pyspark? > > > > i want to get data from mysql, but in PySpark, there is not supported > JDBCRDD like java/scala. > > and i search docs from web site, no answer. > > > > > > So i need yo

Re: Jupyter notebook on Dataproc versus GKE

2022-09-14 Thread Bjørn Jørgensen
r scheduling. > > On Tue, Sep 6, 2022 at 10:01 AM Mich Talebzadeh > wrote: > >> Thank you all. >> >> Has anyone used Argo for k8s scheduler by any chance? >> >> On Tue, 6 Sep 2022 at 13:41, Bjørn Jørgensen >> wrote: >> >>> "*Jupyt

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Bjørn Jørgensen
rom relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Sept 2022 at 20:58, Bjørn Jørgensen > wrote: > >&g

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Bjørn Jørgensen
y loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >> -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Pyspark and multiprocessing

2022-07-21 Thread Bjørn Jørgensen
HO. Wouldn't that create > CPU bottleneck? > > Also on the side note, why you need Spark if you use that on local only? > Sparks power can only be (mainly) observed in a cluster env. > I have achieved great parallelism using pandas and pools on a local > machine in the past. > >

Fwd: Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
So now I have tried to run this function in a ThreadPool. But it doesn't seem to work. [image: image.png] -- Forwarded message - Fra: Sean Owen Date: ons. 20. jul. 2022 kl. 22:43 Subject: Re: Pyspark and multiprocessing To: Bjørn Jørgensen I don't think you ever say what

Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
name for col_name in df.columns if col_name not in drop_column_list] + key_cols) # recompute remaining Complex Fields in Schema complex_fields = dict([(field.name, field.dataType) for field in df.schema.fields if type(field

Re: How use pattern matching in spark

2022-07-14 Thread Bjørn Jørgensen
file and .TXT > file. > > So, as per me I could do validation for all these 3 file formats using > spark.read.text().rdd and performing intended operations on Rdds. Just the > validation part. > > Therefore, wanted to understand is there any better way to achieve this? >

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. duplicate column = duplicate rows tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen : > "*but I am getting the issue of the duplicate column which was present in > the old dataset.*" > > So

Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
I am getting the issue of the duplicate column which was present in >>> the old dataset. So, I am trying to understand how the spark reads the >>> data. Does it full dataset and filter on the basis of the last saved >>> timestamp or does it filter only what is required? If the second case is >>> true, then it should have read the data since the latest data is correct. >>> >>> So just trying to understand. Could anyone help here? >>> >>> Thanks, >>> Sid >>> >>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Glue is serverless? how?

2022-06-26 Thread Bjørn Jørgensen
https://en.m.wikipedia.org/wiki/Serverless_computing søn. 26. jun. 2022, 10:26 skrev Sid : > Hi Team, > > I am developing a spark job in glue and have read that glue is serverless. > I know that using glue studio we can autoscale the workers. However, I want > to understand how it is

Re: to find Difference of locations in Spark Dataframe rows

2022-06-09 Thread Bjørn Jørgensen
.getOrCreate() > > val housingDataDF = > spark.read.csv("~/Downloads/real-estate-sample-data.csv") > > // searching for the property by `ref_id` > val searchPropertyDF = housingDataDF.filter(col("ref_id") === > search_property_id) > > // Similar house in the same city (same postal code) and group one > condition > val similarHouseAndSameCity = housingDataDF.join(searchPropertyDF, > groupThreeCriteria ++ groupOneCriteria, > "inner") > > // Similar house not in the same city but 10km range > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
Yes, but how do you read it with spark. tor. 26. mai 2022, 18:30 skrev Sid : > I am not reading it through pandas. I am using Spark because when I tried > to use pandas which comes under import pyspark.pandas, it gives me an > error. > > On Thu, May 26, 2022 at 9:52 PM Bjørn Jør

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
an escape character. >> >> Can you check if this may cause any issues? >> >> Regards, >> >> Apostolos >> >> >> >> On 26/5/22 16:31, Sid wrote: >> >> Thanks for opening the issue, Bjorn. However, could you help me to >> add

Re: Complexity with the data

2022-05-26 Thread Bjørn Jørgensen
names. > > PFB link: > > > https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark > > Thanks, > Sid > > On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen > wrote: > >> Sid, dump one of yours files. >> >> htt

Re: Complexity with the data

2022-05-25 Thread Bjørn Jørgensen
Sid, dump one of yours files. https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ ons. 25. mai 2022, 23:04 skrev Sid : > I have 10 columns with me but in the dataset, I observed that some records > have 11 columns of data(for the additional column it is marked as null).

Re: Count() action leading to errors | Pyspark

2022-05-07 Thread Bjørn Jørgensen
ld be the possible reason > for that simple count error? > > Environment: > AWS GLUE 1.X > 10 workers > Spark 2.4.3 > > Thanks, > Sid > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark error with jupyter

2022-05-03 Thread Bjørn Jørgensen
I am working on spark in jupyter but I have a small error for each running > . > anyone have the same error or have a solution , please tell me . > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
gt; it using below script: > > find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt > > > Thanks, > > Sid > > > On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen > wrote: > >> and the bash script seems to read txt files not json >> >>

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
th the below problem? >> >> >> https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark >> >> >> Thanks, >> Sid >> > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
df = spark.read.json("/*.json") use the *.json tir. 26. apr. 2022 kl. 16:44 skrev Sid : > Hello, > > Can somebody help me with the below problem? > > > https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark >

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-04-26 Thread Bjørn Jørgensen
4893 > > > CVE-2019-14892 > > > CVE-2019-14540 > > > CVE-2019-14439 > > > CVE-2019-14379 > > > CVE-2019-12086 > > > CVE-2018-7489 > > > CVE-2018-5968 > > > CVE-2018-14719 > > > CVE-2018-14718 > > > CVE-2018-12022 > > > CVE-2018-11307 > > > CVE-2017-7525 > > > CVE-2017-17485 > > > > > CVE-2017-15095 > > > Kind Regards > > Harsh Takkar > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-04-26 Thread Bjørn Jørgensen
; > > CVE-2019-16335 > > > CVE-2019-14893 > > > CVE-2019-14892 > > > CVE-2019-14540 > > > CVE-2019-14439 > > > CVE-2019-14379 > > > CVE-2019-12086 > > > CVE-2018-7489 > > > CVE-2018-5968 > > >

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Bjørn Jørgensen
integers if needed, and >> restore them later >> >> >> >> Maybe I would be better off using many small machines? I assume memory is >> the limiting resource not cpu. I notice that memory usage will reach 100%. >> I added several TB’s of local ssd. I am not convinced that spark is using >> the local disk >> >> >> >> >> >> will this perform better than join? >> >> >> >> · The rows before the final pivot will be very very wide (over 5 >> million columns) >> >> · There will only be 10114 rows before the pivot >> >> >> >> I assume the pivots will shuffle all the data. I assume the Colum vectors >> are trivial. The file table pivot will be expensive however will only need >> to be done once >> >> >> >> >> >> >> >> Comments and suggestions appreciated >> >> >> >> Andy >> >> >> >> >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Bjørn Jørgensen
|-- end: integer (nullable = false)* > > * |||-- result: string (nullable = true)* > > * | ||-- metadata: map (nullable = true)* > > * ||||-- key: string* > > * ||||-- value: string (valueContainsNull = true)* > > * |||--

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet *change spark = sparknlp.start()* to spark = sparknlp.start(spark32=True) tir. 19. apr. 2022 kl. 21:10 skrev Bjørn Jørgensen : > Yes, there are some that have that issue. > > Please open a new issue at > https:

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
4.3, while Java version > is 8. I've tried with a different model but the error is still the same, so > what could be causing it? > > If this error is solved I think SparkNLP will be the solution I was > looking for to reduce memory consumption so thank you again for suggesting >

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Bjørn Jørgensen
iment"**).**collect**()]* >> *counts* *=* *[**int**(**row**.**asDict**()[**'count'**])* *for* *row* >> *in* *df**.**select**(**"count"**).**collect**()]* >> >> *print(**entities**,* *sentiments**,* *counts**)* >> >> >> At first I tried with other NER models from Flair they have the same >> effect, after printing the first batch memory use starts increasing until >> it fails and stops the execution because of the memory error. When applying >> a "simple" function instead of the NER model, such as *return >> words.split()* on the UDF there's no such error so the data ingested >> should not be what's causing the overload but the model. >> >> Is there a way to prevent the excessive RAM consumption? Why is there >> only the driver executor and no other executors are generated? How could I >> prevent it from collapsing when applying the NER model? >> >> Thanks in advance! >> >> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark Write BinaryType Column as continues file to S3

2022-04-09 Thread Bjørn Jørgensen
LAS > format specification see > http://www.asprs.org/wp-content/uploads/2019/07/LAS_1_4_r15.pdf section > 2.6, Table 7 > > Thank > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Spark Write BinaryType Column as continues file to S3

2022-04-08 Thread Bjørn Jørgensen
In the New spark 3.3 there Will be an sql function https://github.com/apache/spark/commit/25dd4254fed71923731fd59838875c0dd1ff665a hope this can help you. fre. 8. apr. 2022, 17:14 skrev Philipp Kraus < philipp.kraus.flashp...@gmail.com>: > Hello, > > I have got a data frame with numerical data

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
ed9%7C0%7C0%7C637849132452199021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=WsBEJsDMomXx8e4bT%2BvMCq4vrH35wPD5jy7ngxZSDcs%3D=0> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss,

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
ll responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruc

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-07 Thread Bjørn Jørgensen
t;> >>>>> I am using pyspark. Basicially my code (simplified is): >>>>> >>>>> df=spark.read.csv(hdfs://somehdfslocation) >>>>> df1=spark.sql (complex statement using df) >>>>> ... >>>>> dfx=spark.sql(co

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-04-06 Thread Bjørn Jørgensen
;>>>> ... >>>>> dfx=spark.sql(complex statement using df x-1) >>>>> ... >>>>> dfx15.write() >>>>> >>>>> >>>>> What exactly is meant by "closing resources"? Is it just unpersisting >>&g

Re: how to change data type for columns of dataframe

2022-04-02 Thread Bjørn Jørgensen
--- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- > Best Regards, > Ayan Guha > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

2022-03-30 Thread Bjørn Jørgensen
a healthy spark 2.4 and was optimized already to come to a >> stable job in terms of spark-submit resources parameters like >> driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait). >> Any clue how to “really” clear the memory in between jobs? So basically >> currently I can loop 10x and then need to restart my cluster so all memory >> is cleared completely. >> >> >> Thanks for any info! >> >> > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: Question for so many SQL tools

2022-03-25 Thread Bjørn Jørgensen
: > Just a question why there are so many SQL based tools existing for data > jobs? > > The ones I know, > > Spark > Flink > Ignite > Impala > Drill > Hive > … > > They are doing the similar jobs IMO. > Thanks > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
to the support of this Spark feature? Is there >>> active development or is GraphX in maintenance mode (e.g. updated to ensure >>> functionality with new Spark releases)? >>> >>> >>> >>> Thanks in advance for your help! >>> >>> >>&

Re: [EXTERNAL] Re: GraphX Support

2022-03-25 Thread Bjørn Jørgensen
ssible solution. Would someone > be able to speak to the support of this Spark feature? Is there active > development or is GraphX in maintenance mode (e.g. updated to ensure > functionality with new Spark releases)? > > > > Thanks in advance for your help! > > > >

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
g for a pyspark data frame column_bind() solution for > several months. Hopefully pyspark.pandas works. The only other solutions I > was aware of was to use spark.dataframe.join(). This does not scale for > obvious reason. > > > > Andy > > > > > > *From: *Bjørn J

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
t; Andy > > > > p.s. My real problem is that spark does not allow you to bind columns. You > can use union() to bind rows. I could get the equivalent of cbind() using > union().transform() > > > > *From: *Bjørn Jørgensen > *Date: *Tuesday, March 15, 2022 at 10:37 AM > *To:

Re: pivoting panda dataframe

2022-03-15 Thread Bjørn Jørgensen
e author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297

  1   2   >