Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-13 Thread Gary Liu
Hi Mich, I used the serverless spark session, not the local mode in the notebook. So machine type does not matter in this case. Below is the chart for serverless spark session execution. I also tried to increase executor memory and core, but the issue did got get resolved. I will try shutting down

unsubscribe

2023-03-13 Thread ypl
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2023-03-13 Thread Jatinder Assi
unsubscribe

Re: Online classes for spark topics

2023-03-12 Thread vaquar khan
I saw you are looking holden video .please find following link. https://www.oreilly.com/library/view/debugging-apache-spark/9781492039174/ Regards, Vaquar khan On Sun, Mar 12, 2023, 6:56 PM Mich Talebzadeh wrote: > Hi Denny, > > Thanks for the offer. How do you envisage that structure to be?

Re: Online classes for spark topics

2023-03-12 Thread Mich Talebzadeh
Hi Denny, Thanks for the offer. How do you envisage that structure to be? Also it would be good to have a webinar (for a given topic) for different target audiences as we have a mixture of members in Spark forums. For example, beginners, intermediate and advanced. do we have a confluence

Re: Online classes for spark topics

2023-03-12 Thread Denny Lee
Looks like we have some good topics here - I'm glad to help with setting up the infrastructure to broadcast if it helps? On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani wrote: > I am happy to be a part of this discussion as well. > > Regards, > Neeraj > > On Wed, 8 Mar 2023 at 22:41, Winston Lai

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-12 Thread Mich Talebzadeh
OK ts is the timestamp right? This is a similar code that works out the average temperature with time frame of 5 minutes. Note the comments and catch error with try: try: # construct a streaming dataframe streamingDataFrame that subscribes to topic temperature

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
" In this case your program may work because effectively you are not using the spark in yarn on the hadoop cluster " I am actually using Yarn as mentioned (client mode) I already know that, but it is not just about collectAsList, the execution freezes also for example when using save() on the

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread Mich Talebzadeh
collectAsList brings all the data into the driver which is a single JVM on a single node. In this case your program may work because effectively you are not using the spark in yarn on the hadoop cluster. The benefit of Spark is that you can process a large amount of data using the memory and

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
not sure what you mean by your question, but it is not helping in any case Le sam. 11 mars 2023 à 19:54, Mich Talebzadeh a écrit : > > > ... To note that if I execute collectAsList on the dataset at the > beginning of the program > > What do you think collectAsList does? > > > >view

Re: What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread Mich Talebzadeh
... To note that if I execute collectAsList on the dataset at the beginning of the program What do you think collectAsList does? view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it

What could be the cause of an execution freeze on Hadoop for small datasets?

2023-03-11 Thread sam smith
Hello guys, I am launching through code (client mode) a Spark program to run in Hadoop. If I execute on the dataset methods of the likes of show() and count() or collectAsList() (that are displayed in the Spark UI) after performing heavy transformations on the columns then the mentioned methods

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-10 Thread karan alang
Hi Mich - Here is the output of the ldf.printSchema() & ldf.show() commands. ldf.printSchema() root |-- applianceName: string (nullable = true) |-- timeslot: long (nullable = true) |-- customer: string (nullable = true) |-- window: struct (nullable = false) ||-- start: timestamp

How to allocate vcores to driver (client mode)

2023-03-10 Thread sam smith
Hi, I am launching through code (client mode) a Spark program to run in Hadoop. Whenever I check the executors tab of Spark UI I always get 0 as the number of vcores for the driver. I tried to change that using *spark.driver.cores*, or also *spark.yarn.am.cores* in the SparkSession configuration

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-10 Thread Mich Talebzadeh
Just looking at the code in here ldf = ldf.groupBy("applianceName", "timeslot", "customer", window(col("ts"), "15 minutes")) \ .agg({'sentOctets':"sum", 'recvdOctets':"sum"}) \ .withColumnRenamed('sum(sentOctets)', 'sentOctets') \

Re: org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-10 Thread Mich Talebzadeh
for your dataproc what type of machines are you using for example n2-standard-4 with 4vCPU and 16GB or something else? how many nodes and if autoscaling turned on. most likely executor memory limit? HTH view my Linkedin profile

org.apache.spark.shuffle.FetchFailedException in dataproc

2023-03-10 Thread Gary Liu
Hi , I have a job in GCP dataproc server spark session (spark 3.3.2), it is a job involving multiple joinings, as well as a complex UDF. I always got the below FetchFailedException, but the job can be done and the results look right. Neither of 2 input data is very big (one is 6.5M rows*11

Spark StructuredStreaming - watermark not working as expected

2023-03-09 Thread karan alang
Hello All - I've a structured Streaming job which has a trigger of 10 minutes, and I'm using watermark to account for late data coming in. However, the watermark is not working - and instead of a single record with total aggregated value, I see 2 records. Here is the code : ``` 1)

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread hueiyuan su
Dear Mich, Sure, that is a good idea. If we have a pause() function, we can temporarily stop streaming and adjust configuration, maybe from environment variable. Once these parameters are adjust, we can restart the streaming to apply the newest parameter without stop spark streaming application.

Re: How to share a dataset file across nodes

2023-03-09 Thread Mich Talebzadeh
Try something like below 1) Put your csv say cities.csv in HDFS as below hdfs dfs -put cities.csv /data/stg/test 2) Read it into dataframe in PySpark as below csv_file="hdfs://:PORT/data/stg/test/cities.csv" # read it in spark listing_df =

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster? On Thu, Mar 9, 2023 at 3:02 PM sam smith wrote: > Hello, > > I use Yarn client mode to submit my driver program to Hadoop, the dataset > I load is from the local file system, when i invoke load("file://path") > Spark complains about the csv

How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello, I use Yarn client mode to submit my driver program to Hadoop, the dataset I load is from the local file system, when i invoke load("file://path") Spark complains about the csv file being not found, which i totally understand, since the dataset is not in any of the workers or the

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
Yeah, that's the right answer! Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com Book a time on Calendly On Thu, Mar 9,

Re: read a binary file and save in another location

2023-03-09 Thread Mich Talebzadeh
Does this need any action in PySpark? How about importing using the shutil package? https://sparkbyexamples.com/python/how-to-copy-files-in-python/ view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html This says "Binary file data source does not support writing a DataFrame back to the original files." which I take to mean this isn't possible... I haven't done this, but going from the docs, it would be:

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread Mich Talebzadeh
most probably we will require an additional method pause() https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html to allow us to pause (as opposed to stop()) the streaming process and resume after changing the parameters. The state of streaming

Re: Online classes for spark topics

2023-03-09 Thread neeraj bhadani
I am happy to be a part of this discussion as well. Regards, Neeraj On Wed, 8 Mar 2023 at 22:41, Winston Lai wrote: > +1, any webinar on Spark related topic is appreciated  > > Thank You & Best Regards > Winston Lai > -- > *From:* asma zgolli > *Sent:* Thursday,

eqNullSafe breaks Sorted Merge Bucket Join?

2023-03-09 Thread Thomas Wang
Hi, I have two tables t1 and t2. Both are bucketed and sorted on user_id into 32 buckets. When I use a regular equal join, Spark triggers the expected Sorted Merge Bucket Join. Please see my code and the physical plan below. from pyspark.sql import SparkSession def

Re: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

2023-03-09 Thread Saurabh Gulati
Hey Jayabindu, We use thriftserver for on K8S. May I ask why you are not going for Trino instead? I know it didn't support autoscaling when we tested it in the past but not sure if it does now. Autoscaling also means that users might have to wait for the cluster to autoscale but that usually

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread asma zgolli
Hello spark community, Adding a new topic. - Spark UI - Dynamic allocation - Tuning of jobs - Collecting spark metrics for monitoring and alerting - For those who prefer to use Pandas API on Spark since the release of Spark 3.2, What are some important notes for those users?

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Winston Lai
Hi everyone, I would like to add one topic to Saurabh's list as well. * Spark UI * Dynamic allocation * Tuning of jobs * Collecting spark metrics for monitoring and alerting * For those who prefer to use Pandas API on Spark since the release of Spark 3.2, What are some

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Saurabh Gulati
Hey guys, Its a nice idea and appreciate the effort you guys are taking. I can add to the list of topics which might be of interest: * Spark UI * Dynamic allocation * Tuning of jobs * Collecting spark metrics for monitoring and alerting HTH From:

read a binary file and save in another location

2023-03-09 Thread second_co...@yahoo.com.INVALID
any example on how to read a binary file using pySpark and save it in another location . copy feature Thank you,Teoh

Re: Online classes for spark topics

2023-03-09 Thread Mich Talebzadeh
Hi Deepak, The priority list of topics is a very good point. The theard owner mentioned Spark on k8s, Data Science and Spark Structured Streaming. What other topics need to be included I guess it depends on demand.. I suggest we wait a couple of days to see the demand . We just need to create a

How to use Fair Scheduler Pools

2023-03-08 Thread 李杰
I have two questions to ask: I wrote a demo referring to the official website(https://spark.apache.org/docs/latest/job-scheduling.html), but it didn't meet my expectations. I don't know if there was a problem with my writing.I hope that when I use the following fairscheduler.xml, pool1

Re: Online classes for spark topics

2023-03-08 Thread Deepak Sharma
I can prepare some topics and present as well , if we have a prioritised list of topics already . On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee wrote: > We used to run Spark webinars on the Apache Spark LinkedIn group > but >

Re: Online classes for spark topics

2023-03-08 Thread Denny Lee
We used to run Spark webinars on the Apache Spark LinkedIn group but honestly the turnout was pretty low. We had dove into various features. If there are particular topics that. you would like to discuss during a live session,

Re: Online classes for spark topics

2023-03-08 Thread Sofia’s World
+1 On Wed, Mar 8, 2023 at 10:40 PM Winston Lai wrote: > +1, any webinar on Spark related topic is appreciated  > > Thank You & Best Regards > Winston Lai > -- > *From:* asma zgolli > *Sent:* Thursday, March 9, 2023 5:43:06 AM > *To:* karan alang > *Cc:* Mich

Spark Thrift Server - Autoscaling on K8

2023-03-08 Thread Jayabindu Singh
Hi All, We are in the process of moving our workloads to K8 and looking for some guidance to run Spark Thrift Server on K8. We need the executor pods to autoscale based on the workload vs running it with a static number of executors. If any one has done it and can share the details, it will be

How to use Fair Scheduler Pools

2023-03-08 Thread 李杰
I have two questions to ask: I wrote a demo referring to the official website(https://spark.apache.org/docs/latest/job-scheduling.html), but it didn't meet my expectations. I don't know if there was a problem with my writing.I hope that when I use the following fairscheduler.xml, pool1

[Spark] How to find which type of key is illegal during from_json() function

2023-03-08 Thread hueiyuan su
*Component*: Spark *Level*: Advanced *Scenario*: How-to - *Problems Description* I have nested json string value in someone field of spark dataframe, and I would like to use from_json() to parse json object. Especially, if one of key type is not match with our defined

Re: Online classes for spark topics

2023-03-08 Thread Winston Lai
+1, any webinar on Spark related topic is appreciated  Thank You & Best Regards Winston Lai From: asma zgolli Sent: Thursday, March 9, 2023 5:43:06 AM To: karan alang Cc: Mich Talebzadeh ; ashok34...@yahoo.com ; User Subject: Re: Online classes for spark

spark-submit: No "driver-" id printed in standalone mode

2023-03-08 Thread Travis Athougies
Hello, I'm trying to get Airflow to work with spark in cluster mode. I can successfully submit jobs via spark-submit and see them complete successfully. However, 'spark-submit' doesn't seem to print any driver- ID to the console. Clearly the drivers have an ID, as they are listed with one in the

Re: Online classes for spark topics

2023-03-08 Thread asma zgolli
+1 Le mer. 8 mars 2023 à 21:32, karan alang a écrit : > +1 .. I'm happy to be part of these discussions as well ! > > > > > On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh > wrote: > >> Hi, >> >> I guess I can schedule this work over a course of time. I for myself can >> contribute plus learn

Re: Online classes for spark topics

2023-03-08 Thread karan alang
+1 .. I'm happy to be part of these discussions as well ! On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh wrote: > Hi, > > I guess I can schedule this work over a course of time. I for myself can > contribute plus learn from others. > > So +1 for me. > > Let us see if anyone else is

Re: Online classes for spark topics

2023-03-08 Thread Mich Talebzadeh
Hi, I guess I can schedule this work over a course of time. I for myself can contribute plus learn from others. So +1 for me. Let us see if anyone else is interested. HTH view my Linkedin profile

Re: Online classes for spark topics

2023-03-08 Thread ashok34...@yahoo.com.INVALID
Hello Mich. Greetings. Would you be able to arrange for Spark Structured Streaming learning webinar.? This is something I haven been struggling with recently. it will be very helpful. Thanks and Regard AKOn Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh wrote: Hi, This might 

[ANNOUNCE] Apache Kyuubi released 1.7.0

2023-03-07 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.0 has been released! Apache Kyuubi is a distributed multi-tenant Lakehouse gateway for large-scale data processing and analytics, built on top of Apache Spark, Apache Flink, Trino and also supports other computing

Re: Online classes for spark topics

2023-03-07 Thread Mich Talebzadeh
Hi, This might be a worthwhile exercise on the assumption that the contributors will find the time and bandwidth to chip in so to speak. I am sure there are many but on top of my head I can think of Holden Karau for k8s, and Sean Owen for data science stuff. They are both very experienced.

Online classes for spark topics

2023-03-07 Thread ashok34...@yahoo.com.INVALID
Hello gurus, Does Spark arranges online webinars for special topics like Spark on K8s, data science and Spark Structured Streaming? I would be most grateful if experts can share their experience with learners with intermediate knowledge like myself. Hopefully we will find the practical

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-07 Thread Mich Talebzadeh
hm interesting proposition. I guess you mean altering one of following parameters in flight streamingDataFrame = self.spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", config['MDVariables']['bootstrapServers'],)

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Tufan Rakshit
I use m1 apple silicon , use java11 from Zulu , and runs SBT based Build Jobs in Kubernetes Best Tufan On Tue, 7 Mar 2023 at 16:11, Sean Owen wrote: > No, it's that JAVA_HOME wasn't set to .../Home. It is simply not finding > javac, in the error. Zulu supports M1. > > On Tue, Mar 7, 2023 at

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
No, it's that JAVA_HOME wasn't set to .../Home. It is simply not finding javac, in the error. Zulu supports M1. On Tue, Mar 7, 2023 at 9:05 AM Artemis User wrote: > Looks like Maven build did find the javac, just can't run it. So it's not > a path problem but a compatibility problem. Are you

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Artemis User
Looks like Maven build did find the javac, just can't run it.  So it's not a path problem but a compatibility problem.  Are you doing this on a Mac with M1/M2?  I don't think that Zulu JDK supports Apple silicon.   Your best option would be to use homebrew to install the dev tools (including

回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread ckgppl_yan
No. I haven't installed Apple Developer Tools. I have installed Zulu OpenJDK 11.0.17 manually.So I need to install Apple Developer Tools?- 原始邮件 - 发件人:Sean Owen 收件人:ckgppl_...@sina.cn 抄送人:user 主题:Re: Build SPARK from source with SBT failed 日期:2023年03月07日 20点58分 This says you don't have

Re: Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread Sean Owen
It's hard to evaluate without knowing what you're doing. Generally, using a built-in function will be fastest. pandas UDFs can be faster than normal UDFs if you can take advantage of processing multiple rows at once. On Tue, Mar 7, 2023 at 6:47 AM neha garde wrote: > Hello All, > > I need help

Re: Build SPARK from source with SBT failed

2023-03-07 Thread Sean Owen
This says you don't have the java compiler installed. Did you install the Apple Developer Tools package? On Tue, Mar 7, 2023 at 1:42 AM wrote: > Hello, > > I have tried to build SPARK source codes with SBT in my local dev > environment (MacOS 13.2.1). But it reported following error: > [error]

Pandas UDFs vs Inbuilt pyspark functions

2023-03-07 Thread neha garde
Hello All, I need help deciding on what is better, pandas udfs or inbuilt functions I have to perform a transformation where I managed to compare the two for a few thousand records and pandas_udf infact performed better. Given the complexity of the transformation, I also found pandas_udf makes it

Re: [Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-06 Thread Mich Talebzadeh
Spark Structured Streaming can write to anything as long as an appropriate API or JDBC connection exists. I have not tried Kinesis but have you thought about how you want to write it as a Sync? Those quota limitations, much like quotas set by the vendors (say Google on BigQuery writes etc) are

unsubscribe

2023-03-06 Thread Deepthi Sathia Raj
> unsubscribe >

unsubscribe

2023-03-06 Thread William R
unsubscribe

[Spark Structured Streaming] Do spark structured streaming is support sink to AWS Kinesis currently and how to handle if achieve quotas of kinesis?

2023-03-05 Thread hueiyuan su
*Component*: Spark Structured Streaming *Level*: Advanced *Scenario*: How-to *Problems Description* 1. I currently would like to use pyspark structured streaming to write data to kinesis. But it seems like does not have corresponding connector can use. I would confirm

Data duplication and loss occur after executing 'insert overwrite...' in Spark 3.1.1

2023-03-05 Thread 周锋
Hi all, We are currently using Spark version 3.1.1 in our production environment. We have noticed that occasionally, after executing 'insert overwrite ... select', the resulting data is inconsistent, with some data being duplicated or lost. This issue does not occur all the time and seems to

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-05 Thread Mich Talebzadeh
OK I found a workaround. Basically each stream state is not kept and I have two streams. One is a business topic and the other one created to shut down spark structured streaming gracefully. I was interested to print the value for the most recent batch Id for the business topic called "md" here

Re: Unable to handle bignumeric datatype in spark/pyspark

2023-03-04 Thread Atheeth SH
Hi Rajnil, Sorry for the multiple emails. It seems you are getting the ModuleNotFoundError error was curious, have you tried using the below-mentioned solution mentioned in the readme file? Below is the link:- https://github.com/GoogleCloudDataproc/spark-bigquery-connector#bignumeric-support

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
This might help https://docs.databricks.com/structured-streaming/foreach.html streamingDF.writeStream.foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of the streaming query. It takes two parameters: a DataFrame or Dataset that has the

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
I am aware of your point that global don't work in a distributed environment. With regard to your other point, these are two different topics with their own streams. The point of second stream is to set the status to false, so it can gracefully shutdown the main stream (the one called "md") here

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
I don't quite get it - aren't you applying to the same stream, and batches? worst case why not apply these as one function? Otherwise, how do you mean to associate one call to another? globals don't help here. They aren't global beyond the driver, and, which one would be which batch? On Sat, Mar

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
Thanks. they are different batchIds >From sendToControl, newtopic batchId is 76 >From sendToSink, md, batchId is 563 As a matter of interest, why does a global variable not work? view my Linkedin profile

Re: How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Sean Owen
It's the same batch ID already, no? Or why not simply put the logic of both in one function? or write one function that calls both? On Sat, Mar 4, 2023 at 2:07 PM Mich Talebzadeh wrote: > > This is probably pretty straight forward but somehow is does not look > that way > > > > On Spark

How to pass variables across functions in spark structured streaming (PySpark)

2023-03-04 Thread Mich Talebzadeh
This is probably pretty straight forward but somehow is does not look that way On Spark Structured Streaming, "foreachBatch" performs custom write logic on each micro-batch through a call function. Example, foreachBatch(sendToSink) expects 2 parameters, first: micro-batch as DataFrame or

Re: SPIP architecture diagrams

2023-03-04 Thread Mich Talebzadeh
ok I decided to bite the bullet and use a Visio diagram for my SPIP "Shutting down spark structured streaming when the streaming process completed the current process". Details from here https://issues.apache.org/jira/browse/SPARK-42485 This is not meant to be complete. In this an indication. I

Re: Unable to handle bignumeric datatype in spark/pyspark

2023-03-03 Thread Atheeth SH
Hi Rajnil, Just curious, what version of spark-bigquery-connector are you using? Thanks, Atheeth On Sat, 25 Feb 2023 at 23:48, Mich Talebzadeh wrote: > sounds like it is cosmetric. The important point is that if the data > stored in GBQ is valid? > > > THT > > >view my Linkedin profile >

Re: Unsubscribe

2023-03-03 Thread Atheeth SH
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks On Thu, 23 Feb 2023 at 07:07, Tang Jinxin wrote: > Unsubscribe >

Re: unsubscribe

2023-03-03 Thread Atheeth SH
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks, Atheeth On Fri, 24 Feb 2023 at 03:58, Roberto Jr wrote: > please unsubscribe from that email list. > thank you in advance. > roberto. >

[ANNOUNCE] Apache Celeborn(incubating) 0.2.0 available

2023-03-01 Thread Ethan Feng
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.2.0 Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Re: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Russell Jurney
I think it is awesome. Brilliant interface that is missing from Spark. Would you integrate with something like MLFlow? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com

Fwd: [New Project] sparksql-ml : Distributed Machine Learning using SparkSQL.

2023-02-27 Thread Chitral Verma
Hi All, I worked on this idea a few years back as a pet project to bridge *SparkSQL* and *SparkML* and empower anyone to implement production grade, distributed machine learning over Apache Spark as long as they have SQL skills. In principle the idea works exactly like Google's BigQueryML but at

Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread Mich Talebzadeh
Hi Murat, I have dealt with EMR but have used Spark cluster on Google Dataproc with 3.1.1 with autoscaling policy. My understanding is that autoscaling policy will decide on how to scale if needed without manual intervention. Is this the case with yours? HTH view my Linkedin profile

Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread murat migdisoglu
Hey Mich, This cluster is running spark 2.4.6 on EMR On Mon, Feb 27, 2023 at 12:20 PM Mich Talebzadeh wrote: > Hi, > > What is the spark version and what type of cluster is it, spark on > dataproc or other? > > HTH > > > >view my Linkedin profile >

Re: Spike on number of tasks - dynamic allocation

2023-02-27 Thread Mich Talebzadeh
Hi, What is the spark version and what type of cluster is it, spark on dataproc or other? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Spike on number of tasks - dynamic allocation

2023-02-27 Thread murat migdisoglu
On an auto-scaling cluster using YARN as resource manager, we observed that when we decrease the number of worker nodes after upscaling instance types, the number of tasks for the same spark job spikes. (the total cpu/memory capacity of the cluster remains identical) the same spark job, with the

Fwd: 自动回复: Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Mich Talebzadeh
Hi, Can someone disable the below login from spark forums please? Sounds like someone left this email and we are receiving a spam type message anytime we respond. thanks view my Linkedin profile

[JDBC] [PySpark] Possible bug when comparing incoming data frame from mssql and empty delta table

2023-02-26 Thread lennart
Hello, I have been working on a small ETL framework for pyspark/delta/databricks on my spare time. It looks like I might have encountered a bug, however I'm not totally sure its actually caused by spark itself and not one of the other technologies. The error shows up when using spark sql

Re: Unable to handle bignumeric datatype in spark/pyspark

2023-02-25 Thread Mich Talebzadeh
sounds like it is cosmetric. The important point is that if the data stored in GBQ is valid? THT view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: Unable to handle bignumeric datatype in spark/pyspark

2023-02-25 Thread Rajnil Guha
Hi All, I had created an issue on Stackoverflow(linked below) a few months back about issues while handling bignumeric type values of BigQuery in Spark. link On Fri,

Late arriving updates to fact tables

2023-02-25 Thread rajat kumar
Hi Users, We are getting updates in Kafka Topic(Through CDC). Can you please tell how do I correct/replay/reprocess the late arriving records in Data lake? Thanks Rajat

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Sorry, I didn't try that. On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney wrote: > Oliver, just curious: did you get a clean error message when you broke it > out into separate statements? > > Thanks, > Russell Jurney @rjurney > russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it out into separate statements? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com Book a time on

Re: SPIP architecture diagrams

2023-02-24 Thread Mich Talebzadeh
Sounds like I have to decide for myself what to use. A correction Vision should read* Visio * ideally the SPIP guide https://spark.apache.org/improvement-proposals.html should include this topic. Additionally there should be a repository for the original diagrams as well. From the said guide:

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Hello, Thanks for the advice. First of all, it looks like I used the wrong *max* function, but *pyspark.sql.functions.max* isn't right either, because it finds the maximum of a given column over groups of rows. To find the maximum among multiple columns, I need

Re: Unable to handle bignumeric datatype in spark/pyspark

2023-02-24 Thread Mich Talebzadeh
Hi Nidhi, can you create a BigQuery table with a bignumeric and numeric column types, add a few lines and try to read into spark. through DF and do df.printSchema() df.show(5,False) HTH view my Linkedin profile

Unable to handle bignumeric datatype in spark/pyspark

2023-02-23 Thread nidhi kher
Hello, I am facing below issue in pyspark code: We are running spark code using dataproc serverless batch in google cloud platform. Spark code is causing issue while writing the data to bigquery table. In bigquery table , few of the columns have datatype as bignumeric and spark code is changing

unsubscribe

2023-02-23 Thread Roberto Jr
please unsubscribe from that email list. thank you in advance. roberto.

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That's pretty impressive. I'm not sure it's quite right - not clear that the intent is taking a minimum of absolute values (is it? that'd be wild). But I think it might have pointed in the right direction. I'm not quite sure why that error pops out, but I think 'max' is the wrong function. That's

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
I'm trying to learn how to use chatgpt for coding. So after a lite chat I got this. The code you provided seems to calculate the distance between a gene and a variant by finding the maximum value between the difference of the variant position and the gene start position, the difference of the

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it out and perform each minute operation as a field, then combine those into a final answer. Can you do that here? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Here is the complete error: ``` Traceback (most recent call last): File "nearest-gene.py", line 74, in main() File "nearest-gene.py", line 62, in main distances = joined.withColumn("distance", max(col("start") - col("position"), col("position") - col("end"), 0)) File

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position),

[PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Hello, I'm trying to calculate the distance between a gene (with start and end) and a variant (with position), so I joined gene and variant data by chromosome and then tried to calculate the distance like this: ``` distances = joined.withColumn("distance", max(col("start") -

<    10   11   12   13   14   15   16   17   18   19   >