Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
hm. you are getting below AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark; The problem seems to be that you are using the append output mode when writing the streaming query results to Kafka. This mode

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hi Mich, Thank you so much for your response. I really appreciate your help! You mentioned "defining the watermark using the withWatermark function on the streaming_df before creating the temporary view” - I believe this is what I’m doing and it’s not working for me. Here is the exact code

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
ok let us take it for a test. The original code of mine def fetch_data(self): self.sc.setLogLevel("ERROR") schema = StructType() \ .add("rowkey", StringType()) \ .add("timestamp", TimestampType()) \ .add("temperature", IntegerType())

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides on

Re: [External] Re: Issue of spark with antlr version

2024-04-01 Thread Chawla, Parul
Hi Team, Can you let us know the when this spark 4.x will be released to maven. regards, Parul Get Outlook for iOS From: Bjørn Jørgensen Sent: Wednesday, February 28, 2024 5:06:54 PM To: Chawla, Parul Cc: Sahni, Ashima ;

Apache Spark integration with Spring Boot 3.0.0+

2024-03-28 Thread Szymon Kasperkiewicz
Hello, Ive got a project which has to use newest versions of both Apache Spark and Spring Boot due to vulnerabilities issues. I build my project using Gradle. And when I try to run it i get : Unsatisfied dependecy exception about javax/servlet/Servlet. Ive tried to add jakarta servlet,

Community Over Code NA 2024 Travel Assistance Applications now open!

2024-03-27 Thread Gavin McDonald
Hello to all users, contributors and Committers! [ You are receiving this email as a subscriber to one or more ASF project dev or user mailing lists and is not being sent to you directly. It is important that we reach all of our users and contributors/committers so that they may get a chance

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun
Hi, Cheng. Thank you for the suggestion. Your suggestion seems to have at least two themes. A. Adding a new Apache Spark community policy (contract) to guarantee MySQL LTS Versions Support. B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1) And, it brings me three questions.

[DISCUSS] MySQL version support policy

2024-03-24 Thread Cheng Pan
Hi, Spark community, I noticed that the Spark JDBC connector MySQL dialect is testing against the 8.3.0[1] now, a non-LTS version. MySQL changed the version policy recently[2], which is now very similar to the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 8.3

Is one Spark partition mapped to one and only Spark Task ?

2024-03-24 Thread Sreyan Chakravarty
I am trying to understand the Spark Architecture for my upcoming certification, however there seems to be conflicting information available. https://stackoverflow.com/questions/47782099/what-is-the-relationship-between-tasks-and-partitions Does Spark assign a Spark partition to only a single

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1 -- Thank You & Best Regards Winston Lai From: Jay Han Date: Sunday, 24 March 2024 at 08:39 To: Kiran Kumar Dusi Cc: Farshid Ashouri , Matei Zaharia , Mich Talebzadeh , Spark dev list , user @spark Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome! Kiran Kumar Dusi 于2024年3月21日周四 14:16写道: > +1 > > On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri < > farsheed.asho...@gmail.com> wrote: > >> +1 >> >> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, >> wrote: >> >>> Some of you may be aware that Databricks community Home |

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI |

Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We can use Linkedin for now. Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI | Financial Fraud London United Kingdom view my Linkedin

Re:

2024-03-21 Thread Mich Talebzadeh
You can try this val kafkaReadStream = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", broker) .option("subscribe", topicName) .option("startingOffsets", startingOffsetsMode) .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger) .load() kafkaReadStream

Bug in org.apache.spark.util.sketch.BloomFilter

2024-03-21 Thread Nathan Conroy
Hi All, I believe that there is a bug that affects the Spark BloomFilter implementation when creating a bloom filter with large n. Since this implementation uses integer hash functions, it doesn’t work properly when the number of bits exceeds MAX_INT. I asked a question about this on

[no subject]

2024-03-21 Thread Рамик И
Hi! I want to exucute code inside forEachBatch that will trigger regardless of whether there is data in the batch or not. val kafkaReadStream = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", broker) .option("subscribe", topicName) .option("startingOffsets",

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
+1 On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri wrote: > +1 > > On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, > wrote: > >> Some of you may be aware that Databricks community Home | Databricks >> have just launched a knowledge sharing hub. I thought it would be a >> good idea for the Apache

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1 On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge sharing hub. I thought it would be a > good idea for the Apache Spark user group to have the same, especially > for repeat questions on

Announcing the Community Over Code 2024 Streaming Track

2024-03-20 Thread James Hughes
Hi all, Community Over Code , the ASF conference, will be held in Denver, Colorado, October 7-10, 2024. The call for presentations is open

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
One option that comes to my mind, is that given the cyclic nature of these types of proposals in these two forums, we should be able to use Databricks's existing knowledge sharing hub Knowledge Sharing Hub - Databricks

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-19 Thread sharad mishra
Hi Team, We're encountering an issue with Spark UI. I've documented the details here: https://issues.apache.org/jira/browse/SPARK-47232 When enabled reverse proxy in master and worker configOptions. We're not able to access different tabs available in spark UI e.g.(stages, environment, storage

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
+1 On 18 Mar 2024, at 21:53, Mich Talebzadeh wrote: Well as long as it works. Please all check this link from Databricks and let us know your thoughts. Will something similar work for us?. Of course Databricks have much deeper pockets than our ASF community. Will it require moderation in

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah
+1 Great initiative. QQ : Stack overflow has a similar feature called "Collectives", but I am not sure of the expenses to create one for Apache Spark. With SO being used ( atleast before ChatGPT became quite the norm for searching questions), it already has a lot of questions asked and answered

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 . I can contribute to it as well . On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage wrote: > +1 > > Thanks for proposing > > On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >>

[ANNOUNCE] Apache Kyuubi released 1.9.0

2024-03-18 Thread Binjie Yang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.9.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel ( https://github.com/conda-forge/r-sparkr-feedstock). This is fully run by the community unofficially. On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect |

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
OK thanks for the update. What does officially blessed signify here? Can we have and run it as a sister site? The reason this comes to my mind is that the interested parties should have easy access to this site (from ISUG Spark sites) as a reference repository. I guess the advice would be that

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that the ASF couldn't have officially blessed venues beyond the already approved ones. So that's something to look into. Now of course you are welcome to run unofficial things unblessed as long as they follow trademark

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Well as long as it works. Please all check this link from Databricks and let us know your thoughts. Will something similar work for us?. Of course Databricks have much deeper pockets than our ASF community. Will it require moderation in our side to block spams and nutcases. Knowledge Sharing Hub

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
something like this Spark community · GitHub man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud : > Good idea. Will be useful > > > > +1 > > > > > > > > *From: *ashok34...@yahoo.com.INVALID > *Date: *Monday, March 18, 2024 at 6:36 AM > *To: *user @spark ,

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
+1 Thanks for proposing On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud wrote: > Good idea. Will be useful > > > > +1 > > > > > > > > *From: *ashok34...@yahoo.com.INVALID > *Date: *Monday, March 18, 2024 at 6:36 AM > *To: *user @spark , Spark dev list < > d...@spark.apache.org>, Mich

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful +1 From: ashok34...@yahoo.com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark , Spark dev list , Mich Talebzadeh Cc: Matei Zaharia Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community External message, be mindful

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
Good idea. Will be useful +1 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh wrote: Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
Hi, I must admit I don't know much about this Fruchterman-Reingold (call it FR) visualization using GraphX and Kubernetes..But you are suggesting this slowdown issue starts after the second iteration, and caching/persisting the graph after each iteration does not help. FR involves many

A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Some of you may be aware that Databricks community Home | Databricks have just launched a knowledge sharing hub. I thought it would be a good idea for the Apache Spark user group to have the same, especially for repeat questions on Spark core, Spark SQL, Spark Structured Streaming, Spark Mlib and

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation

[GraphX]: Prevent recomputation of DAG

2024-03-17 Thread Marek Berith
Dear community, for my diploma thesis, we are implementing a distributed version of Fruchterman-Reingold visualization algorithm, using GraphX and Kubernetes. Our solution is a backend that continously computes new positions of vertices in a graph and sends them via RabbitMQ to a consumer.

Re: [External] Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-17 Thread Ofir Manor
Just to add - the latest version is 0.8.3, it seems to support 3.3: "Support Spark 3.3 / Scala 2.12 , Spark 3.4 / Scala 2.12 and Scala 2.13, Spark 3.5 / Scala 2.12 and Scala 2.13" Releases · graphframes/graphframes (github.com) Ofir

Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
I came across this a few weeks ago. II a nutshell you can use it for generating test data and other scenarios where you need realistic-looking but not necessarily real data. With so many regulations and copyrights etc it is a viable alternative. I used it to generate 1000 lines of mixed true and

Re: [GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-03-15 Thread Russell Jurney
There is an implementation for Spark 3, but GraphFrames isn't released often enough to match every point version. It supports Spark 3.4. Try it - it will probably work. https://spark-packages.org/package/graphframes/graphframes Thanks, Russell Jurney @rjurney

Requesting further assistance with Spark Scala code coverage

2024-03-14 Thread 里昂
I have sent out an email regarding Spark coverage, but haven't received any response. I'm hoping someone could provide an answer on whether there is currently any code coverage statistics available for Scala code in Spark. https://lists.apache.org/thread/hob7x42gk3q244t9b0d8phwjtxjk2plt

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, 'course_title':

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
Thanks for the clarification. That makes sense.. In the code below, we can see def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
Oh I see why the confusion. microbatch_data = event.progress means that microbatch_data is a StreamingQueryProgress instance, it's not a dictionary, so you should use ` microbatch_data.processedRowsPerSecond`, instead of the `get` method which is used for dictionaries. But weirdly, for

Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all, I am using pyspark program to write the data into elastic index by using upsert operation (sample code snippet below). def writeDataToES(final_df): write_options = { "es.nodes": elastic_host, "es.net.ssl": "false", "es.nodes.wan.only": "true",

Re: Bugs with joins and SQL in Structured Streaming

2024-03-11 Thread Andrzej Zera
Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems. Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >>

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html But having tested it, there seems to be a bug there that I reported to Databricks forum as well (in answer to a user question) I have come to a conclusion

Spark on Kubenets, execute dataset.show raise exceptions

2024-03-09 Thread BODY NO
Hi, I encountered a strange issue. I run spark-shell with client mode in kubernets. as below command: val data=spark.read.parquet("datapath") When I run: "data.show", it may raise exceptions, the stacktrace like below: DEBUG BlockManagerMasterEndpoint: Updating block info on master taskresult_3

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-08 Thread sharad mishra
Hi Team, We're encountering an issue with Spark UI. When enabled reverse proxy in master and worker configOptions. We're not able to access different tabs available in spark UI e.g.(stages, environment, storage etc.) We're deploying spark through bitnami helm chart :

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
The error message shows a mismatch between the configured warehouse directory and the actual location accessible by the Spark application running in the container.. You have configured the SparkSession with spark.sql.warehouse.dir="file:/data/hive/warehouse". This tells Spark where to store

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumption

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay interesting, maybe my assumption was incorrect, although I'm still confused. I tried to mount a central mount point that would be the same on my local machine and the container. Same error although I moved the path to /tmp/hive/data/hive/ but when I rerun the test code to save a table,

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or not. I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import

Dark mode logo

2024-03-06 Thread Mike Drob
Hi Spark Community, I see that y'all have a logo uploaded to https://www.apache.org/logos/#spark but it has black text. Is there an official, alternate logo with lighter text that would look good on a dark background? Thanks, Mike

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, Let me double-check it carefully. Thank you very much for your help! 发件人: Jungtaek Lim 发送时间: 2024年3月5日 21:56:41 收件人: Pan,Bingkun 抄送: Dongjoon Hyun; dev; user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yeah the approach seems OK to me - please double

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit spark-submit --verbose \ --master yarn \

S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
Hi folks, We have been following this doc for writing data from Spark Job to S3. However it fails writing to dynamic partitions. Any suggestions on what config should be used to avoid the cost of renaming in S3?

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
I use zip file personally and pass the application name (in your case main.py) as the last input line like below APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3 #

It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
Hi all, I am working in Databricks. When I submit a spark job with the -py-files argument, it seems the first two are read in but the third is ignored. "--py-files", "s3://some_path/appl_src.py", "s3://some_path/main.py", "s3://a_different_path/common.py", I can see the first two acknowledged

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc generation in Spark repo won't fail after the move of the js file. Other than that, it would be probably just a matter of updating the release process. On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun wrote: > Okay, I see. > >

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, I see. Perhaps we can solve this confusion by sharing the same file `version.json` across `all versions` in the `Spark website repo`? Make each version of the document display the `same` data in the dropdown menu. 发件人: Jungtaek Lim 发送时间: 2024年3月5日

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific. We have two active release version lines, 3.4.x and 3.5.x. We just released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the dropdown of Spark 3.4.3, there will be 3.5.1 and

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Based on my understanding, we should not update versions that have already been released, such as the situation you mentioned: `But what about dropout of version D? Should we add E in the dropdown?` We only need to record the latest `version. json` file that has already been published at the

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the doc of "already released versions", right? Let's say we just released version D, and the dropdown has version A, B, C. We have another release tomorrow as version E, and it's probably easy to add A, B, C, D in the dropdown

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
According to my understanding, the original intention of this feature is that when a user has entered the pyspark document, if he finds that the version he is currently in is not the version he wants, he can easily jump to the version he wants by clicking on the drop-down box. Additionally, in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion. 发件人: Jungtaek Lim 日期: 2024年3月5日 星期二 10:46 收件人: Hyukjin Kwon 抄送: yangjie01 , Dongjoon Hyun , dev , user 主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, it should be in versionless doc (spark-website) rather than the doc being pinned to a specific version. On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon wrote: > Is this related to

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions.

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark. In

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual versions, and for each individual version we depend on other released versions. This does not seem to be right to me. Also, the functionality is only in PySpark API doc which does not seem to be consistent as well. I don't

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other recent releases. https://spark.apache.org/docs/3.5.0/api/python/index.html https://spark.apache.org/docs/3.4.2/api/python/index.html https://spark.apache.org/docs/3.3.4/api/python/index.html Looks like the dropdown feature

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek. PySpark document seems to show a wrong branch. At this time, `master`. https://spark.apache.org/docs/3.5.1/api/python/index.html PySpark Overview Date: Feb 24, 2024 Version: master

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations! On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark =

pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All, I have two dataframe with below structure, i have to join these two dataframe - the scenario is one column is string in one dataframe and in other df join column is array of string, so we have to inner join two df and get the data if string value is present in any of the array of string

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations! Thanks, Xinrong On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun wrote: > Congratulations! > > Bests, > Dongjoon. > > On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > >> Congratulations! >> >> >> >> At 2024-02-28 17:43:25, "Jungtaek Lim" >> wrote: >> >> Hi everyone, >> >> We

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations! Bests, Dongjoon. On Wed, Feb 28, 2024 at 11:43 AM beliefer wrote: > Congratulations! > > > > At 2024-02-28 17:43:25, "Jungtaek Lim" > wrote: > > Hi everyone, > > We are happy to announce the availability of Spark 3.5.1! > > Spark 3.5.1 is a maintenance release containing

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Chawla, Parul
Hi , Can we get spark version on whuich this is resolved. From: Bjørn Jørgensen Sent: Tuesday, February 27, 2024 7:05:36 PM To: Sahni, Ashima Cc: Chawla, Parul ; user@spark.apache.org ; Misra Parashar, Jyoti ; Mekala, Rajesh ; Grandhi, Venkatesh ; George,

Re: [External] Re: Issue of spark with antlr version

2024-02-28 Thread Bjørn Jørgensen
[image: image.png] ons. 28. feb. 2024 kl. 11:28 skrev Chawla, Parul : > > Hi , > Can we get spark version on whuich this is resolved. > -- > *From:* Bjørn Jørgensen > *Sent:* Tuesday, February 27, 2024 7:05:36 PM > *To:* Sahni, Ashima > *Cc:* Chawla, Parul ;

Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations! At 2024-02-28 17:43:25, "Jungtaek Lim" wrote: Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone, We are happy to announce the availability of Spark 3.5.1! Spark 3.5.1 is a maintenance release containing stability fixes. This release is based on the branch-3.5 maintenance branch of Spark. We strongly recommend all 3.5 users to upgrade to this stable release. To download Spark

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
Hi, Quick observations from what you have provided - The observed discrepancy between rdd.count() and rdd.map(Item::getType).countByValue()in distributed mode suggests a potential aggregation issue with countByValue(). The correct results in local mode give credence to this theory. - Workarounds

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr
Hello, I recently encountered a bug with the results from JavaRDD#countByValue that does not reproduce when running locally. For background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0. The code in question is something like this: JavaRDD rdd = // ... > rdd.count(); // 75187 // Get the

Re: Issue of spark with antlr version

2024-02-27 Thread Bjørn Jørgensen
[SPARK-44366][BUILD] Upgrade antlr4 to 4.13.1 tir. 27. feb. 2024 kl. 13:25 skrev Sahni, Ashima : > Hi Team, > > > > Can you please let us know the update on below. > > > > Thanks, > > Ashima > > > > *From:* Chawla, Parul > *Sent:* Sunday, February

Re: Issue of spark with antlr version

2024-02-27 Thread Mich Talebzadeh
Hi, You have provided little information about where Spark fits in here. So I am guessing :) Data Source (JSON, XML, log file, etc.) --> Preprocessing (Spark jobs for filtering, cleaning, etc.)? --> Antlr Parser (Generated tool) --> Extracted Data (Mapped to model) --> Spring Data Model (Java

RE: Issue of spark with antlr version

2024-02-27 Thread Sahni, Ashima
Hi Team, Can you please let us know the update on below. Thanks, Ashima From: Chawla, Parul Sent: Sunday, February 25, 2024 11:57 PM To: user@spark.apache.org Cc: Sahni, Ashima ; Misra Parashar, Jyoti Subject: Issue of spark with antlr version Hi Spark Team, Our application is currently

Unsubscribe

2024-02-27 Thread benson fang
Unsubscribe Regards

<    1   2   3   4   5   6   7   8   9   10   >