Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Xiaolong Wang
Hi, Spark community, I encountered a weird bug when using Spark Connect server to integrate with Iceberg. I added the iceberg-spark-runtime dependency with `--packages` param, the driver/connect-server pod did get the correct dependencies. But when looking at the executor's library, the jar was

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
Hi Michail, observations as well as ordinary accumulators only observe / process rows that are iterated / consumed by downstream stages. If the query plan decides to skip one side of the join, that one will be removed from the final plan completely. Then, the Observation will not retrieve any

[PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-02 Thread Михаил Кулаков
Hey folks, I actively using observe method on my spark jobs and noticed interesting behavior: Here is an example of working and non working code: https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c In a few words, if I'm joining dataframe after some filter rules and it became

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > >

ML using Spark Connect

2023-12-01 Thread Faiz Halde
Hello, Is it possible to run SparkML using Spark Connect 3.5.0? So far I've had no success setting up a connect client that uses ML package The ML package uses spark core/sql afaik which seems to be shadowing the Spark connect client classes Do I have to exclude any dependencies from the mllib

[FYI] SPARK-45981: Improve Python language test coverage

2023-12-01 Thread Dongjoon Hyun
Hi, All. As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community starts to have test coverage for all supported Python versions from Today. - https://github.com/apache/spark/actions/runs/7061665420 Here is a summary. 1. Main CI: All PRs and commits on `master` branch are

Re: [Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Mich Talebzadeh
Ok pause/continue to throw some challenges. The implication is to pause gracefully and resume the same' First have a look at this SPIP of mine [SPARK-42485] SPIP: Shutting down spark structured streaming when the streaming process completed current process - ASF JIRA (apache.org)

[Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Saurabh Agrawal (180813)
Hi Spark Team, I am using Spark 3.4.0 version in my application which is use to consume messages from Kafka topics. I have below queries: 1. Does DStream support pause/resume streaming message consumption at runtime on particular condition? If yes, please provide details. 2. I tried to revoke

Re:[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread beliefer
Congratulations! At 2023-12-01 01:23:55, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance

[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this

unsubscribe

2023-11-30 Thread Dharmin Siddesh J
unsubscribe

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu
Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

Re: Tuning Best Practices

2023-11-29 Thread Bryant Wright
Thanks, Jack! Please let me know if you find any other guides specific to tuning shuffles and joins. Currently, the best way I know how to handle joins across large datasets that can't be broadcast is by rewriting the source tables HIVE partitioned by one or two join keys, and then breaking down

RE: Re: Spark Compatibility with Spring Boot 3.x

2023-11-29 Thread Guru Panda
Team, Do we have any updates when spark 4.x version will release in order to address below issues related to > java.lang.NoClassDefFoundError: javax/servlet/Servlet Thanks and Regards, Guru On 2023/10/05 17:19:51 Angshuman Bhattacharya wrote: > Thanks Ahmed. I am trying to bring this up with

Re: Tuning Best Practices

2023-11-28 Thread Jack Goodson
Hi Bryant, the below docs are a good start on performance tuning https://spark.apache.org/docs/latest/sql-performance-tuning.html Hope it helps! On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright wrote: > Hi, I'm looking for a comprehensive list of Tuning Best Practices for > spark. > > I did a

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
I actually think it should be totally possible to use it on an executor side. Maybe it will require a small extension/udf, but generally no issues here. Pf4j is very lightweight, so you'll only have a small overhead for classloaders. There's still a small question of distribution of

Tuning Best Practices

2023-11-28 Thread Bryant Wright
Hi, I'm looking for a comprehensive list of Tuning Best Practices for spark. I did a search on the archives for "tuning" and the search returned no results. Thanks for your help.

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Faiz Halde
Hey Pasha, Is your suggestion towards the spark team? I can make use of the plugin system on the driver side of spark but considering spark is distributed, the executor side of spark needs to adapt to the pf4j framework I believe too Thanks Faiz On Tue, Nov 28, 2023, 16:57 Pasha Finkelshtein

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
To me it seems like it's the best possible use case for PF4J. [image: facebook] [image: twitter] [image: linkedin] [image: instagram] Pasha Finkelshteyn Developer Advocate

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Thanks Holden, So you're saying even Spark connect is not going to provide that guarantee? The code referred to above is taken up from Spark connect implementation Could you explain which parts are tricky to get right? Just to be well prepared of the consequences On Tue, Nov 28, 2023, 01:30

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Hello, We are using spark 3.5.0 and were wondering if the following is achievable using spark-core Our use case involves spinning up a spark cluster where the driver application loads user jars containing spark transformations at runtime. A single spark application can load multiple user jars (

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table

Query fails on CASE statement depending on order of summed columns

2023-11-22 Thread Evgenii Ignatev
Good day, Recently we have faced an issue that was pinpointed to the following situation - https://github.com/YevIgn/spark-case-issue Basically query in question has differently ordered summation of three columns (`1` + `2` + `3`) next (`1` + `3` + `2`) in a CASE and fails with the

Re: How exactly does dropDuplicatesWithinWatermark work?

2023-11-21 Thread Jungtaek Lim
I'll probably reply the same to SO but posting here first. This is mentioned in JIRA ticket, design doc, and also API doc, but to reiterate, the contract/guarantee of the new API is that the API will deduplicate events properly when the max distance of all your duplicate events are less than

How exactly does dropDuplicatesWithinWatermark work?

2023-11-19 Thread Perfect Stranger
Hello, I have trouble understanding how dropDuplicatesWithinWatermark works. And I posted this stackoverflow question: https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work Could somebody answer it please? Best Regards, Pavel.

Setting fs.s3a.aws.credentials.provider through a connect server.

2023-11-17 Thread Leandro Martelli
Hi all! Has anyone been through this already? I have a spark docker images that are used in 2 different environments and each one requires a different credentials provider for s3a. That parameter is the only difference between them. When passing via --conf, it works as expected. When --conf is

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster \

RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala? Clay Stevens From: Hanyu Huang Sent: Wednesday, 15 November, 2023 1:15 AM To: user@spark.apache.org Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1 Caution, this email may be from a sender outside Wolters Kluwer. Verify the

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, As the logs indicate, when executing spark-submit, Spark will package and upload spark/conf to HDFS, along with uploading spark/jars. These files are uploaded to HDFS unless you specify uploading them to another OSS. To do so, you'll need to modify the configuration in

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey! Thanks for the response. We are getting the error because there is no network connectivity to the data nodes - that's expected. What I am trying to find out is WHY we need access to the data nodes, and if there is a way to submit a job without it. Cheers, Eugene On Wed, Nov 15, 2023 at

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due

Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit

[Spark Structured Streaming] Two sink from Single stream

2023-11-15 Thread Subash Prabanantham
Hi Team, I am working on a basic streaming aggregation where I have one file stream source and two write sinks (Hudi table). The only difference is the aggregation performed is different, hence I am using the same spark session to perform both operations. (File Source) --> Agg1 -> DF1

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-14 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

Re: Okio Vulnerability in Spark 3.4.1

2023-11-14 Thread Bjørn Jørgensen
FYI I have opened Update okio to version 1.17.6 for this now. tor. 31. aug. 2023 kl. 21:18 skrev Sean Owen : > It's a dependency of some other HTTP library. Use mvn dependency:tree to > see where it comes from. It may be more

Why create/drop/alter/rename partition does not post listener event in ExternalCatalogWithListener?

2023-11-14 Thread 李响
Dear Spark Community: In ExternalCatalogWithListener , I see postToAll() is called for create/drop/alter/rename database/table/function to post

unsubscribe

2023-11-09 Thread Duflot Patrick
unsubscribe

Pass xmx values to SparkLauncher launched Java process

2023-11-09 Thread Deepthi Sathia Raj
Hi, We have a usecase where we are submitting multiple spark jobs using SparkLauncher from a Java class. We are currently in a memory crunch situation on our edge node where we see that the Java processes spawned by the launcher is taking around 1 GB. Is there a way to pass JMX parameters to this

How grouping rows without shuffle

2023-11-09 Thread Yoel Benharrous
Hi all, I'm trying to group X rows in a single one without shuffling the date. I was thinking doing something like that : val myDF = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11).toDF("myColumn") myDF.withColumn("myColumn", expr("sliding(myColumn, 3)")) expected result: myColumn [1,2,3] [4,5,6]

help needed with SPARK-45598 and SPARK-45769

2023-11-09 Thread Maksym M
Greetings, tl;dr there must have been a regression in spark *connect*'s ability to retrieve data, more details in linked issues https://issues.apache.org/jira/browse/SPARK-45598 https://issues.apache.org/jira/browse/SPARK-45769 we have projects that depend on spark connect 3.5 and we'd

Storage Partition Joins only works for buckets?

2023-11-08 Thread Arwin Tio
Hey team, I was reading through the Storage Partition Join SPIP (https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl) but it seems like it only supports buckets, not partitions. Is that true? And if so does anybody have an intuition for

Re: Unsubscribe

2023-11-08 Thread Xin Zhang
Unsubscribe -- Email:josseph.zh...@gmail.com

Unsubscribe

2023-11-07 Thread Kiran Kumar Dusi
Unsubscribe

unsubscribe

2023-11-07 Thread Kalhara Gurugamage
unsubscribeSent from my phone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2023-11-07 Thread Suraj Choubey
unsubscribe

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this? On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12

org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory ClassNotFoundException

2023-11-07 Thread Yi Zheng
Hi, The problem I’ve encountered is: after “spark-shell” command, when I first enter “spark.sql("select * from test.test_3 ").show(false)” command, it throws “ERROR session.SessionState: Error setting up authorization: java.lang.ClassNotFoundException:

Re: Spark master shuts down when one of zookeeper dies

2023-11-07 Thread Mich Talebzadeh
Hi, Spark standalone mode does not use or rely on ZooKeeper by default. The Spark master and workers communicate directly with each other without using ZooKeeper. However, it appears that in your case you are relying on ZooKeeper to provide high availability for your standalone cluster. By

unsubscribe

2023-11-07 Thread Kelvin Qin
unsubscribe

[ANNOUNCE] Apache Kyuubi released 1.8.0

2023-11-06 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

unsubscribe

2023-11-06 Thread Stefan Hagedorn

Spark master shuts down when one of zookeeper dies

2023-11-06 Thread Kaustubh Ghode
I am using spark-3.4.1 I have a setup with three ZooKeeper servers, Spark master shuts down when a Zookeeper instance is down a new master is elected as leader and the cluster is up. But the original master that was down never comes up. can you please help me with this issue? Stackoverflow link:-

How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi, Our company is currently introducing the Spark Connect server to production. Most of the issues have been solved yet I don't know how to configure authentication from a pySpark client to the Spark Connect server. I noticed that there is some interceptor configs at the Scala client side,

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
General The reason why os.path.join is appending double backslash on Windows is because that is how Windows paths are represented. However, GCS paths (a Hadoop Compatible File System (HCFS) use forward slashes like in Linux. This can cause problems if you are trying to use a Windows path in a

Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Richard Smith
Hi All, I've just encountered and worked around a problem that is pretty obscure and unlikely to affect many people, but I thought I'd better report it anyway All the data I'm using is inside Google Cloud Storage buckets (path starts with gs://) and I'm running Spark 3.5.0 locally (for

Re: Data analysis issues

2023-11-02 Thread Mich Talebzadeh
Hi, Your mileage varies so to speak.Whether or not the data you use to analyze in Spark through RStudio will be seen by Spark's back-end depends on how you deploy Spark and RStudio. If you are deploying Spark and RStudio on your own premises or in a private cloud environment, then the data you

Re: Spark / Scala conflict

2023-11-02 Thread Harry Jamison
Thanks Alonso, I think this gives me some ideas. My code is written in Python, and I use spark-submit to submit it. I am not sure what code is written in scala.  Maybe the Phoenix driver based on the stack trace? How do I tell which version of scala that was compiled against? Is there a jar

RE: jackson-databind version mismatch

2023-11-02 Thread moshik.vitas
Thanks for replying, The issue was import of spring-boot-dependencies on my dependencyManagement pom that forced invalid jar version. Removed this section and got valid spark dependencies. Regards, Moshik Vitas From: Bjørn Jørgensen Sent: Thursday, 2 November 2023 10:40 To:

Data analysis issues

2023-11-02 Thread Jauru Lin
Hello all, I have a question about Apache Spark, I would like to ask if I use Rstudio to connect to Spark to analyze data, will the data I use be seen by Spark's back-end personnel? Hope someone can solve my problem. Thanks!

Re: Re: jackson-databind version mismatch

2023-11-02 Thread eab...@163.com
Hi, But in fact, it does have those packages. D:\02_bigdata\spark-3.5.0-bin-hadoop3\jars 2023/09/09 10:0875,567 jackson-annotations-2.15.2.jar 2023/09/09 10:08 549,207 jackson-core-2.15.2.jar 2023/09/09 10:08

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen : > In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those > are with groupid

Re: Spark / Scala conflict

2023-11-02 Thread Aironman DirtDiver
The error message Caused by: java.lang.ClassNotFoundException: scala.Product$class indicates that the Spark job is trying to load a class that is not available in the classpath. This can happen if the Spark job is compiled with a different version of Scala than the version of Scala that is used to

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those are with groupid org.codehaus.jackson. Those others jackson-* are with groupid com.fasterxml.jackson.core tor. 2.

Spark / Scala conflict

2023-11-01 Thread Harry Jamison
I am getting the error below when I try to run a spark job connecting to phoneix.  It seems like I have the incorrect scala version that some part of the code is expecting. I am using spark 3.5.0, and I have copied these phoenix jars into the spark lib phoenix-server-hbase-2.5-5.1.3.jar  

Re: jackson-databind version mismatch

2023-11-01 Thread eab...@163.com
Hi, Please check the versions of jar files starting with "jackson-". Make sure all versions are consistent. jackson jar list in spark-3.3.0: 2022/06/10 04:3775,714 jackson-annotations-2.13.3.jar 2022/06/10 04:37 374,895 jackson-core-2.13.3.jar

Fixed byte array issue

2023-11-01 Thread KhajaAsmath Mohammed
Hi, I am facing an issue with fixed byte array issue when reading spark dataframe. spark.sql.parquet.enableVectorizedReader = false is solving my issue but it is causing significant performance issue. any resolution for this? Thanks, Asmath

jackson-databind version mismatch

2023-11-01 Thread moshik.vitas
Hi Spark team, On upgrading spark version from 3.2.1 to 3.4.1 got the following issue: java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream, com.fasterxml.jackson.core.JsonEncoding)'

Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh
I was thinking in line of elasticity and autoscaling for Spark in the context of Kubernetes. My experience with Kubernetes and Spark on the so called autopilot has not been that great.This is mainly from my experience that in autopilot you let the choice of nodes be decided by the vendor's

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-29 Thread Nagatomi Yasukazu
Hi, eabour Thank you for the insights. Based on the information you provided, along with the PR [SPARK-42371][CONNECT] that add "./sbin/start-connect-server.sh" script, I'll experiment with launching the Spark Connect Server in Cluster Mode on Kubernetes. [SPARK-42371][CONNECT] Add scripts to

Re: Spark join produce duplicate rows in resultset

2023-10-27 Thread Meena Rajani
Thanks all: Patrick selected rev.* and I.* cleared the confusion. The Item actually brought 4 rows hence the final result set had 4 rows. Regards, Meena On Sun, Oct 22, 2023 at 10:13 AM Bjørn Jørgensen wrote: > alos remove the space in rev. scode > > søn. 22. okt. 2023 kl. 19:08 skrev Sadha

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-27 Thread Andrzej Zera
Hi, thank you very much for an update! Thanks, Andrzej On 2023/10/27 01:50:35 Jungtaek Lim wrote: > Hi, we are aware of your ticket and plan to look into it. We can't say > about ETA but just wanted to let you know that we are going to look into > it. Thanks for reporting! > > Thanks, >

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Jungtaek Lim
Hi, we are aware of your ticket and plan to look into it. We can't say about ETA but just wanted to let you know that we are going to look into it. Thanks for reporting! Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera wrote: > Hey All, > > I'm trying to

[Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Andrzej Zera
Hey All, I'm trying to reproduce the following streaming operation: "Time window aggregation in separate streams followed by stream-stream join". According to documentation, this should be possible in Spark 3.5.0 but I had no success despite different tries. Here is a documentation snippet I'm

[Resolved] Re: spark.stop() cannot stop spark connect session

2023-10-25 Thread eab...@163.com
Hi all. I read source code at spark/python/pyspark/sql/connect/session.py at master · apache/spark (github.com) and the comment for the "stop" method is described as follows: def stop(self) -> None: # Stopping the session will only close the connection to the current session

spark schema conflict behavior records being silently dropped

2023-10-24 Thread Carlos Aguni
hi all, i noticed a weird behavior to when spark parses nested json with schema conflict. i also just noticed that spark "fixed" this in the most recent release 3.5.0 but since i'm working with AWS services being: * EMR 6: spark 3.3.* spark3.4.* * Glue 3: spark3.1.1 * Glue 4: spark 3.3.0

Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
hi all, thank you for your reply. > Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? yes Jorn, i also believe this is the best approach. but here we're dealing with company policies and all the bureaucracy that comes along. in parallel i'm

Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When

submitting tasks failed in Spark standalone mode due to missing failureaccess jar file

2023-10-24 Thread eab...@163.com
Hi Team. I use spark 3.5.0 to start Spark cluster with start-master.sh and start-worker.sh, when I use ./bin/spark-shell --master spark://LAPTOP-TC4A0SCV.:7077 and get error logs: ``` 23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): Command exited with code

Contribution Recommendations

2023-10-23 Thread Phil Dakin
Per the "Contributing to Spark " guide, I am requesting guidance on selecting a good ticket to take on. I've opened documentation/test PRs: https://github.com/apache/spark/pull/43369 https://github.com/apache/spark/pull/43405 If you have

Maximum executors in EC2 Machine

2023-10-23 Thread KhajaAsmath Mohammed
Hi, I am running a spark job in spark EC2 machine whiich has 40 cores. Driver and executor memory is 16 GB. I am using local[*] but I still get only one executor(driver). Is there a way to get more executors with this config. I am not using yarn or mesos in this case. Only one machine which is

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Pol Santamaria
Hi Carlos! Take a look at this project, it's 6 years old but the approach is still valid: https://github.com/zillow/aws-custom-credential-provider The credential provider gets called each time an S3 or Glue Catalog is accessed, and then you can decide whether to use a cached token or renew.

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? Assumerole can make sense if you use an AWS IAM user and STS authentication, but this would make no sense within AWS for cross-account access as attaching the permissions to the Glue job role is

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
alos remove the space in rev. scode søn. 22. okt. 2023 kl. 19:08 skrev Sadha Chilukoori : > Hi Meena, > > I'm asking to clarify, are the *on *& *and* keywords optional in the join > conditions? > > Please try this snippet, and see if it helps > > select rev.* from rev > inner join customer c >

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
Hi Meena, I'm asking to clarify, are the *on *& *and* keywords optional in the join conditions? Please try this snippet, and see if it helps select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p on rev.sys = p.sys and rev.prin = p.prin and rev.scode= p.bcode

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
Hi Meena, It's not impossible, but it's unlikely that there's a bug in Spark SQL randomly duplicating rows. The most likely explanation is there are more records in the item table that match your sys/custumer_id/scode criteria than you expect. In your original query, try changing select rev.* to

automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
hi all, i've a scenario where I need to assume a cross account role to have S3 bucket access. the problem is that this role only allows for 1h time span (no negotiation). that said. does anyone know a way to tell spark to automatically renew the token or to dinamically renew the token on each

Spark join produce duplicate rows in resultset

2023-10-21 Thread Meena Rajani
Hello all: I am using spark sql to join two tables. To my surprise I am getting redundant rows. What could be the cause. select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p rev.sys = p.sys rev.prin = p.prin rev.scode= p.bcode left join item I on rev.sys =

Error when trying to get the data from Hive Materialized View

2023-10-21 Thread Siva Sankar Reddy
Hi Team , We are not getting any error when retrieving the data from hive table in PYSPARK , but getting the error ( Scala.matcherror MATERIALIZED_VIEW ( of class org.Apache.Hadoop.hive.metastore.TableType ) . Please let me know resolution for this ? Thanks

spark.stop() cannot stop spark connect session

2023-10-20 Thread eab...@163.com
Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"] }) #

"Premature end of Content-Length" Error

2023-10-19 Thread Sandhya Bala
Hi all, I am running into the following error with spark 2.4.8 Job aborted due to stage failure: Task 9 in stage 2.0 failed 4 times, most > recent failure: Lost task 9.3 in stage 2.0 (TID 100, 10.221.8.73, executor > 2): org.apache.http.ConnectionClosedException: Premature end of >

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi, I have found three important classes: org.apache.spark.sql.connect.service.SparkConnectServer : the ./sbin/start-connect-server.sh script use SparkConnectServer class as main class. In main function, use SparkSession.builder.getOrCreate() create local sessin, and start

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark

Re: hive: spark as execution engine. class not found problem

2023-10-17 Thread Vijay Shankar
UNSUBSCRIBE On Tue, Oct 17, 2023 at 5:09 PM Amirhossein Kabiri < amirhosseikab...@gmail.com> wrote: > I used Ambari to config and install Hive and Spark. I want to insert into > a hive table using Spark execution Engine but I face to this weird error. > The error is: > > Job failed with

hive: spark as execution engine. class not found problem

2023-10-17 Thread Amirhossein Kabiri
I used Ambari to config and install Hive and Spark. I want to insert into a hive table using Spark execution Engine but I face to this weird error. The error is: Job failed with java.lang.ClassNotFoundException: ive_20231017100559_301568f9-bdfa-4f7c-89a6-f69a65b30aaf:1 2023-10-17 10:07:42,972

<    1   2   3   4   5   6   7   8   9   10   >