Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Patil, Atul
Hello Team, Does Spark support role-based authentication and access to Amazon S3 for Kubernetes deployment? Note: we have deployed our spark application in the Kubernetes cluster. Below are the Hadoop-AWS dependencies we are using: org.apache.hadoop hadoop-aws 3.3.4 We are using the

Unsubscribe

2023-12-12 Thread Daniel Maangi

Unsubscribe

2023-12-12 Thread Klaus Schaefers
-- “Overfitting” is not about an excessive amount of physical exercise...

Unsubscribe

2023-12-12 Thread Sergey Boytsov
Unsubscribe --

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread murat migdisoglu
Hey Jack, Emr serverless is a great fit for this. You can get these metrics for each job when they are completed. Besides that, if you create separate "emr applications" per group and tag them appropriately, you can use the cost explorer to see the amount of resources being used. If emr

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread Jörn Franke
It could be simpler and faster to use tagging of resources for billing: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-billing.html That could also include other resources (eg s3). > Am 12.12.2023 um 04:47 schrieb Jack Wells : > >  > Hello Spark experts - I’m running

Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells
Hello Spark experts - I’m running Spark jobs in cluster mode using a dedicated cluster for each job. Is there a way to see how much compute time each job takes via Spark APIs, metrics, etc.? In case it makes a difference, I’m using AWS EMR - I’d ultimately like to be able to say this job costs $X

Unsubscribe

2023-12-11 Thread 18706753459
Unsubscribe

Unsubscribe

2023-12-11 Thread Dusty Williams
Unsubscribe

unsubscribe

2023-12-11 Thread Stevens, Clay
unsubscribe

Spark 3.1.3 with Hive dynamic partitions fails while driver moves the staged files

2023-12-11 Thread Shay Elbaz
Hi all, Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert new (time) partitions into existing Hive tables. But we see too many failures coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where the driver moves the successful files from

unsubscribe

2023-12-11 Thread Sergey Boytsov
-- Sergei Boitsov JetBrains GmbH Christoph-Rapparini-Bogen 23 80639 München Handelsregister: Amtsgericht München, HRB 187151 Geschäftsführer: Yury Belyaev

unsubscribe

2023-12-11 Thread Klaus Schaefers
-- “Overfitting” is not about an excessive amount of physical exercise...

Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky
Hey Mich, Thanks for the detailed response. I get most of these options. However, what we are trying to do is avoid having to upload the source configs and pyspark.zip files to the cluster every time we execute the job using spark-submit. Here is the code that does it:

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi Michail, > > with

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
Hi Eugene, With regard to your points What are the PYTHONPATH and SPARK_HOME env variables in your script? OK let us look at a typical of my Spark project structure - project_root |-- README.md |-- __init__.py |-- conf | |-- (configuration files for Spark) |-- deployment | |--

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't understand a few things 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip is uploaded from the local SPARK_HOME. If it is set to "local://" the upload is skipped. I would expect the latter to be the

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__.zip). It is working now because I enabled direct access to HDFS to allow copying

unsubscribe

2023-12-10 Thread Rajanikant V

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in

Re: Spark on Java 17

2023-12-09 Thread Faiz Halde
Thanks, IL check them out Curious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mb That's strange because our heaps go up to 100gb and that would require 64mb

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
If you do tests with newer Java versions you can also try: - UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345 You can also assess the new Java GC algorithms: - -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient than zgc with heaps <32 GB. See also:

RE: Spark on Java 17

2023-12-09 Thread Luca Canali
Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher)

Spark on Java 17

2023-12-07 Thread Faiz Halde
Hello, We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning? We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the clarification. I will try to do plain jdbc connection on Scala/Java and will update this thread on how it goes. *Thanks,* *Venkat* On Thu, Dec 7, 2023 at 9:40 AM Nicholas Chammas wrote: > PyMySQL has its own implementation >

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation of the MySQL client-server protocol. It does not use JDBC. > On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan > wrote: > > Thanks for the advice

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the advice Nicholas. As mentioned in the original email, I have tried JDBC + SSH Tunnel using pymysql and sshtunnel and it worked fine. The problem happens only with Spark. *Thanks,* *Venkat* On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas wrote: > This is not a question for the

SSH Tunneling issue with Apache Spark

2023-12-05 Thread Venkatesan Muniappan
Hi Team, I am facing an issue with SSH Tunneling in Apache Spark. The behavior is same as the one in this Stackoverflow question but there are no answers there. This is what I am trying:

Re: ordering of rows in dataframe

2023-12-05 Thread Enrico Minack
Looks like what you want is to add a column that, when ordered by that column, the current order of the dateframe is preserved. All you need is the monotonically_increasing_id() function: spark.range(0, 10, 1, 5).withColumn("row", monotonically_increasing_id()).show() +---+---+ | id|

ordering of rows in dataframe

2023-12-05 Thread Som Lima
want to maintain the order of the rows in the data frame in Pyspark. Is there any way to achieve this for this function here we have the row ID which will give numbering to each row. Currently, the below function results in the rearrangement of the row in the data frame. def createRowIdColumn(

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L] The

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.

ML advice

2023-12-04 Thread Zahid Rahman
Hi, I heard some big things about Machine learning and data science. To upgrade my skill set I took a udemy course Python and Spark for Big Data with Spark. It took about week to learn the concepts and the workflow to follow when using each of the Spark APIs. To complete a Machine Learning

Re: Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Apologies to everyone. I sent this to the wrong email list. Please discard On Mon, 04 Dec 2023 10:48:11 -0500 Yeikel Santana wrote --- Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a

Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a shared environment where the ingestion rate needs to be as low as possible (for example, 5 requests/second as an upper limit), and as far as I can

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Aironman DirtDiver
The issue you're encountering with the iceberg-spark-runtime dependency not being properly passed to the executors in your Spark Connect server deployment could be due to a couple of factors: 1. *Spark Submit Packaging:* When you use the --packages parameter in spark-submit, it only

Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Xiaolong Wang
Hi, Spark community, I encountered a weird bug when using Spark Connect server to integrate with Iceberg. I added the iceberg-spark-runtime dependency with `--packages` param, the driver/connect-server pod did get the correct dependencies. But when looking at the executor's library, the jar was

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
Hi Michail, observations as well as ordinary accumulators only observe / process rows that are iterated / consumed by downstream stages. If the query plan decides to skip one side of the join, that one will be removed from the final plan completely. Then, the Observation will not retrieve any

[PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-02 Thread Михаил Кулаков
Hey folks, I actively using observe method on my spark jobs and noticed interesting behavior: Here is an example of working and non working code: https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c In a few words, if I'm joining dataframe after some filter rules and it became

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > >

ML using Spark Connect

2023-12-01 Thread Faiz Halde
Hello, Is it possible to run SparkML using Spark Connect 3.5.0? So far I've had no success setting up a connect client that uses ML package The ML package uses spark core/sql afaik which seems to be shadowing the Spark connect client classes Do I have to exclude any dependencies from the mllib

[FYI] SPARK-45981: Improve Python language test coverage

2023-12-01 Thread Dongjoon Hyun
Hi, All. As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community starts to have test coverage for all supported Python versions from Today. - https://github.com/apache/spark/actions/runs/7061665420 Here is a summary. 1. Main CI: All PRs and commits on `master` branch are

Re: [Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Mich Talebzadeh
Ok pause/continue to throw some challenges. The implication is to pause gracefully and resume the same' First have a look at this SPIP of mine [SPARK-42485] SPIP: Shutting down spark structured streaming when the streaming process completed current process - ASF JIRA (apache.org)

[Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Saurabh Agrawal (180813)
Hi Spark Team, I am using Spark 3.4.0 version in my application which is use to consume messages from Kafka topics. I have below queries: 1. Does DStream support pause/resume streaming message consumption at runtime on particular condition? If yes, please provide details. 2. I tried to revoke

Re:[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread beliefer
Congratulations! At 2023-12-01 01:23:55, "Dongjoon Hyun" wrote: We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance

[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.2! Spark 3.4.2 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this

unsubscribe

2023-11-30 Thread Dharmin Siddesh J
unsubscribe

[sql] how to connect query stage to Spark job/stages?

2023-11-29 Thread Chenghao Lyu
Hi, I am seeking advice on measuring the performance of each QueryStage (QS) when AQE is enabled in Spark SQL. Specifically, I need help to automatically map a QS to its corresponding jobs (or stages) to get the QS runtime metrics. I recorded the QS structure via a customized injected Query

Re: Tuning Best Practices

2023-11-29 Thread Bryant Wright
Thanks, Jack! Please let me know if you find any other guides specific to tuning shuffles and joins. Currently, the best way I know how to handle joins across large datasets that can't be broadcast is by rewriting the source tables HIVE partitioned by one or two join keys, and then breaking down

RE: Re: Spark Compatibility with Spring Boot 3.x

2023-11-29 Thread Guru Panda
Team, Do we have any updates when spark 4.x version will release in order to address below issues related to > java.lang.NoClassDefFoundError: javax/servlet/Servlet Thanks and Regards, Guru On 2023/10/05 17:19:51 Angshuman Bhattacharya wrote: > Thanks Ahmed. I am trying to bring this up with

Re: Tuning Best Practices

2023-11-28 Thread Jack Goodson
Hi Bryant, the below docs are a good start on performance tuning https://spark.apache.org/docs/latest/sql-performance-tuning.html Hope it helps! On Wed, Nov 29, 2023 at 9:32 AM Bryant Wright wrote: > Hi, I'm looking for a comprehensive list of Tuning Best Practices for > spark. > > I did a

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
I actually think it should be totally possible to use it on an executor side. Maybe it will require a small extension/udf, but generally no issues here. Pf4j is very lightweight, so you'll only have a small overhead for classloaders. There's still a small question of distribution of

Tuning Best Practices

2023-11-28 Thread Bryant Wright
Hi, I'm looking for a comprehensive list of Tuning Best Practices for spark. I did a search on the archives for "tuning" and the search returned no results. Thanks for your help.

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Faiz Halde
Hey Pasha, Is your suggestion towards the spark team? I can make use of the plugin system on the driver side of spark but considering spark is distributed, the executor side of spark needs to adapt to the pf4j framework I believe too Thanks Faiz On Tue, Nov 28, 2023, 16:57 Pasha Finkelshtein

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-28 Thread Pasha Finkelshtein
To me it seems like it's the best possible use case for PF4J. [image: facebook] [image: twitter] [image: linkedin] [image: instagram] Pasha Finkelshteyn Developer Advocate

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Thanks Holden, So you're saying even Spark connect is not going to provide that guarantee? The code referred to above is taken up from Spark connect implementation Could you explain which parts are tricky to get right? Just to be well prepared of the consequences On Tue, Nov 28, 2023, 01:30

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Faiz Halde
Hello, We are using spark 3.5.0 and were wondering if the following is achievable using spark-core Our use case involves spinning up a spark cluster where the driver application loads user jars containing spark transformations at runtime. A single spark application can load multiple user jars (

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming

[Spark-sql 3.2.4] Wrong Statistic INFO From 'ANALYZE TABLE' Command

2023-11-24 Thread Nick Luo
Hi, all The ANALYZE TABLE command run from Spark on a Hive table. Question: Before I run ANALYZE TABLE' Command on Spark-sql client, I ran 'ANALYZE TABLE' Command on Hive client, the wrong Statistic Info show up. For example 1. run the analyze table command o hive client - create table

Query fails on CASE statement depending on order of summed columns

2023-11-22 Thread Evgenii Ignatev
Good day, Recently we have faced an issue that was pinpointed to the following situation - https://github.com/YevIgn/spark-case-issue Basically query in question has differently ordered summation of three columns (`1` + `2` + `3`) next (`1` + `3` + `2`) in a CASE and fails with the

Re: How exactly does dropDuplicatesWithinWatermark work?

2023-11-21 Thread Jungtaek Lim
I'll probably reply the same to SO but posting here first. This is mentioned in JIRA ticket, design doc, and also API doc, but to reiterate, the contract/guarantee of the new API is that the API will deduplicate events properly when the max distance of all your duplicate events are less than

How exactly does dropDuplicatesWithinWatermark work?

2023-11-19 Thread Perfect Stranger
Hello, I have trouble understanding how dropDuplicatesWithinWatermark works. And I posted this stackoverflow question: https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work Could somebody answer it please? Best Regards, Pavel.

Setting fs.s3a.aws.credentials.provider through a connect server.

2023-11-17 Thread Leandro Martelli
Hi all! Has anyone been through this already? I have a spark docker images that are used in 2 different environments and each one requires a different credentials provider for s3a. That parameter is the only difference between them. When passing via --conf, it works as expected. When --conf is

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster \

RE: The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-16 Thread Stevens, Clay
Perhaps you also need to upgrade Scala? Clay Stevens From: Hanyu Huang Sent: Wednesday, 15 November, 2023 1:15 AM To: user@spark.apache.org Subject: The job failed when we upgraded from spark 3.3.1 to spark3.4.1 Caution, this email may be from a sender outside Wolters Kluwer. Verify the

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, As the logs indicate, when executing spark-submit, Spark will package and upload spark/conf to HDFS, along with uploading spark/jars. These files are uploaded to HDFS unless you specify uploading them to another OSS. To do so, you'll need to modify the configuration in

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey! Thanks for the response. We are getting the error because there is no network connectivity to the data nodes - that's expected. What I am trying to find out is WHY we need access to the data nodes, and if there is a way to submit a job without it. Cheers, Eugene On Wed, Nov 15, 2023 at

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due

Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit

[Spark Structured Streaming] Two sink from Single stream

2023-11-15 Thread Subash Prabanantham
Hi Team, I am working on a basic streaming aggregation where I have one file stream source and two write sinks (Hudi table). The only difference is the aggregation performed is different, hence I am using the same spark session to perform both operations. (File Source) --> Agg1 -> DF1

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-15 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

The job failed when we upgraded from spark 3.3.1 to spark3.4.1

2023-11-14 Thread Hanyu Huang
The version our job originally ran was spark 3.3.1 and Apache Iceberg to 1.2.0, But since we upgraded to spark3.4.1 and Apache Iceberg to 1.3.1, jobs started to fail frequently, We tried to upgrade only iceberg without upgrading spark, and the job did not report an error. Detailed description:

Re: Okio Vulnerability in Spark 3.4.1

2023-11-14 Thread Bjørn Jørgensen
FYI I have opened Update okio to version 1.17.6 for this now. tor. 31. aug. 2023 kl. 21:18 skrev Sean Owen : > It's a dependency of some other HTTP library. Use mvn dependency:tree to > see where it comes from. It may be more

Why create/drop/alter/rename partition does not post listener event in ExternalCatalogWithListener?

2023-11-14 Thread 李响
Dear Spark Community: In ExternalCatalogWithListener , I see postToAll() is called for create/drop/alter/rename database/table/function to post

unsubscribe

2023-11-09 Thread Duflot Patrick
unsubscribe

Pass xmx values to SparkLauncher launched Java process

2023-11-09 Thread Deepthi Sathia Raj
Hi, We have a usecase where we are submitting multiple spark jobs using SparkLauncher from a Java class. We are currently in a memory crunch situation on our edge node where we see that the Java processes spawned by the launcher is taking around 1 GB. Is there a way to pass JMX parameters to this

How grouping rows without shuffle

2023-11-09 Thread Yoel Benharrous
Hi all, I'm trying to group X rows in a single one without shuffling the date. I was thinking doing something like that : val myDF = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11).toDF("myColumn") myDF.withColumn("myColumn", expr("sliding(myColumn, 3)")) expected result: myColumn [1,2,3] [4,5,6]

help needed with SPARK-45598 and SPARK-45769

2023-11-09 Thread Maksym M
Greetings, tl;dr there must have been a regression in spark *connect*'s ability to retrieve data, more details in linked issues https://issues.apache.org/jira/browse/SPARK-45598 https://issues.apache.org/jira/browse/SPARK-45769 we have projects that depend on spark connect 3.5 and we'd

Storage Partition Joins only works for buckets?

2023-11-08 Thread Arwin Tio
Hey team, I was reading through the Storage Partition Join SPIP (https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl) but it seems like it only supports buckets, not partitions. Is that true? And if so does anybody have an intuition for

Re: Unsubscribe

2023-11-08 Thread Xin Zhang
Unsubscribe -- Email:josseph.zh...@gmail.com

Unsubscribe

2023-11-07 Thread Kiran Kumar Dusi
Unsubscribe

unsubscribe

2023-11-07 Thread Kalhara Gurugamage
unsubscribeSent from my phone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2023-11-07 Thread Suraj Choubey
unsubscribe

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this? On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12

org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory ClassNotFoundException

2023-11-07 Thread Yi Zheng
Hi, The problem I’ve encountered is: after “spark-shell” command, when I first enter “spark.sql("select * from test.test_3 ").show(false)” command, it throws “ERROR session.SessionState: Error setting up authorization: java.lang.ClassNotFoundException:

Re: Spark master shuts down when one of zookeeper dies

2023-11-07 Thread Mich Talebzadeh
Hi, Spark standalone mode does not use or rely on ZooKeeper by default. The Spark master and workers communicate directly with each other without using ZooKeeper. However, it appears that in your case you are relying on ZooKeeper to provide high availability for your standalone cluster. By

unsubscribe

2023-11-07 Thread Kelvin Qin
unsubscribe

[ANNOUNCE] Apache Kyuubi released 1.8.0

2023-11-06 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.8.0 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

unsubscribe

2023-11-06 Thread Stefan Hagedorn

Spark master shuts down when one of zookeeper dies

2023-11-06 Thread Kaustubh Ghode
I am using spark-3.4.1 I have a setup with three ZooKeeper servers, Spark master shuts down when a Zookeeper instance is down a new master is elected as leader and the cluster is up. But the original master that was down never comes up. can you please help me with this issue? Stackoverflow link:-

How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi, Our company is currently introducing the Spark Connect server to production. Most of the issues have been solved yet I don't know how to configure authentication from a pySpark client to the Spark Connect server. I noticed that there is some interceptor configs at the Scala client side,

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
General The reason why os.path.join is appending double backslash on Windows is because that is how Windows paths are represented. However, GCS paths (a Hadoop Compatible File System (HCFS) use forward slashes like in Linux. This can cause problems if you are trying to use a Windows path in a

Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Richard Smith
Hi All, I've just encountered and worked around a problem that is pretty obscure and unlikely to affect many people, but I thought I'd better report it anyway All the data I'm using is inside Google Cloud Storage buckets (path starts with gs://) and I'm running Spark 3.5.0 locally (for

Re: Data analysis issues

2023-11-02 Thread Mich Talebzadeh
Hi, Your mileage varies so to speak.Whether or not the data you use to analyze in Spark through RStudio will be seen by Spark's back-end depends on how you deploy Spark and RStudio. If you are deploying Spark and RStudio on your own premises or in a private cloud environment, then the data you

Re: Spark / Scala conflict

2023-11-02 Thread Harry Jamison
Thanks Alonso, I think this gives me some ideas. My code is written in Python, and I use spark-submit to submit it. I am not sure what code is written in scala.  Maybe the Phoenix driver based on the stack trace? How do I tell which version of scala that was compiled against? Is there a jar

RE: jackson-databind version mismatch

2023-11-02 Thread moshik.vitas
Thanks for replying, The issue was import of spring-boot-dependencies on my dependencyManagement pom that forced invalid jar version. Removed this section and got valid spark dependencies. Regards, Moshik Vitas From: Bjørn Jørgensen Sent: Thursday, 2 November 2023 10:40 To:

<    1   2   3   4   5   6   7   8   9   10   >