Data analysis issues

2023-11-02 Thread Jauru Lin
Hello all, I have a question about Apache Spark, I would like to ask if I use Rstudio to connect to Spark to analyze data, will the data I use be seen by Spark's back-end personnel? Hope someone can solve my problem. Thanks!

Re: Re: jackson-databind version mismatch

2023-11-02 Thread eab...@163.com
Hi, But in fact, it does have those packages. D:\02_bigdata\spark-3.5.0-bin-hadoop3\jars 2023/09/09 10:0875,567 jackson-annotations-2.15.2.jar 2023/09/09 10:08 549,207 jackson-core-2.15.2.jar 2023/09/09 10:08

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen : > In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those > are with groupid

Re: Spark / Scala conflict

2023-11-02 Thread Aironman DirtDiver
The error message Caused by: java.lang.ClassNotFoundException: scala.Product$class indicates that the Spark job is trying to load a class that is not available in the classpath. This can happen if the Spark job is compiled with a different version of Scala than the version of Scala that is used to

Re: jackson-databind version mismatch

2023-11-02 Thread Bjørn Jørgensen
In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those are with groupid org.codehaus.jackson. Those others jackson-* are with groupid com.fasterxml.jackson.core tor. 2.

Spark / Scala conflict

2023-11-01 Thread Harry Jamison
I am getting the error below when I try to run a spark job connecting to phoneix.  It seems like I have the incorrect scala version that some part of the code is expecting. I am using spark 3.5.0, and I have copied these phoenix jars into the spark lib phoenix-server-hbase-2.5-5.1.3.jar  

Re: jackson-databind version mismatch

2023-11-01 Thread eab...@163.com
Hi, Please check the versions of jar files starting with "jackson-". Make sure all versions are consistent. jackson jar list in spark-3.3.0: 2022/06/10 04:3775,714 jackson-annotations-2.13.3.jar 2022/06/10 04:37 374,895 jackson-core-2.13.3.jar

Fixed byte array issue

2023-11-01 Thread KhajaAsmath Mohammed
Hi, I am facing an issue with fixed byte array issue when reading spark dataframe. spark.sql.parquet.enableVectorizedReader = false is solving my issue but it is causing significant performance issue. any resolution for this? Thanks, Asmath

jackson-databind version mismatch

2023-11-01 Thread moshik.vitas
Hi Spark team, On upgrading spark version from 3.2.1 to 3.4.1 got the following issue: java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream, com.fasterxml.jackson.core.JsonEncoding)'

Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh
I was thinking in line of elasticity and autoscaling for Spark in the context of Kubernetes. My experience with Kubernetes and Spark on the so called autopilot has not been that great.This is mainly from my experience that in autopilot you let the choice of nodes be decided by the vendor's

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-29 Thread Nagatomi Yasukazu
Hi, eabour Thank you for the insights. Based on the information you provided, along with the PR [SPARK-42371][CONNECT] that add "./sbin/start-connect-server.sh" script, I'll experiment with launching the Spark Connect Server in Cluster Mode on Kubernetes. [SPARK-42371][CONNECT] Add scripts to

Re: Spark join produce duplicate rows in resultset

2023-10-27 Thread Meena Rajani
Thanks all: Patrick selected rev.* and I.* cleared the confusion. The Item actually brought 4 rows hence the final result set had 4 rows. Regards, Meena On Sun, Oct 22, 2023 at 10:13 AM Bjørn Jørgensen wrote: > alos remove the space in rev. scode > > søn. 22. okt. 2023 kl. 19:08 skrev Sadha

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-27 Thread Andrzej Zera
Hi, thank you very much for an update! Thanks, Andrzej On 2023/10/27 01:50:35 Jungtaek Lim wrote: > Hi, we are aware of your ticket and plan to look into it. We can't say > about ETA but just wanted to let you know that we are going to look into > it. Thanks for reporting! > > Thanks, >

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Jungtaek Lim
Hi, we are aware of your ticket and plan to look into it. We can't say about ETA but just wanted to let you know that we are going to look into it. Thanks for reporting! Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera wrote: > Hey All, > > I'm trying to

[Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Andrzej Zera
Hey All, I'm trying to reproduce the following streaming operation: "Time window aggregation in separate streams followed by stream-stream join". According to documentation, this should be possible in Spark 3.5.0 but I had no success despite different tries. Here is a documentation snippet I'm

[Resolved] Re: spark.stop() cannot stop spark connect session

2023-10-25 Thread eab...@163.com
Hi all. I read source code at spark/python/pyspark/sql/connect/session.py at master · apache/spark (github.com) and the comment for the "stop" method is described as follows: def stop(self) -> None: # Stopping the session will only close the connection to the current session

spark schema conflict behavior records being silently dropped

2023-10-24 Thread Carlos Aguni
hi all, i noticed a weird behavior to when spark parses nested json with schema conflict. i also just noticed that spark "fixed" this in the most recent release 3.5.0 but since i'm working with AWS services being: * EMR 6: spark 3.3.* spark3.4.* * Glue 3: spark3.1.1 * Glue 4: spark 3.3.0

Re: automatically/dinamically renew aws temporary token

2023-10-24 Thread Carlos Aguni
hi all, thank you for your reply. > Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? yes Jorn, i also believe this is the best approach. but here we're dealing with company policies and all the bureaucracy that comes along. in parallel i'm

Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When

submitting tasks failed in Spark standalone mode due to missing failureaccess jar file

2023-10-24 Thread eab...@163.com
Hi Team. I use spark 3.5.0 to start Spark cluster with start-master.sh and start-worker.sh, when I use ./bin/spark-shell --master spark://LAPTOP-TC4A0SCV.:7077 and get error logs: ``` 23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): Command exited with code

Contribution Recommendations

2023-10-23 Thread Phil Dakin
Per the "Contributing to Spark " guide, I am requesting guidance on selecting a good ticket to take on. I've opened documentation/test PRs: https://github.com/apache/spark/pull/43369 https://github.com/apache/spark/pull/43405 If you have

Maximum executors in EC2 Machine

2023-10-23 Thread KhajaAsmath Mohammed
Hi, I am running a spark job in spark EC2 machine whiich has 40 cores. Driver and executor memory is 16 GB. I am using local[*] but I still get only one executor(driver). Is there a way to get more executors with this config. I am not using yarn or mesos in this case. Only one machine which is

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Pol Santamaria
Hi Carlos! Take a look at this project, it's 6 years old but the approach is still valid: https://github.com/zillow/aws-custom-credential-provider The credential provider gets called each time an S3 or Glue Catalog is accessed, and then you can decide whether to use a cached token or renew.

Re: automatically/dinamically renew aws temporary token

2023-10-23 Thread Jörn Franke
Can’t you attach the cross account permission to the glue job role? Why the detour via AssumeRole ? Assumerole can make sense if you use an AWS IAM user and STS authentication, but this would make no sense within AWS for cross-account access as attaching the permissions to the Glue job role is

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Bjørn Jørgensen
alos remove the space in rev. scode søn. 22. okt. 2023 kl. 19:08 skrev Sadha Chilukoori : > Hi Meena, > > I'm asking to clarify, are the *on *& *and* keywords optional in the join > conditions? > > Please try this snippet, and see if it helps > > select rev.* from rev > inner join customer c >

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
Hi Meena, I'm asking to clarify, are the *on *& *and* keywords optional in the join conditions? Please try this snippet, and see if it helps select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p on rev.sys = p.sys and rev.prin = p.prin and rev.scode= p.bcode

Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Patrick Tucci
Hi Meena, It's not impossible, but it's unlikely that there's a bug in Spark SQL randomly duplicating rows. The most likely explanation is there are more records in the item table that match your sys/custumer_id/scode criteria than you expect. In your original query, try changing select rev.* to

automatically/dinamically renew aws temporary token

2023-10-22 Thread Carlos Aguni
hi all, i've a scenario where I need to assume a cross account role to have S3 bucket access. the problem is that this role only allows for 1h time span (no negotiation). that said. does anyone know a way to tell spark to automatically renew the token or to dinamically renew the token on each

Spark join produce duplicate rows in resultset

2023-10-21 Thread Meena Rajani
Hello all: I am using spark sql to join two tables. To my surprise I am getting redundant rows. What could be the cause. select rev.* from rev inner join customer c on rev.custumer_id =c.id inner join product p rev.sys = p.sys rev.prin = p.prin rev.scode= p.bcode left join item I on rev.sys =

Error when trying to get the data from Hive Materialized View

2023-10-21 Thread Siva Sankar Reddy
Hi Team , We are not getting any error when retrieving the data from hive table in PYSPARK , but getting the error ( Scala.matcherror MATERIALIZED_VIEW ( of class org.Apache.Hadoop.hive.metastore.TableType ) . Please let me know resolution for this ? Thanks

spark.stop() cannot stop spark connect session

2023-10-20 Thread eab...@163.com
Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"] }) #

"Premature end of Content-Length" Error

2023-10-19 Thread Sandhya Bala
Hi all, I am running into the following error with spark 2.4.8 Job aborted due to stage failure: Task 9 in stage 2.0 failed 4 times, most > recent failure: Lost task 9.3 in stage 2.0 (TID 100, 10.221.8.73, executor > 2): org.apache.http.ConnectionClosedException: Premature end of >

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi, I have found three important classes: org.apache.spark.sql.connect.service.SparkConnectServer : the ./sbin/start-connect-server.sh script use SparkConnectServer class as main class. In main function, use SparkSession.builder.getOrCreate() create local sessin, and start

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark

Re: hive: spark as execution engine. class not found problem

2023-10-17 Thread Vijay Shankar
UNSUBSCRIBE On Tue, Oct 17, 2023 at 5:09 PM Amirhossein Kabiri < amirhosseikab...@gmail.com> wrote: > I used Ambari to config and install Hive and Spark. I want to insert into > a hive table using Spark execution Engine but I face to this weird error. > The error is: > > Job failed with

hive: spark as execution engine. class not found problem

2023-10-17 Thread Amirhossein Kabiri
I used Ambari to config and install Hive and Spark. I want to insert into a hive table using Spark execution Engine but I face to this weird error. The error is: Job failed with java.lang.ClassNotFoundException: ive_20231017100559_301568f9-bdfa-4f7c-89a6-f69a65b30aaf:1 2023-10-17 10:07:42,972

Re: Spark stand-alone mode

2023-10-17 Thread Ilango
Hi all, Thanks a lot for your suggestions and knowledge sharing. I like to let you know that, I completed setting up the stand alone cluster and couple of data science users are able to use it already for last two weeks. And the performance is really good. Almost 10X performance improvement

Re: Can not complete the read csv task

2023-10-14 Thread Khalid Mammadov
This command only defines a new DataFrame, in order to see some results you need to do something like merged_spark_data.show() on a new line. Regarding the error I think it's typical error that you get when you run Spark on Windows OS. You can suppress it using Winutils tool (Google it or ChatGPT

[ANNOUNCE] Apache Celeborn(incubating) 0.3.1 available

2023-10-13 Thread Cheng Pan
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Fwd: Fw: Can not complete the read csv task

2023-10-13 Thread KP Youtuber
Dear group members, I'm trying to get a fresh start with Spark, but came a cross following issue; I tried to read few CSV files from a folder, but the task got stuck and didn't complete. ( copied content from the terminal.) Can someone help to understand what is going wrong? Versions; java

Fw: Can not complete the read csv task

2023-10-13 Thread Kelum Perera
From: Kelum Perera Sent: Thursday, October 12, 2023 11:40 AM To: user@spark.apache.org ; Kelum Perera ; Kelum Gmail Subject: Can not complete the read csv task Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a

Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark

[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying to Mysql Database and applying `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as expected in spark 3.3.1 , but not working with 3.5.0. Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st)

Can not complete the read csv task

2023-10-12 Thread Kelum Perera
Dear friends, I'm trying to get a fresh start with Spark. I tried to read few CSV files in a folder, but the task got stuck and not completed as shown in the copied content from the terminal. Can someone help to understand what is going wrong? Versions; java version "11.0.16" 2022-07-19 LTS

Re: Autoscaling in Spark

2023-10-10 Thread Mich Talebzadeh
This has been brought up a few times. I will focus on Spark Structured Streaming Autoscaling does not support Spark Structured Streaming (SSS). Why because streaming jobs are typically long-running jobs that need to maintain state across micro-batches. Autoscaling is designed to scale up and down

Autoscaling in Spark

2023-10-10 Thread Kiran Biswal
Hello Experts Is there any true auto scaling option for spark? The dynamic auto scaling works only for batch. Any guidelines on spark streaming autoscaling and how that will be tied to any cluster level autoscaling solutions? Thanks

Re: Updating delta file column data

2023-10-10 Thread Mich Talebzadeh
Hi, Since you mentioned that there could be duplicate records with the same unique key in the Delta table, you will need a way to handle these duplicate records. One approach I can suggest is to use a timestamp to determine the latest or most relevant record among duplicates, the so-called

Re: Log file location in Spark on K8s

2023-10-09 Thread Prashant Sharma
Hi Sanket, Driver and executor logs are written to stdout by default, it can be configured using SPARK_HOME/conf/log4j.properties file. The file including the entire SPARK_HOME/conf is auto propogateded to all driver and executor container and mounted as volume. Thanks On Mon, 9 Oct, 2023, 5:37

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Danilo Sousa
Unsubscribe > Em 9 de out. de 2023, à(s) 07:03, Mich Talebzadeh > escreveu: > > Hi, > > Please see my responses below: > > 1) In Spark Structured Streaming does commit mean streaming data has been > delivered to the sink like Snowflake? > > No. a commit does not refer to data being

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Your mileage varies. Often there is a flavour of Cloud Data warehouse already there. CDWs like BigQuery, Redshift, Snowflake and so forth. They can all do a good job for various degrees - Use efficient data types. Choose data types that are efficient for Spark to process. For example, use

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread ashok34...@yahoo.com.INVALID
Thank you for your feedback Mich. In general how can one optimise the cloud data warehouses (the sink part), to handle streaming Spark data efficiently, avoiding bottlenecks that discussed. AKOn Monday, 9 October 2023 at 11:04:41 BST, Mich Talebzadeh wrote: Hi, Please see my

Re: Updating delta file column data

2023-10-09 Thread Mich Talebzadeh
In a nutshell, is this what you are trying to do? 1. Read the Delta table into a Spark DataFrame. 2. Explode the string column into a struct column. 3. Convert the hexadecimal field to an integer. 4. Write the DataFrame back to the Delta table in merge mode with a unique key. Is

Log file location in Spark on K8s

2023-10-09 Thread Agrawal, Sanket
Hi All, We are trying to send the spark logs using fluent-bit. We validated that fluent-bit is able to move logs of all other pods except the driver/executor pods. It would be great if someone can guide us where should I look for spark logs in Spark on Kubernetes with client/cluster mode

Re: Clarification with Spark Structured Streaming

2023-10-09 Thread Mich Talebzadeh
Hi, Please see my responses below: 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? No. a commit does not refer to data being delivered to a sink like Snowflake or bigQuery. The term commit refers to Spark Structured Streaming (SS)

Re: Updating delta file column data

2023-10-09 Thread Karthick Nk
Hi All, I have mentioned the sample data below and the operation I need to perform over there, I have delta tables with columns, in that columns I have the data in the string data type(contains the struct data), So, I need to update one key value in the struct field data in the string column

Clarification with Spark Structured Streaming

2023-10-08 Thread ashok34...@yahoo.com.INVALID
Hello team 1) In Spark Structured Streaming does commit mean streaming data has been delivered to the sink like Snowflake? 2) if sinks like Snowflake  cannot absorb or digest streaming data in a timely manner, will there be an impact on spark streaming itself? Thanks AK

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Angshuman Bhattacharya
Thanks Ahmed. I am trying to bring this up with Spark DE community On Thu, Oct 5, 2023 at 12:32 PM Ahmed Albalawi < ahmed.albal...@capitalone.com> wrote: > Hello team, > > We are in the process of upgrading one of our apps to Spring Boot 3.x > while using Spark, and we have encountered an issue

Re: Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Sean Owen
I think we already updated this in Spark 4. However for now you would have to also include a JAR with the jakarta.* classes instead. You are welcome to try Spark 4 now by building from master, but it's far from release. On Thu, Oct 5, 2023 at 11:53 AM Ahmed Albalawi wrote: > Hello team, > > We

Spark Compatibility with Spring Boot 3.x

2023-10-05 Thread Ahmed Albalawi
Hello team, We are in the process of upgrading one of our apps to Spring Boot 3.x while using Spark, and we have encountered an issue with Spark compatibility, specifically with Jakarta Servlet. Spring Boot 3.x uses Jakarta Servlet while Spark uses Javax Servlet. Can we get some guidance on how

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Prashant Sharma
Hi Sanket, more details might help here. How does your spark configuration look like? What exactly was done when this happened? On Thu, 5 Oct, 2023, 2:29 pm Agrawal, Sanket, wrote: > Hello Everyone, > > > > We are trying to stream the changes in our Iceberg tables stored in AWS > S3. We are

Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Agrawal, Sanket
Hello Everyone, We are trying to stream the changes in our Iceberg tables stored in AWS S3. We are achieving this through Spark-Iceberg Connector and using JAR files for Spark-AWS. Suddenly we have started receiving error "Connection pool shut down". Spark Version: 3.4.1 Iceberg: 1.3.1 Any

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]:

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

[Spark Core]: Recomputation cost of a job due to executor failures

2023-10-04 Thread Faiz Halde
Hello, Due to the way Spark implements shuffle, a loss of an executor sometimes results in the recomputation of partitions that were lost The definition of a *partition* is the tuple ( RDD-ids, partition id ) RDD-ids is a sequence of RDD ids In our system, we define the unit of work performed

Updating delta file column data

2023-10-02 Thread Karthick Nk
Hi community members, In databricks adls2 delta tables, I need to perform the below operation, could you help me with your thoughts I have the delta tables with one colum with data type string , which contains the json data in string data type, I need to do the following 1. I have to update one

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members, Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding. We're interfacing with S3 Compatible storages, but our operational context is somewhat

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon, Using IAM as suggested by Jorn is the best approach. We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM. It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html). It is almost a guarantee to get your data stolen and your account manipulated. If

using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link ->

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
Dear Spark Community Members, I trust this message finds you all in good health and spirits. I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within

Re: Inquiry about Processing Speed

2023-09-28 Thread Jack Goodson
Hi Haseeb, I think the user mailing list is what you're looking for, people are usually pretty active on here if you present a direct question about apache spark. I've linked below the community guidelines which says which mailing lists are for what etc https://spark.apache.org/community.html

Thread dump only shows 10 shuffle clients

2023-09-28 Thread Nebi Aydin
Hi all, I set the spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads to *800* But when I click Thread dump from the Spark UI for the executor: I only see 10 shuffle client threads for the executor. Is that normal, am I missing something?

Re: Inquiry about Processing Speed

2023-09-27 Thread Deepak Goel
Hi "Processing Speed" can be at a software level (Code Optimization) and at a hardware level (Capacity Planning) Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool

Files io threads vs shuffle io threads

2023-09-27 Thread Nebi Aydin
Hi all, Can someone explain the difference between Files io threads and shuffle io threads, as I couldn't find any explanation. I'm specifically asking about these: spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads spark.files.io.serverThreads spark.files.io.clientThreads

Inquiry about Processing Speed

2023-09-27 Thread Haseeb Khalid
Dear Support Team, I hope this message finds you well. My name is Haseeb Khalid, and I am reaching out to discuss a scenario related to processing speed in Apache Spark. I have been utilizing these technologies in our projects, and we have encountered a specific use case where we are seeking to

Reading Glue Catalog Views through Spark.

2023-09-25 Thread Agrawal, Sanket
Hello Everyone, We have setup spark and setup Iceberg-Glue connectors as mentioned at https://iceberg.apache.org/docs/latest/aws/ to integrate Spark, Iceberg, and AWS Glue Catalog. We are able to read tables through this but we are unable to read data through views. PFB, the error:

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting).

[ANNOUNCE] Apache Kyuubi released 1.7.3

2023-09-25 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.3 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Spark Connect Multi-tenant Support

2023-09-22 Thread Kezhi Xiong
Hi, >From Spark Connect's official site's image, it mentions the "Multi-tenant Application Gateway" on driver. Are there any more documents about it? Can I know how users can utilize such a feature? Thanks, Kezhi

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-22 Thread Karthick
Hi All, It will be helpful if we gave any pointers to the problem addressed. Thanks Karthick. On Wed, Sep 20, 2023 at 3:03 PM Gowtham S wrote: > Hi Spark Community, > > Thank you for bringing up this issue. We've also encountered the same > challenge and are actively working on finding a

Re: Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Found this issue reported earlier but was bulk closed: https://issues.apache.org/jira/browse/SPARK-27030 Regards, Shrikant On Fri, 22 Sep 2023 at 12:03 AM, Shrikant Prasad wrote: > Hi all, > > We have multiple spark jobs running in parallel trying to write into same > hive table but each job

Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Hi all, We have multiple spark jobs running in parallel trying to write into same hive table but each job writing into different partition. This was working fine with Spark 2.3 and Hadoop 2.7. But after upgrading to Spark 3.2 and Hadoop 3.2.2, these parallel jobs are failing with FileNotFound

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh
In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df =

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to  select the top 5 incoming IP addresses with the highest total volume of data

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks! On Wed, Sep 20, 2023 at 1:04 PM Sean Owen wrote: > [ External sender. Exercise caution. ] > > I think the announcement mentioned there were some issues with pypi and > the upload size this time. I am sure it's intended to be there when > possible. > > On Wed, Sep 20,

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( >

PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi, Are there any plans to upload PySpark 3.5.0 to PyPI ( https://pypi.org/project/pyspark/)? It's still 3.4.1. Thanks, Kezhi

[Spark 3.5.0] Is the protobuf-java JAR no longer shipped with Spark?

2023-09-20 Thread Gijs Hendriksen
Hi all, This week, I tried upgrading to Spark 3.5.0, as it contained some fixes for spark-protobuf that I need for my project. However, my code is no longer running under Spark 3.5.0. My build.sbt file is configured as follows: val sparkV  = "3.5.0" val hadoopV = "3.3.6"

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
This has turned into a big thread for a simple thing and has been answered 3 times over now. Neither is better, they just calculate different things. That the 'default' is sample stddev is just convention. stddev_pop is the simple standard deviation of a set of numbers stddev_samp is used when

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-20 Thread Gowtham S
Hi Spark Community, Thank you for bringing up this issue. We've also encountered the same challenge and are actively working on finding a solution. It's reassuring to know that we're not alone in this. If you have any insights or suggestions regarding how to address this problem, please feel

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
Spark uses the sample standard deviation stddev_samp by default, whereas *Hive* uses population standard deviation stddev_pop as default. My understanding is that spark uses sample standard deviation by default because - It is more commonly used. - It is more efficient to calculate. -

unsubscribe

2023-09-19 Thread Danilo Sousa
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

<    1   2   3   4   5   6   7   8   9   10   >