Structured Streaming Process Each Records Individually

2024-01-10 Thread PRASHANT L
Hi I have a use case where I need to process json payloads coming from Kafka using structured streaming , but thing is json can have different formats , schema is not fixed and each json will have a @type tag so based on tag , json has to be parsed and loaded to table with tag name , and if a

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Yes, I agree. But apart from maintaining this state internally (in memory or in memory+disk as in case of RocksDB), every trigger it saves some information about this state in a checkpoint location. I'm afraid we can't do much about this checkpointing operation. I'll continue looking for

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, You may have a point on scenario 2. Caching Streaming DataFrames: In Spark Streaming, each batch of data is processed incrementally, and it may not fit the typical caching we discussed. Instead, Spark Streaming has its mechanisms to manage and optimize the processing of streaming data. Case

[Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Andrzej Zera
I'm struggling with the following issue in Spark >=3.4, related to multiple stateful operations. When spark.sql.streaming.statefulOperator.allowMultiple is enabled, Spark keeps track of two types of watermarks: eventTimeWatermarkForEviction and eventTimeWatermarkForLateEvents. Introducing them

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Hey, Yes, that's how I understood it (scenario 1). However, I'm not sure if scenario 2 is possible. I think cache on streaming DataFrame is supported only in forEachBatch (in which it's actually no longer a streaming DF). śr., 10 sty 2024 o 15:01 Mich Talebzadeh napisał(a): > Hi, > > With

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, With regard to your point - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try to cache batch ones to save on reads. But I'm not sure if it's what you mean, and I don't know how to apply what

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Thank you very much for your suggestions. Yes, my main concern is checkpointing costs. I went through your suggestions and here're my comments: - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try

unsubscribe

2024-01-10 Thread Daniel Maangi

[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path:

Unsubscribe

2024-01-09 Thread qi bryce
Unsubscribe

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh
Hi Ashok, Thanks for pointing out the databricks article Scalable Spark Structured Streaming for REST API Destinations | Databricks Blog I browsed it and it is basically similar to many of us involved

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID
Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations. Their use

Unsubscribe

2024-01-09 Thread mahzad kalantari
Unsubscribe

Unsubscribe

2024-01-09 Thread Kalhara Gurugamage
Unsubscribe

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Please also note that Flask, by default, is a single-threaded web framework. While it is suitable for development and small-scale applications, it may not handle concurrent requests efficiently in a production environment. In production, one can utilise Gunicorn (Green Unicorn) which is a WSGI (

Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Thought it might be useful to share my idea with fellow forum members. During the breaks, I worked on the *seamless integration of Spark Structured Streaming with Flask REST API for real-time data ingestion and analytics*. The use case revolves around a scenario where data is generated through

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
Hi, Have you come back with some ideas for implementing this? Specifically integrating Spark Structured Streaming with REST API? FYI, I did some work on it as it can have potential wider use cases, i.e. the seamless integration of Spark Structured Streaming with Flask REST API for real-time data

[ANNOUNCE] Apache Celeborn(incubating) 0.3.2 available

2024-01-07 Thread Nicholas Jiang
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.2. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Mich Talebzadeh
OK I assume that your main concern is checkpointing costs. - Caching: If your queries read the same data multiple times, caching the data might reduce the amount of data that needs to be checkpointed. - Optimize Checkpointing Frequency i.e - Consider Changelog Checkpointing with RocksDB.

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Andrzej Zera
Usually one or two topics per query. Each query has its own checkpoint directory. Each topic has a few partitions. Performance-wise I don't experience any bottlenecks in terms of checkpointing. It's all about the number of requests (including a high number of LIST requests) and the associated

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-06 Thread Mich Talebzadeh
How many topics and checkpoint directories are you dealing with? Does each topic has its own checkpoint on S3? All these checkpoints are sequential writes so even SSD would not really help HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view

[Structured Streaming] Keeping checkpointing cost under control

2024-01-05 Thread Andrzej Zera
Hey, I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require near-real time accuracy with trigger intervals in the level of 5-10 seconds. I usually run 3-6 streaming queries as part of the job and each query includes at least one stateful operation (and usually two or more).

Re: Issue with Spark Session Initialization in Kubernetes Deployment

2024-01-05 Thread Mich Talebzadeh
Hi, I personally do not use the Spark operator. Anyhow, the Spark Operator automates the deployment and management of Spark applications within Kubernetes. However, it does not eliminate the need to configure Spark sessions for proper communication with the k8 cluster. So specifying the master

Issue with Spark Session Initialization in Kubernetes Deployment

2024-01-04 Thread Atul Patil
Hello Team, I am currently working on initializing a Spark session using Spark Structure Streaming within a Kubernetes deployment managed by the Spark Operator. During the initialization process, I encountered an error message indicating the necessity to set a master URL: *"Caused by:

Unsubscribe

2024-01-02 Thread Atlas - Samir Souidi
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Select Columns from Dataframe in Java

2023-12-30 Thread Grisha Weintraub
Hi, Have a look here - https://repost.aws/knowledge-center/spark-driver-logs-emr-cluster. Usually, you have application logs out-of-the-box in the driver stdout. It looks like

Re: Select Columns from Dataframe in Java

2023-12-30 Thread PRASHANT L
Hi Grisha This is Great :) It worked thanks alot I have this requirement , I will be running my spark application on EMR and build a custom logging to create logs on S3. Any idea what should I do? or In general if i create a custom log (with my Application name ), where will logs be generated

Re: Select Columns from Dataframe in Java

2023-12-30 Thread Grisha Weintraub
In Java, it expects an array of Columns, so you can simply cast your list to an array: array_df.select(fields.toArray(new Column[0])) On Fri, Dec 29, 2023 at 10:58 PM PRASHANT L wrote: > > Team > I am using Java and want to select columns from Dataframe , columns are > stored in List >

Unsubscribe

2023-12-29 Thread Vinti Maheshwari

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
Hi, Do you have more info on this Jira besides the github link as I don't seem to find it! Thanks Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: the life cycle shuffle Dependency

2023-12-29 Thread murat migdisoglu
Hello, why would you like to delete the shuffle data yourself in the first place? On Thu, Dec 28, 2023, 10:08 yang chen wrote: > > hi, I'm learning spark, and wonder when to delete shuffle data, I find the > ContextCleaner class which clean the shuffle data when shuffle dependency > is GC-ed.

Select Columns from Dataframe in Java

2023-12-29 Thread PRASHANT L
Team I am using Java and want to select columns from Dataframe , columns are stored in List equivalent of below scala code * array_df=array_df.select(fields: _*)* When I try array_df=array_df.select(fields) , I get error saying Cast to Column I am using Spark 3.4

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav , On Pyspark DF can you the following df.printSchema() and send the output please HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much! Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Thursday, December 28, 2023 5:14 PM To: Hyukjin Kwon Cc: Поротиков Станислав Вячеславович ; user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming You can work around this issue by

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
You can work around this issue by trying to write your DF to a flat file and use Kafka to pick it up from the flat file and stream it in. Bear in mind that Kafa will require a unique identifier as K/V pair. Check this link how to generate UUID for this purpose

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress https://github.com/apache/spark/pull/44416 we will likely release this in spark 4.0 On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович wrote: > Yes, it's actual data. > > > > Best regards, > > Stanislav Porotikov > > > > *From:*

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a form

Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From: Mich

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From:

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Hello! Is it possible to write pyspark UDF, generated data to streaming dataframe? I want to get some data from REST API requests in real time and consider to save this data to dataframe. And then put it to Kafka. I can't realise how to create streaming dataframe from generated data. I am new in

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfluff

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff it's a linter for SQL code and it seems to have support for sparksql man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow post and > none of them >

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
Mailing lists For broad, opinion based, ask for external resources, debug issues, bugs, contributing to the project, and scenarios, it is recommended you use the user@spark.apache.org mailing list. - user@spark.apache.org is for

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
What about EXPLAIN? https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content Fusion Zhu

Re: Validate spark sql

2023-12-24 Thread ram manickam
Thanks Mich, Nicholas. I tried looking over the stack overflow post and none of them Seems to cover the syntax validation. Do you know if it's even possible to do syntax validation in spark? Thanks Ram On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh wrote: > Well not to put too finer point on

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Well not to put too finer point on it, in a public forum, one ought to respect the importance of open communication. Everyone has the right to ask questions, seek information, and engage in discussions without facing unnecessary patronization. Mich Talebzadeh, Dad | Technologist | Solutions

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and

Unsubscribe

2023-12-21 Thread yxj1141
Unsubscribe

India Scala & Big Data Job Referral

2023-12-21 Thread sri hari kali charan Tummala
Hi Community, I was laid off from Apple in February 2023, which led to my relocation from the USA due to immigration issues related to my H1B visa. I have over 12 years of experience as a consultant in Big Data, Spark, Scala, Python, and Flink. Despite my move to India, I haven't secured a

About shuffle partition size

2023-12-20 Thread Nebi Aydin
Hi all, What happens when # of unique join keys less than shuffle partitions? Are we going to end up with lots of empty partitions? If yes,is there any point to have shuffle partitions bigger than # of unique join keys?

[ANNOUNCE] Apache Spark 3.3.4 released

2023-12-16 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.3.4! Spark 3.3.4 is the last maintenance release based on the branch-3.3 maintenance branch of Spark. It contains many fixes including security and correctness domains. We strongly recommend all 3.3 users to upgrade to this or higher

Unsubscribe

2023-12-16 Thread Andrew Milkowski

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
Apologies Koert! Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
Hi kurt, I read this document of yours. indeed interesting and pretty recent (9th Dec). I am more focused on GCP and GKE . But obviously the concepts are the same. One thing I noticed, there was a lack of mention of Workload Identity federation

Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark Sessions. To multiplex or run multiple Drivers, you need some work such as gateway. On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong wrote: > Hi, > > My understanding is there is only one driver/spark context for all user >

Re: Architecture of Spark Connect

2023-12-14 Thread Kezhi Xiong
Hi, My understanding is there is only one driver/spark context for all user sessions. When you run the bin/start-connect-server script, you are submitting one long standing spark job / application. Every time a new user request comes in, a new user session is created under that. Please correct me

Re: Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
If multiple applications are running, we would need multiple spark connect servers? If so, is the user responsible for creating these servers or they are just created on the fly when the user requests a new spark session? On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal wrote: > Hi folks, > I am

Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
Hi folks, I am trying to understand one question. Does Spark Connect create a new driver in the backend for every user or there are a fixed number of drivers running to which requests are sent to? Thanks Nikhil

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Koert Kuipers
yes it does using IAM roles for service accounts. see: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html i wrote a little bit about this also here: https://technotes.tresata.com/spark-on-k8s/ On Wed, Dec 13, 2023 at 7:52 AM Atul Patil wrote: > Hello Team, > >

Unsubscribe

2023-12-13 Thread kritika jain

Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Atul Patil
Hello Team, Does Spark support role-based authentication and access to Amazon S3 for Kubernetes deployment? *Note: we have deployed our spark application in the Kubernetes cluster.* Below are the Hadoop-AWS dependencies we are using: org.apache.hadoop hadoop-aws 3.3.4 We are

Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Patil, Atul
Hello Team, Does Spark support role-based authentication and access to Amazon S3 for Kubernetes deployment? Note: we have deployed our spark application in the Kubernetes cluster. Below are the Hadoop-AWS dependencies we are using: org.apache.hadoop hadoop-aws 3.3.4 We are using the

Unsubscribe

2023-12-12 Thread Daniel Maangi

Unsubscribe

2023-12-12 Thread Klaus Schaefers
-- “Overfitting” is not about an excessive amount of physical exercise...

Unsubscribe

2023-12-12 Thread Sergey Boytsov
Unsubscribe --

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread murat migdisoglu
Hey Jack, Emr serverless is a great fit for this. You can get these metrics for each job when they are completed. Besides that, if you create separate "emr applications" per group and tag them appropriately, you can use the cost explorer to see the amount of resources being used. If emr

Re: Cluster-mode job compute-time/cost metrics

2023-12-12 Thread Jörn Franke
It could be simpler and faster to use tagging of resources for billing: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-tags-billing.html That could also include other resources (eg s3). > Am 12.12.2023 um 04:47 schrieb Jack Wells : > >  > Hello Spark experts - I’m running

Cluster-mode job compute-time/cost metrics

2023-12-11 Thread Jack Wells
Hello Spark experts - I’m running Spark jobs in cluster mode using a dedicated cluster for each job. Is there a way to see how much compute time each job takes via Spark APIs, metrics, etc.? In case it makes a difference, I’m using AWS EMR - I’d ultimately like to be able to say this job costs $X

Unsubscribe

2023-12-11 Thread 18706753459
Unsubscribe

Unsubscribe

2023-12-11 Thread Dusty Williams
Unsubscribe

unsubscribe

2023-12-11 Thread Stevens, Clay
unsubscribe

Spark 3.1.3 with Hive dynamic partitions fails while driver moves the staged files

2023-12-11 Thread Shay Elbaz
Hi all, Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert new (time) partitions into existing Hive tables. But we see too many failures coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where the driver moves the successful files from

unsubscribe

2023-12-11 Thread Sergey Boytsov
-- Sergei Boitsov JetBrains GmbH Christoph-Rapparini-Bogen 23 80639 München Handelsregister: Amtsgericht München, HRB 187151 Geschäftsführer: Yury Belyaev

unsubscribe

2023-12-11 Thread Klaus Schaefers
-- “Overfitting” is not about an excessive amount of physical exercise...

Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky
Hey Mich, Thanks for the detailed response. I get most of these options. However, what we are trying to do is avoid having to upload the source configs and pyspark.zip files to the cluster every time we execute the job using spark-submit. Here is the code that does it:

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi Michail, > > with

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
Hi Eugene, With regard to your points What are the PYTHONPATH and SPARK_HOME env variables in your script? OK let us look at a typical of my Spark project structure - project_root |-- README.md |-- __init__.py |-- conf | |-- (configuration files for Spark) |-- deployment | |--

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't understand a few things 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip is uploaded from the local SPARK_HOME. If it is set to "local://" the upload is skipped. I would expect the latter to be the

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__.zip). It is working now because I enabled direct access to HDFS to allow copying

unsubscribe

2023-12-10 Thread Rajanikant V

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in

Re: Spark on Java 17

2023-12-09 Thread Faiz Halde
Thanks, IL check them out Curious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mb That's strange because our heaps go up to 100gb and that would require 64mb

Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
If you do tests with newer Java versions you can also try: - UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345 You can also assess the new Java GC algorithms: - -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient than zgc with heaps <32 GB. See also:

RE: Spark on Java 17

2023-12-09 Thread Luca Canali
Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher)

Spark on Java 17

2023-12-07 Thread Faiz Halde
Hello, We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning? We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the clarification. I will try to do plain jdbc connection on Scala/Java and will update this thread on how it goes. *Thanks,* *Venkat* On Thu, Dec 7, 2023 at 9:40 AM Nicholas Chammas wrote: > PyMySQL has its own implementation >

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation of the MySQL client-server protocol. It does not use JDBC. > On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan > wrote: > > Thanks for the advice

Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the advice Nicholas. As mentioned in the original email, I have tried JDBC + SSH Tunnel using pymysql and sshtunnel and it worked fine. The problem happens only with Spark. *Thanks,* *Venkat* On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas wrote: > This is not a question for the

SSH Tunneling issue with Apache Spark

2023-12-05 Thread Venkatesan Muniappan
Hi Team, I am facing an issue with SSH Tunneling in Apache Spark. The behavior is same as the one in this Stackoverflow question but there are no answers there. This is what I am trying:

Re: ordering of rows in dataframe

2023-12-05 Thread Enrico Minack
Looks like what you want is to add a column that, when ordered by that column, the current order of the dateframe is preserved. All you need is the monotonically_increasing_id() function: spark.range(0, 10, 1, 5).withColumn("row", monotonically_increasing_id()).show() +---+---+ | id|

ordering of rows in dataframe

2023-12-05 Thread Som Lima
want to maintain the order of the rows in the data frame in Pyspark. Is there any way to achieve this for this function here we have the row ID which will give numbering to each row. Currently, the below function results in the rearrangement of the row in the data frame. def createRowIdColumn(

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L] The

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths.

ML advice

2023-12-04 Thread Zahid Rahman
Hi, I heard some big things about Machine learning and data science. To upgrade my skill set I took a udemy course Python and Spark for Big Data with Spark. It took about week to learn the concepts and the workflow to follow when using each of the Spark APIs. To complete a Machine Learning

Re: Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Apologies to everyone. I sent this to the wrong email list. Please discard On Mon, 04 Dec 2023 10:48:11 -0500 Yeikel Santana wrote --- Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a

Do we have any mechanism to control requests per second for a Kafka connect sink?

2023-12-04 Thread Yeikel Santana
Hello everyone, Is there any mechanism to force Kafka Connect to ingest at a given rate per second as opposed to tasks? I am operating in a shared environment where the ingestion rate needs to be as low as possible (for example, 5 requests/second as an upper limit), and as far as I can

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Aironman DirtDiver
The issue you're encountering with the iceberg-spark-runtime dependency not being properly passed to the executors in your Spark Connect server deployment could be due to a couple of factors: 1. *Spark Submit Packaging:* When you use the --packages parameter in spark-submit, it only

<    1   2   3   4   5   6   7   8   9   10   >