Re: [Issue] Spark SQL - broadcast failure

2024-08-01 Thread Sudharshan V
Hi all, Do we have any idea on this. Thanks On Tue, 23 Jul, 2024, 12:54 pm Sudharshan V, wrote: > We removed the explicit broadcast for that particular table and it took > longer time since the join type changed from BHJ to SMJ. > > I wanted to understand how I can find what went wrong with

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
Glad it worked! On Tue, 30 Jul, 2024, 11:12 Ilango, wrote: > > Thanks Prabodh. I copied the spark connect jar to $SPARK_HOME/jars > folder. And passed the location as —jars attr. Its working now. I could > submit spark jobs via spark connect. > > Really appreciate the help. > > > > Thanks, >

Re: [Spark Connect] connection issue

2024-07-29 Thread Ilango
Thanks Prabodh. I copied the spark connect jar to $SPARK_HOME/jars folder. And passed the location as —jars attr. Its working now. I could submit spark jobs via spark connect. Really appreciate the help. Thanks, Elango On Tue, 30 Jul 2024 at 11:05 AM, Prabodh Agarwal wrote: > Yeah. I

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Meena Rajani
You probably have to increase jvm/jdk memory size https://stackoverflow.com/questions/1565388/increase-heap-size-in-java On Mon, Jul 29, 2024 at 9:36 PM mike Jadoo wrote: > Thanks. I just downloaded the corretto but I got this error message, > which was the same as before. [It was shared

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
Yeah. I understand the problem. One of the ways is to actually place the spark connect jar in the $SPARK_HOME/jars folder. That is how we run spark connect. Using the `--packages` or the `--jars` option is flaky in case of spark connect. You can instead manually place the relevant spark connect

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, This appears to be an access issue on Windows + Python. Can you try setting up the PYTHON_PATH environment variable as described in this stackoverflow post https://stackoverflow.com/questions/60414394/createprocess-error-5-access-is-denied-pyspark - Sadha On Mon, Jul 29, 2024 at 3:39 

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
Thanks. I just downloaded the corretto but I got this error message, which was the same as before. [It was shared with me that this saying that I have limited resources, i think] ---Py4JJavaError

Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike, I'm not sure about the minimum requirements of a machine for running Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local machine, I found the following steps are the easiest. I installed Amazon corretto and updated the java_home variable as instructed here

Re: [Spark Connect] connection issue

2024-07-29 Thread Ilango
Thanks Prabodh, Yes I can see the spark connect logs in $SPARK_HOME/logs path. It seems like the spark connect dependency issue. My spark node is air gapped node so no internet is allowed. Can I download the spark connect jar and pom files locally and share the local paths? How can I share the

Re: [Spark Connect] connection issue

2024-07-29 Thread Prabodh Agarwal
The spark connect startup prints the log location. Is that not feasible for you? For me log comes to $SPARK_HOME/logs On Mon, 29 Jul, 2024, 15:30 Ilango, wrote: > > Hi all, > > > I am facing issues with a Spark Connect application running on a Spark > standalone cluster (without YARN and HDFS).

Re: Issue with comparing structs (possible bug)

2024-07-26 Thread Dhruv Singla
The spark version 3.5.1 On Fri, Jul 26, 2024 at 6:54 PM Dhruv Singla wrote: > Hey Everyone > > Hope you are doing well > > I am trying to compare structs with structs using the IN clause. Here is > what I found. > The following query comparing structs gives error > > SELECT struct(1, 2) IN ( >

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
We removed the explicit broadcast for that particular table and it took longer time since the join type changed from BHJ to SMJ. I wanted to understand how I can find what went wrong with the broadcast now. How do I know the size of the table inside of spark memory. I have tried to cache the

Re: [Issue] Spark SQL - broadcast failure

2024-07-23 Thread Sudharshan V
Hi all, apologies for the delayed response. We are using spark version 3.4.1 in jar and EMR 6.11 runtime. We have disabled the auto broadcast always and would broadcast the smaller tables using explicit broadcast. It was working fine historically and only now it is failing. The data sizes I

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Holden Karau
This might a good discussion for the dev@ list, I don’t know much about SLURM deployments personally. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams:

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Patrice Duroux
Hi, Here is a small patch that solves this issue. Considering all the scripts, I'm not sure if sbin/stop-workers.sh and sbin/stop-worker.sh need a similar change. Do they really care about SPARK_CONF_DIR to do the job? Note that I have also removed the following part in the script: cd

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Meena Rajani
Can you try disabling broadcast join and see what happens? On Mon, Jul 8, 2024 at 12:03 PM Sudharshan V wrote: > Hi all, > > Been facing a weird issue lately. > In our production code base , we have an explicit broadcast for a small > table. > It is just a look up table that is around 1gb in

Re: [Issue] Spark SQL - broadcast failure

2024-07-16 Thread Mich Talebzadeh
It will help if you mention the Spark version and the piece of problematic code HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London

Re: Help wanted on securing spark with Apache Knox / JWT

2024-07-12 Thread Adam Binford
You need to use the spark.ui.filters setting on the history server https://spark.apache.org/docs/latest/configuration.html#spark-ui: spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter

Re: AttributeError: 'MulticlassMetrics' object has no attribute '_sc'

2024-06-23 Thread Saurabh Kumar
Please Unsubscribe me On Mon, 24 Jun 2024 at 07:02, Azhuvath, RajeevX wrote: > Getting the error “AttributeError: 'MulticlassMetrics' object has no > attribute '_sc'” while executing the standalone attached code in a bare > metal system. > > > > Thanks and Regards, > > Rajeev > >

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-06-20 Thread Anil Dasari
Hello @Tathagata Das Could you share your thoughts on https://issues.apache.org/jira/browse/SPARK-48418 ? Let me know if you have any questions. thanks. Regards, Anil On Fri, May 24, 2024 at 12:13 AM Anil Dasari wrote: > It appears that structured streaming and Dstream have entirely different

Re: Spark Decommission

2024-06-20 Thread Rajesh Mahindra
Thank Ahmed, thats useful information On Wed, Jun 19, 2024 at 1:36 AM Khaldi, Ahmed wrote: > Hey Rajesh, > > > > Fromm y experience, it’s a stable feature, however you must keep in mind > that it will not guarantee that you will not lose the data that is on the > pods of the nodes getting a

Re: Help in understanding Exchange in Spark UI

2024-06-20 Thread Mich Talebzadeh
OK, I gave an answer in StackOverflow. Happy reading Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD Imperial College London London, United

Re: [SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-19 Thread Chhavi Bansal
Hello Team, I am pinging back on this thread to get a pair of eyes on this issue. Ticket: https://issues.apache.org/jira/browse/SPARK-48423 On Thu, 6 Jun 2024 at 00:19, Chhavi Bansal wrote: > Hello team, > I was exploring on how to save ML pipeline to azure blob storage, but was > setback by

Re: Spark Decommission

2024-06-19 Thread Khaldi, Ahmed
Hey Rajesh, Fromm y experience, it’s a stable feature, however you must keep in mind that it will not guarantee that you will not lose the data that is on the pods of the nodes getting a spot kill. Once you have a spot a kill, you have 120s to give the node back to the cloud provider. This is

Re: Update mode in spark structured streaming

2024-06-15 Thread Mich Talebzadeh
Best to qualify your thoughts with an example By using the foreachBatch function combined with the update output mode in Spark Structured Streaming, you can effectively handle and integrate late-arriving data into your aggregations. This approach will allow you to continuously update your

Re: Re: OOM issue in Spark Driver

2024-06-11 Thread Mich Talebzadeh
se OOM. > Checking logs will always be a good start. And it would be better if some > colleague of you is familiar with JVM and OOM related issues. > > BS > Lingzhe Sun > > > *From:* Karthick Nk > *Date:* 2024-06-11 13:28 > *To:* Lingzhe Sun > *CC:* Andrzej Zera ;

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-10 Thread Gourav Sengupta
Hi, another thing we can consider while parallelising connection with the upstream sources is that it means you are querying the system simultaneously and that causes usage spikes, and in case the source system is facing a lot of requests during production workloads the best time to parallelise

Re: Unable to load MongoDB atlas data via PySpark because of BsonString error

2024-06-09 Thread Perez
Hi Team, Any help in this matter would be greatly appreciated. TIA On Sun, Jun 9, 2024 at 11:26 AM Perez wrote: > Hi Team, > > this is the problem > https://stackoverflow.com/questions/78593858/unable-to-load-mongodb-atlas-data-via-pyspark-jdbc-in-glue > > I can't go ahead with *StructType*

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Someshwar Kale
Hi Chhavi, Currently there is no way to handle backtick(`) spark StructType. Hence the field name a.b and `a.b` are completely different within StructType. To handle that, I have added a custom implementation fixing StringIndexer# validateAndTransformSchema. You can refer to the code on my

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar, Thanks for the response, I have added my comments to the ticket . Thanks, Chhavi Bansal On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote: > As a fix, you may consider adding a transformer to rename columns (perhaps > replace

Re: OOM issue in Spark Driver

2024-06-08 Thread Andrzej Zera
Hey, do you perform stateful operations? Maybe your state is growing indefinitely - a screenshot with state metrics would help (you can find it in Spark UI -> Structured Streaming -> your query). Do you have a driver-only cluster or do you have workers too? What's the memory usage profile at

Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex, Spark is an open source software available under Apache License 2.0 ( https://www.apache.org/licenses/), further details can be found here in the FAQ page (https://spark.apache.org/faq.html). Hope this helps. Thanks, Sadha On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX wrote: >

Re: Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Mich Talebzadeh
The issue you are encountering is due to the order of operations when Spark initializes the JVM for driver and executor pods. The JVM options (-Dlog4j2.configurationFile) are evaluated when the JVM starts, but the --files option copies the files after the JVM has already started. Hence, the log4j

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Also can I take my lower bound starting from 1 or is it index? On Thu, Jun 6, 2024 at 8:42 PM Perez wrote: > Thanks again Mich. It gives the clear picture but I have again couple of > doubts: > > 1) I know that there will be multiple threads that will be executed with > 10 segment sizes each

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks again Mich. It gives the clear picture but I have again couple of doubts: 1) I know that there will be multiple threads that will be executed with 10 segment sizes each until the upper bound is reached but I didn't get this part of the code exactly segments = [(i, min(i + segment_size,

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps replace all columns with dot to underscore) and use the renamed columns in your pipeline as below- val renameColumn = new RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") val si = new

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
well you can dynamically determine the upper bound by first querying the database to find the maximum value of the partition column and use it as the upper bound for your partitioning logic. def get_max_value(spark, mongo_config, column_name): max_value_df =

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks, Mich for your response. However, I have multiple doubts as below: 1) I am trying to load the data for the incremental batch so I am not sure what would be my upper bound. So what can we do? 2) So as each thread loads the desired segment size's data into a dataframe if I want to aggregate

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
Yes, partitioning and parallel loading can significantly improve the performance of data extraction from JDBC sources or databases like MongoDB. This approach can leverage Spark's distributed computing capabilities, allowing you to load data in parallel, thus speeding up the overall data loading

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
Thanks Nitin and Russel for your responses. Much appreciated. On Mon, Jun 3, 2024 at 9:47 PM Russell Jurney wrote: > You could use either Glue or Spark for your job. Use what you’re more > comfortable with. > > Thanks, > Russell Jurney @rjurney >

Re: Classification request

2024-06-04 Thread Dirk-Willem van Gulik
Actually - that answer may oversimplify things / be rather incorrect depending on the exact question of the entity that asks and the exact situation (who ships what code from where). For this reason it is properly best to refer this original poster to:

Re: Classification request

2024-06-04 Thread Artemis User
Sara, Apache Spark is open source under Apache License 2.0 (https://github.com/apache/spark/blob/master/LICENSE).  It is not under export control of any country!  Please feel free to use, reproduce and distribute, as long as your practice is compliant with the license. Having said that, some

Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more comfortable with. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com On Sun, Jun 2, 2024 at 9:59 PM

Re: Terabytes data processing via Glue

2024-06-02 Thread Perez
Hello, Can I get some suggestions? On Sat, Jun 1, 2024 at 1:18 PM Perez wrote: > Hi Team, > > I am planning to load and process around 2 TB historical data. For that > purpose I was planning to go ahead with Glue. > > So is it ok if I use glue if I calculate my DPUs needed correctly? or >

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clarifies to Spark that you are dealing

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-30 Thread Subhasis Mukherjee
From: Gera Shegalov Sent: Wednesday, May 29, 2024 7:57:56 am To: Prem Sahoo Cc: eab...@163.com ; Vibhor Gupta ; user @spark Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way I agree with the previous answers that (if requirements allow it) it is much easier

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv is an

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list. -

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv The

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Linux,

Re: OOM concern

2024-05-29 Thread Perez
Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-28 Thread Gera Shegalov
Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > >> Hi, >> I think you should write to HDFS then copy file (parquet or orc) >> from HDFS to MinIO. >> >> -- >> eabour >> >> >> *From:* Prem Sahoo >>

Re: OOM concern

2024-05-28 Thread Russell Jurney
If Glue lets you take a configuration based approach, and you don't have to operate any servers as with EMR... use Glue. Try EMR if that is troublesome. Russ On Tue, May 28, 2024 at 9:23 AM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage >

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Russell mentioned some of these issues before. So in short your mileage varies. For a 100 GB data transfer, the speed difference between Glue and EMR might not be significant, especially considering the benefits of Glue's managed service aspects. However, for much larger datasets or scenarios

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich. Yes, I agree on the costing part but how does the data transfer speed be impacted? Is it because glue takes some time to initialize underlying resources and then process the data? On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh wrote: > Your mileage varies as usual > > Glue with

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Your mileage varies as usual Glue with DPUs seems like a strong contender for your data transfer needs based on the simplicity, scalability, and managed service aspects. However, if data transfer speed is critical or costs become a concern after testing, consider EMR as an alternative. HTH Mich

Re: OOM concern

2024-05-27 Thread Perez
Thank you everyone for your response. I am not getting any errors as of now. I am just trying to choose the right tool for my task which is data loading from an external source into s3 via Glue/EMR. I think Glue job would be the best fit for me because I can calculate DPUs needed (maybe keeping

Re: Spark Protobuf Deserialization

2024-05-27 Thread Sandish Kumar HN
Did you try using to_protobuf and from_protobuf ? https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html On Mon, May 27, 2024 at 15:45 Satyam Raj wrote: > Hello guys, > We're using Spark 3.5.0 for processing Kafka source that contains protobuf > serialized data. The format is as

Re: OOM concern

2024-05-27 Thread Russell Jurney
If you’re using EMR and Spark, you need to choose nodes with enough RAM to accommodate any given partition in your data or you can get an OOM error. Not sure if this job involves a reduce, but I would choose a single 128GB+ memory optimized instance and then adjust parallelism as via the Dpark

Re: OOM concern

2024-05-27 Thread Meena Rajani
What exactly is the error? Is it erroring out while reading the data from db? How are you partitioning the data? How much memory currently do you have? What is the network time out? Regards, Meena On Mon, May 27, 2024 at 4:22 PM Perez wrote: > Hi Team, > > I want to extract the data from DB

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
rk.apache.org Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing This message contains hyperlinks, take precaution before opening these links. Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproduc

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing the issue with smaller subsets of your data to pinpoint the specific operation causing the memory problems. 2. Explode or Flatten Nested Structures: If your DataFrame schema

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation of the

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury
Can you please explain how did you realize it’s wrong? Did you check cloudwatch for the same metrics and compare? Also are you using do.cache() and expecting that shuffle read/write to go away ? Sent from Yahoo Mail for iPhone On Sunday, May 26, 2024, 7:53 AM, Prem Sahoo wrote: Can anyone

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can view the tasks. >> >> In each

Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
Something like this in Python from pyspark.sql import SparkSession # Configure Spark Session with JDBC URLs spark_conf = SparkConf() \ .setAppName("SparkCatalogMultipleSources") \ .set("hive.metastore.uris", "thrift://hive1-metastore:9080,thrift://hive2-metastore:9080") jdbc_urls =

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-24 Thread Anil Dasari
It appears that structured streaming and Dstream have entirely different microbatch metadata representation Can someone assist me in finding the following Dstream microbatch metadata equivalent in Structured streaming. 1. microbatch timestamp : structured streaming foreachBatch gives batchID

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
eabour > > > *From:* Prem Sahoo > *Date:* 2024-05-22 00:38 > *To:* Vibhor Gupta ; user > > *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way > > > On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > >> Hello Vibhor, >> Thanks for the sugge

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output {    "Postcode":

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode": { "exists": true, "num_rows": 93348, "data_type": "string", "null_count": 21921,

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one

Re: pyspark dataframe join with two different data type

2024-05-17 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["column1"] df =

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ |

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
nt M-CS) ; user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ]

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv ").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 1440, in

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

  1   2   3   4   5   6   7   8   9   10   >