Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-08 Thread Chhavi Bansal
Hi Someshwar, Thanks for the response, I have added my comments to the ticket . Thanks, Chhavi Bansal On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote: > As a fix, you may consider adding a transformer to rename columns (perhaps > replace

Re: OOM issue in Spark Driver

2024-06-08 Thread Andrzej Zera
Hey, do you perform stateful operations? Maybe your state is growing indefinitely - a screenshot with state metrics would help (you can find it in Spark UI -> Structured Streaming -> your query). Do you have a driver-only cluster or do you have workers too? What's the memory usage profile at

Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex, Spark is an open source software available under Apache License 2.0 ( https://www.apache.org/licenses/), further details can be found here in the FAQ page (https://spark.apache.org/faq.html). Hope this helps. Thanks, Sadha On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX wrote: >

Re: Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Mich Talebzadeh
The issue you are encountering is due to the order of operations when Spark initializes the JVM for driver and executor pods. The JVM options (-Dlog4j2.configurationFile) are evaluated when the JVM starts, but the --files option copies the files after the JVM has already started. Hence, the log4j

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Also can I take my lower bound starting from 1 or is it index? On Thu, Jun 6, 2024 at 8:42 PM Perez wrote: > Thanks again Mich. It gives the clear picture but I have again couple of > doubts: > > 1) I know that there will be multiple threads that will be executed with > 10 segment sizes each

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks again Mich. It gives the clear picture but I have again couple of doubts: 1) I know that there will be multiple threads that will be executed with 10 segment sizes each until the upper bound is reached but I didn't get this part of the code exactly segments = [(i, min(i + segment_size,

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps replace all columns with dot to underscore) and use the renamed columns in your pipeline as below- val renameColumn = new RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") val si = new

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
well you can dynamically determine the upper bound by first querying the database to find the maximum value of the partition column and use it as the upper bound for your partitioning logic. def get_max_value(spark, mongo_config, column_name): max_value_df =

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks, Mich for your response. However, I have multiple doubts as below: 1) I am trying to load the data for the incremental batch so I am not sure what would be my upper bound. So what can we do? 2) So as each thread loads the desired segment size's data into a dataframe if I want to aggregate

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
Yes, partitioning and parallel loading can significantly improve the performance of data extraction from JDBC sources or databases like MongoDB. This approach can leverage Spark's distributed computing capabilities, allowing you to load data in parallel, thus speeding up the overall data loading

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
Thanks Nitin and Russel for your responses. Much appreciated. On Mon, Jun 3, 2024 at 9:47 PM Russell Jurney wrote: > You could use either Glue or Spark for your job. Use what you’re more > comfortable with. > > Thanks, > Russell Jurney @rjurney >

Re: Classification request

2024-06-04 Thread Dirk-Willem van Gulik
Actually - that answer may oversimplify things / be rather incorrect depending on the exact question of the entity that asks and the exact situation (who ships what code from where). For this reason it is properly best to refer this original poster to:

Re: Classification request

2024-06-04 Thread Artemis User
Sara, Apache Spark is open source under Apache License 2.0 (https://github.com/apache/spark/blob/master/LICENSE).  It is not under export control of any country!  Please feel free to use, reproduce and distribute, as long as your practice is compliant with the license. Having said that, some

Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more comfortable with. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com On Sun, Jun 2, 2024 at 9:59 PM

Re: Terabytes data processing via Glue

2024-06-02 Thread Perez
Hello, Can I get some suggestions? On Sat, Jun 1, 2024 at 1:18 PM Perez wrote: > Hi Team, > > I am planning to load and process around 2 TB historical data. For that > purpose I was planning to go ahead with Glue. > > So is it ok if I use glue if I calculate my DPUs needed correctly? or >

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clarifies to Spark that you are dealing

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-30 Thread Subhasis Mukherjee
From: Gera Shegalov Sent: Wednesday, May 29, 2024 7:57:56 am To: Prem Sahoo Cc: eab...@163.com ; Vibhor Gupta ; user @spark Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way I agree with the previous answers that (if requirements allow it) it is much easier

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv is an

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list. -

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv The

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Linux,

Re: OOM concern

2024-05-29 Thread Perez
Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-28 Thread Gera Shegalov
Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > >> Hi, >> I think you should write to HDFS then copy file (parquet or orc) >> from HDFS to MinIO. >> >> -- >> eabour >> >> >> *From:* Prem Sahoo >>

Re: OOM concern

2024-05-28 Thread Russell Jurney
If Glue lets you take a configuration based approach, and you don't have to operate any servers as with EMR... use Glue. Try EMR if that is troublesome. Russ On Tue, May 28, 2024 at 9:23 AM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage >

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Russell mentioned some of these issues before. So in short your mileage varies. For a 100 GB data transfer, the speed difference between Glue and EMR might not be significant, especially considering the benefits of Glue's managed service aspects. However, for much larger datasets or scenarios

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich. Yes, I agree on the costing part but how does the data transfer speed be impacted? Is it because glue takes some time to initialize underlying resources and then process the data? On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh wrote: > Your mileage varies as usual > > Glue with

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Your mileage varies as usual Glue with DPUs seems like a strong contender for your data transfer needs based on the simplicity, scalability, and managed service aspects. However, if data transfer speed is critical or costs become a concern after testing, consider EMR as an alternative. HTH Mich

Re: OOM concern

2024-05-27 Thread Perez
Thank you everyone for your response. I am not getting any errors as of now. I am just trying to choose the right tool for my task which is data loading from an external source into s3 via Glue/EMR. I think Glue job would be the best fit for me because I can calculate DPUs needed (maybe keeping

Re: Spark Protobuf Deserialization

2024-05-27 Thread Sandish Kumar HN
Did you try using to_protobuf and from_protobuf ? https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html On Mon, May 27, 2024 at 15:45 Satyam Raj wrote: > Hello guys, > We're using Spark 3.5.0 for processing Kafka source that contains protobuf > serialized data. The format is as

Re: OOM concern

2024-05-27 Thread Russell Jurney
If you’re using EMR and Spark, you need to choose nodes with enough RAM to accommodate any given partition in your data or you can get an OOM error. Not sure if this job involves a reduce, but I would choose a single 128GB+ memory optimized instance and then adjust parallelism as via the Dpark

Re: OOM concern

2024-05-27 Thread Meena Rajani
What exactly is the error? Is it erroring out while reading the data from db? How are you partitioning the data? How much memory currently do you have? What is the network time out? Regards, Meena On Mon, May 27, 2024 at 4:22 PM Perez wrote: > Hi Team, > > I want to extract the data from DB

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
rk.apache.org Subject: Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing This message contains hyperlinks, take precaution before opening these links. Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproduc

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing the issue with smaller subsets of your data to pinpoint the specific operation causing the memory problems. 2. Explode or Flatten Nested Structures: If your DataFrame schema

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation of the

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury
Can you please explain how did you realize it’s wrong? Did you check cloudwatch for the same metrics and compare? Also are you using do.cache() and expecting that shuffle read/write to go away ? Sent from Yahoo Mail for iPhone On Sunday, May 26, 2024, 7:53 AM, Prem Sahoo wrote: Can anyone

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can view the tasks. >> >> In each

Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
Something like this in Python from pyspark.sql import SparkSession # Configure Spark Session with JDBC URLs spark_conf = SparkConf() \ .setAppName("SparkCatalogMultipleSources") \ .set("hive.metastore.uris", "thrift://hive1-metastore:9080,thrift://hive2-metastore:9080") jdbc_urls =

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-24 Thread Anil Dasari
It appears that structured streaming and Dstream have entirely different microbatch metadata representation Can someone assist me in finding the following Dstream microbatch metadata equivalent in Structured streaming. 1. microbatch timestamp : structured streaming foreachBatch gives batchID

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
eabour > > > *From:* Prem Sahoo > *Date:* 2024-05-22 00:38 > *To:* Vibhor Gupta ; user > > *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way > > > On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > >> Hello Vibhor, >> Thanks for the sugge

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output {    "Postcode":

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode": { "exists": true, "num_rows": 93348, "data_type": "string", "null_count": 21921,

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one

Re: pyspark dataframe join with two different data type

2024-05-17 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["column1"] df =

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ |

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
nt M-CS) ; user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ]

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv ").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 1440, in

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts, the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh,

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi, In k8s the driver is responsible for executor creation. The likelihood of your problem is that Insufficient memory allocated for executors in the K8s cluster. Even with dynamic allocation, k8s won't schedule executor pods if there is not enough free memory to fulfill their resource requests.

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
Mitch, you are right these days the attention span is getting shorter. Christian could work on a completely new thing for 3 hours and is proud to explain. It is amazing. Thanks for sharing. On Sat, Apr 27, 2024 at 9:40 PM Farshid Ashouri wrote: > Mich, this is absolutely amazing. > > Thanks

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
Mich, this is absolutely amazing. Thanks for sharing. On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, wrote: > Python for the kids. Slightly off-topic but worthwhile sharing. > > One of the things that may benefit kids is starting to learn something > new. Basically anything that can focus their

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
"128G" ).set("spark.executor.memoryOverhead", "32G" ).set("spark.driver.cores", "16" ).set("spark.driver.memory", "64G" ) I dont think b) applies as its a single machine. Kind regards, Jelle Fr

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
o 100K records. > Once I go past that amount of records the results become inconsistent and > incorrect. > > Kind regards, > Jelle Nijland > > > -- > *From:* Mich Talebzadeh > *Sent:* Wednesday, April 24, 2024 4:40 PM > *To:* Nijland, J.G.W. (Jelle, Student M-CS) < > j.g.

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
___ From: Mich Talebzadeh Sent: Wednesday, April 24, 2024 4:40 PM To: Nijland, J.G.W. (Jelle, Student M-CS) Cc: user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, seque

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
ithub issue of someone with significantly larger graphs >have similar issues. >One suggestion there blamed indexing using strings rather than ints or >longs. >I rewrote my system to use int for IDs but I ran into the same issue. >The amount of incorrect edges wa

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply. I agree, it is not so scalable and efficient. But it works correctly for kafka transaction, and there is no problem with committing offset to kafka async for now. I try to tell you some more details about my streaming job. CustomReceiver does not receive anything

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver());

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply. Actually, I tried to do most of your advice. When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the following error. Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 3)

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) # Define schema for parsing Kafka messages schema = StructType([

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug. I will

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, provides such

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition with

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools The Spark UI, particularly the Staging and Executors tabs, do provide some valuable insights related to database health metrics for applications using a JDBC source. Stage Overview: This section provides a summary of all the stages executed

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

  1   2   3   4   5   6   7   8   9   10   >