Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
quot;, "5 minutes")). \ avg('temperature') - Write to Sink: Write the filtered records (dropped records) to a separate Kafka topic. - Consume and Store: Consume the dropped records topic with another streaming job and store them in a Postgres table or S3 usin

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
think about another way and revert HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
a look at the ticket and add your comments. Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
nCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essentia

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
hat uUsing materialized views with Spark Structured Streaming and Change Data Capture (CDC) is a potential solution for efficiently streaming view data updates in this scenario. . Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view m

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
ilar issue or if there are any insights into why this discrepancy exists between Spark SQL and Hive. Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
, the monotonically_increasing_id() sequence might restart from the beginning. This could again cause duplicate IDs if other Spark applications are running concurrently or if data is processed across multiple runs of the same application.. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
ces are limited (memory, CPU). b) Data Skew: Uneven distribution of values in certain columns could lead to imbalanced processing across machines. Check Spark UI (4040) on staging and execution tabs HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London Uni

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
creation? Are joins matching columns correctly? 4) Specific Edge Issues: Can you share examples of vertex IDs with incorrect connections? Is this related to ID generation or edge creation logic? HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI, FinCrime London United

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
ssages. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The informat

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
| Now I recently saw a note (if i recall correctly) that Spark should be using camelCase in new spark related documents. What are the accepted views or does it matter? Thanks Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
tes') ORDER BY window.start """ # Write the aggregated results to Kafka sink stream = session.sql(query) \ .writeStream \ .format("kafka") \ .option("checkpointLocation", "checkpoint") \ .option("kafka.bootstrap.servers", "localhost:9092

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
, numBytes) => host }.toArray } } HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
your work. say with Oracle (as an example), utilise tools like OEM, VM StatPack, SQL*Plus scripts etc or third-party monitoring tools to collect detailed database health metrics directly from the Oracle database server. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Gen

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
anks Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is cor

Re: Idiomatic way to rate-limit streaming sources to avoid OutOfMemoryError?

2024-04-07 Thread Mich Talebzadeh
thin the trigger interval, preventing backlogs and potential OOM issues. >From Spark UI, look at the streaming tab. There are various statistics there. In general your Processing Time has to be less than your batch interval. The scheduling Delay and Total Delay are additional indicator of

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
a Kubernetes cluster. They can include these configurations in the Spark application code or pass them as command-line arguments or environment variables during application submission. HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
better performance and scalability for handling larger datasets efficiently. Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
with these files systems come into it. I will be interested in hearing more about any progress on this. Thanks . Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
n window-generated 'start' # Rest of the code remains the same streaming_df.createOrReplaceTempView("streaming_df") spark.sql(""" SELECT window.start, window.end, provinceId, totalPayAmount FROM streaming_df ORDER BY window.start """) \ .writeStream \ .format

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes') ORDER BY window.start """) .writeStream .format("kafka") .option("checkpointLocation", "checkpoint") .option("kafka.bootstr

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn <https://www.linkedin.com/pulse/leveraging-generative-ai-apache-spark-transforming-mich-lxbte/?trackingId=aqZMBOg4O1KYRB4Una7NEg%3D%3D> Mich Talebzadeh, Technologist | Data | Generat

Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We can use Linkedin for now. Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI | Financial Fraud London United Kingdom view my Linkedin

Re:

2024-03-21 Thread Mich Talebzadeh
g("MDVariables.targetDataset"), config.getString("MDVariables.targetTable")) df.unpersist() // println("wrote to DB") } else { println("DataFrame df is empty") } } If the DataFrame is empty, it prints a message indicating that the DataFrame is empty. You

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
ertain this idea. They seem to have a well defined structure for hosting topics. Let me know your thoughts Thanks <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub> Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
be that the information (topics) are provided as best efforts and cannot be guaranteed. Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
- Databricks <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub> Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/&

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is c

Re: [GraphX]: Prevent recomputation of DAG

2024-03-18 Thread Mich Talebzadeh
tools like Spark UI or third-party libraries.for this purpose. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best

A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
hat should not be that difficult. If anyone is supportive of this proposal, let the usual +1, 0, -1 decide HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The informatio

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
such as pipelining transformations and removing unnecessary computations. "I may need something like that for synthetic data for testing. Any way to do that ?" Have a look at this. https://github.com/joke2k/faker <https://github.com/joke2k/faker>HTH Mich Talebzadeh, Dad | Technologist | Sol

Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
and fraudulent transactions to build a machine learning model to detect fraudulent transactions using PySpark's MLlib library. You can install it via pip install Faker Details from https://github.com/joke2k/faker HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
You can get additional info from Spark UI default port 4040 tabs like SQL and executors - Spark uses Catalyst optimiser for efficient execution plans. df.explain("extended") shows both logical and physical plans HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect |

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
edRowsPerSecond")) fails because it uses the .get() method, while the second line (processedRowsPerSecond = microbatch_data.processedRowsPerSecond) accesses the attribute directly. In short, they need to ensure that that event.progress* returns a dictionary * Cheers Mich Talebzadeh, Dad | Tec

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
---++ | key|doubled_value|op_type| op_time| ++-+---++ |a960f663-d13a-49c...|2| 1|2024-03-11 12:17:...| ++-+---+----+ I am afraid it is not working. Not even printing anything Cheers

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
the output sink (for example, console sink) query = ( processed_streaming_df.select( \ col("key").alias("key") \ , col("doubled_value").alias("doubled_value") \ , col("op_type"

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
"spark.sql.warehouse.dir") to print the configured warehouse directory after creating the SparkSession to confirm all is OK HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-tale

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
ent: Use --py-files if your application code consists mainly of Python files and doesn't require a separate virtual environment. - Separate Virtual Environment: Use --conf spark.yarn.dist.archives if you manage dependencies in a separate virtual environment archive. HTH Mich Talebzadeh

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
--conf spark.executor.cores=3 \ --conf spark.driver.memory=1024m \ --conf spark.executor.memory=1024m \ * $CODE_DIRECTORY_CLOUD/${APPLICATION}* HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://ww

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
d") My question is can these operations be done more efficiently in Pyspark itself ideally with one df operation reading the original file (.bz2.zip)? Thanks Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile http

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
) # Show the result joined_df.show() root |-- combined_id: array (nullable = true) ||-- element: long (containsNull = true) root |-- mr_id: long (nullable = true) +---+-+ |combined_id|mr_id| +---+-+ | [1, 2, 3]|2| | [4, 5, 6]|5| +---+-+ HTH

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
cause partial aggregation if a single executor processes most items of a particular type. - Partial Aggregations, Spark might be combining partial counts from executors incorrectly, leading to inaccuracies. - Finally a bug in 3.5 is possible. HTH Mich Talebzadeh, Dad | Technologist | Solutions

Re: Issue of spark with antlr version

2024-02-27 Thread Mich Talebzadeh
Model (Java objects) --> Spring Application Logic (Controllers, Services, Repositories) etc. Is this a good guess? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d

Re: Bugs with joins and SQL in Structured Streaming

2024-02-26 Thread Mich Talebzadeh
Hi, These are all on spark 3.5, correct? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disc

Re: [Beginner Debug]: Executor OutOfMemoryError

2024-02-23 Thread Mich Talebzadeh
executor overhead. This is important for tasks that require additional memory beyond the executor memory setting. Example 5. --conf spark.executor.memoryOverhead=1000 HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
s detailed, but potentially functional, explanation. - Manual Analysis*, *analyze the query structure and logical steps yourself - Spark UI, review the Spark UI (accessible through your Spark application on 4040) for delving into query execution and potential bottlenecks. HTH Mich Taleb

Kafka-based Spark Streaming and Vertex AI for Sentiment Analysis

2024-02-21 Thread Mich Talebzadeh
;: "Sleek and modern design, but lacking some features.", "user_feedback": "Negative", "review_source": "online", "sentiment_confidence": 0.33, "product_features": "user-friendly", "timestamp": &quo

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
and the host spark version you are submitting your spark-submit from? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Tale

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Log the exception print(f"Exception in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep time.sleep(60) else: # Break out of the loop if the job completes successfully break HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-19 Thread Mich Talebzadeh
Ok thanks for your clarifications Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The infor

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
and handling within Spark itself using max_retries = 5 etc? HTH HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Tale

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Sure but first it would be beneficial to understand the way Spark works on Kubernetes and the concept.s Have a look at this article of mine Spark on Kubernetes, A Practitioner’s Guide <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3Ftrackin

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
OK you have a jar file that you want to work with when running using Spark on k8s as the execution engine (EKS) as opposed to YARN on EMR as the execution engine? Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <ht

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
query-with-dependencies_2.12-0.22.2.jar "${SPARK_EXTRA_JARS_DIR}" Here I am accessing Google BigQuery DW from EKS cluster HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Mich Talebzadeh
true) Shuffle cleanup successful. Iteration 5 root |-- column_to_check: string (nullable = true) |-- partition_column: long (nullable = true) Shuffle cleanup successful. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profil

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-18 Thread Mich Talebzadeh
force with shutil.rmtree(path) to remove these files. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The

Re: job uuid not unique

2024-02-16 Thread Mich Talebzadeh
ther brute force is that instead of *SaveMode.Append*, you can try using *SaveMode.Overwrite**.* This will overwrite the existing data if it already exists. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedi

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-16 Thread Mich Talebzadeh
Hi Chao, As a cool feature - Compared to standard Spark, what kind of performance gains can be expected with Comet? - Can one use Comet on k8s in conjunction with something like a Volcano addon? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-15 Thread Mich Talebzadeh
Hi,I gather from the replies that the plugin is not currently available in the form expected although I am aware of the shell script. Also have you got some benchmark results from your tests that you can possibly share? Thanks, Mich Talebzadeh, Dad | Technologist | Solutions Architect

Re: Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-02-13 Thread Mich Talebzadeh
verify that the spark.local.dir property in your Spark configuration points to a writable directory with enough space. - Permission issues: check directories listed in spark.local.dir are accessible by the Spark user with read/write permissions. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer

Re: Null pointer exception while replying WAL

2024-02-12 Thread Mich Talebzadeh
gesJson.write.mode("append").parquet(data) } catch { case ex: Exception => ex.printStackTrace() } } } This modification uses Option *t*o handle potential null values in the rdd and filters out any elements that are still "NO records found" after

Re: Null pointer exception while replying WAL

2024-02-11 Thread Mich Talebzadeh
to review the part of the code where the exception is thrown and identifying which object or method call is resulting in *null* can help the debugging process plus checking the logs. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my

Re: Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
The full code is available from the link below https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.

Building an Event-Driven Real-Time Data Processor with Spark Structured Streaming and API Integration

2024-02-09 Thread Mich Talebzadeh
esktop> HTH, Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own ri

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
-++--------++-+---+ only showing top 1 row rows is 50 HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebza

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
eckpoint_path) .queryName(f"{query_name}") .start() ) break # Exit the loop after starting the streaming query else: time.sleep(5) # Sleep for a while before checking for the next event HTH Mich Talebzadeh, Dad | Technologis

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-29 Thread Mich Talebzadeh
uot;append") .format("console") .trigger(processingTime="1 second") .option("checkpointLocation", checkpoint_path) .start() ) query.awaitTermination() In this example, it listens to a socket on localhost: and expects a single integer value per line. Y

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-28 Thread Mich Talebzadeh
dcb-8acfd2e9a61e, runId = f71cd26f-7e86-491c-bcf2-ab2e3dc63eca] terminated with exception: No usable value for offset Did not find value which can be converted into long Seems like there might be an issue with the *rate-micro-batch* source when using the *startTimestamp* option. You can

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Mich Talebzadeh
tate updates, and available resources. In summary, what works well for one workload might not be optimal for another. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph

Re: Structured Streaming Process Each Records Individually

2024-01-11 Thread Mich Talebzadeh
ror(f"Error processing table {table_name}: {e}") else: print("DataFrame is empty") # Handle empty DataFrame HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-ta

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Mich Talebzadeh
Use an intermediate work table to put json data streaming in there in the first place and then according to the tag store the data in the correct table HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <ht

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
the necessary state. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any a

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
ntermediate_df.cache() # Use cached intermediate_df for further transformations or actions HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybo

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh
ements. Cheers Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk.

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
and performance of your Flask application. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at yo

Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
yone's benefit. Hopefully your comments will help me to improve it. Cheers Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mic

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
ingestion and analytics. My use case revolves around a scenario where data is generated through REST API requests in real time with Pyspark.. The Flask REST API efficiently captures and processes this data, saving it to a sync of your choice like a data warehouse or kafka. HTH Mich Talebzadeh, Dad

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Mich Talebzadeh
67108864") These configurations provide a starting point for tuning RocksDB. Depending on your specific use case and requirements, of course, your mileage varies. HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <ht

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-06 Thread Mich Talebzadeh
How many topics and checkpoint directories are you dealing with? Does each topic has its own checkpoint on S3? All these checkpoints are sequential writes so even SSD would not really help HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view

Re: Issue with Spark Session Initialization in Kubernetes Deployment

2024-01-05 Thread Mich Talebzadeh
URL is mandatory even when using the Spark Operator in Kubernetes deployments. from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("AppConstants.APPLICATION_NAME") \ .config("spark.master", "k8s://https://:") \ .getOrCreat

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
Hi, Do you have more info on this Jira besides the github link as I don't seem to find it! Thanks Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav , On Pyspark DF can you the following df.printSchema() and send the output please HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
/49785108/spark-streaming-with-python-how-to-add-a-uuid-column HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Tale

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN <https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.co

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Well not to put too finer point on it, in a public forum, one ought to respect the importance of open communication. Everyone has the right to ask questions, seek information, and engage in discussions without facing unnecessary patronization. Mich Talebzadeh, Dad | Technologist | Solutions

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
Apologies Koert! Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own ris

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
rs to impersonate Identity and Access Management (IAM) service accounts to access Google Cloud services." *Cheers* Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzad

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
and manage the Spark application's execution on the YARN cluster. HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki

Re: [Streaming (DStream) ] : Does Spark Streaming supports pause/resume consumption of message from Kafka?

2023-12-01 Thread Mich Talebzadeh
ocess_stream) stream.start() HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
file of my project and copy it across to the cloud storage and then put the application (py file) there as well and use them in spark-submit I trust this answers your question. HTH Mich Talebzadeh, Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin

Re: Spark master shuts down when one of zookeeper dies

2023-11-07 Thread Mich Talebzadeh
Zookeeper cluster. HTH, Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom Mich Talebzadeh (Ph.D.) | LinkedIn <https://www.linkedin.com/in/michtalebzadeh/> https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. A

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
= "gs://etcbucket/data-file" normalized_path = os.path.normpath(path) # Pass the normalized path to Spark *In Scala* import java.io.File val path = "gs://etcbucket/data-file" val normalizedPath = new File(path).getCanonicalPath() // Pass the normalized path to Spark

Re: Data analysis issues

2023-11-02 Thread Mich Talebzadeh
on their own servers. Try using encryption combined with RBAC (who can access what), to protect your data privacy. Also beware of security risks associated with third-party libraries if you are deploying them. HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer Lo

Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh
ike to explore ideas on it. with the other members. Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibil

  1   2   3   4   5   6   7   8   9   10   >