Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Mich Talebzadeh
not particularly recommend that, bearing in mind that as we are dealing with edge cases, in case of error recovering from edge cases can be more costly than using disk space. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it a

Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Mich Talebzadeh
physical memory for other uses" free totalusedfree shared buff/cache available Mem: 6565973230116700 1429436 234177234113596 32665372 Swap: 104857596 550912 104306684 HTH view my Linkedin profile <https://www.lin

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-14 Thread Mich Talebzadeh
n profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The

Creating docker images for Data Science

2021-08-14 Thread Mich Talebzadeh
to be added. Regards, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this e

Re: K8S submit client vs. cluster

2021-08-12 Thread Mich Talebzadeh
switch spark-submit --verbose HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on t

Re: K8S submit client vs. cluster

2021-08-12 Thread Mich Talebzadeh
Ok As I see it with PySpark even if it is submitted as cluster, it will be converted to client mode anyway Are you running this on AWS or GCP? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any a

Re: K8S submit client vs. cluster

2021-08-12 Thread Mich Talebzadeh
Is this Spark or PySpark? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this e

Performance of PySpark jobs on the Kubernetes cluster

2021-08-09 Thread Mich Talebzadeh
Hi, I have a basic question to ask. I am running a Google k8s cluster (AKA GKE) with three nodes each having configuration below e2-standard-2 (2 vCPUs, 8 GB memory) spark-submit is launched from another node (actually a data proc single node that I have just upgraded to e2-custom (4 vCPUs, 8

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-08 Thread Mich Talebzadeh
ces before checking secrets! So in summary PySpark 3.11 with Java 8 with spark-bigquery-latest_2.12.jar works fine inside the docker image. The problem is that Debian buster no longer supports Java 8. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-08 Thread Mich Talebzadeh
for any monetary damages arising from such loss, damage or destruction. On Sat, 7 Aug 2021 at 21:49, Mich Talebzadeh wrote: > Thanks Kartik, > > Yes indeed this is a BigQuery issue with Spark. Those two setting (below) > did not work in spark-submit or adding to > $SUBASE_

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-07 Thread Mich Talebzadeh
ctor/issues/256 > and > https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/350. > These issues also mention a few possible solutions. > > Regards > Kartik > > On Sun, Aug 8, 2021 at 1:02 AM Mich Talebzadeh > wrote: > >> Thanks for the hint

Re: error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-07 Thread Mich Talebzadeh
eeing the code and the whole stack trace, just a wild guess if > you set the config param for enabling arrow > (spark.sql.execution.arrow.pyspark.enabled)? If not in your code, you > would have to set it in the spark-default.conf. Please note that the > parameter spark.sql.exec

error: java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available

2021-08-07 Thread Mich Talebzadeh
Hi, I encounter the error: "java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available" When reading from Google BigQuery (GBQ) table using Kubernetes cluster built on debian buster The current debian bustere from the docker image is:

Is operation subtracting two dataframe valid.

2021-08-06 Thread Mich Talebzadeh
e calling o116.count. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: OK this may be specific to BigQuery because as I rtecall this operation could be done against an Oracle table. Thanks view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at

Re: How can I write data to hive with jdbc

2021-08-04 Thread Mich Talebzadeh
iew my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly di

Re: How can I write data to hive with jdbc

2021-08-04 Thread Mich Talebzadeh
;ods_job_log") cfg += ("user"->"jztwk") cfg += ("passwrod"-> "123456") You have repeated your username and password url should be simply val url = "jdbc:hive2://" + "hiveHost" + ":" + "hivePort" +

Re: UnspecifiedDistribution Error using AQE

2021-08-03 Thread Mich Talebzadeh
our query causing error with this configuration parameter is turned on. You may get additional info. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Mich Talebzadeh
kedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explici

Re: How can I write data to hive with jdbc

2021-07-30 Thread Mich Talebzadeh
quot;") sys.exit(1) Example url = "jdbc:hive2://" + config['hiveVariables']['hiveHost'] + ':' + config['hiveVariables']['hivePort'] + '/default' driver= "com.cloudera.hive.jdbc41.HS2Driver" Note the correct driver HTH view my Linkedin profile &l

Hacking my way through Kubernetes docker file

2021-07-29 Thread Mich Talebzadeh
u 0 -it be686602970d bash root@addbe3ffb9fa:/opt/spark/work-dir# id uid=0(root) gid=0(root) groups=0(root) Hope this helps someone. It is a hack but it works for now. view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Mich Talebzadeh
it will take significant time and space and probably will not fit in a single machine RAM. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction o

Why PySpark with spark-submit throws error trying to untar --archives pyspark_venv.tar.gz

2021-07-26 Thread Mich Talebzadeh
Hi, Maybe someone can shed some light on this. Running Pyspark job in minikube. Because it is PySpark the following two conf parameters are used: spark-submit --verbose \ --master k8s://$K8S_SERVER \ --deploy-mode cluster \ --name pytest \

Fwd: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-25 Thread Mich Talebzadeh
r Spark to be installed on each or few designated nodes of hadoop on your physical hosts on premise. Kindly help me, to accomplish this. let me know, what else you need. Thanks, Dinakar On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh wrote: > Hi, > > Right you seem to have an on

Re: Installing Distributed apache spark cluster with Cluster mode on Docker

2021-07-25 Thread Mich Talebzadeh
. Having hadoop implies that you have a YARN resource manager already plus HDFS. YARN is the most widely used resource manager on-premise for Spark. Provide some additional info and we go from there. . HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-520

spark-submit --files causes spark context to fail in Kubernetes

2021-07-24 Thread Mich Talebzadeh
user/minikube/spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde total 16 -rw-r--r--. 1 hduser hadoop 4433 Jul 24 11:26 config.yml Sound like a bug view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsi

Kubernetes reading a yaml file with Pyspark fails

2021-07-23 Thread Mich Talebzadeh
determine_encoding File "/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/reader.py", line 178, in update_raw AttributeError: 'list' object has no attribute 'read' Which throws an error! I am sure there is a solution to read this yaml file inside pod? Than

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-23 Thread Mich Talebzadeh
ation issues that have dampened my enthusiasm and the inevitable question regardless of performance, can Kubernetes be used in anger today at industrial scale. Regards, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at

Unpacking and using external modules with PySpark inside k8s

2021-07-20 Thread Mich Talebzadeh
It just cannot find the modules. How can I find out if the gz file is unpacked OK? Thanks view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data o

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
ot;jdbc:hive2://localhost:1/foundation;AuthMech=2; UseNativeQuery=0") There is some confusion somewhere! view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-20 Thread Mich Talebzadeh
oreign key:28:27, org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329 Sounds like a mismatch between the columns through Spark Dataframe and the underlying table. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *

import yaml fails with docker or kubernetes but works ok when run wiyh YARN

2021-07-19 Thread Mich Talebzadeh
py", line 17, in main() File "/tmp/spark-c34d1329-7a5a-49a7-a1bb-1889ba5a659d/testyml.py", line 13, in main import yaml ModuleNotFoundError: No module named 'yaml' Well yaml is a bit of an issue so I was wondering if anyone has seen this before? Thanks view my Linkedin

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-19 Thread Mich Talebzadeh
erved word for table columns? What is your DDL for this table? HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property whic

Re: Naming files while saving a Dataframe

2021-07-17 Thread Mich Talebzadeh
e > .partitionBy(partitionDate) > .mode("overwrite") > .format("parquet") > > .save(*someDirectory*) > > > Now where would I add a 'prefix' in this one? > > > On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh < > mich.talebza...@gmail.

Re: Naming files while saving a Dataframe

2021-07-17 Thread Mich Talebzadeh
Jobs have names in spark. You can prefix it to the file name when writing to directory I guess val sparkConf = new SparkConf(). setAppName(sparkAppName). view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at yo

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-15 Thread Mich Talebzadeh
from val start = spark.sql("SELECT MAX(id) FROM test.randomData").collect.apply(0).getInt(0) + 1 HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or

Re: Unable to write data into hive table using Spark via Hive JDBC driver Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED

2021-07-15 Thread Mich Talebzadeh
Have you created that table in Hive or are you trying to create it from Spark itself. You Hive is local. In this case you don't need a JDBC connection. Have you tried: df2.write.mode("overwrite").saveAsTable(mydb.mytable) HTH view my Linkedin profile <https://www.linkedi

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Mich Talebzadeh
ce degradation. +---+---+ | id|idx| +---+---+ | 1| 1| | 2| 2| | 3| 3| | 4| 4| +---+---+ HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Mich Talebzadeh
int(f"""wrote to DB""") else: print("DataFrame md is empty") That value batchId can be used for each Batch. Otherwise you can do this startval = 1 df = df.withColumn('id', monotonicallyIncreasingId + startval) HTH view my Linkedin profile &

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
_path) \ .queryName("orphanedTransactions") \ .start() And consume it somewhere else HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all re

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
he essence is not time critical this can be done through a scheduling job every x minutes through airflow or something similar on the database alone. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and al

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
a query in Postgres for say transaction_id 3 but they don't exist yet? When are they expected to arrive? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destr

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
(len(df.take(1))) > 0: print(f"""md batchId is {batchId}""") df.show(100,False) df. persist() # write to BigQuery batch table s.writeTableToBQ(df, "append", config['MDVariables']['targetDataset'],config['MDVariables'][

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Can you please clarify if you are reading these two topics separately or within the same scala or python script in Spark Structured Streaming? HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any a

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Mich Talebzadeh
Splendid. Please invite me to the next meeting mich.talebza...@gmail.com Timezone London, UK *GMT+1* Thanks, view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss,

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-06 Thread Mich Talebzadeh
chanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn > > -Regards > Aditya > > On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh > wrote: > >> Thanks Yuri. Those are very valid points. >> >> Let me clarify my point. Let us ass

Re: OutOfMemoryError

2021-07-06 Thread Mich Talebzadeh
emory 8G \ --num-executors 1 \ --master local \ --executor-cores 2 \ --conf "spark.scheduler.mode=FIFO" \ --conf "spark.ui.port=5" \ --conf spark.executor.memoryOverhead=3000 \ HTH

Re: Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Need help to debug issue.

2021-07-06 Thread Mich Talebzadeh
to come to a meaningful conclusion. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from r

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
cloud buckets are independent. So that is a question we can ask the author who is an ex Databricks guy. Cheers view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss,

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
Thanks Aditya for the link. I will have a look. Cheers view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
as they do a single function. Cheers, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from r

Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
nnot see how one can pigeon hole Spark into a container and make it performant but I may be totally wrong. Thanks view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, dama

Re: Spark AQE Post-Shuffle partitions coalesce don't work as expected, and even make data skew in some partitions. Need help to debug issue.

2021-07-04 Thread Mich Talebzadeh
view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical conte

Re: Structuring a PySpark Application

2021-07-02 Thread Mich Talebzadeh
in/python/DSBQ/src/RandomData.py echo `date` ", ===> Cleaning up files" [ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebz

Re: Increase batch interval in case of delay

2021-07-01 Thread Mich Talebzadeh
being processed after a delay reaches a certain time" The only way you can do this is by allocating more resources to your cluster at the start so that additional capacity is made available. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Structuring a PySpark Application

2021-07-01 Thread Mich Talebzadeh
e/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which m

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
DataPy( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000) ) STORED AS PARQUET """ spark.sql(sqltext) HTH view my Linkedin profile <https://

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Mich Talebzadeh
will face will be compatibility problems with versions of Hive and Spark. My suggestion would be to use Spark as a massive parallel processing and Hive as a storage layer. However, you need to test what can be migrated or not. HTH Mich view my Linkedin profile <https://www.linkedin.com/in/m

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
ell itself for clarification HTH, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying

Re: PySpark dependency management in minikube on prem

2021-06-28 Thread Mich Talebzadeh
r.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.680748 s Pi is roughly 3.134995674978375 2021-06-28 16:48:57,873 INFO server.AbstractConnector: Stopped Spark@a20b94b{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-p

PySpark dependency management in minikube on prem

2021-06-28 Thread Mich Talebzadeh
y Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly discla

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
sure you have some inroads/ideas on this subject as well, then truly I guess love would be in the air for Kubernetes <https://www.youtube.com/watch?v=NNC0kIzM1Fo> HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at y

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great. It will also be intuitive if you elaborate the need for this feature in line with the limitation of the current batch workload. Regards, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Mich Talebzadeh
ceived limitation of SSS in dealing with chain of aggregation( if I am correct it is not yet supported on streaming datasets) These are my opinions and they are not facts, just opinions so to speak :) view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: Insert into table with one the value is derived from DB function using spark

2021-06-20 Thread Mich Talebzadeh
with id = 1 scratch...@orasource.mich.LOCAL> select count(1) from scratchpad.randomdata; COUNT(1) -- 10 If you repeat the spark code again you will get primary key constraint violation in Oracle and rows will be rejected cx_Oracle.IntegrityError: ORA-1: unique constraint

Re: Insert into table with one the value is derived from DB function using spark

2021-06-19 Thread Mich Talebzadeh
way of inserting columns directly from Spark. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may a

Re: Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Mich Talebzadeh
Spark dataframe to that staging table? HTH Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may aris

Re: Insert into table with one the value is derived from DB function using spark

2021-06-18 Thread Mich Talebzadeh
leName). \ You can replace *tableName* with the equivalent SQL insert statement You will need JDBC driver for Oracle say ojdbc6.jar in $SPARK_HOME/conf/spark-defaults.conf spark.driver.extraClassPath /home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar HTH view my Linkedin pro

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Mich Talebzadeh
Great Amit, best of luck Cheers, Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from r

Re: class KafkaCluster related errors

2021-06-17 Thread Mich Talebzadeh
view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is expl

Re: class KafkaCluster related errors

2021-06-17 Thread Mich Talebzadeh
Hi Kiran, You need kafka-clients for the version of kafka you are using. So if it is the correct version keep it. Try running and see what the error says. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own ris

Re: Migrating from hive to spark

2021-06-17 Thread Mich Talebzadeh
o/2019/02/19/spark-ad-hoc-querying/ > > > > > > would like to know, like this anybody else migrated..? and any challenges > or pre-requisite to migrate(Like hardware)..? any tools to evaluate before > we migrate? > > > > > > > > > > *From: *M

Re: Does Rollups work with spark structured streaming with state.

2021-06-17 Thread Mich Talebzadeh
uot;)) Now basic aggregation on singular columns can be done like avg('temperature'),max(),stddev() etc For cube() and rollup() I will require additional columns like location etc in my kafka topic. Personally I have not tried it but it will be interesting to see if it works. Have you tri

Re: Does Rollups work with spark structured streaming with state.

2021-06-16 Thread Mich Talebzadeh
Hi, Just to clarify Are we talking about* rollup* as a subset of a cube that computes hierarchical subtotals from left to right? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsi

Re: Spark-sql can replace Hive ?

2021-06-15 Thread Mich Talebzadeh
OK you mean use spark.sql as opposed to HiveContext.sql? val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext.sql("") replace with spark.sql("") ? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2

Re: class KafkaCluster related errors

2021-06-13 Thread Mich Talebzadeh
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be

Re: class KafkaCluster related errors

2021-06-11 Thread Mich Talebzadeh
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in n

Re: Spark-sql can replace Hive ?

2021-06-10 Thread Mich Talebzadeh
These are different things. Spark provides a computational layer and a dialogue of SQL based on Hive. Hive is a DW on top of HDFS. What are you trying to replace? HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at yo

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Mich Talebzadeh
Are you running this in Managed Instance Group (MIG)? https://cloud.google.com/compute/docs/instance-groups view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss,

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Mich Talebzadeh
y Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly discla

Re: class KafkaCluster related errors

2021-06-08 Thread Mich Talebzadeh
ntLocation', checkpoint_path) \* .queryName("avgtemperature") \ .start() Now within that checkpoint_path directory you have five sub-directories containing all you need including offsets /ssd/hduser/avgtemperature/chkpt> ls commits metadata offsets source

Re: class KafkaCluster related errors

2021-06-07 Thread Mich Talebzadeh
ot; -> autoCommitIntervalMS ) //val topicsSet = topics.split(",").toSet val topicsValue = Set(topics) val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsValue) d

Re: class KafkaCluster related errors

2021-06-07 Thread Mich Talebzadeh
Hi, Are you trying to read topics from Kafka in spark 3.0.1? Have you checked Spark 3.0.1 documentation? Integrating Spark with Kafka is pretty straight forward. with 3.0.1 and higher HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disc

Re: Kube estimate for Spark

2021-06-03 Thread Mich Talebzadeh
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be

Re: Spark Structured Streaming

2021-05-31 Thread Mich Talebzadeh
thread in this forum "Calculate average from Spark stream" And for triggering mechanism you can see an example in my linkedin below https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/ HTH view my Linkedin profile <https://www.l

Re: About Spark executs sqlscript

2021-05-24 Thread Mich Talebzadeh
that all rows are there if df2.subtract(read_df).count() == 0: print("Data has been loaded OK to BQ table") else: print("Data could not be loaded to BQ table, quitting") sys.exit(1) HTH view my Linkedin profile <https://ww

Re: About Spark executs sqlscript

2021-05-24 Thread Mich Talebzadeh
o joins join etc in Spark itself. At times it is more efficient for BigQuery to do the join itself and create a result set table in BigQuery dataset that you can import into Spark. Whatever approach there is a solution and as usual your mileage varies so to speak. HTH view my Linkedin pr

Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread Mich Talebzadeh
. What is the output of print(rdd.toDebugString()) [image: image.png] I doubt this issue is caused by RDD lineage by adding additional steps not required. HTH Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at

Re: Calculate average from Spark stream

2021-05-21 Thread Mich Talebzadeh
| |2021-05-17 19:50:00|2021-05-17 19:55:00|25.0 | |2021-05-17 20:30:00|2021-05-17 20:35:00|25.8 | |2021-05-17 20:10:00|2021-05-17 20:15:00|25.25 | |2021-05-17 19:30:00|2021-05-17 19:35:00|27.0 | |2021-05-17 20:15:00|2021-05-17 20:20:00|23.8 | |2021-05-17 20:00:00|2021-05-17 20:05:00|24.6

Re: PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread Mich Talebzadeh
gh 'pyspark' is not supported as of Spark 2.0. Use ./bin/spark-submit This works $SPARK_HOME/bin/spark-submit --master local[4] dynamic_ARRAY_generator_parquet.py See https://spark.apache.org/docs/latest/submitting-applications.html HTH view my Linkedin profile <https://www.linkedi

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
without partitionBy() clause data will be skewed towards one executor. WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. Cheers view my Linkedin profile <https://www.linkedin.com/in/mich

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
be able to live it. Let me think about what else can be done. HTH Mich view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
|amount_6m|amount_9m| +-+-+ | 100| 500| | 200| 600| | 300| 700| | 400| 800| | 500 | 900| +-+-+ HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Us

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Hi Kushagra, A bit late on this but what is the business use case for this merge? You have two data frames each with one column and you want to UNION them in a certain way but the correlation is not known. In other words this UNION is as is? amount_6m | amount_9m 100

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh
ibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 18 May 2021 a

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh
uot;CAST(temperature AS > STRING)") \ > .writeStream \ > .format("kafka") \ > .option("kafka.bootstrap.servers", "localhost:9092") \ > .option('checkpointLocation', > "/home/kafka/Documenti/confluent/examples-6.1.0-po

Re: Calculate average from Spark stream

2021-05-17 Thread Mich Talebzadeh
Hi Giuseppe , How have you defined your resultM above in qK? Cheers view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other pr

Re: Calculate average from Spark stream

2021-05-17 Thread Mich Talebzadeh
. \ queryName("temperature"). \ start() except Exception as e: print(f"""{e}, quitting""") sys.exit(1) #print(result.status) #print(result.recentProgress)

Re: Calculate average from Spark stream

2021-05-15 Thread Mich Talebzadeh
re)| +--++ |{2021-05-15 22:35:00, 2021-05-15 22:40:00}|27.0| |{2021-05-15 22:30:00, 2021-05-15 22:35:00}|25.5| +--++ --- B

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-15 Thread Mich Talebzadeh
-site.xml lrwxrwxrwx 1 hduser hadoop 50 Mar 3 08:08 hdfs-site.xml -> /home/hduser/hadoop-3.1.0/etc/hadoop/hdfs-site.xml lrwxrwxrwx 1 hduser hadoop 43 Mar 3 08:07 hive-site.xml -> /data6/hduser/hive-3.0.0/conf/hive-site.xml This works HTH view my Linkedin profile <https://www.li

<    1   2   3   4   5   6   7   8   9   10   >