not particularly recommend that, bearing in mind that as we are
dealing with edge cases, in case of error recovering from edge cases can be
more costly than using disk space.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it a
physical memory for
other uses"
free
totalusedfree shared buff/cache
available
Mem: 6565973230116700 1429436 234177234113596
32665372
Swap: 104857596 550912 104306684
HTH
view my Linkedin profile
<https://www.lin
n profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The
to be added.
Regards,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this e
switch
spark-submit --verbose
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on t
Ok
As I see it with PySpark even if it is submitted as cluster, it will be
converted to client mode anyway
Are you running this on AWS or GCP?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any a
Is this Spark or PySpark?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this e
Hi,
I have a basic question to ask.
I am running a Google k8s cluster (AKA GKE) with three nodes each having
configuration below
e2-standard-2 (2 vCPUs, 8 GB memory)
spark-submit is launched from another node (actually a data proc single
node that I have just upgraded to e2-custom (4 vCPUs, 8
ces before
checking secrets!
So in summary PySpark 3.11 with Java 8 with spark-bigquery-latest_2.12.jar
works fine inside the docker image.
The problem is that Debian buster no longer supports Java 8.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
for any monetary damages arising from
such loss, damage or destruction.
On Sat, 7 Aug 2021 at 21:49, Mich Talebzadeh
wrote:
> Thanks Kartik,
>
> Yes indeed this is a BigQuery issue with Spark. Those two setting (below)
> did not work in spark-submit or adding to
> $SUBASE_
ctor/issues/256
> and
> https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/350.
> These issues also mention a few possible solutions.
>
> Regards
> Kartik
>
> On Sun, Aug 8, 2021 at 1:02 AM Mich Talebzadeh
> wrote:
>
>> Thanks for the hint
eeing the code and the whole stack trace, just a wild guess if
> you set the config param for enabling arrow
> (spark.sql.execution.arrow.pyspark.enabled)? If not in your code, you
> would have to set it in the spark-default.conf. Please note that the
> parameter spark.sql.exec
Hi,
I encounter the error:
"java.lang.UnsupportedOperationException: sun.misc.Unsafe or
java.nio.DirectByteBuffer.(long, int) not available"
When reading from Google BigQuery (GBQ) table using Kubernetes cluster
built on debian buster
The current debian bustere from the docker image is:
e calling o116.count.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
tree:
OK this may be specific to BigQuery because as I rtecall this operation
could be done against an Oracle table.
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at
iew my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly di
;ods_job_log")
cfg += ("user"->"jztwk")
cfg += ("passwrod"-> "123456")
You have repeated your username and password
url should be simply
val url = "jdbc:hive2://" + "hiveHost" + ":" + "hivePort" +
our
query causing error with this configuration parameter is turned on. You may
get additional info.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction
kedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explici
quot;")
sys.exit(1)
Example
url = "jdbc:hive2://" + config['hiveVariables']['hiveHost'] + ':' +
config['hiveVariables']['hivePort'] + '/default'
driver= "com.cloudera.hive.jdbc41.HS2Driver"
Note the correct driver
HTH
view my Linkedin profile
&l
u 0 -it be686602970d bash
root@addbe3ffb9fa:/opt/spark/work-dir# id
uid=0(root) gid=0(root) groups=0(root)
Hope this helps someone. It is a hack but it works for now.
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your
it will take significant time and space and
probably will not fit in a single machine RAM.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction o
Hi,
Maybe someone can shed some light on this.
Running Pyspark job in minikube.
Because it is PySpark the following two conf parameters are used:
spark-submit --verbose \
--master k8s://$K8S_SERVER \
--deploy-mode cluster \
--name pytest \
r Spark to be installed on each or few designated nodes of
hadoop on your physical hosts on premise.
Kindly help me, to accomplish this.
let me know, what else you need.
Thanks,
Dinakar
On Sun, Jul 25, 2021 at 10:35 PM Mich Talebzadeh
wrote:
> Hi,
>
> Right you seem to have an on
.
Having hadoop implies that you have a YARN resource manager already plus
HDFS. YARN is the most widely used resource manager on-premise for Spark.
Provide some additional info and we go from there. .
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-520
user/minikube/spark-upload-065d87cf-a1ee-4448-8199-5ec018aacfde
total 16
-rw-r--r--. 1 hduser hadoop 4433 Jul 24 11:26 config.yml
Sound like a bug
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsi
determine_encoding
File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/reader.py",
line 178, in update_raw
AttributeError: 'list' object has no attribute 'read'
Which throws an error!
I am sure there is a solution to read this yaml file inside pod?
Than
ation issues that have dampened my enthusiasm and the
inevitable question regardless of performance, can Kubernetes be used in
anger today at industrial scale.
Regards,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at
It just cannot find the
modules. How can I find out if the gz file is unpacked OK?
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data o
ot;jdbc:hive2://localhost:1/foundation;AuthMech=2;
UseNativeQuery=0")
There is some confusion somewhere!
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage
oreign key:28:27,
org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:329
Sounds like a mismatch between the columns through Spark Dataframe and the
underlying table.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*
py", line
17, in
main()
File "/tmp/spark-c34d1329-7a5a-49a7-a1bb-1889ba5a659d/testyml.py", line
13, in main
import yaml
ModuleNotFoundError: No module named 'yaml'
Well yaml is a bit of an issue so I was wondering if anyone has seen this
before?
Thanks
view my Linkedin
erved word for table columns? What is your DDL for this
table?
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property whic
e
> .partitionBy(partitionDate)
> .mode("overwrite")
> .format("parquet")
>
> .save(*someDirectory*)
>
>
> Now where would I add a 'prefix' in this one?
>
>
> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <
> mich.talebza...@gmail.
Jobs have names in spark. You can prefix it to the file name when writing
to directory I guess
val sparkConf = new SparkConf().
setAppName(sparkAppName).
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at yo
from
val start = spark.sql("SELECT MAX(id) FROM
test.randomData").collect.apply(0).getInt(0) + 1
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or
Have you created that table in Hive or are you trying to create it from
Spark itself.
You Hive is local. In this case you don't need a JDBC connection. Have you
tried:
df2.write.mode("overwrite").saveAsTable(mydb.mytable)
HTH
view my Linkedin profile
<https://www.linkedi
ce degradation.
+---+---+
| id|idx|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of
int(f"""wrote to DB""")
else:
print("DataFrame md is empty")
That value batchId can be used for each Batch.
Otherwise you can do this
startval = 1
df = df.withColumn('id', monotonicallyIncreasingId + startval)
HTH
view my Linkedin profile
&
_path) \
.queryName("orphanedTransactions") \
.start()
And consume it somewhere else
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all re
he essence is not time critical this can be done through a scheduling
job every x minutes through airflow or something similar on the database
alone.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and al
a query in Postgres for say transaction_id 3 but they don't exist yet?
When are they expected to arrive?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destr
(len(df.take(1))) > 0:
print(f"""md batchId is {batchId}""")
df.show(100,False)
df. persist()
# write to BigQuery batch table
s.writeTableToBQ(df, "append",
config['MDVariables']['targetDataset'],config['MDVariables'][
Can you please clarify if you are reading these two topics separately or
within the same scala or python script in Spark Structured Streaming?
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any a
Splendid.
Please invite me to the next meeting
mich.talebza...@gmail.com
Timezone London, UK *GMT+1*
Thanks,
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss,
chanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn
>
> -Regards
> Aditya
>
> On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh
> wrote:
>
>> Thanks Yuri. Those are very valid points.
>>
>> Let me clarify my point. Let us ass
emory 8G \
--num-executors 1 \
--master local \
--executor-cores 2 \
--conf "spark.scheduler.mode=FIFO" \
--conf "spark.ui.port=5" \
--conf spark.executor.memoryOverhead=3000 \
HTH
to come to a meaningful conclusion.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from r
cloud buckets
are independent. So that is a question we can ask the author who is an ex
Databricks guy.
Cheers
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss,
Thanks Aditya for the link. I will have a look.
Cheers
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may
as they do a single function.
Cheers,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from r
nnot see how one
can pigeon hole Spark into a container and make it performant but I may be
totally wrong.
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, dama
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical conte
in/python/DSBQ/src/RandomData.py
echo `date` ", ===> Cleaning up files"
[ -f ${pyspark_venv}.tar.gz ] && rm -r -f ${pyspark_venv}.tar.gz
[ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebz
being processed after a delay reaches a certain time"
The only way you can do this is by allocating more resources to your
cluster at the start so that additional capacity is made available.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
e/hduser/dba/bin/python/dynamic_ARRAY_generator_parquet.py
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which m
DataPy(
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(50)
, PADDING VARCHAR(4000)
)
STORED AS PARQUET
"""
spark.sql(sqltext)
HTH
view my Linkedin profile
<https://
will face will be compatibility problems with versions of
Hive and Spark.
My suggestion would be to use Spark as a massive parallel processing and
Hive as a storage layer. However, you need to test what can be migrated or
not.
HTH
Mich
view my Linkedin profile
<https://www.linkedin.com/in/m
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be
ell itself for clarification HTH,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying
r.DAGScheduler: Job 0 finished: reduce
at SparkPi.scala:38, took 0.680748 s
Pi is roughly 3.134995674978375
2021-06-28 16:48:57,873 INFO server.AbstractConnector: Stopped
Spark@a20b94b{HTTP/1.1,
(http/1.1)}{0.0.0.0:4040}
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-p
y Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly discla
sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes <https://www.youtube.com/watch?v=NNC0kIzM1Fo>
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at y
Thanks Klaus. That will be great.
It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.
Regards,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer
ceived limitation of SSS in dealing with chain of
aggregation( if I am correct it is not yet supported on streaming datasets)
These are my opinions and they are not facts, just opinions so to speak :)
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
with id = 1
scratch...@orasource.mich.LOCAL> select count(1) from scratchpad.randomdata;
COUNT(1)
--
10
If you repeat the spark code again you will get primary key constraint
violation in Oracle and rows will be rejected
cx_Oracle.IntegrityError: ORA-1: unique constraint
way of inserting columns directly from
Spark.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may a
Spark dataframe to that staging table?
HTH
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may aris
leName). \
You can replace *tableName* with the equivalent SQL insert statement
You will need JDBC driver for Oracle say ojdbc6.jar in
$SPARK_HOME/conf/spark-defaults.conf
spark.driver.extraClassPath
/home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar
HTH
view my Linkedin pro
Great Amit, best of luck
Cheers,
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from r
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is expl
Hi Kiran,
You need kafka-clients for the version of kafka you are using. So if it is
the correct version keep it.
Try running and see what the error says.
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own ris
o/2019/02/19/spark-ad-hoc-querying/
>
>
>
>
>
> would like to know, like this anybody else migrated..? and any challenges
> or pre-requisite to migrate(Like hardware)..? any tools to evaluate before
> we migrate?
>
>
>
>
>
>
>
>
>
> *From: *M
uot;))
Now basic aggregation on singular columns can be done like
avg('temperature'),max(),stddev() etc
For cube() and rollup() I will require additional columns like location etc
in my kafka topic. Personally I have not tried it but it will be
interesting to see if it works.
Have you tri
Hi,
Just to clarify
Are we talking about* rollup* as a subset of a cube that computes
hierarchical subtotals from left to right?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsi
OK you mean use spark.sql as opposed to HiveContext.sql?
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("")
replace with
spark.sql("")
?
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in n
These are different things. Spark provides a computational layer and a
dialogue of SQL based on Hive.
Hive is a DW on top of HDFS. What are you trying to replace?
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at yo
Are you running this in Managed Instance Group (MIG)?
https://cloud.google.com/compute/docs/instance-groups
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss,
y Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly discla
ntLocation', checkpoint_path) \*
.queryName("avgtemperature") \
.start()
Now within that checkpoint_path directory you have five sub-directories
containing all you need including offsets
/ssd/hduser/avgtemperature/chkpt> ls
commits metadata offsets source
ot; ->
autoCommitIntervalMS
)
//val topicsSet = topics.split(",").toSet
val topicsValue = Set(topics)
val dstream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsValue)
d
Hi,
Are you trying to read topics from Kafka in spark 3.0.1?
Have you checked Spark 3.0.1 documentation?
Integrating Spark with Kafka is pretty straight forward. with 3.0.1 and
higher
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disc
tps://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be
thread in this forum
"Calculate average from Spark stream"
And for triggering mechanism you can see an example in my linkedin below
https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/
HTH
view my Linkedin profile
<https://www.l
that all rows are there
if df2.subtract(read_df).count() == 0:
print("Data has been loaded OK to BQ table")
else:
print("Data could not be loaded to BQ table, quitting")
sys.exit(1)
HTH
view my Linkedin profile
<https://ww
o
joins join etc in Spark itself.
At times it is more efficient for BigQuery to do the join itself and create
a result set table in BigQuery dataset that you can import into Spark.
Whatever approach there is a solution and as usual your mileage varies so
to speak.
HTH
view my Linkedin pr
. What is the output of print(rdd.toDebugString())
[image: image.png]
I doubt this issue is caused by RDD lineage by adding additional steps not
required.
HTH
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at
|
|2021-05-17 19:50:00|2021-05-17 19:55:00|25.0 |
|2021-05-17 20:30:00|2021-05-17 20:35:00|25.8 |
|2021-05-17 20:10:00|2021-05-17 20:15:00|25.25 |
|2021-05-17 19:30:00|2021-05-17 19:35:00|27.0 |
|2021-05-17 20:15:00|2021-05-17 20:20:00|23.8 |
|2021-05-17 20:00:00|2021-05-17 20:05:00|24.6
gh 'pyspark' is not supported as of Spark
2.0.
Use ./bin/spark-submit
This works
$SPARK_HOME/bin/spark-submit --master local[4]
dynamic_ARRAY_generator_parquet.py
See
https://spark.apache.org/docs/latest/submitting-applications.html
HTH
view my Linkedin profile
<https://www.linkedi
without partitionBy() clause data will be skewed
towards one executor.
WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.
Cheers
view my Linkedin profile
<https://www.linkedin.com/in/mich
be able to live it. Let me think about what else can be done.
HTH
Mich
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property
|amount_6m|amount_9m|
+-+-+
| 100| 500|
| 200| 600|
| 300| 700|
| 400| 800|
| 500 | 900|
+-+-+
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Us
Hi Kushagra,
A bit late on this but what is the business use case for this merge?
You have two data frames each with one column and you want to UNION them in
a certain way but the correlation is not known. In other words this UNION
is as is?
amount_6m | amount_9m
100
ibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Tue, 18 May 2021 a
uot;CAST(temperature AS
> STRING)") \
> .writeStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers", "localhost:9092") \
> .option('checkpointLocation',
> "/home/kafka/Documenti/confluent/examples-6.1.0-po
Hi Giuseppe ,
How have you defined your resultM above in qK?
Cheers
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other pr
. \
queryName("temperature"). \
start()
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)
#print(result.status)
#print(result.recentProgress)
re)|
+--++
|{2021-05-15 22:35:00, 2021-05-15 22:40:00}|27.0|
|{2021-05-15 22:30:00, 2021-05-15 22:35:00}|25.5|
+--++
---
B
-site.xml
lrwxrwxrwx 1 hduser hadoop 50 Mar 3 08:08 hdfs-site.xml ->
/home/hduser/hadoop-3.1.0/etc/hadoop/hdfs-site.xml
lrwxrwxrwx 1 hduser hadoop 43 Mar 3 08:07 hive-site.xml ->
/data6/hduser/hive-3.0.0/conf/hive-site.xml
This works
HTH
view my Linkedin profile
<https://www.li
501 - 600 of 2083 matches
Mail list logo