Hi Karthick,
Sorry to say it but there's not enough "data" to help you. There should be
something more above or below this exception snippet you posted that could
pinpoint the root cause.
Pozdrawiam,
Jacek Laskowski
"The Internals Of" Online Books <https://book
/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60
[4]
https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24
Pozdrawiam,
Jacek Laskowski
"The Internals Of" Online Bo
properly using the custom catalog impl.
HTH
Pozdrawiam,
Jacek Laskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Fri, Apr 14, 2023 at 2:10 PM 许新浩 <948718...@qq.
give you that level
of detail. You'd have to intercept execution events and correlate them. Not
an easy task yet doable. HTH.
Pozdrawiam,
Jacek Laskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitt
Hi,
You could use QueryExecutionListener or Spark listeners to intercept query
execution events and extract whatever is required. That's what web UI does
(as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y
;-)).
Pozdrawiam,
Jacek Laskowski
"The Internals Of
ub.com/apache/spark/blob/e60ce3e85081ca8bb247aeceb2681faf6a59a056/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L91
Pozdrawiam,
Jacek Laskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitt
Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a
huge one for my clients (who still are on 3.2.1 or even older :))
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on htt
Hi Raj,
Do you want to do the following?
spark.read.format("prometheus").load...
I haven't heard of such a data source / format before.
What would you like it for?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <
am really curious (not implying that one is better or
worse than the other(s)).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklas
You should not really be doing such risky config changes (unless you've got
no other choice and you know what you're doing).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://tw
ght but wanted to share as I think it's worth investigating.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On T
coming from broadcast joins perhaps?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Mon, Aug 30, 2021 at 3:26
ided
How are the above different from yours?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Thu, Aug 19, 2021 at 5:
Hi Bobby,
What a great summary of what happens behind the scenes! Enjoyed every
sentence!
"The default shuffle implementation will always write out to disk." <--
that's what I wasn't sure about the most. Thanks again!
/me On digging deeper...
Pozdrawiam,
Jacek Laskowski
htt
OOMEs).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
Hi Pedro,
No idea what might be causing it. Do you perhaps have some code to
reproduce it locally?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<
Big shout-out to you, Dongjoon! Thank you.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Wed, Jun 2, 20
Hi,
The easiest (but perhaps not necessarily the most flexible) is simply to
use two different versions of spark-submit script with the env var set to
two different values. Have you tried it yet?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" On
what happens in Kafka Streams too
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Sun, Apr 4, 2021 at 12:28 P
Hi Vaquar,
Thanks a lot! Accepted as the answer (yet there was the other answer that
was very helpful too). Tons of reading ahead to understand it more.
That once again makes me feel that Hadoop MapReduce experience would help a
great deal (and I've got none).
Pozdrawiam,
Jacek Laskowski
Hi Bartosz,
This is not a question about whether the data source supports fixed or
user-defined schema but what schema to use when requested for a streaming
batch in Source.getBatch.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Bo
uot;safe" and "safety" meanings.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Sat, Apr 3,
calls are there under the
covers? How to know it for GCS?
Thank you for any help you can provide. Merci beaucoup mes amis :)
[1] https://stackoverflow.com/q/66933229/1305344
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://bo
e/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61
[2]
https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L35
Pozdrawiam,
Jacek Laskowski
https://about.me/Jace
Hi,
I think I found it. I should be using OnDemand claim name so it gets
replaced to be unique per executor (?)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jacekl
-kubernetes-book/demo/persistentvolumeclaims/
Please help. Thank you!
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
Hi,
On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in
production deployments though. Only demo experience here.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on htt
Hi,
> as Executors terminates after their work completes.
--conf spark.kubernetes.executor.deleteOnTermination=false ?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter
they're
not deleted as they simply wait forever. I might be mistaken here though.
What property is this for "this timeout of 60 sec."?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow
Hi Filip,
Care to share the code behind "The only thing I found so far involves using
forEachBatch and manually updating my aggregates. "?
I'm not completely sure I understand your use case and hope the code could
shed more light on it. Thank you.
Pozdrawiam,
Jacek Laskowski
Hi,
Never heard of it (and have once been tasked to explore a similar use
case). I'm curious how you'd like it to work? (no idea how Hive does this
either)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.p
Hi,
Can you use console sink and make sure that the pipeline shows some
progress?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter
Hi Brett,
No idea why it happens, but got curious about this "Cores" column being 0.
Is this always the case?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.
Hi,
I'd look at stages and jobs as it's possible that the only task running is
the missing one in a stage of a job. Just guessing...
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on htt
as a container of a driver pod.
There's no point using cluster deploy mode...ever. Makes sense?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter
Hi Marco,
A Scala dev here.
In short: yet another reason against Python :)
Honestly, I've got no idea why the code gives the output. Ran it with
3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will
chime in and shed more light on this.
Pozdrawiam,
Jacek Laskowski
https
pment IMHO).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Wed, Jan 20, 2021 at 2:44 PM Marco Firrincieli
wrote:
d also forward that to ElasticSearch via log4j
for monitoring
Think SparkListener API would help here too.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
is used to look up
any cached queries.
Again, I'm not really sure and if I'd have to answer it (e.g. as part of an
interview) I'd say nothing would be shared / re-used.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.p
Hey Yurii,
> which is unavailable from executors.
Register it on the driver and use accumulators on executors to update the
values (on the driver)?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
te
.insertInto(sqlView)
In summary, you should report this to JIRA, but don't expect this get fixed
other than to catch this case just to throw this exception
from ResolveRelations: Inserting into a view is not allowed"
Unless I'm mistaken...
Pozdrawiam,
Jacek Laskowski
https:
Hi,
Can you post the whole message? I'm trying to find what might be causing
it. A small reproducible example would be of help too. Thank you.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Fo
Hi,
Start with DataStreamWriter.foreachBatch.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
On Thu, Jan 7, 2
I wish myself that someone with more skills in this area chimed in...
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski&g
ble (as it's
on a stable HDFS file system not on an ephemeral executor). In either case,
the lineage should be the same = cut.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitte
ing queries, respectively.
Please note that I'm not a PMC member or even a committer so I'm speaking
for myself only (not representing the project in an official way).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japil
Hi Emma,
I'm curious about the purpose of the email. Mind elaborating?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklasko
Hi Neeraj,
I'd start from "Contributing Documentation Changes" in
https://spark.apache.org/contributing.html
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.
Hi,
It's been a while since I worked with Spark Standalone, but I'd check the
logs of the workers. How do you spark-submit the app?
DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of&qu
g at the very least and/or use YARN as the
cluster manager".
Another thought was that the user code (your code) could be leaking
resources so Spark eventually reports heap-related errors that may not
necessarily be Spark's.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowsk
; on twitter @
https://twitter.com/adamwathan/status/1257641015835611138. You could borrow
some ideas of the docs that are claimed "the best".
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
F
Hi,
Thanks Dongjoon Hyun for stepping up as a release manager!
Much appreciated.
If there's a volunteer to cut a release, I'm always to support it.
In addition, the more frequent releases the better for end users so they
have a choice to upgrade and have all the latest fixes or wait. It's their
Hi,
I've got a talk "The internals of stateful stream processing in Spark
Structured Streaming" at https://dataxday.fr/ today and am going to include
the tool on the slides to thank you for the work. Thanks.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
The Internal
ink and DataFlow allow changing
>> parallelism number, and by my knowledge of Spark Streaming, it seems it is
>> also able to do that: if some “key interval” concept is used, then state
>> can somehow decoupled from partition number by consistent hashing.
>>
>>
>>
is
> also able to do that: if some “key interval” concept is used, then state
> can somehow decoupled from partition number by consistent hashing.
>
>
>
>
>
> Regards
>
> Jialei
>
>
>
> *From: *Jacek Laskowski
> *Date: *Wednesday, June 26, 2019 at 11:00
Hi,
It's not allowed to change the numer of partitions after your streaming
query is started.
The reason is exactly the number of state stores which is exactly the
number of partitions (perhaps multiplied by the number of stateful
operators).
I think you'll even get a warning or an exception
.org/docs/latest/sql-programming-guide.html and then hop
onto
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
to
know the Spark API better. I'm sure you'll quickly find out the answer(s).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
The Internals of Spar
Hi,
What are "the spark driver and executor threads information" and "spark
application logging"?
Spark uses log4j so set up logging levels appropriately and you should be
done.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
The Internals of Spark SQL http
staken, in your case, what you really need is to replace
`withColumn` with `select("id")` itself and you're done.
When I'm writing this (I'm saying exactly what you actually have already)
and I'm feeling confused.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
The I
t "timestamp").show
++
| ts|
++
|null|
|null|
++
scala> Seq("1", "2").toDF("ts").select($"ts" cast "long").select($"ts" cast
"timestamp").show
+---+
| ts|
+
$"b", $"e" - $"b" + 1) as "demo").show
+-+
| demo|
+-+
|hello|
+-+
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Ma
Hi,
I'm curious about "I found the bug code". Can you point me at it? Thanks.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Str
Hi,
Not possible. What are you really trying to do? Why do you need to share
dataframes? They're nothing but metadata of a distributed computation (no
data inside) so what would be the purpose of such sharing?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL
Hi,
Add the following line to conf/log4j.properties and you should have all the
logs:
log4j.logger.org.apache.spark.scheduler.DAGScheduler=ALL
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https
Hi,
Can you show the plans with explain(extended=true) for both versions?
That's where I'd start to pinpoint the issue. Perhaps the underlying
execution engine change to affect keyBy? Dunno and guessing...
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https
Hi Lian,
"What have you tried?" would be a good starting point. Any help on this?
How do you read the JSONs? readStream.json? You could use readStream.text
followed by filter to include/exclude good/bad JSONs.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering
e're talking about an RDD, this
abstraction is planned as a set of tasks (one per partition of the RDD).
And yes, the tasks are sent out over the wire to executors. It's been like
this from Spark 1.0 (and even earlier).
Hope I helped a bit.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekL
that helps.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com
Hi Elior,
Could you show the query that led to the exception?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly
Hi,
What about https://issues.apache.org/jira/projects/SPARK/versions/12342385?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams
ming since the
former uses Dataset API while the latter RDD API.
Don't touch RDD API and Spark Streaming unless you know what you're doing :)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bi
ort org.apache.spark.sql.streaming.Trigger
df.writeStream.trigger(Trigger.ProcessingTime("1 second"))
See
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spa
Hi,
The exception message should be self-explanatory and says that you cannot
join two streaming Datasets. This feature was added in 2.3 if I'm not
mistaken.
Just to be sure that you work with two streaming Datasets, can you show the
query plan of the join query?
Jacek
On Sat, 12 May 2018,
that I'm 100% sure that the query is indeed the same since
I'm working on a reproducible test case and only when I got it I'll really
be).
Sorry for the vague description, but I've got nothing more to share yet.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https
Hi,
What's the deployment process then (if not using spark-submit)? How is the
AM deployed? Why would you want to skip spark-submit?
Jacek
On 19 Mar 2018 00:20, "Serega Sheypak" wrote:
> Hi, Is it even possible to run spark on yarn as usual java application?
> I've
Hi,
Filled https://issues.apache.org/jira/browse/SPARK-23731 and am working on
a workaround (aka fix).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https
Hi,
Since I'm new to Hive, what does `stored by` do? I might help a bit in
Spark if I only knew a bit about Hive :)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured
Hi,
What would you expect? The data is simply dropped as that's the purpose of
watermarking it. That's my understanding at least.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly
Hi,
Start with spark.executor.memory 2g. You may also
give spark.yarn.executor.memoryOverhead a try.
See https://spark.apache.org/docs/latest/configuration.html and
https://spark.apache.org/docs/latest/running-on-yarn.html for more in-depth
information.
Pozdrawiam,
Jacek Laskowski
https
t records which are lagging behind (based on event
time) by a certain amount of time."
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka St
Hi Esa,
I'd say https://stackoverflow.com/questions/tagged/apache-spark is where
many active sparkians hang out :)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured
| id|
+---+---+
| 0| 0|
+---+---+
Am I missing something? When aliasing a table, use the identifier in column
refs (inside).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-s
Hi Michael,
-dev +user
What's the query? How do you "fool spark"?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams http
gt; Do you have any other suggestion/recommendation ?
What's wrong with the current solution? I don't think you should change how
you do things currently. You should just avoid collect on large datasets
(which you have to do anywhere in Spark).
Pozdrawiam,
Jacek Laskowski
https://ab
available on the driver in
cluster deploy mode? That should give you a definitive answer (or at least
get you closer).
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured
Thanks Silvio!
In the meantime, with help of Adam and code review of WholeStageCodegenExec
and CollapseCodegenStages, I found out that anything that's codegend is as
fast as the tasks in a stage. In this case, union of two codegend subtrees
is indeed parallel.
Pozdrawiam,
Jacek Laskowski
at rdd at :26 []
| MapPartitionsRDD[13] at rdd at :26 []
| ParallelCollectionRDD[12] at rdd at :26 []
What am I missing and how to be certain whether and what parts of a query
are going to be executed in parallel?
Please help...
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
ave datasets of different schema in a query. You'd have to use
the most wide schema to cover all schemas.
p.s. Have you tried anything...spark-shell's your friend, my friend :)
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structu
Hi,
https://issues.apache.org/jira/browse/SPARK-19076
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com
Hi,
What about a custom streaming Sink that would stop the query after addBatch
has been called?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Hi,
What about memory sink? That could work.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
Hi,
Not that I'm aware of, but in your case checking out whether a JSON message
fit your schema and the pipeline would've taken pyspark alone with JSONs on
disk, wouldn't it?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark
Hi Satyajit,
That's exactly what Dataset.rdd does -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala?utf8=%E2%9C%93#L2916-L2921
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/sp
Hi,
Use explode function, filter operator and collect_list function.
Or "heavier" flatMap.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-sp
Hi,
You're right...killing the spark streaming job is the way to go. If a batch
was completed successfully, Spark Streaming will recover from the
controlled failure and start where it left off. I don't think there's other
way to do it.
Pozdrawiam,
Jacek Laskowski
https://about.me
Hi,
Guessing it's a timing issue. Once you started the query the batch 0 did
not have rows to save or didn't start yet (it's a separate thread) and so
spark.sql ran once and saved nothing.
You should rather use foreach writer to save results to Hive.
Jacek
On 29 Sep 2017 11:36 am, "HanPan"
to, e.g. topic\d [2]
[1]
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribe
[2]
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribepattern
Pozdrawiam,
Jacek Laskowski
https
Hi,
Ah, right! Start the queries and once they're running, awaitTermination
them.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache
Hi,
What's the code in readFromKafka to read from hello2 and hello1?
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming (Apache Spark 2.2+)
https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me
1 - 100 of 459 matches
Mail list logo