Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski
Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://book

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60 [4] https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24 Pozdrawiam, Jacek Laskowski "The Internals Of" Online Bo

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
properly using the custom catalog impl. HTH Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Fri, Apr 14, 2023 at 2:10 PM 许新浩 <948718...@qq.

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
give you that level of detail. You'd have to intercept execution events and correlate them. Not an easy task yet doable. HTH. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitt

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The Internals Of

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Jacek Laskowski
ub.com/apache/spark/blob/e60ce3e85081ca8bb247aeceb2681faf6a59a056/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L91 Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitt

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Jacek Laskowski
Yoohoo! Thanks Yuming for driving this release. A tiny step for Spark a huge one for my clients (who still are on 3.2.1 or even older :)) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on htt

Re: Prometheus with spark

2022-10-25 Thread Jacek Laskowski
Hi Raj, Do you want to do the following? spark.read.format("prometheus").load... I haven't heard of such a data source / format before. What would you like it for? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <

Re: query time comparison to several SQL engines

2022-04-07 Thread Jacek Laskowski
am really curious (not implying that one is better or worse than the other(s)). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklas

Re: Spark Stream on Kubernetes Cannot Set up JavaSparkContext

2021-09-05 Thread Jacek Laskowski
You should not really be doing such risky config changes (unless you've got no other choice and you know what you're doing). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://tw

Re: Spark Stream on Kubernetes Cannot Set up JavaSparkContext

2021-08-31 Thread Jacek Laskowski
ght but wanted to share as I think it's worth investigating. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On T

Re: Connection reset by peer : failed to remove cache rdd

2021-08-30 Thread Jacek Laskowski
coming from broadcast joins perhaps? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Mon, Aug 30, 2021 at 3:26

Re: Java : Testing RDD aggregateByKey

2021-08-21 Thread Jacek Laskowski
ided How are the above different from yours? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Aug 19, 2021 at 5:

Re: Is memory-only no-disk Spark possible?

2021-08-21 Thread Jacek Laskowski
Hi Bobby, What a great summary of what happens behind the scenes! Enjoyed every sentence! "The default shuffle implementation will always write out to disk." <-- that's what I wasn't sure about the most. Thanks again! /me On digging deeper... Pozdrawiam, Jacek Laskowski htt

Is memory-only no-disk Spark possible?

2021-08-20 Thread Jacek Laskowski
OOMEs). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>

Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Jacek Laskowski
Hi Pedro, No idea what might be causing it. Do you perhaps have some code to reproduce it locally? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-02 Thread Jacek Laskowski
Big shout-out to you, Dongjoon! Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jun 2, 20

Re: Updating spark-env.sh per application

2021-05-09 Thread Jacek Laskowski
Hi, The easiest (but perhaps not necessarily the most flexible) is simply to use two different versions of spark-submit script with the env var set to two different values. Have you tried it yet? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" On

Re: Spark structured streaming + offset management in kafka + kafka headers

2021-04-04 Thread Jacek Laskowski
what happens in Kafka Streams too Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sun, Apr 4, 2021 at 12:28 P

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-04 Thread Jacek Laskowski
Hi Vaquar, Thanks a lot! Accepted as the answer (yet there was the other answer that was very helpful too). Tons of reading ahead to understand it more. That once again makes me feel that Hadoop MapReduce experience would help a great deal (and I've got none). Pozdrawiam, Jacek Laskowski

Re: Source.getBatch and schema vs qe.analyzed.schema?

2021-04-03 Thread Jacek Laskowski
Hi Bartosz, This is not a question about whether the data source supports fixed or user-defined schema but what schema to use when requested for a streaming batch in Source.getBatch. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Bo

Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread Jacek Laskowski
uot;safe" and "safety" meanings. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Sat, Apr 3,

Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread Jacek Laskowski
calls are there under the covers? How to know it for GCS? Thank you for any help you can provide. Merci beaucoup mes amis :) [1] https://stackoverflow.com/q/66933229/1305344 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://bo

Source.getBatch and schema vs qe.analyzed.schema?

2021-03-29 Thread Jacek Laskowski
e/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L61 [2] https://github.com/apache/spark/blob/053dd858d38e6107bc71e0aa3a4954291b74f8c8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L35 Pozdrawiam, Jacek Laskowski https://about.me/Jace

Re: [k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
Hi, I think I found it. I should be using OnDemand claim name so it gets replaced to be unique per executor (?) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jacekl

[k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
-kubernetes-book/demo/persistentvolumeclaims/ Please help. Thank you! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski>

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in production deployments though. Only demo experience here. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on htt

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, > as Executors terminates after their work completes. --conf spark.kubernetes.executor.deleteOnTermination=false ? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter

Re: Spark Kubernetes 3.0.1 | podcreationTimeout not working

2021-02-10 Thread Jacek Laskowski
they're not deleted as they simply wait forever. I might be mistaken here though. What property is this for "this timeout of 60 sec."? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow

Re: Spark structured streaming - efficient way to do lots of aggregations on the same input files

2021-01-22 Thread Jacek Laskowski
Hi Filip, Care to share the code behind "The only thing I found so far involves using forEachBatch and manually updating my aggregates. "? I'm not completely sure I understand your use case and hope the code could shed more light on it. Thank you. Pozdrawiam, Jacek Laskowski

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Jacek Laskowski
Hi, Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.p

Re: Process each kafka record for structured streaming

2021-01-21 Thread Jacek Laskowski
Hi, Can you use console sink and make sure that the pipeline shows some progress? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter

Re: Application Timeout

2021-01-21 Thread Jacek Laskowski
Hi Brett, No idea why it happens, but got curious about this "Cores" column being 0. Is this always the case? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jacek Laskowski
Hi, I'd look at stages and jobs as it's possible that the only task running is the missing one in a stage of a job. Just guessing... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on htt

Re: Query on entrypoint.sh Kubernetes spark

2021-01-21 Thread Jacek Laskowski
as a container of a driver pod. There's no point using cluster deploy mode...ever. Makes sense? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Jacek Laskowski
Hi Marco, A Scala dev here. In short: yet another reason against Python :) Honestly, I've got no idea why the code gives the output. Ran it with 3.1.1-rc1 and got the very same results. Hoping pyspark/python devs will chime in and shed more light on this. Pozdrawiam, Jacek Laskowski https

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Jacek Laskowski
pment IMHO). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Wed, Jan 20, 2021 at 2:44 PM Marco Firrincieli wrote:

Re: Spark Event Log Forwarding and Offset Tracking

2021-01-17 Thread Jacek Laskowski
d also forward that to ElasticSearch via log4j for monitoring Think SparkListener API would help here too. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski

Re: understanding spark shuffle file re-use better

2021-01-17 Thread Jacek Laskowski
is used to look up any cached queries. Again, I'm not really sure and if I'd have to answer it (e.g. as part of an interview) I'd say nothing would be shared / re-used. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.p

Re: Dynamic Spark metrics creation

2021-01-17 Thread Jacek Laskowski
Hey Yurii, > which is unavailable from executors. Register it on the driver and use accumulators on executors to update the values (on the driver)? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/>

Re: Insertable records in Datasource v2.

2021-01-14 Thread Jacek Laskowski
te .insertInto(sqlView) In summary, you should report this to JIRA, but don't expect this get fixed other than to catch this case just to throw this exception from ResolveRelations: Inserting into a view is not allowed" Unless I'm mistaken... Pozdrawiam, Jacek Laskowski https:

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jacek Laskowski
Hi, Can you post the whole message? I'm trying to find what might be causing it. A small reproducible example would be of help too. Thank you. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Fo

Re: Converting spark batch to spark streaming

2021-01-08 Thread Jacek Laskowski
Hi, Start with DataStreamWriter.foreachBatch. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski> On Thu, Jan 7, 2

Re: Impact of .localCheckpoint() and executor dying

2021-01-06 Thread Jacek Laskowski
I wish myself that someone with more skills in this area chimed in... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklaskowski&g

Re: Impact of .localCheckpoint() and executor dying

2021-01-06 Thread Jacek Laskowski
ble (as it's on a stable HDFS file system not on an ephemeral executor). In either case, the lineage should be the same = cut. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitte

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
ing queries, respectively. Please note that I'm not a PMC member or even a committer so I'm speaking for myself only (not representing the project in an official way). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japil

Re: BOOK review of Spark: WARNING to spark users

2020-05-21 Thread Jacek Laskowski
Hi Emma, I'm curious about the purpose of the email. Mind elaborating? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.com/jaceklaskowski <https://twitter.com/jaceklasko

Re: Spark Window Documentation

2020-05-08 Thread Jacek Laskowski
Hi Neeraj, I'd start from "Contributing Documentation Changes" in https://spark.apache.org/contributing.html Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> Follow me on https://twitter.

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Jacek Laskowski
Hi, It's been a while since I worked with Spark Standalone, but I'd check the logs of the workers. How do you spark-submit the app? DId you check /grid/1/spark/work/driver-20200508153502-1291 directory? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of&qu

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Jacek Laskowski
g at the very least and/or use YARN as the cluster manager". Another thought was that the user code (your code) could be leaking resources so Spark eventually reports heap-related errors that may not necessarily be Spark's. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowsk

Re: Spark Window Documentation

2020-05-08 Thread Jacek Laskowski
; on twitter @ https://twitter.com/adamwathan/status/1257641015835611138. You could borrow some ideas of the docs that are claimed "the best". Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books <https://books.japila.pl/> F

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-11 Thread Jacek Laskowski
Hi, Thanks Dongjoon Hyun for stepping up as a release manager! Much appreciated. If there's a volunteer to cut a release, I'm always to support it. In addition, the more frequent releases the better for end users so they have a choice to upgrade and have all the latest fixes or wait. It's their

Re: Change parallelism number in Spark Streaming

2019-06-27 Thread Jacek Laskowski
Hi, I've got a talk "The internals of stateful stream processing in Spark Structured Streaming" at https://dataxday.fr/ today and am going to include the tool on the slides to thank you for the work. Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internal

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski
ink and DataFlow allow changing >> parallelism number, and by my knowledge of Spark Streaming, it seems it is >> also able to do that: if some “key interval” concept is used, then state >> can somehow decoupled from partition number by consistent hashing. >> >> >>

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski
is > also able to do that: if some “key interval” concept is used, then state > can somehow decoupled from partition number by consistent hashing. > > > > > > Regards > > Jialei > > > > *From: *Jacek Laskowski > *Date: *Wednesday, June 26, 2019 at 11:00

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski
Hi, It's not allowed to change the numer of partitions after your streaming query is started. The reason is exactly the number of state stores which is exactly the number of partitions (perhaps multiplied by the number of stateful operators). I think you'll even get a warning or an exception

Re: Spark 2.2 With Column usage

2019-06-11 Thread Jacek Laskowski
.org/docs/latest/sql-programming-guide.html and then hop onto http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package to know the Spark API better. I'm sure you'll quickly find out the answer(s). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spar

Re: Spark logging questions

2019-06-08 Thread Jacek Laskowski
Hi, What are "the spark driver and executor threads information" and "spark application logging"? Spark uses log4j so set up logging levels appropriately and you should be done. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL http

Re: Spark 2.2 With Column usage

2019-06-08 Thread Jacek Laskowski
staken, in your case, what you really need is to replace `withColumn` with `select("id")` itself and you're done. When I'm writing this (I'm saying exactly what you actually have already) and I'm feeling confused. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The I

[SQL] Why casting string column to timestamp gives null?

2019-06-07 Thread Jacek Laskowski
t "timestamp").show ++ | ts| ++ |null| |null| ++ scala> Seq("1", "2").toDF("ts").select($"ts" cast "long").select($"ts" cast "timestamp").show +---+ | ts| +

Re: What is the difference for the following UDFs?

2019-05-14 Thread Jacek Laskowski
$"b", $"e" - $"b" + 1) as "demo").show +-+ | demo| +-+ |hello| +-+ Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Ma

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Jacek Laskowski
Hi, I'm curious about "I found the bug code". Can you point me at it? Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Str

Re: How to use same SparkSession in another app?

2019-04-16 Thread Jacek Laskowski
Hi, Not possible. What are you really trying to do? Why do you need to share dataframes? They're nothing but metadata of a distributed computation (no data inside) so what would be the purpose of such sharing? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL

Re: Observing DAGScheduler Log Messages

2019-04-07 Thread Jacek Laskowski
Hi, Add the following line to conf/log4j.properties and you should have all the logs: log4j.logger.org.apache.spark.scheduler.DAGScheduler=ALL Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Jacek Laskowski
Hi, Can you show the plans with explain(extended=true) for both versions? That's where I'd start to pinpoint the issue. Perhaps the underlying execution engine change to affect keyBy? Dunno and guessing... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https

Re: structured streaming handling validation and json flattening

2019-02-11 Thread Jacek Laskowski
Hi Lian, "What have you tried?" would be a good starting point. Any help on this? How do you read the JSONs? readStream.json? You could use readStream.text followed by filter to include/exclude good/bad JSONs. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering

Re: Where is the DAG stored before catalyst gets it?

2018-10-06 Thread Jacek Laskowski
e're talking about an RDD, this abstraction is planned as a set of tasks (one per partition of the RDD). And yes, the tasks are sent out over the wire to executors. It's been like this from Spark 1.0 (and even earlier). Hope I helped a bit. Pozdrawiam, Jacek Laskowski https://about.me/JacekL

Re: Spark code to write to MySQL and Hive

2018-08-29 Thread Jacek Laskowski
that helps. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com

Re: Bug in Window Function

2018-07-25 Thread Jacek Laskowski
Hi Elior, Could you show the query that led to the exception? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly

Re: Spark 2.4 release date

2018-06-18 Thread Jacek Laskowski
Hi, What about https://issues.apache.org/jira/projects/SPARK/versions/12342385? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-28 Thread Jacek Laskowski
ming since the former uses Dataset API while the latter RDD API. Don't touch RDD API and Spark Streaming unless you know what you're doing :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bi

Re: help with streaming batch interval question needed

2018-05-25 Thread Jacek Laskowski
ort org.apache.spark.sql.streaming.Trigger df.writeStream.trigger(Trigger.ProcessingTime("1 second")) See http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spa

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-13 Thread Jacek Laskowski
Hi, The exception message should be self-explanatory and says that you cannot join two streaming Datasets. This feature was added in 2.3 if I'm not mistaken. Just to be sure that you work with two streaming Datasets, can you show the query plan of the join query? Jacek On Sat, 12 May 2018,

Curious case of Spark SQL 2.3 - number of stages different for the same query ever?

2018-04-16 Thread Jacek Laskowski
that I'm 100% sure that the query is indeed the same since I'm working on a reproducible test case and only when I got it I'll really be). Sorry for the vague description, but I've got nothing more to share yet. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jacek Laskowski
Hi, What's the deployment process then (if not using spark-submit)? How is the AM deployed? Why would you want to skip spark-submit? Jacek On 19 Mar 2018 00:20, "Serega Sheypak" wrote: > Hi, Is it even possible to run spark on yarn as usual java application? > I've

Re: NPE in Subexpression Elimination optimization

2018-03-18 Thread Jacek Laskowski
Hi, Filled https://issues.apache.org/jira/browse/SPARK-23731 and am working on a workaround (aka fix). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming

NPE in Subexpression Elimination optimization

2018-03-16 Thread Jacek Laskowski
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https

Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Jacek Laskowski
Hi, Since I'm new to Hive, what does `stored by` do? I might help a bit in Spark if I only knew a bit about Hive :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly

Re: spark job error

2018-01-30 Thread Jacek Laskowski
Hi, Start with spark.executor.memory 2g. You may also give spark.yarn.executor.memoryOverhead a try. See https://spark.apache.org/docs/latest/configuration.html and https://spark.apache.org/docs/latest/running-on-yarn.html for more in-depth information. Pozdrawiam, Jacek Laskowski https

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
t records which are lagging behind (based on event time) by a certain amount of time." Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka St

Re: Best active groups, forums or contacts for Spark ?

2018-01-26 Thread Jacek Laskowski
Hi Esa, I'd say https://stackoverflow.com/questions/tagged/apache-spark is where many active sparkians hang out :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski
| id| +---+---+ | 0| 0| +---+---+ Am I missing something? When aliasing a table, use the identifier in column refs (inside). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-s

Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski
Hi Michael, -dev +user What's the query? How do you "fool spark"? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams http

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread Jacek Laskowski
gt; Do you have any other suggestion/recommendation ? What's wrong with the current solution? I don't think you should change how you do things currently. You should just avoid collect on large datasets (which you have to do anywhere in Spark). Pozdrawiam, Jacek Laskowski https://ab

Re: Is spark-env.sh sourced by Application Master and Executor for Spark on YARN?

2018-01-03 Thread Jacek Laskowski
available on the driver in cluster deploy mode? That should give you a definitive answer (or at least get you closer). Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured

Re: How to...UNION ALL of two SELECTs over different data sources in parallel?

2017-12-17 Thread Jacek Laskowski
Thanks Silvio! In the meantime, with help of Adam and code review of WholeStageCodegenExec and CollapseCodegenStages, I found out that anything that's codegend is as fast as the tasks in a stage. In this case, union of two codegend subtrees is indeed parallel. Pozdrawiam, Jacek Laskowski

How to...UNION ALL of two SELECTs over different data sources in parallel?

2017-12-16 Thread Jacek Laskowski
at rdd at :26 [] | MapPartitionsRDD[13] at rdd at :26 [] | ParallelCollectionRDD[12] at rdd at :26 [] What am I missing and how to be certain whether and what parts of a query are going to be executed in parallel? Please help... Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski

Re: is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-16 Thread Jacek Laskowski
ave datasets of different schema in a query. You'd have to use the most wide schema to cover all schemas. p.s. Have you tried anything...spark-shell's your friend, my friend :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structu

Re: Why Spark 2.2.1 still bundles old Hive jars?

2017-12-11 Thread Jacek Laskowski
Hi, https://issues.apache.org/jira/browse/SPARK-19076 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com

Re: Infer JSON schema in structured streaming Kafka.

2017-12-11 Thread Jacek Laskowski
Hi, What about a custom streaming Sink that would stop the query after addBatch has been called? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark

Re: Infer JSON schema in structured streaming Kafka.

2017-12-10 Thread Jacek Laskowski
Hi, What about memory sink? That could work. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski

Re: pyspark + from_json(col("col_name"), schema) returns all null

2017-12-10 Thread Jacek Laskowski
Hi, Not that I'm aware of, but in your case checking out whether a JSON message fit your schema and the pipeline would've taken pyspark alone with JSONs on disk, wouldn't it? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark

Re: RDD[internalRow] -> DataSet

2017-12-09 Thread Jacek Laskowski
Hi Satyajit, That's exactly what Dataset.rdd does --> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala?utf8=%E2%9C%93#L2916-L2921 Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/sp

Re: Struct Type

2017-11-17 Thread Jacek Laskowski
Hi, Use explode function, filter operator and collect_list function. Or "heavier" flatMap. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-sp

Re: Restart Spark Streaming after deployment

2017-11-16 Thread Jacek Laskowski
Hi, You're right...killing the spark streaming job is the way to go. If a batch was completed successfully, Spark Streaming will recover from the controlled failure and start where it left off. I don't think there's other way to do it. Pozdrawiam, Jacek Laskowski https://about.me

Re: Structured Streaming and Hive

2017-09-30 Thread Jacek Laskowski
Hi, Guessing it's a timing issue. Once you started the query the batch 0 did not have rows to save or didn't start yet (it's a separate thread) and so spark.sql ran once and saved nothing. You should rather use foreach writer to save results to Hive. Jacek On 29 Sep 2017 11:36 am, "HanPan"

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread Jacek Laskowski
to, e.g. topic\d [2] [1] https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribe [2] https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html#subscribepattern Pozdrawiam, Jacek Laskowski https

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, Ah, right! Start the queries and once they're running, awaitTermination them. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache

Re: Structured streaming coding question

2017-09-19 Thread Jacek Laskowski
Hi, What's the code in readFromKafka to read from hello2 and hello1? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me

  1   2   3   4   5   >