Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
see. > > Perhaps we can solve this confusion by sharing the same file `version.json` > across `all versions` in the `Spark website repo`? Make each version of > the document display the `same` data in the dropdown menu. > -- > *发件人:* Jungtaek Lim > *发送时间:* 2

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
possible. > > Only by sharing the same version. json file in each version. > ------ > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月5日 16:47:30 > *收件人:* Pan,Bingkun > *抄送:* Dongjoon Hyun; dev; user > *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released &

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
https://github.com/apache/spark/pull/42881 > > So, we need to manually update this file. I can manually submit an update > first to get this feature working. > ------ > *发件人:* Jungtaek Lim > *发送时间:* 2024年3月4日 6:34:42 > *收件人:* Dongjoon Hyun > *抄送:* dev;

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
ull/42428? > > cc @Yang,Jie(INF) > > On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim > wrote: > >> Shall we revisit this functionality? The API doc is built with individual >> versions, and for each individual version we depend on other released >> versions. Thi

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
? What's the criteria of pruning the version? Unless we have a good answer to these questions, I think it's better to revert the functionality - it missed various considerations. On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim wrote: > Thanks for reporting - this is odd - the dropdown did not ex

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
to update the version. (For automatic bumping I don't have a good idea.) I'll look into it. Please expect some delay during the holiday weekend in S. Korea. Thanks again. Jungtaek Lim (HeartSaVioR) On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun wrote: > BTW, Jungtaek. > > PySpark document se

[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
you. Jungtaek Lim ps. Yikun is helping us through releasing the official docker image for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Jungtaek Lim
Hi, Streaming query clones the spark session - when you create a temp view from DataFrame, the temp view is created under the cloned session. You will need to use micro_batch_df.sparkSession to access the cloned session. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jan 31, 2024 at 3:29 PM

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded.

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-11 Thread Jungtaek Lim
wed more late events to be accepted. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jan 11, 2024 at 6:13 AM Andrzej Zera wrote: > I'm struggling with the following issue in Spark >=3.4, related to > multiple stateful operations. > > When spark.sql.streaming.statef

Re: Spark structured streaming tab is missing from spark web UI

2023-11-24 Thread Jungtaek Lim
The feature was added in Spark 3.0. Btw, you may want to check out the EOL date for Apache Spark releases - https://endoflife.date/apache-spark 2.x is already EOLed. On Fri, Nov 24, 2023 at 11:13 PM mallesh j wrote: > Hi Team, > > I am trying to test the performance of a spark streaming

Re: How exactly does dropDuplicatesWithinWatermark work?

2023-11-21 Thread Jungtaek Lim
is new API will ensure that these duplicated writes are deduplicated once users provide the max distance of time (max - min) among duplicated events as delay threshold of watermark. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Nov 20, 2023 at 10:18 AM Perfect Stranger wrote: > Hello, I

Re: [Structured Streaming] Joins after aggregation don't work in streaming

2023-10-26 Thread Jungtaek Lim
Hi, we are aware of your ticket and plan to look into it. We can't say about ETA but just wanted to let you know that we are going to look into it. Thanks for reporting! Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Oct 27, 2023 at 5:22 AM Andrzej Zera wrote: > Hey All, > > I

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
be there. On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim wrote: > The number of subscribers doesn't give any meaningful value. Please look > into the number of mails being sent to the list. > > https://lists.apache.org/list.html?user@spark.apache.org > The latest month there were more than 2

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
subscribers. > > May I ask if the users prefer to use the ASF Official Slack channel > than the user mailing list? > > Dongjoon. > > > > On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim > wrote: > >> I'm reading through the page "Briefing: The Apache Way&q

Re: Slack for PySpark users

2023-03-30 Thread Jungtaek Lim
I'm reading through the page "Briefing: The Apache Way", and in the section of "Open Communications", restriction of communication inside ASF INFRA (mailing list) is more about code and decision-making. https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define It's

Re: Create a Jira account

2022-12-01 Thread Jungtaek Lim
There is a guide in the page - send the request mail to priv...@spark.apache.org. On Thu, Dec 1, 2022 at 10:07 PM ideal wrote: > hello > i need to open a Jira ticket for spark about thrift server operation > log output is empty. but i do not have an ASF Jira account. recently Infra > ended

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Jungtaek Lim
Thanks Chao for driving the release! On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan wrote: > Thanks, Chao! > > On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > >> We are happy to announce the availability of Apache Spark 3.2.3! >> >> Spark 3.2.3 is a maintenance release containing stability

Re: Creating a Spark 3 Connector

2022-11-23 Thread Jungtaek Lim
tors. It's encouraged to look at reference implementations like Kafka and understand interfaces. Each interface has its own documentation so it will guide you to implement your own. Please post any question on dev@ mailing list if you have doubts or are stuck with implementing it. Thanks, Ju

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Jungtaek Lim
I have no context on ML, but your "streaming" query exposes the possibility of memory issues. *flattenedNER.registerTempTable(**"df"**) >>> >>> >>> querySelect = **"SELECT col as entity, avg(sentiment) as sentiment, >>> count(col) as count FROM df GROUP BY col"** >>> finalDF =

Re: DataStreamReader cleanSource option

2022-02-03 Thread Jungtaek Lim
ads than 1.) If it doesn't help, please turn on the DEBUG log level for the package "org.apache.spark.sql.execution.streaming" and grep the log messages from SourceFileArchiver & SourceFileRemover. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jan 27, 2022 at 9:56 PM Gabriela Dvořáková wrote:

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Jungtaek Lim
Thanks to Gengliang for driving this huge release! On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun wrote: > Thank you so much, Gengliang and all! > > Dongjoon. > > On Tue, Oct 19, 2021 at 8:48 AM Xiao Li wrote: > >> Thank you, Gengliang! >> >> Congrats to our community and all the contributors!

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-06 Thread Jungtaek Lim
a static dataframe. > > Anyhow, best regards > Eugen > > On Fri, 2021-09-03 at 11:44 +0900, Jungtaek Lim wrote: > > Hi, > > The file stream sink maintains the metadata in the output directory. The > metadata retains the list of files written by the streaming query, and

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-02 Thread Jungtaek Lim
will only read the files which are written from the streaming query. There are 3rd party projects dealing with transactional write from multiple writes, (alphabetically) Apache Iceberg, Delta Lake, and so on. You may want to check them out. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Sep 2, 2021 at 10:04

Re: How to generate unique incrementing identifier in a structured streaming dataframe

2021-07-13 Thread Jungtaek Lim
Theoretically, the composed value of batchId + monotonically_increasing_id() would achieve the goal. The major downside is that you'll need to deal with "deduplication" of output based on batchID as monotonically_increasing_id() is indeterministic. You need to ensure there's NO overlap on output

Re: ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

2021-03-18 Thread Jungtaek Lim
We've fixed the single case for "onJobStart", please check SPARK-34731 [1]. The patch will be available in Spark 3.1.2 / 3.2.0, but if someone reports the same for lower version lines I think we could port back to lower version lines as well. 1. https://issues.apache.org/jira/browse/SPARK-34731

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
), but worth to know in any way that it's not in official support from the Hadoop community. On Wed, Mar 17, 2021 at 6:54 AM Jungtaek Lim wrote: > Hadoop 2.x doesn't support JDK 11. See Hadoop Java version compatibility > with JDK: > > https://cwiki.apache.org/confluence/display/HADOOP

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Jungtaek Lim
Hadoop 2.x doesn't support JDK 11. See Hadoop Java version compatibility with JDK: https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions That said, you'll need to use Spark 3.x with Hadoop 3.1 profile to make Spark work with JDK 11. On Tue, Mar 16, 2021 at 10:06 PM Sean Owen

Re: Spark Structured Streaming and Kafka message schema evolution

2021-03-15 Thread Jungtaek Lim
integrated, and the latest schema is "backward compatible" with the integrated schema. Hope this helps. Thanks Jungtaek Lim (HeartSaVioR) On Mon, Mar 15, 2021 at 9:25 PM Mich Talebzadeh wrote: > This is just a query. > > In general Kafka-connect requires means to register that schema such t

Re: Using Spark as a fail-over platform for Java app

2021-03-12 Thread Jungtaek Lim
That's what resource managers provide to you. So you can code and deal with resource managers, but I assume you're finding ways to not deal with resource managers directly and let Spark do it instead. I admit I have no experience (I did the similar with Apache Storm on standalone setup 5+ years

Re: Detecting latecomer events in Spark structured streaming

2021-03-11 Thread Jungtaek Lim
omes into conclusion it worths to put efforts. If your business logic requires it, you could be a hacker and try to deal with this, and share if you succeed to make it.) I'd skip answering questions as I explained you'd be stuck even before raising these questions. Hope this helps. Thanks, Jungt

Re: FlatMapGroupsWithStateFunction is called thrice - Production use case.

2021-03-11 Thread Jungtaek Lim
that others aren't interested in your own code even if they are interested in the problematic behavior itself. It'd be nice if you can minimize the hurdle on debugging. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Mar 11, 2021 at 4:54 PM Kuttaiah Robin wrote: > Hello, > > I have a use cas

Re: Spark job crashing - Spark Structured Streaming with Kafka

2021-03-03 Thread Jungtaek Lim
c480-4d00-866b-0fbd88e9520e, runId = > 8f1f1756-da8d-4983-9f76-dc1af626ad84] > Current Committed Offsets: {} > Current Available Offsets: {KafkaV2[Subscribe[test-topic]]: > {"test-topic":{"0":4628}}} > Current State: ACTIVE > Thread State: RUNNABLE > Logical Plan: > WriteToMicro

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim
Thanks Hyukjin for driving the huge release, and thanks everyone for contributing the release! On Wed, Mar 3, 2021 at 6:54 PM angers zhu wrote: > Great work, Hyukjin ! > > Bests, > Angers > > Wenchen Fan 于2021年3月3日周三 下午5:02写道: > >> Great work and congrats! >> >> On Wed, Mar 3, 2021 at 3:51 PM

Re: Spark job crashing - Spark Structured Streaming with Kafka

2021-03-02 Thread Jungtaek Lim
I feel this quite lacks information. Full stack traces from driver/executors are essential at least to determine what was happening. On Tue, Mar 2, 2021 at 5:26 PM Sachit Murarka wrote: > Hi All, > > My spark job is crashing (Structured stream) . Can anyone help please. I > am using spark 3.0.1

Re: Spark 2.3 Stream-Stream Join with left outer join lost left stream value

2021-02-27 Thread Jungtaek Lim
We figured out edge-case from stream-stream left/right outer join in Spark 2.x and fixed in Spark 3.0.0. Please refer SPARK-26154 for more details. The fix brought another regression which was fixed in 3.0.1, so you may want to move to Spark

Re: Structured streaming, Writing Kafka topic to BigQuery table, throws error

2021-02-23 Thread Jungtaek Lim
If your code doesn't require "end to end exactly-once" then you could leverage foreachBatch which enables you to use batch sink. If your code requires "end to end exactly-once", then well, that's the different story. I'm not familiar with BigQuery and even have no idea how sink is implemented,

Re: Controlling Spark StateStore retention

2021-02-20 Thread Jungtaek Lim
n upcoming Spark 3.1, the query having such a pattern is disallowed unless end users set the config explicitly to force run. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Sun, Feb 21, 2021 at 8:49 AM Sergey Oboguev wrote: > I am trying to write a Spark Structured Streaming applicati

Re: [Spark SQL] - Not able to consume Kafka topics

2021-02-18 Thread Jungtaek Lim
(Dropping Kafka user mailing list as this is more likely Spark issue) Do you have a full stack trace for a log message? It would help to make clear where the issue lays. On Thu, Feb 18, 2021 at 8:01 PM Rathore, Yashasvini wrote: > Hello, > > Issues : > > * I and my team are trying to

Re: KafkaUtils module not found on spark 3 pyspark

2021-02-17 Thread Jungtaek Lim
to maintain DStream. Honestly, contributions on DStream have been quite rare.) Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Feb 17, 2021 at 4:19 PM aupres wrote: > I use hadoop 3.3.0 and spark 3.0.1-bin-hadoop3.2. And my python ide is > eclipse version 2020-12. I try to develop

Re: How to handle spark state which is growing too big even with timeout set.

2021-02-14 Thread Jungtaek Lim
rs migrate their state from state store provider A to B. The hopeful plan is to support any arbitrary providers between the two. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Feb 11, 2021 at 5:01 PM Kuttaiah Robin wrote: > Hello, > > I have a use case where I need to read events(non corre

Re: Structured Streaming Spark 3.0.1

2021-01-21 Thread Jungtaek Lim
Looks like it's a driver side error log, and I think executor log would have much more warning/error logs and probably with stack traces. I'd also suggest excluding the external dependency whatever possible while experimenting/investigating. If you're suspecting Apache Spark I'd rather say you'll

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Jungtaek Lim
I'm not sure how many people could even guess possible reasons - I'd say there's not enough information. No driver/executor logs, no job/stage/executor information, no code. On Thu, Jan 21, 2021 at 8:25 PM Jacek Laskowski wrote: > Hi, > > I'd look at stages and jobs as it's possible that the

Re: Structured Streaming Spark 3.0.1

2021-01-20 Thread Jungtaek Lim
I quickly looked into the attached log in SO post, and the problem doesn't seem to be related to Kafka. The error stack trace is from checkpointing to GCS, and the implementation of OutputStream for GCS seems to be provided with Google. Could you please elaborate the stack trace or upload the log

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Jungtaek Lim
eleasing Spark 3.0.0. Clearing Spark artifacts in your .m2 cache may work. On Tue, Jan 19, 2021 at 8:50 AM Jungtaek Lim wrote: > And also include some test data as well. I quickly looked through the code > and the code may require a specific format of the record. > > On Tue, Jan 19,

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Jungtaek Lim
>>> spark-sql-kafka-0-10_${scala.binary.version} >>>> ${spark.version} >>>> >>>> >>>> >>>> org.slf4j >>>> slf4j-log4j12 >>>> 1.7.7 >>>> runtime

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jungtaek Lim
1.0. > > On Wed, Jan 13, 2021 at 11:35 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Which exact Spark version did you use? Did you make sure the version for >> Spark and the version for spark-sql-kafka artifact are the same? (I asked >> this be

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Jungtaek Lim
Which exact Spark version did you use? Did you make sure the version for Spark and the version for spark-sql-kafka artifact are the same? (I asked this because you've said you've used Spark 3.0 but spark-sql-kafka dependency pointed to 3.1.0.) On Tue, Jan 12, 2021 at 11:04 PM Eric Beabes wrote:

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread Jungtaek Lim
Please refer my previous answer - https://lists.apache.org/thread.html/r7dfc9e47cd9651fb974f97dde756013fd0b90e49d4f6382d7a3d68f7%40%3Cuser.spark.apache.org%3E Probably we may want to add it in the SS guide doc. We didn't need it as it just didn't work with eventually consistent model, and now it

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread Jungtaek Lim
In theory it would work, but works very inefficiently on checkpointing. If I understand correctly, it will write the content to the temp file on s3, and rename the file which actually gets the temp file from s3 and write the content of temp file to the final path on s3. Compared to checkpoint with

Re: Excessive disk IO with Spark structured streaming

2020-11-05 Thread Jungtaek Lim
FYI, SPARK-30294 is merged and will be available in Spark 3.1.0. This reduces the number of temp files for the state store to half when you use streaming aggregation. 1. https://issues.apache.org/jira/browse/SPARK-30294 On Thu, Oct 8, 2020 at 11:55 AM Jungtaek Lim wrote: > I can't spend

Re: Best way to emit custom metrics to Prometheus in spark structured streaming

2020-11-02 Thread Jungtaek Lim
You can try out "Dataset.observe" added in Spark 3, which enables arbitrary metrics to be logged and exposed to streaming query listeners. On Tue, Nov 3, 2020 at 3:25 AM meetwes wrote: > Hi I am looking for the right approach to emit custom metrics for spark > structured streaming job. *Actual

Re: Cannot perform operation after producer has been closed

2020-11-02 Thread Jungtaek Lim
Which Spark version do you use? There's a known issue on Kafka producer pool in Spark 2.x which was fixed in Spark 3.0, so you'd like to check whether your case is bound to the known issue or not. https://issues.apache.org/jira/browse/SPARK-21869 On Tue, Nov 3, 2020 at 1:53 AM Eric Beabes

Re: States get dropped in Structured Streaming

2020-10-23 Thread Jungtaek Lim
Unfortunately your information wouldn't provide any hint that rows in the state are evicted correctly on watermark advance or there's an unknown bug which some of the rows in state are silently dropped. I haven't heard of the case for the latter - probably you'd like to double check it with

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Jungtaek Lim
I can't spend too much time on explaining one by one. I strongly encourage you to do a deep-dive instead of just looking around as you want to know about "details" - that's how open source works. I'll go through a general explanation instead of replying inline; probably I'd write a blog doc if

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
and in Python: > > from gresearch.spark.dgraph.connector import *triples = > spark.read.dgraph.triples("localhost:9080") > > I agree that 3rd parties should also support the official > spark.read.format() and the new catalog approaches. > > Enrico > > Am 05.10.20 um 14:03 schri

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
edge of a previously found utility location, and > repeats the search from the very start causing useless file system search > operations over and over again. > > This may or may not matter when HDFS is used for checkpoint store > (depending on how HDFS server implements the calls), but it does mat

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
checkpoint works in SS, temp file is atomically renamed to be the final file), and as a workaround (SPARK-28025 [3]) Spark tries to delete the crc file which two additional operations (exist -> delete) may occur per crc file. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) 1. ht

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-05 Thread Jungtaek Lim
ove state for the group (key). Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) ‪On Mon, Oct 5, 2020 at 6:16 PM ‫Yuri Oleynikov (יורי אולייניקוב‬‎ < yur...@gmail.com> wrote:‬ > Hi all, I have following question: > > What happens to the state (in terms of expiration) if I’m updating the

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

2020-10-05 Thread Jungtaek Lim
andra connector leverages it. I see some external data sources starting to support catalog, and in Spark itself there's some effort to support catalog for JDBC. https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR

Re: Query around Spark Checkpoints

2020-09-29 Thread Jungtaek Lim
a workaround then probably something is going wrong. Please feel free to share it. Thanks, Jungtaek Lim (HeartSaVioR) 2020년 9월 30일 (수) 오전 1:14, Bryan Jeffrey 님이 작성: > Jungtaek, > > How would you contrast stateful streaming with checkpoint vs. the idea of > writing updates to a Delt

Re: Query around Spark Checkpoints

2020-09-27 Thread Jungtaek Lim
it, and evaluate whether your target storage can fulfill the requirement. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi wrote: > Hi, > > As far as I know, it depends on whether you are using spark streaming or > structured streaming. > In spark streamin

Re: Elastic Search sink showing -1 for numOutputRows

2020-09-07 Thread Jungtaek Lim
I don't know about ES sink. The availability of "numOutputRows" depends on the API version the sink is implementing (DSv1 vs DSv2), so you may be better to ask a question to the author of ES sink and confirm the case. On Tue, Sep 8, 2020 at 5:15 AM jainshasha wrote: > Hi, > > Using structured

Re: Keeping track of how long something has been in a queue

2020-09-07 Thread Jungtaek Lim
te. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Sep 4, 2020 at 11:21 PM Hamish Whittal wrote: > Sorry, I moved a paragraph, > > (2) If Ms green.th was first seen at 13:04:04, then at 13:04:05 and >> finally at 13:04:17, she's been in the queue for 13 seconds (ignoring the >> ms). >> >

Re: [Spark Kafka Structured Streaming] Adding partition and topic to the kafka dynamically

2020-08-28 Thread Jungtaek Lim
Hi Amit, if I remember correctly, you don't need to restart the query to reflect the newly added topic and partition, if your subscription covers the topic (like subscribe pattern). Please try it out. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Fri, Aug 28, 2020 at 1:56 PM Amit

Re: Structured Streaming metric for count of delayed/late data

2020-08-22 Thread Jungtaek Lim
s there any workaround for this limitation of inaccurate count, maybe by > adding some additional streaming operation in SS job without impacting perf > too much ? > > > > Regards, > > Rajat > > > > *From: *Jungtaek Lim > *Date: *Friday, 21 August 2020 at 12:07 PM >

Re: Structured Streaming metric for count of delayed/late data

2020-08-21 Thread Jungtaek Lim
One more thing to say, unfortunately, the number is not accurate compared to the input rows on streaming aggregation, because Spark does local-aggregate and counts dropped inputs based on "pre-locally-aggregated" rows. You may want to treat the number as whether dropping inputs is happening or

Re: [SPARK-STRUCTURED-STREAMING] IllegalStateException: Race while writing batch 4

2020-08-12 Thread Jungtaek Lim
the table in good shape) Thanks, Jungtaek Lim (HeartSaVioR) On Sat, Aug 8, 2020 at 4:19 AM Amit Joshi wrote: > Hi, > > I have 2spark structure streaming queries writing to the same outpath in > object storage. > Once in a while I am getting the "IllegalStateException: Race w

Re: Lazy Spark Structured Streaming

2020-08-02 Thread Jungtaek Lim
; refer if not fixing > the "need to add a dummy record to move watermark forward"? > > Kind regards, > > Phillip > > > > > On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> I'm not sure what exac

Re: Pyspark: Issue using sql in foreachBatch sink

2020-07-31 Thread Jungtaek Lim
Python doesn't allow abbreviating () with no param, whereas Scala does. Use `write()`, not `write`. On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > In a pyspark SS job, trying to use sql instead of sql functions in > foreachBatch sink > throws AttributeError: 'JavaMember' object has no attribute

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Jungtaek Lim
d compared to OutputMode.Update. Unfortunately there's no mechanism on SSS to move forward only watermark without actual input, so if you want to test some behavior on OutputMode.Append you would need to add a dummy record to move watermark forward. Hope this helps. Thanks, Jungtaek Lim (H

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Jungtaek Lim
Please provide logs and dump file for the OOM case - otherwise no one could say what's the cause. Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="...dir..." On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread Jungtaek Lim
me. > > Thanks, > Asmath > > On Sun, Jul 5, 2020 at 6:22 PM Jungtaek Lim > wrote: > >> There're sections in SS programming guide which exactly answer these >> questions: >> >> >> http://spark.apache.org/docs/latest/structured-streaming-programming-guid

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread Jungtaek Lim
in point of Kafka's view, especially the gap between highest offset and committed offset. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi wrote: > In 3.0 the community just added it. > > On Sun, 5 Jul 2020, 14:28 KhajaAsmath Mohammed,

Re: Spark streaming with Kafka

2020-07-02 Thread Jungtaek Lim
I can't reproduce. Could you please make sure you're running spark-shell with official spark 3.0.0 distribution? Please try out changing the directory and using relative path like "./spark-shell". On Thu, Jul 2, 2020 at 9:59 PM dwgw wrote: > Hi > I am trying to stream kafka topic from spark

Re: Failure Threshold in Spark Structured Streaming?

2020-07-02 Thread Jungtaek Lim
Structured Streaming is basically following SQL semantic, which doesn't have such a semantic of "max allowance of failures". If you'd like to tolerate malformed data, please read with raw format (string or binary) which won't fail with such data, and try converting. e.g. from_json() will produce

Re: REST Structured Steaming Sink

2020-07-01 Thread Jungtaek Lim
would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin wrote: > Hi All, > > > We ingest alot of restful APIs i

Re: Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Jungtaek Lim
As Spark uses micro-batch for streaming, it's unavoidable to adjust the batch size properly to achieve your expectation of throughput vs latency. Especially, Spark uses global watermark which doesn't propagate (change) during micro-batch, you'd want to make the batch relatively small to make

Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Jungtaek Lim
I'm not sure how it is implemented, but in general I wouldn't expect such behavior on the connectors which read from non-streaming fashion storages. The query result may depend on "when" the records are fetched. If you need to reflect the changes in your query you'll probably want to find a way

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Jungtaek Lim
Great, thanks all for your efforts on the huge step forward! On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon wrote: > Yay! > > 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > >> Great job everyone ! Congratulations :-) >> >> Regards, >> Mridul >> >> On Thu, Jun 18, 2020 at 10:21 AM Reynold

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Jungtaek Lim
Just in case if anyone prefers ASF projects then there are other alternative projects in ASF as well, alphabetically, Apache Hudi [1] and Apache Iceberg [2]. Both are recently graduated as top level projects. (DISCLAIMER: I'm not involved in both.) BTW it would be nice if we make the metadata

Re: Structured Streaming using File Source - How to handle live files

2020-06-07 Thread Jungtaek Lim
how many lines/bytes it reads "per file", even the bad case it may read the incomplete line (if the writer doesn't guarantee that) and crash or bring incorrect results. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev wrote: > We were

Re: RecordTooLargeException in Spark *Structured* Streaming

2020-05-25 Thread Jungtaek Lim
igurations Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Tue, May 26, 2020 at 6:42 AM Something Something < mailinglist...@gmail.com> wrote: > I keep getting this error message: > > > *The message is 1169350 bytes when serialized which is larger than the > maximum r

Re: [structured streaming] [stateful] Null value appeared in non-nullable field

2020-05-23 Thread Jungtaek Lim
Hi, Only with stack trace there’s nothing to look into it. It’d be better to provide simple reproducer (code, and problematic inputs) so that someone may give it a try. You may also want to give it a try with 3.0.0, RC2 is better to test against, but preview2 would be easier for end users to

Re: No. of active states?

2020-05-07 Thread Jungtaek Lim
these metrics (e.g. numInputRows, > inputRowsPerSecond). > > I am talking about "No. of States" in the memory at any given time. > > On Thu, May 7, 2020 at 4:31 PM Jungtaek Lim > wrote: > >> If you're referring total "entries" in all states in SS job

Re: [E] Re: Pyspark Kafka Structured Stream not working.

2020-05-07 Thread Jungtaek Lim
at 3:55 PM Vijayant Kumar wrote: > Hi Jungtek, > > > > Thanks for the response. It appears to be #1. > > I will appreciate if you can share some sample command to submit the Spark > application.? > > > > *From:* Jungtaek Lim [mailto:kabhwan.opensou...@gmail.

Re: No. of active states?

2020-05-07 Thread Jungtaek Lim
If you're referring total "entries" in all states in SS job, it's being provided via StreamingQueryListener. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries Hope this helps. On Fri, May 8, 2020 at 3:26 AM Something Something wrote:

Re: Pyspark Kafka Structured Stream not working.

2020-05-06 Thread Jungtaek Lim
be adopted with such convention. (e.g. no space) Hope this helps, Thanks, Jungtaek Lim (HeartSaVioR) On Wed, May 6, 2020 at 5:36 PM Vijayant Kumar wrote: > Hi All, > > > > I am getting the below error while using Pyspark Structured Streaming from > Kafka Producer. > > > >

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Jungtaek Lim
which is the characteristic of streaming query, hence hybrid one. > > > On Sun, May 3, 2020 at 3:20 AM Jungtaek Lim > wrote: > >> If I understand correctly, Trigger.once executes only one micro-batch and >> terminates, that's all. Your understanding of structured stream

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Jungtaek Lim
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well. It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said,

Re: [Structured Streaming] NullPointerException in long running query

2020-04-28 Thread Jungtaek Lim
The root cause of exception is occurred in executor side "Lost task 10.3 in stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there. On Tue, Apr 28, 2020 at 2:52 PM lec ssmi wrote: > Hi: > One of my long-running queries occasionally encountered the following > exception: > > >

Re: is RosckDB backend available in 3.0 preview?

2020-04-22 Thread Jungtaek Lim
to the implementation on Spark ecosystem. On Thu, Apr 23, 2020 at 1:22 AM kant kodali wrote: > is it going to make it in 3.0? > > On Tue, Apr 21, 2020 at 9:24 PM Jungtaek Lim > wrote: > >> Unfortunately, the short answer is no. Please refer the last part of >> discussion on the P

Re: is RosckDB backend available in 3.0 preview?

2020-04-21 Thread Jungtaek Lim
Unfortunately, the short answer is no. Please refer the last part of discussion on the PR https://github.com/apache/spark/pull/24922 Unless we get any native implementation of this, I guess this project is most widely known implementation for RocksDB backend state store -

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
limits debugging help, but wanted > to understand if anyone has encountered a similar issue. > > On Tue, Apr 21, 2020 at 7:12 PM Jungtaek Lim > wrote: > >> If there's no third party libraries in the dump then why not share the >> thread dump? (I mean, the output of jstack) >&g

Re: Spark hangs while reading from jdbc - does nothing Removing Guess work from trouble shooting

2020-04-21 Thread Jungtaek Lim
cks after being clicked, then >>>> you can check the root cause of holding locks like this(Thread 48 of >>>> above): >>>> >>>> org.fusesource.jansi.internal.Kernel32.ReadConsoleInputW(Native >>>> Method) >>>> >>>> org.fusesource.jansi.internal.Kernel32.r

Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

2020-04-21 Thread Jungtaek Lim
/rb4ebf1d20d13db0a78694e8d301e51c326f803cb86fc1a1f66f2ae7e%40%3Cuser.spark.apache.org%3E Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav wrote: > Hi Team, > > While Running Spark Below are some finding. > >1. FileStreamSourceLog is responsible for maintaining input s

Re: [Structured Streaming] Checkpoint file compact file grows big

2020-04-19 Thread Jungtaek Lim
each the bad state). That said, it doesn't completely get rid of necessity of TTL, but open the chance to have longer TTL without encountering bad state. If you're adventurous you can apply these patches on your version of Spark and see whether it helps. Hope this helps. Thanks, Jungtaek Lim (HeartSa

Re: Using startingOffsets latest - no data from structured streaming kafka query

2020-04-19 Thread Jungtaek Lim
Did you provide more records to topic "after" you started the query? That's the only one I can imagine based on such information. On Fri, Apr 17, 2020 at 9:13 AM Ruijing Li wrote: > Hi all, > > Apologies if this has been asked before, but I could not find the answer > to this question. We have

Re: Spark structured streaming - Fallback to earliest offset

2020-04-19 Thread Jungtaek Lim
hang? > > On Tue, Apr 14, 2020 at 10:48 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> I think Spark is trying to ensure that it reads the input "continuously" >> without any missing. Technically it may be valid to say the situation is a

  1   2   >