[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier

checkpoint file deletion

2023-06-29 Thread Lingzhe Sun
Hi all, I performed a stateful structure streaming job, and configured spark.cleaner.referenceTracking.cleanCheckpoints to true spark.cleaner.periodicGC.interval to 1min in the config. But the checkpoint files are not deleted and the number of them keeps growing. Did I miss something

Re: structured streaming- checkpoint metadata growing indefinetely

2022-05-04 Thread Wojciech Indyk
, Wojciech Indyk sob., 30 kwi 2022 o 12:35 Wojciech Indyk napisał(a): > Hi Gourav! > I use stateless processing, no watermarking, no aggregations. > I don't want any data loss, so changing checkpoint location is not an > option to me. > > -- > Kind regards/ Pozdrawiam, > Wojciech

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-30 Thread Wojciech Indyk
Hi Gourav! I use stateless processing, no watermarking, no aggregations. I don't want any data loss, so changing checkpoint location is not an option to me. -- Kind regards/ Pozdrawiam, Wojciech Indyk pt., 29 kwi 2022 o 11:07 Gourav Sengupta napisał(a): > Hi, > > this may

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-29 Thread Gourav Sengupta
Hi, this may not solve the problem, but have you tried to stop the job gracefully, and then restart without much delay by pointing to a new checkpoint location? The approach will have certain uncertainties for scenarios where the source system can loose data, or we do not expect duplicates

Re: structured streaming- checkpoint metadata growing indefinetely

2022-04-29 Thread Wojciech Indyk
Update for the scenario of deleting compact files: it recovers from the recent (not compacted) checkpoint file, but when it comes to compaction of checkpoint then it fails with missing recent compaction file. I use Spark 3.1.2 -- Kind regards/ Pozdrawiam, Wojciech Indyk pt., 29 kwi 2022 o 07:00

structured streaming- checkpoint metadata growing indefinetely

2022-04-28 Thread Wojciech Indyk
Hello! I use spark struture streaming. I need to use s3 for storing checkpoint metadata (I know, it's not optimal storage for checkpoint metadata). Compaction interval is 10 (default) and I set "spark.sql.streaming.minBatchesToRetain"=5. When the job was running for a few weeks then che

Re: [Spark Core][Intermediate][How-to]: Measure the time for a checkpoint

2021-10-07 Thread Lalwani, Jayesh
1. Is your join and aggregation based on the same keys? You might want to look at the execution plan. It is possible that without checkpointing, Spark puts join and aggregation into the same stage to eliminate shuffling. With a checkpoint, you might have forced Spark to introduce a shuffle

Re: [Spark Core][Intermediate][How-to]: Measure the time for a checkpoint

2021-10-07 Thread Mich Talebzadeh
onfiguration:* 10 executor instances with 4 CPU cores >and 8GB of memory each > > > *Assumptions I want to verify:* > >1. *Spark application A *will take less time than* Spark application >B *because *Spark application B* needs time to write a checkpoint to >reliable

[Spark Core][Intermediate][How-to]: Measure the time for a checkpoint

2021-10-07 Thread Schneider, Felix Jan
A will take less time than Spark application B because Spark application B needs time to write a checkpoint to reliable storage (HDFS) 2. The mean runtime difference of Spark application A and Spark application B will be the mean of the time it will take to write the checkpoint. Experiment sequence

Re: Structured Streaming Checkpoint Error

2020-12-03 Thread German Schiavon
Thanks Jungtaek! It makes sense, we are currently changing to an HDFS-Compatible FS, I was wondering how this change would impact the checkpoint, but after what you said it is more clear now. On Thu, 3 Dec 2020 at 00:23, Jungtaek Lim wrote: > In theory it would work, but works v

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread Jungtaek Lim
In theory it would work, but works very inefficiently on checkpointing. If I understand correctly, it will write the content to the temp file on s3, and rename the file which actually gets the temp file from s3 and write the content of temp file to the final path on s3. Compared to checkpoint

Re: Structured Streaming Checkpoint Error

2020-12-02 Thread German Schiavon
nks a lot! > > On Thu, 17 Sep 2020 at 11:51, Gabor Somogyi > wrote: > >> Hi, >> >> Structured Streaming is simply not working when checkpoint location is on >> S3 due to it's read-after-write consistency. >> Please choose an HDFS compliant filesystem and it wil

Re: how to disable replace HDFS checkpoint location in structured streaming in spark3.0.1

2020-10-13 Thread lec ssmi
sorry, the mail title is a little problematic. "How to disable or replace .." lec ssmi 于2020年10月14日周三 上午9:27写道: > I have written a demo using spark3.0.0, and the location where the > checkpoint file is saved has been explicitly specified like >> >> strea

how to disable replace HDFS checkpoint location in structured streaming in spark3.0.1

2020-10-13 Thread lec ssmi
I have written a demo using spark3.0.0, and the location where the checkpoint file is saved has been explicitly specified like > > stream.option("checkpointLocation","file:///C:\\Users\\Administrator\\ > Desktop\\test") But the app still throws an excepti

Re: Structured Streaming Checkpoint Error

2020-09-17 Thread German Schiavon
Hi Gabor, Makes sense, thanks a lot! On Thu, 17 Sep 2020 at 11:51, Gabor Somogyi wrote: > Hi, > > Structured Streaming is simply not working when checkpoint location is on > S3 due to it's read-after-write consistency. > Please choose an HDFS compliant filesystem and it will wo

Re: Structured Streaming Checkpoint Error

2020-09-17 Thread Gabor Somogyi
Hi, Structured Streaming is simply not working when checkpoint location is on S3 due to it's read-after-write consistency. Please choose an HDFS compliant filesystem and it will work like a charm. BR, G On Wed, Sep 16, 2020 at 4:12 PM German Schiavon wrote: > Hi! > > I have an S

Structured Streaming Checkpoint Error

2020-09-16 Thread German Schiavon
Hi! I have an Structured Streaming Application that reads from kafka, performs some aggregations and writes in S3 in parquet format. Everything seems to work great except that from time to time I get a checkpoint error, at the beginning I thought it was a random error but it happened more than 3

Re: Appropriate checkpoint interval in a spark streaming application

2020-08-15 Thread Sheel Pancholi
Guys any inputs explaining the rationale on the below question will really help. Requesting some expert opinion. Regards, Sheel On Sat, 15 Aug, 2020, 1:47 PM Sheel Pancholi, wrote: > Hello, > > I am trying to figure an appropriate checkpoint interval for my spark > streaming appl

Appropriate checkpoint interval in a spark streaming application

2020-08-15 Thread Sheel Pancholi
Hello, I am trying to figure an appropriate checkpoint interval for my spark streaming application. Its Spark Kafka integration based on Direct Streams. If my *micro batch interval is 2 mins*, and let's say *each microbatch takes only 15 secs to process* then shouldn't my checkpoint interval

[Apache Spark][Streaming Job][Checkpoint]Spark job failed on Checkpoint recovery with Batch not found error

2020-05-28 Thread taylorwu
Hi, We have a Spark 2.4 job failed on Checkpoint recovery every few hours with the following errors (from the Driver Log): driver spark-kubernetes-driver ERROR 20:38:51 ERROR MicroBatchExecution: Query impressionUpdate [id = 54614900-4145-4d60-8156-9746ffc13d1f, runId = 3637c2f3-49b6-40c2-b6d0

Re: spark on k8s - can driver and executor have separate checkpoint location?

2020-05-16 Thread Ali Gouta
. On Sat, May 16, 2020 at 6:06 AM wzhan wrote: > Hi guys, > > I'm running spark applications on kubernetes. According to spark > documentation > > https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing > Spark needs distributed file system to store i

spark on k8s - can driver and executor have separate checkpoint location?

2020-05-15 Thread wzhan
Hi guys, I'm running spark applications on kubernetes. According to spark documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing Spark needs distributed file system to store its checkpoint data so that in case of failure, it can recover from checkpoint

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Jungtaek Lim
t;>> >>>> Hi Rishi, >>>> >>>> That is exactly why Trigger.Once was created for Structured Streaming. >>>> The way we look at streaming is that it doesn't have to be always real >>>> time, or 24-7 always on. We see streaming as a workfl

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Magnus Nilsson
this blog post for more details! >>> >>> https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html >>> >>> Best, >>> Burak >>> >>> On Fri, May 1, 2020 at 2:55 PM Rishi Shah >>> wrote: >>>

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Jungtaek Lim
bs-day-10x-cost-savings.html >> >> Best, >> Burak >> >> On Fri, May 1, 2020 at 2:55 PM Rishi Shah >> wrote: >> >>> Hi All, >>> >>> I recently started playing with spark streaming, and checkpoint location >>> feat

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Magnus Nilsson
> wrote: > >> Hi All, >> >> I recently started playing with spark streaming, and checkpoint location >> feature looks very promising. I wonder if anyone has an opinion about using >> spark streaming with checkpoint location option as a slow batch pro

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
have to repeat > indefinitely. See this blog post for more details! > > https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html > > Best, > Burak > > On Fri, May 1, 2020 at 2:55 PM Rishi Shah > wrote: > >> Hi All, >> >>

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html Best, Burak On Fri, May 1, 2020 at 2:55 PM Rishi Shah wrote: > Hi All, > > I recently started playing with spark streaming, and checkpoint location > feature looks very promising. I wonder if anyone has an o

[spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Hi All, I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming

Re: [Structured Streaming] Checkpoint file compact file grows big

2020-04-19 Thread Jungtaek Lim
0866 https://issues.apache.org/jira/browse/SPARK-30900 https://issues.apache.org/jira/browse/SPARK-30915 https://issues.apache.org/jira/browse/SPARK-30946 SPARK-30946 is closely related to the issue - it will help the size of checkpoint file much smaller and also much shorter elapsed time to com

Re:[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Kelvin Qin
as I know, the official documentation states that the checkpoint of the spark streaming application will continue to increase over time. Whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used. So,for applications that require long-term aggregation

[Structured Streaming] Checkpoint file compact file grows big

2020-04-15 Thread Ahn, Daniel
Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data? Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled

[Spark MicroBatchExecution] Error fetching kafka/checkpoint/state/0/0/1.delta does not exist

2020-03-12 Thread Miguel Silvestre
Hi community, I'm having this error in some kafka streams: Caused by: java.io.FileNotFoundException: File file:/efs/.../kafka/checkpoint/state/0/0/1.delta does not exist Because of this I have some streams down. How can I fix this? Thank you. -- Miguel Silvestre

Spark checkpoint problem for python api

2019-07-29 Thread zenglong chen
Hi, My code is below: from pyspark.conf import SparkConf from pyspark.context import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils def test(record_list): print(list(record_list)) return record_list def

Checkpointing and accessing the checkpoint data

2019-06-27 Thread Jean-Georges Perrin
Hi Sparkians, Few questions around checkpointing. 1. Checkpointing “dump” file / persisting to disk Is the file encrypted or is it a standard parquet file? 2. If the file is not encrypted, can I use it with another app (I know it’s kind of of a weird stretch case) 3. Have you/do you know of

Spark Streaming: Checkpoint, Recovery and Idempotency

2019-05-29 Thread sheelstera
Hello, I am trying to understand the content of a checkpoint and corresponding recovery. *My understanding of Spark Checkpointing: * If you have really long DAGs and your spark cluster fails, checkpointing helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50 transformations can

PySpark Streaming “PicklingError: Could not serialize object” when use transform operator and checkpoint enabled

2019-05-23 Thread Xilang Yan
In PySpark streaming, if checkpoint enabled, and if use a stream.transform operator to join with another rdd, “PicklingError: Could not serialize object” will be thrown. I have asked the same question at stackoverflow: https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau
Hi, I am looking for a setup that would be to be able to split a single spark processing into 2 jobs (operational constraints) without wasting too much time persisting the data between the two jobs during spark checkpoint/writes. I have a config with a lot of ram and I'm willing to configure

Redeploying spark streaming application aborts because of checkpoint issue

2018-10-14 Thread Kuttaiah Robin
"failOnDataLoss", false) .load() .filter(strFilter) .select(functions.from_json(functions.col("value").cast("string"), oSchema).alias("events")) .select("events.*"); Checkpoint is used as shown below; DataSt

Re: Structured Streaming doesn't write checkpoint log when I use coalesce

2018-08-09 Thread Jungtaek Lim
gt; .writeStream > .outputMode("append") > .trigger(Trigger.ProcessingTime(s"$triggerSeconds seconds")) > .option("checkpointLocation",s"$checkpointDir/$appName/tsdb") > .foreach { >

Structured Streaming doesn't write checkpoint log when I use coalesce

2018-08-09 Thread WangXiaolong
assword, mongoDatabase, mongoCollection,mongoAuthenticationDatabase) )(createMetricBuilder(tsdbMetricPrefix)) } .start() And when I check the checkpoint dir, I discover that the "/checkpoint/state" dir is empty. I looked into the executor's log and found that the

Re: [Structured Streaming] Reading Checkpoint data

2018-07-09 Thread subramgr
thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Structured Streaming] Reading Checkpoint data

2018-07-09 Thread Tathagata Das
Only the stream metadata (e.g., streamid, offsets) are stored as json. The stream state data is stored in an internal binary format. On Mon, Jul 9, 2018 at 4:07 PM, subramgr wrote: > Hi, > > I read somewhere that with Structured Streaming all the checkpoint data is > more readable

[Structured Streaming] Reading Checkpoint data

2018-07-09 Thread subramgr
Hi, I read somewhere that with Structured Streaming all the checkpoint data is more readable (Json) like. Is there any documentation on how to read the checkpoint data. If I do `hadoop fs -ls` on the `state` directory I get some encoded data. Thanks Girish -- Sent from: http://apache-spark

making query state checkpoint compatible in structured streaming

2018-06-17 Thread puneetloya
Consider there is a spark query(A) which is dependent on Kafka topics t1 and t2. After running this query in the streaming mode, a checkpoint(C1) directory for the query gets created with offsets and sources directories. Now I add a third topic(t3) on which the query is dependent. Now if I

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
If you are using kafka direct connect api it might be committing offset back to kafka itself בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl ‏: > I met the same issue and I have try to delete the checkpoint dir before the > job , > > But spark seems can read the correct offset even

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from: http://apache-spark-user-list.1001560.n3

Using checkpoint much, much faster than cache. Why?

2018-06-05 Thread Phillip Henry
: 100%" everywhere). I have called cache() on all my DataFrames with no effect. However, calling checkpoint() on the DF fed to Spark's ML code solved the problem. So, although the problem is fixed, I'd like to know why cache() did not work when checkpoint() did. Can anybody explain? Thanks, Phill

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset

The last successful batch before stop re-execute after restart the DStreams with checkpoint

2018-03-11 Thread Terry Hoo
Experts, I see the last batch before stop (graceful shutdown) always re-execute after restart the DStream from a checkpoint, is this a expected behavior? I see a bug in JIRA: https://issues.apache.org/jira/browse/SPARK-20050, whic reports duplicates on Kafka, I also see this with HDFS file

Issue with EFS checkpoint

2018-02-07 Thread Khan, Obaidur Rehman
Hello, We have a Spark cluster with 3 worker nodes available as EC2 on AWS. Spark application is running in cluster mode and the checkpoints are stored in EFS. Spark version used is 2.2.0. We noticed the below error coming up – our understanding was that this intermittent checkpoint issue

Spark Streaming checkpoint

2018-01-29 Thread KhajaAsmath Mohammed
Hi, I have written spark streaming job to use the checkpoint. I have stopped the streaming job for 5 days and then restart it today. I have encountered weird issue where it shows as zero records for all cycles till date. is it causing data loss? [image: Inline image 1] Thanks, Asmath

Null pointer exception in checkpoint directory

2018-01-16 Thread KhajaAsmath Mohammed
Hi, I keep getting null pointer exception in the spark streaming job with checkpointing. any suggestions to resolve this. Exception in thread "pool-22-thread-9" java.lang.NullPointerException at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233) at

Re: Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread Shixiong(Ryan) Zhu
ompaction (AWS support directed us to do this, as > Namenode edit logs were filling the disk). > > Occasionally, the Structured Streaming query will not restart because the > most recent file in the "commits" or "offsets" checkpoint subdirectory is > empty. This seems l

Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread William Briggs
will not restart because the most recent file in the "commits" or "offsets" checkpoint subdirectory is empty. This seems like an undesirable behavior, as it requires manual intervention to remove the empty files in order to force the job to fall back onto the last good va

Spark 2.1.2 Spark Streaming checkpoint interval not respected

2017-11-18 Thread Shing Hing Man
Hi, In the following example using mapWithState, I set checkpoint interval to 1 minute. From the log, Spark stills write to the checkpoint directory every second. Would be appreciated if someone can point out what I have done wrong. object MapWithStateDemo { def main(args: Array[String

Re: Kafka 010 Spark 2.2.0 Streaming / Custom checkpoint strategy

2017-10-13 Thread Jörn Franke
gt; wrote: > > Greetings! > > I would like to accomplish a custom kafka checkpoint strategy (instead of > hdfs, i would like to use redis). is there a strategy I can use to change > this behavior; any advise will help. Th

Kafka 010 Spark 2.2.0 Streaming / Custom checkpoint strategy

2017-10-13 Thread Anand Chandrashekar
Greetings! I would like to accomplish a custom kafka checkpoint strategy (instead of hdfs, i would like to use redis). is there a strategy I can use to change this behavior; any advise will help. Thanks! Regards, Anand.

Re: Cases when to clear the checkpoint directories.

2017-10-09 Thread Tathagata Das
de validation/de-serialisation in case of DStreams? > > We are using mapWithState in our application and it builds its state from > checkpointed RDDs. I would like understand the cases where we can avoid > clearing the checkpoint directories. > > > thanks in advance, > Vish

Cases when to clear the checkpoint directories.

2017-10-07 Thread John, Vishal (Agoda)
the cases where we can avoid clearing the checkpoint directories. thanks in advance, Vishal This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you

ReduceByKeyAndWindow checkpoint recovery issues in Spark Streaming

2017-08-23 Thread SRK
Hi, ReduceByKeyAndWindow checkpoint recovery has issues when trying to recover for the second time. Basically it is losing the reduced value of the previous window but is present in the old values that needs to be inverse reduced resulting in the following error. Does anyone has any idea

Fwd: Issues when trying to recover a textFileStream from checkpoint in Spark streaming

2017-08-11 Thread swetha kasireddy
Hi, I am facing issues while trying to recover a textFileStream from checkpoint. Basically it is trying to load the files from the begining of the job start whereas I am deleting the files after processing them. I have the following configs set so was thinking that it should not look for files

Issues when trying to recover a textFileStream from checkpoint in Spark streaming

2017-08-10 Thread SRK
Hi, I am facing issues while trying to recover a textFileStream from checkpoint. Basically it is trying to load the files from the begining of the job start whereas I am deleting the files after processing them. I have the following configs set so was thinking that it should not look for files

Re: Spark streaming: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration ... on restart from checkpoint

2017-08-08 Thread dcam
Considering the @transient annotations and the work done in the instance initializer, not much state is really be broadcast to the executors. It might be simpler to just create these instances on the executors, rather than trying to broadcast them? -- View this message in context:

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-08 Thread Jacek Laskowski
iteStream. >> | format("console"). >> | option("truncate", false). >> | option("checkpointLocation", "/tmp/checkpoint"). // <-- >> checkpoint directory >> | trigger(Trigger.ProcessingTime(10.seconds))

Re: [SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Michael Armbrust
r > scala> spark.version > res8: String = 2.3.0-SNAPSHOT > > scala> val q = records. > | writeStream. > | format("console"). > | option("truncate", false). > | option("checkpointLocation", "/tmp/checkpo

[SS] Console sink not supporting recovering from checkpoint location? Why?

2017-08-07 Thread Jacek Laskowski
quot;truncate", false). | option("checkpointLocation", "/tmp/checkpoint"). // <-- checkpoint directory | trigger(Trigger.ProcessingTime(10.seconds)). | outputMode(OutputMode.Update). | start org.apache.spark.sql.AnalysisException: This query does not su

RE: underlying checkpoint

2017-07-16 Thread Mendelson, Assaf
does). Thanks, Assaf. From: Bernard Jesop [mailto:bernard.je...@gmail.com] Sent: Thursday, July 13, 2017 6:58 PM To: Vadim Semenov Cc: user Subject: Re: underlying checkpoint Thank you, one of my mistakes was to think that show() was an action. 2017-07-13 17:52 GMT+02:00 Vadim

Re: underlying checkpoint

2017-07-13 Thread Bernard Jesop
Thank you, one of my mistakes was to think that show() was an action. 2017-07-13 17:52 GMT+02:00 Vadim Semenov <vadim.seme...@datadoghq.com>: > You need to trigger an action on that rdd to checkpoint it. > > ``` > scala>spark.sparkContext.setCheckpointDir(&quo

Re: underlying checkpoint

2017-07-13 Thread Vadim Semenov
You need to trigger an action on that rdd to checkpoint it. ``` scala>spark.sparkContext.setCheckpointDir(".") scala>val df = spark.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20))) df: org.a

underlying checkpoint

2017-07-13 Thread Bernard Jesop
park.createDataFrame(List(("Scala", 35), ("Python", 30), ("R", 15), ("Java", 20))) df.show() df.rdd.checkpoint() println(if (df.rdd.isCheckpointed) "checkpointed" else "not checkpointed") }* But the result is still *"not checkpoi

Re: How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-07-03 Thread Yuval.Itzchakov
-to-reduce-the-amount-of-data-that-is-getting-written-to-the-checkpoint-from-Spark-Streaming-tp28798p28820.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr

Re: How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-07-02 Thread Yuval.Itzchakov
You can't. Spark doesn't let you fiddle with the data being checkpoint, as it's an internal implementation detail. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-reduce-the-amount-of-data-that-is-getting-written-to-the-checkpoint-from-Spark

How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-06-27 Thread SRK
Hi, I have checkpoints enabled in Spark streaming and I use updateStateByKey and reduceByKeyAndWindow with inverse functions. How do I reduce the amount of data that I am writing to the checkpoint or clear out the data that I dont care? Thanks! -- View this message in context: http://apache

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
ng, i highly recommend >> learning Structured Streaming >> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> >> instead. >> >> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan <alunarbe...@gmail.com> >> wrote: &g

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread ALunar Beach
y with Spark Streaming, i highly recommend > learning Structured Streaming > <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> > instead. > > On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan <alunarbe...@gmail.com> > wrote: > >> I a

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
:16 AM, anbucheeralan <alunarbe...@gmail.com> wrote: > I am using Spark Streaming Checkpoint and Kafka Direct Stream. > It uses a 30 sec batch duration and normally the job is successful in > 15-20 sec. > > If the spark application fails after the successful completion > (14

Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
checkpointDirectory); sparkContext.setCheckpointDir(checkpointPath); Asher Krim Senior Software Engineer On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid > wrote: > Still haven't found a --conf option. > > Regarding a temporary HDFS checkpoint directory, it looks lik

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Everett Anderson
Still haven't found a --conf option. Regarding a temporary HDFS checkpoint directory, it looks like when using --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment variable. Thus, one could do the following when creating a SparkSession: val checkpointPath = new Path

Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-26 Thread Everett Anderson
Hi, I need to set a checkpoint directory as I'm starting to use GraphFrames. (Also, occasionally my regular DataFrame lineages get too long so it'd be nice to use checkpointing to squash the lineage.) I don't actually need this checkpointed data to live beyond the life of the job, however. I'm

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke
Just load it as from any other directory. > On 26. May 2017, at 17:26, Priya PM <pmpr...@gmail.com> wrote: > > > -- Forwarded message -- > From: Priya PM <pmpr...@gmail.com> > Date: Fri, May 26, 2017 at 8:54 PM > Subject: Re: Spark checkpoint

Fwd: Spark checkpoint - nonstreaming

2017-05-26 Thread Priya PM
-- Forwarded message -- From: Priya PM <pmpr...@gmail.com> Date: Fri, May 26, 2017 at 8:54 PM Subject: Re: Spark checkpoint - nonstreaming To: Jörn Franke <jornfra...@gmail.com> Oh, how do i do it. I dont see it mentioned anywhere in the documentation. I have follow

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Jörn Franke
Do you have some source code? Did you set the checkpoint directory ? > On 26. May 2017, at 16:06, Priya <pmpr...@gmail.com> wrote: > > Hi, > > With nonstreaming spark application, did checkpoint the RDD and I could see > the RDD getting checkpointed. I have kill

Re: Spark checkpoint - nonstreaming

2017-05-26 Thread Holden Karau
ark application, did checkpoint the RDD and I could see > the RDD getting checkpointed. I have killed the application after > checkpointing the RDD and restarted the same application again immediately, > but it doesn't seem to pick from checkpoint and it again checkpoints the > RDD. Could a

Spark checkpoint - nonstreaming

2017-05-26 Thread Priya
Hi, With nonstreaming spark application, did checkpoint the RDD and I could see the RDD getting checkpointed. I have killed the application after checkpointing the RDD and restarted the same application again immediately, but it doesn't seem to pick from checkpoint and it again checkpoints

Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead
Im having some difficulty reliably restoring a streaming job from a checkpoint. When restoring a streaming job constructed from the following snippet, I receive NullPointerException's when `map` is called on the the restored RDD. lazy val ssc = StreamingContext.getOrCreate(checkpointDir

checkpoint on spark standalone

2017-04-20 Thread Vivek Mishra
(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) I understand that it could be due to RDD linage and tried to resolve it via checkpoint, but no luck. Any help

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
ld be specified by >> spark.sql.streaming.checkpointLocation on SparkSession level and thus >> automatically checkpoint dirs will be created per foreach query? >> >> > Sure, please open a pull request. > > >> 2) Do we really need to specify the checkpoint dir per query? what the >&

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Michael Armbrust
> > 1) could we update documentation for Structured Streaming and describe > that checkpointing could be specified by > spark.sql.streaming.checkpointLocation > on SparkSession level and thus automatically checkpoint dirs will be > created per foreach query? > > Sure, pl

Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be able to transform to Python? http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ jg > On Apr 14, 2017, at 07:18, issues solution wrote: > > Hi > somone can give me an

checkpoint

2017-04-14 Thread issues solution
Hi somone can give me an complete example to work with chekpoint under Pyspark 1.6 ? thx regards

SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
s you are attempting to restart a query from checkpoint that is already active. It is caused by that *StreamingQueryManager.scala* get the checkpoint dir from stream’s configuration, and because my streams have equal checkpointDirs, the second stream tries to recover instead of creating of new one.Fo

checkpoint how to use correctly checkpoint with udf

2017-04-13 Thread issues solution
Hi , somone can explain me how i can use inPYSPAK not in scala chekpoint , Because i have lot of udf to apply on large data frame and i dont understand how i can use checkpoint to break lineag to prevent from java.lang.stackoverflow regrads

Re: checkpoint

2017-04-13 Thread ayan guha
Looks like your udf expects numeric data but you are sending string type. Suggest to cast to numeric. On Thu, 13 Apr 2017 at 7:03 pm, issues solution <issues.solut...@gmail.com> wrote: > Hi > I am newer in spark and i want ask you what wrang with checkpoint On > pyspark 1

checkpoint

2017-04-13 Thread issues solution
Hi I am newer in spark and i want ask you what wrang with checkpoint On pyspark 1.6.0 i dont unertsand what happen after i try to use it under datframe : dfTotaleNormalize24 = dfTotaleNormalize23.select([i if i not in listrapcot else udf_Grappra(F.col(i)).alias(i) for i

[Spark Streaming] Checkpoint backup (.bk) file purpose

2017-03-16 Thread Bartosz Konieczny
Hello, Actually I'm studying metadata checkpoint implementation in Spark Streaming and I was wondering the purpose of so called "backup files": CheckpointWriter snippet: > // We will do checkpoint when generating a batch and completing a batch. > When the processing >

Re: Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-17 Thread shyla deshpande
.RemoteException(java.io.IOException): >> File >> /checkpoint/11ea8862-122c-4614-bc7e-f761bb57ba23/rdd-347/.part-1-attempt-3 >> could only be replicated to 0 nodes instead of minReplication (=1). There >> are 0 datanode(s) running and no node(s) are excluded in this

  1   2   3   4   >