Hi all,
Wondering if anyone has run into this as I can't find any similar issues in
JIRA, mailing list archives, Stack Overflow, etc. I had a query that was
running successfully, but the query planning time was extremely long (4+
hours). To fix this I added `checkpoint()` calls earlier
Hi all,
I performed a stateful structure streaming job, and configured
spark.cleaner.referenceTracking.cleanCheckpoints to true
spark.cleaner.periodicGC.interval to 1min
in the config. But the checkpoint files are not deleted and the number of them
keeps growing. Did I miss something
,
Wojciech Indyk
sob., 30 kwi 2022 o 12:35 Wojciech Indyk
napisał(a):
> Hi Gourav!
> I use stateless processing, no watermarking, no aggregations.
> I don't want any data loss, so changing checkpoint location is not an
> option to me.
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech
Hi Gourav!
I use stateless processing, no watermarking, no aggregations.
I don't want any data loss, so changing checkpoint location is not an
option to me.
--
Kind regards/ Pozdrawiam,
Wojciech Indyk
pt., 29 kwi 2022 o 11:07 Gourav Sengupta
napisał(a):
> Hi,
>
> this may
Hi,
this may not solve the problem, but have you tried to stop the job
gracefully, and then restart without much delay by pointing to a new
checkpoint location? The approach will have certain uncertainties for
scenarios where the source system can loose data, or we do not expect
duplicates
Update for the scenario of deleting compact files: it recovers from the
recent (not compacted) checkpoint file, but when it comes to compaction of
checkpoint then it fails with missing recent compaction file. I use Spark
3.1.2
--
Kind regards/ Pozdrawiam,
Wojciech Indyk
pt., 29 kwi 2022 o 07:00
Hello!
I use spark struture streaming. I need to use s3 for storing checkpoint
metadata (I know, it's not optimal storage for checkpoint metadata).
Compaction interval is 10 (default) and I set
"spark.sql.streaming.minBatchesToRetain"=5. When the job was running for a
few weeks then che
1. Is your join and aggregation based on the same keys?
You might want to look at the execution plan. It is possible that without
checkpointing, Spark puts join and aggregation into the same stage to eliminate
shuffling. With a checkpoint, you might have forced Spark to introduce a
shuffle
onfiguration:* 10 executor instances with 4 CPU cores
>and 8GB of memory each
>
>
> *Assumptions I want to verify:*
>
>1. *Spark application A *will take less time than* Spark application
>B *because *Spark application B* needs time to write a checkpoint to
>reliable
A will take less time than Spark application B because
Spark application B needs time to write a checkpoint to reliable storage (HDFS)
2. The mean runtime difference of Spark application A and Spark application
B will be the mean of the time it will take to write the checkpoint.
Experiment sequence
Thanks Jungtaek!
It makes sense, we are currently changing to an HDFS-Compatible FS, I was
wondering how this change would impact the checkpoint, but after what you
said it is more clear now.
On Thu, 3 Dec 2020 at 00:23, Jungtaek Lim
wrote:
> In theory it would work, but works v
In theory it would work, but works very inefficiently on checkpointing. If
I understand correctly, it will write the content to the temp file on s3,
and rename the file which actually gets the temp file from s3 and write the
content of temp file to the final path on s3. Compared to checkpoint
nks a lot!
>
> On Thu, 17 Sep 2020 at 11:51, Gabor Somogyi
> wrote:
>
>> Hi,
>>
>> Structured Streaming is simply not working when checkpoint location is on
>> S3 due to it's read-after-write consistency.
>> Please choose an HDFS compliant filesystem and it wil
sorry, the mail title is a little problematic. "How to disable or
replace .."
lec ssmi 于2020年10月14日周三 上午9:27写道:
> I have written a demo using spark3.0.0, and the location where the
> checkpoint file is saved has been explicitly specified like
>>
>> strea
I have written a demo using spark3.0.0, and the location where the
checkpoint file is saved has been explicitly specified like
>
> stream.option("checkpointLocation","file:///C:\\Users\\Administrator\\
> Desktop\\test")
But the app still throws an excepti
Hi Gabor,
Makes sense, thanks a lot!
On Thu, 17 Sep 2020 at 11:51, Gabor Somogyi
wrote:
> Hi,
>
> Structured Streaming is simply not working when checkpoint location is on
> S3 due to it's read-after-write consistency.
> Please choose an HDFS compliant filesystem and it will wo
Hi,
Structured Streaming is simply not working when checkpoint location is on
S3 due to it's read-after-write consistency.
Please choose an HDFS compliant filesystem and it will work like a charm.
BR,
G
On Wed, Sep 16, 2020 at 4:12 PM German Schiavon
wrote:
> Hi!
>
> I have an S
Hi!
I have an Structured Streaming Application that reads from kafka, performs
some aggregations and writes in S3 in parquet format.
Everything seems to work great except that from time to time I get a
checkpoint error, at the beginning I thought it was a random error but it
happened more than 3
Guys any inputs explaining the rationale on the below question will really
help. Requesting some expert opinion.
Regards,
Sheel
On Sat, 15 Aug, 2020, 1:47 PM Sheel Pancholi, wrote:
> Hello,
>
> I am trying to figure an appropriate checkpoint interval for my spark
> streaming appl
Hello,
I am trying to figure an appropriate checkpoint interval for my spark
streaming application. Its Spark Kafka integration based on Direct Streams.
If my *micro batch interval is 2 mins*, and let's say *each microbatch
takes only 15 secs to process* then shouldn't my checkpoint interval
Hi,
We have a Spark 2.4 job failed on Checkpoint recovery every few hours with
the following errors (from the Driver Log):
driver spark-kubernetes-driver ERROR 20:38:51 ERROR MicroBatchExecution:
Query impressionUpdate [id = 54614900-4145-4d60-8156-9746ffc13d1f, runId =
3637c2f3-49b6-40c2-b6d0
.
On Sat, May 16, 2020 at 6:06 AM wzhan wrote:
> Hi guys,
>
> I'm running spark applications on kubernetes. According to spark
> documentation
>
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
> Spark needs distributed file system to store i
Hi guys,
I'm running spark applications on kubernetes. According to spark
documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Spark needs distributed file system to store its checkpoint data so that in
case of failure, it can recover from checkpoint
t;>>
>>>> Hi Rishi,
>>>>
>>>> That is exactly why Trigger.Once was created for Structured Streaming.
>>>> The way we look at streaming is that it doesn't have to be always real
>>>> time, or 24-7 always on. We see streaming as a workfl
this blog post for more details!
>>>
>>> https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
>>>
>>> Best,
>>> Burak
>>>
>>> On Fri, May 1, 2020 at 2:55 PM Rishi Shah
>>> wrote:
>>>
bs-day-10x-cost-savings.html
>>
>> Best,
>> Burak
>>
>> On Fri, May 1, 2020 at 2:55 PM Rishi Shah
>> wrote:
>>
>>> Hi All,
>>>
>>> I recently started playing with spark streaming, and checkpoint location
>>> feat
> wrote:
>
>> Hi All,
>>
>> I recently started playing with spark streaming, and checkpoint location
>> feature looks very promising. I wonder if anyone has an opinion about using
>> spark streaming with checkpoint location option as a slow batch pro
have to repeat
> indefinitely. See this blog post for more details!
>
> https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
>
> Best,
> Burak
>
> On Fri, May 1, 2020 at 2:55 PM Rishi Shah
> wrote:
>
>> Hi All,
>>
>>
://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Best,
Burak
On Fri, May 1, 2020 at 2:55 PM Rishi Shah wrote:
> Hi All,
>
> I recently started playing with spark streaming, and checkpoint location
> feature looks very promising. I wonder if anyone has an o
Hi All,
I recently started playing with spark streaming, and checkpoint location
feature looks very promising. I wonder if anyone has an opinion about using
spark streaming with checkpoint location option as a slow batch processing
solution. What would be the pros and cons of utilizing streaming
0866
https://issues.apache.org/jira/browse/SPARK-30900
https://issues.apache.org/jira/browse/SPARK-30915
https://issues.apache.org/jira/browse/SPARK-30946
SPARK-30946 is closely related to the issue - it will help the size of
checkpoint file much smaller and also much shorter elapsed time to com
as I know, the official documentation states that the checkpoint of the
spark streaming application will continue to increase over time.
Whereas data or RDD checkpointing is necessary even for basic functioning if
stateful transformations are used.
So,for applications that require long-term aggregation
Are Spark Structured Streaming checkpoint files expected to grow over time
indefinitely? Is there a recommended way to safely age-off old checkpoint data?
Currently we have a Spark Structured Streaming process reading from Kafka and
writing to an HDFS sink, with checkpointing enabled
Hi community,
I'm having this error in some kafka streams:
Caused by: java.io.FileNotFoundException: File
file:/efs/.../kafka/checkpoint/state/0/0/1.delta does not exist
Because of this I have some streams down. How can I fix this?
Thank you.
--
Miguel Silvestre
Hi,
My code is below:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def test(record_list):
print(list(record_list))
return record_list
def
Hi Sparkians,
Few questions around checkpointing.
1. Checkpointing “dump” file / persisting to disk
Is the file encrypted or is it a standard parquet file?
2. If the file is not encrypted, can I use it with another app (I know it’s
kind of of a weird stretch case)
3. Have you/do you know of
Hello,
I am trying to understand the content of a checkpoint and corresponding
recovery.
*My understanding of Spark Checkpointing:
*
If you have really long DAGs and your spark cluster fails, checkpointing
helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50
transformations can
In PySpark streaming, if checkpoint enabled, and if use a stream.transform
operator to join with another rdd, “PicklingError: Could not serialize
object” will be thrown. I have asked the same question at stackoverflow:
https://stackoverflow.com/questions/56267591/pyspark-streaming-picklingerror
Hi,
I am looking for a setup that would be to be able to split a single spark
processing into 2 jobs (operational constraints) without wasting too much
time persisting the data between the two jobs during spark
checkpoint/writes.
I have a config with a lot of ram and I'm willing to configure
"failOnDataLoss", false)
.load()
.filter(strFilter)
.select(functions.from_json(functions.col("value").cast("string"),
oSchema).alias("events"))
.select("events.*");
Checkpoint is used as shown below;
DataSt
gt; .writeStream
> .outputMode("append")
> .trigger(Trigger.ProcessingTime(s"$triggerSeconds seconds"))
> .option("checkpointLocation",s"$checkpointDir/$appName/tsdb")
> .foreach {
>
assword,
mongoDatabase, mongoCollection,mongoAuthenticationDatabase)
)(createMetricBuilder(tsdbMetricPrefix))
}
.start()
And when I check the checkpoint dir, I discover that the
"/checkpoint/state" dir is empty. I looked into the executor's log and found
that the
thanks
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Only the stream metadata (e.g., streamid, offsets) are stored as json. The
stream state data is stored in an internal binary format.
On Mon, Jul 9, 2018 at 4:07 PM, subramgr
wrote:
> Hi,
>
> I read somewhere that with Structured Streaming all the checkpoint data is
> more readable
Hi,
I read somewhere that with Structured Streaming all the checkpoint data is
more readable (Json) like. Is there any documentation on how to read the
checkpoint data.
If I do `hadoop fs -ls` on the `state` directory I get some encoded data.
Thanks
Girish
--
Sent from: http://apache-spark
Consider there is a spark query(A) which is dependent on Kafka topics t1 and
t2.
After running this query in the streaming mode, a checkpoint(C1) directory
for the query gets created with offsets and sources directories. Now I add a
third topic(t3) on which the query is dependent.
Now if I
If you are using kafka direct connect api it might be committing offset
back to kafka itself
בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl :
> I met the same issue and I have try to delete the checkpoint dir before the
> job ,
>
> But spark seems can read the correct offset even
I met the same issue and I have try to delete the checkpoint dir before the
job ,
But spark seems can read the correct offset even though after the
checkpoint dir is deleted ,
I don't know how spark do this without checkpoint's metadata.
--
Sent from: http://apache-spark-user-list.1001560.n3
: 100%" everywhere).
I have called cache() on all my DataFrames with no effect. However, calling
checkpoint() on the DF fed to Spark's ML code solved the problem.
So, although the problem is fixed, I'd like to know why cache() did not
work when checkpoint() did.
Can anybody explain?
Thanks,
Phill
Hi:
I am working on a realtime application using spark structured streaming (v
2.2.1). The application reads data from kafka and if there is a failure, I
would like to ignore the checkpoint. Is there any configuration to just read
from last kafka offset after a failure and ignore any offset
Experts,
I see the last batch before stop (graceful shutdown) always re-execute
after restart the DStream from a checkpoint, is this a expected behavior?
I see a bug in JIRA: https://issues.apache.org/jira/browse/SPARK-20050,
whic reports duplicates on Kafka, I also see this with HDFS file
Hello,
We have a Spark cluster with 3 worker nodes available as EC2 on AWS. Spark
application is running in cluster mode and the checkpoints are stored in EFS.
Spark version used is 2.2.0.
We noticed the below error coming up – our understanding was that this
intermittent checkpoint issue
Hi,
I have written spark streaming job to use the checkpoint. I have stopped
the streaming job for 5 days and then restart it today.
I have encountered weird issue where it shows as zero records for all
cycles till date. is it causing data loss?
[image: Inline image 1]
Thanks,
Asmath
Hi,
I keep getting null pointer exception in the spark streaming job with
checkpointing. any suggestions to resolve this.
Exception in thread "pool-22-thread-9" java.lang.NullPointerException
at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)
at
ompaction (AWS support directed us to do this, as
> Namenode edit logs were filling the disk).
>
> Occasionally, the Structured Streaming query will not restart because the
> most recent file in the "commits" or "offsets" checkpoint subdirectory is
> empty. This seems l
will not restart because the
most recent file in the "commits" or "offsets" checkpoint subdirectory is
empty. This seems like an undesirable behavior, as it requires manual
intervention to remove the empty files in order to force the job to fall
back onto the last good va
Hi,
In the following example using mapWithState, I set checkpoint interval to 1
minute. From the log, Spark stills write to the checkpoint directory every
second. Would be appreciated if someone can point out what I have done wrong.
object MapWithStateDemo {
def main(args: Array[String
gt; wrote:
>
> Greetings!
>
> I would like to accomplish a custom kafka checkpoint strategy (instead of
> hdfs, i would like to use redis). is there a strategy I can use to change
> this behavior; any advise will help. Th
Greetings!
I would like to accomplish a custom kafka checkpoint strategy (instead of
hdfs, i would like to use redis). is there a strategy I can use to change
this behavior; any advise will help. Thanks!
Regards,
Anand.
de validation/de-serialisation in case of DStreams?
>
> We are using mapWithState in our application and it builds its state from
> checkpointed RDDs. I would like understand the cases where we can avoid
> clearing the checkpoint directories.
>
>
> thanks in advance,
> Vish
the cases where we can avoid
clearing the checkpoint directories.
thanks in advance,
Vishal
This message is confidential and is for the sole use of the intended
recipient(s). It may also be privileged or otherwise protected by copyright or
other legal rules. If you
Hi,
ReduceByKeyAndWindow checkpoint recovery has issues when trying to recover
for the second time. Basically it is losing the reduced value of the
previous window but is present in the old values that needs to be inverse
reduced resulting in the following error. Does anyone has any idea
Hi,
I am facing issues while trying to recover a textFileStream from checkpoint.
Basically it is trying to load the files from the begining of the job start
whereas I am deleting the files after processing them. I have the following
configs set so was thinking that it should not look for files
Hi,
I am facing issues while trying to recover a textFileStream from checkpoint.
Basically it is trying to load the files from the begining of the job start
whereas I am deleting the files after processing them. I have the following
configs set so was thinking that it should not look for files
Considering the @transient annotations and the work done in the instance
initializer, not much state is really be broadcast to the executors. It
might be simpler to just create these instances on the executors, rather
than trying to broadcast them?
--
View this message in context:
iteStream.
>> | format("console").
>> | option("truncate", false).
>> | option("checkpointLocation", "/tmp/checkpoint"). // <--
>> checkpoint directory
>> | trigger(Trigger.ProcessingTime(10.seconds))
r
> scala> spark.version
> res8: String = 2.3.0-SNAPSHOT
>
> scala> val q = records.
> | writeStream.
> | format("console").
> | option("truncate", false).
> | option("checkpointLocation", "/tmp/checkpo
quot;truncate", false).
| option("checkpointLocation", "/tmp/checkpoint"). // <--
checkpoint directory
| trigger(Trigger.ProcessingTime(10.seconds)).
| outputMode(OutputMode.Update).
| start
org.apache.spark.sql.AnalysisException: This query does not su
does).
Thanks,
Assaf.
From: Bernard Jesop [mailto:bernard.je...@gmail.com]
Sent: Thursday, July 13, 2017 6:58 PM
To: Vadim Semenov
Cc: user
Subject: Re: underlying checkpoint
Thank you, one of my mistakes was to think that show() was an action.
2017-07-13 17:52 GMT+02:00 Vadim
Thank you, one of my mistakes was to think that show() was an action.
2017-07-13 17:52 GMT+02:00 Vadim Semenov <vadim.seme...@datadoghq.com>:
> You need to trigger an action on that rdd to checkpoint it.
>
> ```
> scala>spark.sparkContext.setCheckpointDir(&quo
You need to trigger an action on that rdd to checkpoint it.
```
scala>spark.sparkContext.setCheckpointDir(".")
scala>val df = spark.createDataFrame(List(("Scala", 35), ("Python",
30), ("R", 15), ("Java", 20)))
df: org.a
park.createDataFrame(List(("Scala", 35), ("Python", 30), ("R",
15), ("Java",
20)))
df.show()
df.rdd.checkpoint()
println(if (df.rdd.isCheckpointed) "checkpointed" else "not
checkpointed")
}*
But the result is still *"not checkpoi
-to-reduce-the-amount-of-data-that-is-getting-written-to-the-checkpoint-from-Spark-Streaming-tp28798p28820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr
You can't. Spark doesn't let you fiddle with the data being checkpoint, as
it's an internal implementation detail.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-reduce-the-amount-of-data-that-is-getting-written-to-the-checkpoint-from-Spark
Hi,
I have checkpoints enabled in Spark streaming and I use updateStateByKey and
reduceByKeyAndWindow with inverse functions. How do I reduce the amount of
data that I am writing to the checkpoint or clear out the data that I dont
care?
Thanks!
--
View this message in context:
http://apache
ng, i highly recommend
>> learning Structured Streaming
>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>> instead.
>>
>> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan <alunarbe...@gmail.com>
>> wrote:
&g
y with Spark Streaming, i highly recommend
> learning Structured Streaming
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
> instead.
>
> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan <alunarbe...@gmail.com>
> wrote:
>
>> I a
:16 AM, anbucheeralan <alunarbe...@gmail.com>
wrote:
> I am using Spark Streaming Checkpoint and Kafka Direct Stream.
> It uses a 30 sec batch duration and normally the job is successful in
> 15-20 sec.
>
> If the spark application fails after the successful completion
> (14
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.
If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch again
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.
If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch again
checkpointDirectory);
sparkContext.setCheckpointDir(checkpointPath);
Asher Krim
Senior Software Engineer
On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid
> wrote:
> Still haven't found a --conf option.
>
> Regarding a temporary HDFS checkpoint directory, it looks lik
Still haven't found a --conf option.
Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:
val checkpointPath = new Path
Hi,
I need to set a checkpoint directory as I'm starting to use GraphFrames.
(Also, occasionally my regular DataFrame lineages get too long so it'd be
nice to use checkpointing to squash the lineage.)
I don't actually need this checkpointed data to live beyond the life of the
job, however. I'm
Just load it as from any other directory.
> On 26. May 2017, at 17:26, Priya PM <pmpr...@gmail.com> wrote:
>
>
> -- Forwarded message --
> From: Priya PM <pmpr...@gmail.com>
> Date: Fri, May 26, 2017 at 8:54 PM
> Subject: Re: Spark checkpoint
-- Forwarded message --
From: Priya PM <pmpr...@gmail.com>
Date: Fri, May 26, 2017 at 8:54 PM
Subject: Re: Spark checkpoint - nonstreaming
To: Jörn Franke <jornfra...@gmail.com>
Oh, how do i do it. I dont see it mentioned anywhere in the documentation.
I have follow
Do you have some source code?
Did you set the checkpoint directory ?
> On 26. May 2017, at 16:06, Priya <pmpr...@gmail.com> wrote:
>
> Hi,
>
> With nonstreaming spark application, did checkpoint the RDD and I could see
> the RDD getting checkpointed. I have kill
ark application, did checkpoint the RDD and I could see
> the RDD getting checkpointed. I have killed the application after
> checkpointing the RDD and restarted the same application again immediately,
> but it doesn't seem to pick from checkpoint and it again checkpoints the
> RDD. Could a
Hi,
With nonstreaming spark application, did checkpoint the RDD and I could see
the RDD getting checkpointed. I have killed the application after
checkpointing the RDD and restarted the same application again immediately,
but it doesn't seem to pick from checkpoint and it again checkpoints
Im having some difficulty reliably restoring a streaming job from a checkpoint.
When restoring a streaming job constructed from the following snippet, I
receive NullPointerException's when `map` is called on the the restored RDD.
lazy val ssc = StreamingContext.getOrCreate(checkpointDir
(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
I understand that it could be due to RDD linage and tried to resolve it via
checkpoint, but no luck.
Any help
ld be specified by
>> spark.sql.streaming.checkpointLocation on SparkSession level and thus
>> automatically checkpoint dirs will be created per foreach query?
>>
>>
> Sure, please open a pull request.
>
>
>> 2) Do we really need to specify the checkpoint dir per query? what the
>&
>
> 1) could we update documentation for Structured Streaming and describe
> that checkpointing could be specified by
> spark.sql.streaming.checkpointLocation
> on SparkSession level and thus automatically checkpoint dirs will be
> created per foreach query?
>
>
Sure, pl
Sorry - can't help with PySpark, but here is some Java code which you may be
able to transform to Python?
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/
jg
> On Apr 14, 2017, at 07:18, issues solution wrote:
>
> Hi
> somone can give me an
Hi
somone can give me an complete example to work with chekpoint under Pyspark
1.6 ?
thx
regards
s you are attempting to
restart a query from checkpoint that is already active.
It is caused by that *StreamingQueryManager.scala* get the checkpoint dir
from stream’s configuration, and because my streams have equal
checkpointDirs, the second stream tries to recover instead of creating of
new one.Fo
Hi ,
somone can explain me how i can use inPYSPAK not in scala chekpoint ,
Because i have lot of udf to apply on large data frame and i dont
understand how i can use checkpoint to break lineag to prevent from
java.lang.stackoverflow
regrads
Looks like your udf expects numeric data but you are sending string type.
Suggest to cast to numeric.
On Thu, 13 Apr 2017 at 7:03 pm, issues solution <issues.solut...@gmail.com>
wrote:
> Hi
> I am newer in spark and i want ask you what wrang with checkpoint On
> pyspark 1
Hi
I am newer in spark and i want ask you what wrang with checkpoint On
pyspark 1.6.0
i dont unertsand what happen after i try to use it under datframe :
dfTotaleNormalize24 = dfTotaleNormalize23.select([i if i not in
listrapcot else udf_Grappra(F.col(i)).alias(i) for i
Hello,
Actually I'm studying metadata checkpoint implementation in Spark Streaming
and I was wondering the purpose of so called "backup files":
CheckpointWriter snippet:
> // We will do checkpoint when generating a batch and completing a batch.
> When the processing
>
.RemoteException(java.io.IOException):
>> File
>> /checkpoint/11ea8862-122c-4614-bc7e-f761bb57ba23/rdd-347/.part-1-attempt-3
>> could only be replicated to 0 nodes instead of minReplication (=1). There
>> are 0 datanode(s) running and no node(s) are excluded in this
1 - 100 of 387 matches
Mail list logo