Thank you Imran
I will check whether there is memory waste or not
Imran Rashid 于2019年11月26日周二 上午1:30写道:
> I think Chang is right, but I also think this only comes up in limited
> scenarios. I initially thought it wasn't a bug, but after some more
> thought I have some concerns in light of the
Very well put Imran. This is a variant of executor failure after an RDD has
been computed (including caching). In general, non determinism in spark is
going to lead to inconsistency.
The only reasonable solution for us, at that time, was to make
pseudo-randomness repeatable and checkpoint after so
I think Chang is right, but I also think this only comes up in limited
scenarios. I initially thought it wasn't a bug, but after some more
thought I have some concerns in light of the issues we've had w/
nondeterministic RDDs, eg. repartition().
Say I have code like this:
val cachedRDD =
emmm, I haven't check code, but I think if an RDD is referenced in several
places, the correct behavior should be: when this RDD data is needed, it
will be computed and then cached only once, otherwise it should be treated
as a bug. If you are suspicious there's a race condition, you could create
Sorry I did't describe clearly, RDD id itself is thread-safe, how about
cached data?
See codes from BlockManager
def getOrElseUpdate(...) = {
get[T](blockId)(classTag) match {
case ...
case _ => // 1. no data is cached.
// Need to compute the
Rdd id is immutable and when rdd object created, the rdd id is generated.
So why there is race condition in "rdd id" ?
On Mon, Nov 25, 2019 at 11:31 AM Chang Chen wrote:
> I am wonder the concurrent semantics for reason about the correctness. If
> the two query simultaneously run the DAGs which
I am wonder the concurrent semantics for reason about the correctness. If
the two query simultaneously run the DAGs which use the same cached
DF\RDD,but before cache data actually happen, what will happen?
By looking into code a litter, I suspect they have different BlockID for
same Dataset which
Hi Chang,
RDD/Dataframe is immutable and lazy computed. They are thread safe.
Thanks!
On Tue, Nov 12, 2019 at 12:31 PM Chang Chen wrote:
> Hi all
>
> I meet a case where I need cache a source RDD, and then create different
> DataFrame from it in different threads to accelerate query.
>
> I
Hi all
I meet a case where I need cache a source RDD, and then create different
DataFrame from it in different threads to accelerate query.
I know that SparkSession is thread safe(
https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
whether RDD is thread safe or not
Thanks