Hello All,
I am using spark to process some files parallelly.
While most files are able to be processed within 3 seconds,
it is possible that we stuck on 1 or 2 files as they will never finish (or
will take more than 48 hours).
Since it is a 3rd party file conversion tool, we are not able to
This is my first email to this mailing list, so I apologize if I made any
errors.
My team's going to be building an application and I'm investigating some
options for distributed compute systems. We want to be performing computes on
large matrices.
The requirements are as follows:
1.
How can I achieve the following by passing a row to a udf ?
val df1 = df.withColumn("col_Z",
when($"col_x" === "a", $"col_A")
.when($"col_x" === "b", $"col_B")
.when($"col_x" === "c", $"col_C")
.when($"col_x" === "d",
Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version.
The main reason of 2.4.4 release (before 3.0.0) is to have a better basis
for comparison to 3.0.0.
For example, SPARK-27798 had an old bug, but its correctness issue is only
exposed at Spark 2.4.3.
It would be great if we can
We will certainly want a 2.4.4 release eventually. In fact I'd expect
2.4.x gets maintained for longer than the usual 18 months, as it's the
last 2.x branch.
It doesn't need to happen before 3.0, but could. Usually maintenance
releases happen 3-4 months apart and the last one was 2 months ago. If
Hi, All.
Spark 2.4.3 was released two months ago (8th May).
As of today (9th July), there exist 45 fixes in `branch-2.4` including the
following correctness or blocker issues.
- SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
decimals not fitting in long
- SPARK-26045
Hello,
I have below spark structural streaming code and I was expecting the results to
be printed on the console every 10 seconds. But, I notice the sink to console
happening every ~2 mins and above.
What could be the issue
def streaming(): Unit = {
System.setProperty("hadoop.home.dir",