Set TimeOut and continue with other tasks

2019-07-09 Thread Wei Chen
Hello All, I am using spark to process some files parallelly. While most files are able to be processed within 3 seconds, it is possible that we stuck on 1 or 2 files as they will never finish (or will take more than 48 hours). Since it is a 3rd party file conversion tool, we are not able to

[Beginner] Run compute on large matrices and return the result in seconds?

2019-07-09 Thread Gautham Acharya
This is my first email to this mailing list, so I apologize if I made any errors. My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices. The requirements are as follows: 1.

Pass row to UDF and select column based on pattern match

2019-07-09 Thread Femi Anthony
How can I achieve the following by passing a row to a udf ? val df1 = df.withColumn("col_Z", when($"col_x" === "a", $"col_A") .when($"col_x" === "b", $"col_B") .when($"col_x" === "c", $"col_C") .when($"col_x" === "d",

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Thank you for the reply, Sean. Sure. 2.4.x should be a LTS version. The main reason of 2.4.4 release (before 3.0.0) is to have a better basis for comparison to 3.0.0. For example, SPARK-27798 had an old bug, but its correctness issue is only exposed at Spark 2.4.3. It would be great if we can

Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Sean Owen
We will certainly want a 2.4.4 release eventually. In fact I'd expect 2.4.x gets maintained for longer than the usual 18 months, as it's the last 2.x branch. It doesn't need to happen before 3.0, but could. Usually maintenance releases happen 3-4 months apart and the last one was 2 months ago. If

Release Apache Spark 2.4.4 before 3.0.0

2019-07-09 Thread Dongjoon Hyun
Hi, All. Spark 2.4.3 was released two months ago (8th May). As of today (9th July), there exist 45 fixes in `branch-2.4` including the following correctness or blocker issues. - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for decimals not fitting in long - SPARK-26045

Spark structural streaming sinks output late

2019-07-09 Thread Kamalanathan Venkatesan
Hello, I have below spark structural streaming code and I was expecting the results to be printed on the console every 10 seconds. But, I notice the sink to console happening every ~2 mins and above. What could be the issue def streaming(): Unit = { System.setProperty("hadoop.home.dir",