Fwd: Spark 2.4.4, RPC encryption and Python

2020-01-19 Thread Luca Toscano
Hi everybody,

trying to ask the same question in dev@ since it would be helpful some
info about how to debug this :)

Thanks in advance,

Luca

-- Forwarded message -
Da: Luca Toscano 
Date: gio 16 gen 2020 alle ore 09:16
Subject: Spark 2.4.4, RPC encryption and Python
To: 


Hi everybody,

I am currently testing Spark 2.4.4 with the following new settings:

spark.authenticate   true
spark.io.encryption.enabled   true
spark.io.encryption.keySizeBits   256
spark.io.encryption.keygen.algorithm   HmacSHA256
spark.network.crypto.enabled   true
spark.network.crypto.keyFactoryAlgorithm   PBKDF2WithHmacSHA256
spark.network.crypto.keyLength   256
spark.network.crypto.saslFallback   false

I use dynamic allocation and the Spark shuffler is set correctly in
Yarn. I added the following two options to yarn-site.xml's config:

  
  spark.authenticate
  true
  

  
  spark.network.crypto.enabled
  true
  

This works very well in all the scala-based code (spark2-shell,
spark-submit, etc..) but it doesn't for Pyspark, since I do see the
following warnings repeating over and over:

20/01/14 10:23:50 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Attempted to request executors before the AM has registered!
20/01/14 10:23:50 WARN ExecutorAllocationManager: Unable to reach the
cluster manager to request 1 total executors!

The culprit seems to be the option "spark.io.encryption.enabled=true",
since without it everything works fine.

At first I thought that it was a Yarn resource allocation problem, but
then I checked and the cluster has plenty of space. After digging a
bit more into Yarn's container logs and I discovered that it seems a
problem related to the Application master not being able to contact
the Driver in time (assuming client mode of course):

20/01/14 09:45:21 INFO ApplicationMaster: ApplicationAttemptId:
appattempt_1576771377404_19608_01
20/01/14 09:45:21 INFO YarnRMClient: Registering the ApplicationMaster
20/01/14 09:45:52 ERROR TransportClientFactory: Exception while
bootstrapping client after 30120 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException:
Timeout waiting for task.
at 
org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:263)
at 
org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105)
at 
org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at 
org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at 
org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:259)
... 11 more

The strange part is that sometimes the timeout doesn't occur, and
sometimes it does. I checked the code related to the above stacktrace
and ended up to:

https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java#L106
https://github.com/apache/spark/blob/branch-2.4/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L129-L133

The "spark.network.auth.rpcTimeout" option seems to help, even if it
is not advertised in the docs as far as I can see (the 30s mentioned
in the exception are definitely trigger by this setting though). What
I am wondering is where/what I should check to debug this further,
since it seems a Python only problem that doesn't affect Scala. I
didn't find any outstanding bugs, so given the fact that 2.4.4 is very
recent I thought to report it in here to ask for an advice :)

Thanks in advance!

Luca

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Correctness and data loss issues

2020-01-19 Thread Dongjoon Hyun
Hi, All.

According to our policy, "Correctness and data loss issues should be
considered Blockers".

- http://spark.apache.org/contributing.html

Since we are close to branch-3.0 cut,
I want to ask your opinions on the following correctness and data loss
issues.

SPARK-30218 Columns used in inequality conditions for joins not
resolved correctly in case of common lineage
SPARK-29701 Different answers when empty input given in GROUPING SETS
SPARK-29699 Different answers in nested aggregates with window functions
SPARK-29419 Seq.toDS / spark.createDataset(Seq) is not thread-safe
SPARK-28125 dataframes created by randomSplit have overlapping rows
SPARK-28067 Incorrect results in decimal aggregation with whole-stage
code gen enabled
SPARK-28024 Incorrect numeric values when out of range
SPARK-27784 Alias ID reuse can break correctness when substituting
foldable expressions
SPARK-27619 MapType should be prohibited in hash expressions
SPARK-27298 Dataset except operation gives different results(dataset
count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
SPARK-27282 Spark incorrect results when using UNION with GROUP BY
clause
SPARK-27213 Unexpected results when filter is used after distinct
SPARK-26836 Columns get switched in Spark SQL using Avro backed Hive
table if schema evolves
SPARK-25150 Joining DataFrames derived from the same source yields
confusing/incorrect results
SPARK-21774 The rule PromoteStrings cast string to a wrong data type
SPARK-19248 Regex_replace works in 1.6 but not in 2.0

Some of them are targeted on 3.0.0, but the others are not.
Although we will work on them until 3.0.0,
I'm not sure we can reach a status with no known correctness and data loss
issue.

How do you think about the above issues?

Bests,
Dongjoon.


Re: [Discuss] Metrics Support for DS V2

2020-01-19 Thread Sandeep Katta
Please send me the patch , I will apply and test.

On Fri, 17 Jan 2020 at 10:33 PM, Ryan Blue  wrote:

> We've implemented these metrics in the RDD (for input metrics) and in the
> v2 DataWritingSparkTask. That approach gives you the same metrics in the
> stage views that you get with v1 sources, regardless of the v2
> implementation.
>
> I'm not sure why they weren't included from the start. It looks like the
> way metrics are collected is changing. There are a couple of metrics for
> number of rows; looks like one that goes to the Spark SQL tab and one that
> is used for the stages view.
>
> If you'd like, I can send you a patch.
>
> rb
>
> On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan  wrote:
>
>> I think there are a few details we need to discuss.
>>
>> how frequently a source should update its metrics? For example, if file
>> source needs to report size metrics per row, it'll be super slow.
>>
>> what metrics a source should report? data size? numFiles? read time?
>>
>> shall we show metrics in SQL web UI as well?
>>
>> On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <
>> sandeep0102.opensou...@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> Currently DS V2 does not update any input metrics. SPARK-30362 aims at
>>> solving this problem.
>>>
>>> We can have the below approach. Have marker interface let's say
>>> "ReportMetrics"
>>>
>>> If the DataSource Implements this interface, then it will be easy to
>>> collect the metrics.
>>>
>>> For e.g. FilePartitionReaderFactory can support metrics.
>>>
>>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>>> implements ReportMetrics
>>>
>>>
>>> Please let me know the views, or even if we want to have new solution or
>>> design.
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>