Re: [K8S] ExecutorPodsWatchSnapshotSource with no spark-exec-inactive label in 3.1?

2021-03-08 Thread attilapiros
Hi,

I do not think this could cause any problem.

The PODs polling snapshots would contain those PODs which were inactivated
before. Every time when the polling is triggered, so here it make sense to
handle them like deleted PODs and skip those.

On the other hand PODs watcher only informed about state changes
(https://kubernetes.io/docs/reference/kubernetes-api/common-parameters/common-parameters/#watch)
and here we are talking about inactivated PODs which I do not expect to have
many state changes.

Best regards,
Attila




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: minikube and kubernetes cluster versions for integration testing

2021-03-04 Thread attilapiros
Thanks Shane!

I can do the documentation task and the Minikube version check can be
incorporated into my PR. 
When my PR is finalized (probably next week) I will create a jira for you
and you can set up the test systems and you can even test my PR before
merging it. Is this possible / fine for you?
 
 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPARK-34600. Support user-defined types in Pandas UDF

2021-03-03 Thread attilapiros
Hi!

First of all thanks for your contribution!

PySpark is not an area I am familiar with but I can answer your question
regarding Jira.

The issue will be assigned to you when your change is in:
>  The JIRA will be Assigned to the primary contributor to the change as a
> way of giving credit. If the JIRA isn’t closed and/or Assigned promptly,
> comment on the JIRA.

You can check  the contributing page
  .

Best regards,
Attila



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is there any inplict RDD cache operation for query optimizations?

2021-02-15 Thread attilapiros
hi,

There is good reason why the decision about caching is left for the user.
Spark does not know about the future of the DataFrames and RDDs.

Think about how your program is running (you are still running program), so
there is an exact point where the execution is and when Spark reaches an
action it evaluates the Spark job but it does not know about the future
jobs. A cached data would be only useful for that future job which will
reuses it.

On the other hand this information is available for the user as he writes
all the jobs.

Attila



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
Managed to improve the site building a bit more: with a Gemfile we can pin
Jekyll to an exact version. For this we just have to call Jekyll via `bundle
exec jekyll`.
 
The PR [1] is opened.

[1] https://github.com/apache/spark-website/pull/303



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
Sure I will do that, too.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Using bundler for Jekyll?

2021-02-12 Thread attilapiros
I run into the same problem today and tried to find the version where the
diff is minimal, so I wrote a script:

```
#!/bin/zsh

versions=('3.7.3' '3.7.2' '3.7.0' '3.6.3' '3.6.2' '3.6.1' '3.6.0' '3.5.2'
'3.5.1' '3.5.0' '3.4.5' '3.4.4' '3.4.3' '3.4.2' '3.4.1' '3.4.0')

for i in $versions; do
  gem uninstall -a -x jekyll rouge
  gem install jekyll --version $i
  jekyll build
  git diff --stat
  git reset --hard HEAD
done
```

Based on this the best version is: jekyll-3.6.3:

```
site/community.html |  2 +-
 site/sitemap.xml| 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)
```

What about changing the README.md [1] and specifying this exact version? 

Moreover changing the command to install it to:
 
```
 gem install jekyll --version 3.6.3
``` 
 
This installs the right rouge version as it is dependency.

Finally I would give this command too as prequest:

```
  gem uninstall -a -x jekyll rouge
```

Because gem keeps all the installed versions and only one is used.

 
[1]
https://github.com/apache/spark-website/blob/6a5fc2ccaa5ad648dc0b25575ff816c10e648bdf/README.md#L5



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Spark cannot identify the problem executor

2021-02-11 Thread attilapiros
Hi,

There is an existing way to handle this situation. Those tasks will become
zombie tasks [1] and they  should not be counted into the tasks failures
[2]. Even the shuffle blocks should be unregistered for the lost executor,
although the lost executor might be already cached as a mapoutput in the
other executors [3] which might generate new fetch failures.

Check the mentioned code parts and run Spark with debug enabled for this
classes to investigate this further. Reading the log and the looking the
code together will help you a lot. And consider using a fresh Spark as there
were changes in this area.

Important: You can avoid this problem altogether by using the external
shuffle service. 
If you happen to be on YARN, please check this link [4].

When the external shuffle service is enabled then shuffle blocks won't be
lost with the dying executor as the blocks can be served by the shuffle
service which is running on the same host where the executor was.

Best Regards,
Attila 

[1]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L812

[2]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L871

[3]
https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L664

[4]
https://spark.apache.org/docs/2.3.0/running-on-yarn.html#configuring-the-external-shuffle-service

 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [K8S] KUBERNETES_EXECUTOR_REQUEST_CORES

2021-02-10 Thread attilapiros
Hi,

This is just an extra unnecessary usage of the /sparkConf/ member val
directly (those two lines are added by two different PRs).

Actually both uses the same /sparkConf/ to give back the config value, as
/KubernetesExecutorConf/ extends the /KubernetesConf/ [1] which uses the
passed /sparkConf/ to get back the value in the get method [2].

So technically this does not cause any problem but it is better to harmonize
it and call the contains method directly on /kubernetesConf/  (see the
method [3]) to avoid confusion for the next readers.

[1]
https://github.com/apache/spark/blob/9b875ceada60732899053fbd90728b4944d1c03d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala#L132-L138

[2]
https://github.com/apache/spark/blob/9b875ceada60732899053fbd90728b4944d1c03d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala#L67

[3]
https://github.com/apache/spark/blob/9b875ceada60732899053fbd90728b4944d1c03d/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala#L65

Best Regards,
Attila




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org