[DISCUSS] Spark cannot identify the problem executor

2020-09-10 Thread 陈晓宇
Hello all,

We've been using spark 2.3 with blacklist enabled and  often meet the
problem that when executor A has some problem(like connection issue). Tasks
on executor B, executor C will fail saying cannot read from executor A.
Finally the job will fail due to task on executor B failed 4 times.

I wonder whether there is any existing fix or discussions how to identify
Executor A as the problem node.

Thanks


Re: Contributing to JIRA Maintenance

2020-09-10 Thread EveLiao
Sure. That's wonderful. My JIRA account is @EveLiao.

I will make a list of issues that I consider to be okay to close. And let
you know when we can make a call.

Thanks!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Contributing to JIRA Maintenance

2020-09-10 Thread Sean Owen
What's your JIRA handle? you may need to be in the Contributor role to
change JIRA status, and I can add you.

I think that triaging JIRAs is a good activity. When you start, I'd
suggest identifying and commenting on potentially closeable issues
rather than closing them directly.
If you highlight them we can make a call and close them, and after a
few rounds of that, I'm sure you'd get a good feel for when something
should be closed.

That JIRA is not entirely obvious. I tend to agree that it's proposing
much freer casts of types to maybe match PostgreSQL, but, that has a
lot of other implications. Knowing only a little, I'd also tend to not
accept the proposed behavior change. I can close that one -- mostly on
the grounds that it's 3 years old.

On Thu, Sep 10, 2020 at 4:54 PM EveLiao  wrote:
>
> Hi,
>
> I am a novice in Spark and would love to help with JIRA Maintenance as a
> start to contribution.
>
> However, how can I edit the status of the issue in JIRA? It looks like that
> I don't have access to it.
> Should I comment below the issue and @ an administrator?
>
> For example, I think the below issue is actually an intended behavior
> decided via an discussion of contributers. Instead, the user should change
> the way how they use it. So I would like to mark it as "invalid" if you
> think it makes sense.
>
> https://issues.apache.org/jira/browse/SPARK-21774
>
> However, I could not find a button to edit its status. The page I looked at
> is like the below:
>
> 
>
> Can you help with that?
>
> Thanks!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Contributing to JIRA Maintenance

2020-09-10 Thread EveLiao
Hi,

I am a novice in Spark and would love to help with JIRA Maintenance as a
start to contribution.

However, how can I edit the status of the issue in JIRA? It looks like that
I don't have access to it.
Should I comment below the issue and @ an administrator?

For example, I think the below issue is actually an intended behavior
decided via an discussion of contributers. Instead, the user should change
the way how they use it. So I would like to mark it as "invalid" if you
think it makes sense.

https://issues.apache.org/jira/browse/SPARK-21774

However, I could not find a button to edit its status. The page I looked at
is like the below:


 

Can you help with that? 

Thanks!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Migrating BinaryFileFormat to DSv2?

2020-09-10 Thread Chao Sun
Hi all,

As we are moving all data sources to v2, I'm wondering whether it makes
sense to do the same for `BinaryFileFormat` which only has v1 impl at the
moment.

Also curious to know what other data sources that haven't been migrated yet.

Thanks,
Chao


Notification of analysis on publicly available project data

2020-09-10 Thread Griselda Cuevas
Dear PMC,


I’m contacting you because your project has been selected by the ASF D&I
committee which is leading a research project to evaluate and understand
the current state of diversity in our community [1]. As part of this
research, we will analyze publicly available data about your project such
as Git logs, Jira boards and mailing lists, to better understand the state
of diversity in Apache projects and to complement the findings we obtained
from the Community Survey that was run this year [2].


This analysis will be performed by Bitegia [3], a vendor specializing in
researching open source projects and foundations. The results will be
published in a report similar to the OpenStack Foundation Analysis
published in 2018 [4].


The analysis will be done only on aggregated data at the project level during
and after processing, ensuring we do not report anything that could
identify a single individual. The data we analyze will be deleted right
after the research is done and won’t be retained by either the researcher
or the ASF.


If you have any concerns or questions, please raise them to the diversity
committee (d...@diversity.apache.org) and/or to the data privacy committee (
priv...@apache.org).


Regards,

Griselda Cuevas

V.P. of Diversity and Inclusion

Apache Software Foundation


[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=127405614

[2] https://youtu.be/4Mr1CRtKqUI

[3] https://bitergia.com/bitergia-analytics/

[4] https://superuser.openstack.org/articles/2018-gender-diversity-report/


[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option

2020-09-10 Thread David Li
Hello all,

We've been working with PySpark and Pandas, and have found that to
convert a dataset using N bytes of memory to Pandas, we need to have
2N bytes free, even with the Arrow optimization enabled. The
fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow
table until conversion finishes, so there are 2 copies of the dataset
in memory.

We'd like to improve this by taking advantage of the Arrow
"self_destruct" option available in Arrow >= 0.16. When converting a
suitable[*] Arrow table to a Pandas dataframe, it avoids the
worst-case 2x memory usage, with something more like ~25% overhead
instead, by freeing the columns in the Arrow table after converting
each column instead of at the end of conversion.

Does this sound like a desirable optimization to have in Spark? If so,
how should it be exposed to users? As discussed below, there are cases
where a user may or may not want it enabled.

Here's a proof-of-concept patch, along with a demonstration, and a
comparison of memory usage (via memory_profiler[2]) with and without
the flag enabled:
https://gist.github.com/lidavidm/289229caa022358432f7deebe26a9bd3

There are some cases where you may _not_ want this optimization,
however, so the patch leaves it as a toggle. Is this the API we'd
want, or would we prefer a different API (e.g. a configuration flag)?

The reason we may not want this enabled by default is that the related
split_blocks option is more likely to find zero-copy opportunities,
which will result in the Pandas dataframe being backed by immutable
buffers. Some Pandas operations will error in these cases, e.g. [3].
Also, to minimize memory usage, we set use_threads=False to converts
each column sequentially, rather than in parallel, but this slows down
the conversion somewhat. One option here may be to set self_destruct
by default, but relegate the other two options (which further save
memory) to a toggle, and I can measure the impact of this if desired.

[1]: https://issues.apache.org/jira/browse/ARROW-3789
[2]: https://github.com/pythonprofilers/memory_profiler
[3]: https://github.com/pandas-dev/pandas/issues/35530
[*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878.

Thanks,
David

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: leveldbjni dependency

2020-09-10 Thread Sean Owen
I think you could also fork and try to compile it for s390x - I think
that's what happened for ARM.
Perf probably isn't a big deal there, but that might be easier than
rewriting or replacing it.

On Thu, Sep 10, 2020 at 1:29 AM mundaym  wrote:
>
> Hi all,
>
> I am currently building Spark from source and also have to build leveldbjni
> from source because the binary release (which is platform dependent) in
> mvnrepository does not support my target platform (s390x). People have run
> into similar problems when building for other platforms too (notably Spark
> builds on arm64 pull in an alternative binary release).
>
> The last binary release of leveldbjni was in 2013 and it does not appear to
> be actively maintained. I suspect the lack of new binary releases will cause
> more issues as time goes on. I am therefore curious whether anyone had any
> thoughts over substituting it with an alternative library. In particular I
> am interested in whether there are any constraints an alternative would need
> to adhere to:
>
> 1. Would an alternative need to be leveldb compatible?
> 2. How performance sensitive are the uses of leveldbjni in Spark? Could a
> pure Java library be used instead?
>
> Specifically I am thinking possible alternatives that are actively
> maintained might be something like https://github.com/dain/leveldb or even
> RocksDB. Or perhaps there is another key-value store used in the Apache
> ecosystem that could be adopted.
>
> Thanks,
> Michael
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Spark 2.4.7 (RC3)

2020-09-10 Thread Wenchen Fan
+1

On Thu, Sep 10, 2020 at 2:26 PM Prashant Sharma 
wrote:

> Thanks again, looks like it works now. Please take a look.
>
> On Thu, Sep 10, 2020 at 11:42 AM Prashant Sharma 
> wrote:
>
>> Hi Wenchen and Sean,
>>
>> Thanks for looking into this and all the details.
>>
>> I have now updated the key in those keyservers. Now, how do I refresh
>> nexus?
>>
>> Thanks,
>>
>> On Thu, Sep 10, 2020 at 9:13 AM Sean Owen  wrote:
>>
>>> Yes I can do that and I am sure it's fine, but why has it been visible
>>> in the past and not now? Minor thing to fix.
>>>
>>> On Wed, Sep 9, 2020, 9:09 PM Wenchen Fan  wrote:
>>>
 Sean, you need to login https://repository.apache.org/ and pick the
 staging repo 1361, then check its status, you will see this
 [image: image.png]

 On Thu, Sep 10, 2020 at 9:26 AM Mridul Muralidharan 
 wrote:

>
> I imported our KEYS file locally [1] to validate ... did not use
> external keyserver.
>
> Regards,
> Mridul
>
> [1] wget https://dist.apache.org/repos/dist/dev/spark/KEYS -O - | gpg
> --import
>
> On Wed, Sep 9, 2020 at 8:03 PM Wenchen Fan 
> wrote:
>
>> I checked
>> https://repository.apache.org/content/repositories/orgapachespark-1361/ ,
>> it says the Signature Validation failed.
>>
>> Prashant, can you double-check your gpg key and make sure it's
>> uploaded to public key servers like the following?
>> http://pool.sks-keyservers.net:11371
>> http://keyserver.ubuntu.com:11371
>>
>>
>> On Wed, Sep 9, 2020 at 6:12 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and built/tested with -Pyarn -Phadoop-2.7 -Phive
>>> -Phive-thriftserver -Pmesos -Pkubernetes
>>>
>>> Thanks,
>>> Mridul
>>>
>>>
>>> On Tue, Sep 8, 2020 at 8:55 AM Prashant Sharma 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.4.7.

 The vote is open until Sep 11th at 9AM PST and passes if a majority
 +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.4.7
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 There are currently no issues targeting 2.4.7 (try project = SPARK
 AND "Target Version/s" = "2.4.7" AND status in (Open, Reopened, "In
 Progress"))

 The tag to be voted on is v2.4.7-rc3 (commit
 14211a19f53bd0f413396582c8970e3e0a74281d):
 https://github.com/apache/spark/tree/v2.4.7-rc3

 The release files, including signatures, digests, etc. can be found
 at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1361/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.4.7-rc3-docs/

 The list of bug fixes going into 2.4.7 can be found at the
 following URL:
 https://s.apache.org/spark-v2.4.7-rc3

 This release is using the release script of the tag v2.4.7-rc3.

 FAQ


 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate,
 then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install
 the current RC and see if anything important breaks, in the
 Java/Scala
 you can add the staging repository to your projects resolvers and
 test
 with the RC (make sure to clean up the artifact cache before/after
 so
 you don't end up building with an out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.4.7?
 ===

 The current list of open tickets targeted at 2.4.7 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for
 "Target Version/s" = 2.4.7

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility
 should
 be worked on