ASF board report for May

2019-05-06 Thread Matei Zaharia
It’s time to submit Spark's quarterly ASF board report on May 15th, so I wanted 
to run the report by everyone to make sure we’re not missing something. Let me 
know whether I missed anything:



Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics. 

Project status:

- We released Apache Spark 2.4.1, 2.4.2 and 2.3.3 in the past three months to 
fix issues in the 2.3 and 2.4 branches.

- Discussions are under way about the next feature release, which will likely 
be Spark 3.0, on our dev and user mailing lists. Some key questions include 
whether to remove various deprecated APIs, and which minimum versions of Java, 
Python, Scala, etc to support. There are also a number of new features 
targeting this release. We encourage everyone in the community to give feedback 
on these discussions through our mailing lists or issue tracker.

- Several Spark Project Improvement Proposals (SPIPs) for major additions to 
Spark were discussed on the dev list in the past three months. These include 
support for passing columnar data efficiently into external engines (e.g. GPU 
based libraries), accelerator-aware scheduling, new data source APIs, and .NET 
support. Some of these have been accepted (e.g. table metadata and accelerator 
aware scheduling proposals) while others are still being discussed.

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- April 23rd, 2019: Spark 2.4.2
- March 31st, 2019: Spark 2.4.1
- Feb 15th, 2019: Spark 2.3.3

Committers and PMC:

- The latest committer was added on Jan 29th, 2019 (Jose Torres).
- The latest PMC member was added on Jan 12th, 2018 (Xiao Li).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread Xiao Li
This vote passes! I'll follow up with a formal release announcement soon.

+1:
Michael Heuer (non-binding)
Gengliang Wang (non-binding)
Sean Owen (binding)
Felix Cheung (binding)
Wenchen Fan (binding)
Herman van Hovell (binding)
Xiao Li (binding)

Cheers,

Xiao

antonkulaga  于2019年5月6日周一 下午2:36写道:

> >Hadoop 3 has not been supported in 2.4.x. 2.12 has been since 2.4.0,
>
> I see. I thought it was as I saw many posts about configuring Spark for
> Hadoop 3 as well as hadoop 3 based spark docker containers
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread antonkulaga
>Hadoop 3 has not been supported in 2.4.x. 2.12 has been since 2.4.0,

I see. I thought it was as I saw many posts about configuring Spark for
Hadoop 3 as well as hadoop 3 based spark docker containers



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[METRICS] Metrics names inconsistent between executions

2019-05-06 Thread Anton Kirillov
Hi everyone!

We are currently working on building a unified monitoring/alerting solution
for Spark and would like to rely on Spark's own metrics to avoid divergence
from the upstream. One of the challenges is to support metrics coming from
multiple Spark applications running on a cluster: scheduled jobs,
long-running streaming applications etc.

Original problem:
Spark assigns metrics names using *spark.app.id *
and *spark.executor.id
* as a part of them. Thus the number of metrics
is continuously growing because those IDs are unique between executions
whereas the metrics themselves report the same thing. Another issue which
arises here is how to use constantly changing metric names in dashboards.

For example, *jvm_heap_used* reported by all Spark instances (components):
- _driver_jvm_heap_used (Driver)
- __jvm_heap_used (Executors)

While *spark.app.id * can be overridden with
*spark.metrics.namespace*, there's no such an option for *spark.executor.id
* which makes it impossible to build a reusable
dashboard because (given the uniqueness of IDs) differently named metrics
are emitted for each execution.

One of the possible solutions would be to make executor metrics names
follow the driver's metrics name pattern, e.g.:
- _driver_jvm_heap_used (Driver)
- _executor_jvm_heap_used (Executors)

and distinguish executors based on tags (tags should be configured in
metric reporters in this case). Not sure if this could potentially break
Driver UI though.

I'd really appreciate any feedback on this issue and would be happy to
create a Jira issue/PR if this change looks sane for the community.

Thanks in advance.

-- 
*Anton Kirillov*
Senior Software Engineer, Mesosphere


DataSourceV2 community sync notes - 1 May 2019

2019-05-06 Thread Ryan Blue
Here are my notes for the latest DSv2 community sync. As usual, if you have
comments or corrections, please reply. If you’d like to be invited to the
next sync, email me directly. Everyone is welcome to attend.

*Attendees*:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang Wang
Kevin Yu
Michael Artz
Russel Spitzer
Yifei Huang
Zhilmil Dhillon

*Topics*:

Introductions
Open pull requests
V2 organization
Bucketing and sort order from v2 sources

*Discussion*:

   - Introductions: we stopped doing introductions when we had a large
   group. Attendance has gone down from the first few syncs, so we decided to
   resume.
   - V2 organization / PR #24416: https://github.com/apache/spark/pull/24416
  - Ryan: There’s an open PR to move v2 into catalyst
  - Andrew: What is the distinction between catalyst and sql? How do we
  know what goes where?
  - Ryan: IIUC, the catalyst module is supposed to be a stand-alone
  query planner that doesn’t get into concrete physical plans. The catalyst
  package is the private implementation. Anything that is generic catalyst,
  including APIs like DataType, should be in the catalyst module. Anything
  public, like an API, should not be in the catalyst package.
  - Ryan: v2 meets those requirements now and I don’t have a strong
  opinion on organization. We just need to choose one.
  - No one had a strong opinion so we tabled this. In #24416 or shortly
  after let’s decide on organization and do the move at once.
  - Next steps: someone with an opinion on organization should suggest
  a structure.
   - TableCatalog API / PR #24246:
   https://github.com/apache/spark/pull/24246
  - Ryan: Wenchen’s last comment was that he was waiting for tests to
  pass and they are. Maybe this will be merged soon?
   - Bucketing and sort order from v2 sources
  - Andrew: interested in using existing data layout and sorting to
  remove expensive tasks in joins
  - Ryan: in v2, bucketing is unified with other partitioning
  functions. I plan to build a way for Spark to get partition function
  implementations from a source so it can use that function to prepare the
  other side of a join. From there, I have been thinking about a
way to check
  compatibility between functions, so we could validate that table
A has the
  same bucketing as table B.
  - Dilip: is bucketing Hive-specific?
  - Russel: Cassandra also buckets
  - Matt: what is the difference between bucketing and other partition
  functions for this?
  - Ryan: probably no difference. If you’re partitioning by hour, you
  could probably use that, too.
  - Dilip: how can Spark compare functions?
  - Andrew: should we introduce a standard? it would be easy to switch
  for their use case
  - Ryan: it is difficult to introduce a standard because so much data
  already exists in tables. I think it is easier to support multiple
  functions.
  - Russel: Cassandra uses dynamic bucketing, which wouldn’t be able to
  use a standard.
  - Dilip: sources could push down joins
  - Russel: that’s a harder problem
  - Andrew: does anyone else limit bucket size?
  - Ryan: we don’t because we assume the sort can spill. probably a
  good optimization for later
  - Matt: what are the follow-up items for this?
  - Andrew: will look into the current state of bucketing in Spark
  - Ryan: it would be great if someone thought about what the
  FunctionCatalog interface will look like

-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Release Apache Spark 2.4.3

2019-05-06 Thread Wenchen Fan
+1.

The Scala version problem has been resolved, which is the main motivation
of 2.4.3.

On Mon, May 6, 2019 at 12:38 AM Felix Cheung 
wrote:

> I ran basic tests on R, r-hub etc. LGTM.
>
> +1 (limited - I didn’t get to run other usual tests)
>
> --
> *From:* Sean Owen 
> *Sent:* Wednesday, May 1, 2019 2:21 PM
> *To:* Xiao Li
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [VOTE] Release Apache Spark 2.4.3
>
> +1 from me. There is little change from 2.4.2 anyway, except for the
> important change to the build script that should build pyspark with
> Scala 2.11 jars. I verified that the package contains the _2.11 Spark
> jars, but have a look!
>
> I'm still getting this weird error from the Kafka module when testing,
> but it's a long-standing weird known issue:
>
> [error]
> /home/ubuntu/spark-2.4.3/external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala:85:
> Symbol 'term org.eclipse' is missing from the classpath.
> [error] This symbol is required by 'method
> org.apache.spark.metrics.MetricsSystem.getServletHandlers'.
> [error] Make sure that term eclipse is in your classpath and check for
> conflicting dependencies with `-Ylog-classpath`.
> [error] A full rebuild may help if 'MetricsSystem.class' was compiled
> against an incompatible version of org.
> [error] testUtils.sendMessages(topic, data.toArray)
>
> Killing zinc and rebuilding didn't help.
> But this isn't happening in Jenkins for example, so it should be
> env-specific.
>
> On Wed, May 1, 2019 at 9:39 AM Xiao Li  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.3.
> >
> > The vote is open until May 5th PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.3-rc1 (commit
> c3e32bf06c35ba2580d46150923abfa795b4446a):
> > https://github.com/apache/spark/tree/v2.4.3-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1324/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.3-rc1-docs/
> >
> > The list of bug fixes going into 2.4.2 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12345410
> >
> > The release is using the release script of the branch 2.4.3-rc1 with the
> following commit
> https://github.com/apache/spark/commit/e417168ed012190db66a21e626b2b8d2332d6c01
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.3?
> > ===
> >
> > The current list of open tickets targeted at 2.4.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>