Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Nicholas Chammas
I had trouble starting up a shell with the AWS package loaded
(specifically, org.apache.hadoop:hadoop-aws:2.7.3):


[NOT FOUND  ]
com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle) (0ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar

[NOT FOUND  ]
org.codehaus.jettison#jettison;1.1!jettison.jar(bundle) (1ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar

[NOT FOUND  ]
com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar (0ms)

 local-m2-cache: tried

  
file:/home/ec2-user/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar

::

::  FAILED DOWNLOADS::

:: ^ see resolution messages for details  ^ ::

::

:: com.sun.jersey#jersey-json;1.9!jersey-json.jar(bundle)

:: org.codehaus.jettison#jettison;1.1!jettison.jar(bundle)

:: com.sun.xml.bind#jaxb-impl;2.2.3-1!jaxb-impl.jar

:: com.sun.jersey#jersey-server;1.9!jersey-server.jar(bundle)

::

Anyone know anything about this? I made sure to build Spark against the
appropriate version of Hadoop.

Nick

On Tue, Apr 18, 2017 at 2:59 PM Michael Armbrust 
wrote:

Please vote on releasing the following candidate as Apache Spark version
> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
>  (
> 2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1230/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>
​


[DStream][Kinesis] Requesting review for spark-kinesis retries

2017-04-18 Thread Yash Sharma
Hi Fellow Devs,
Please share your thoughts on the pull request that allows spark to have
more graceful retries with kinesis streaming.

The patch removes simple hard codings in the code and allows user to pass
the values in config. This will help users to cope up with kinesis
throttling errors and make spark recovery faster. I would love to work on
the suggested changes.

https://github.com/apache/spark/pull/17467

Best Regards,
Yash


Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-18 Thread Michael Armbrust
In case it wasn't obvious by the appearance of RC3, this vote failed.

On Thu, Mar 30, 2017 at 4:09 PM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc2
>  (
> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1227/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.1?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> *What happened to RC1?*
>
> There were issues with the release packaging and as a result was skipped.
>


[VOTE] Apache Spark 2.1.1 (RC3)

2017-04-18 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version
2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.1.1-rc3
 (
2ed19cff2f6ab79a718526e5d16633412d8c4dd4)

List of JIRA tickets resolved can be found with this filter

.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1230/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/


*FAQ*

*How can I help test this release?*

If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

*What should happen to JIRA tickets still targeting 2.1.1?*

Committers should look at those and triage. Extremely important bug fixes,
documentation, and API tweaks that impact compatibility should be worked on
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

*But my bug isn't fixed!??!*

In order to make timely releases, we will typically not hold the release
unless the bug in question is a regression from 2.1.0.

*What happened to RC1?*

There were issues with the release packaging and as a result was skipped.


Re: RDD functions using GUI

2017-04-18 Thread Reynold Xin
This is not really a dev list question ... I'm sure some tools exist out
there, e.g. Talend, Alteryx.


On Tue, Apr 18, 2017 at 10:35 AM, Ke Yang (Conan) 
wrote:

> Ping… wonder why there aren’t any such drag-n-drop GUI tool for creating
> batch query scripts?
>
> Thanks
>
>
>
> *From:* Ke Yang (Conan)
> *Sent:* Monday, April 17, 2017 5:31 PM
> *To:* 'dev@spark.apache.org' 
> *Subject:* RDD functions using GUI
>
>
>
> Hi,
>
>   Are there drag and drop GUI (code-free) for RDD functions available?
> i.e. a GUI that generates code based on drag-n-drops?
>
> http://spark.apache.org/docs/latest/programming-guide.html#
> resilient-distributed-datasets-rdds
>
>
>
> thanks for brainstorming
>
>
>


CfP - VHPC at ISC extension - Papers due May 2

2017-04-18 Thread VHPC 17


CALL FOR PAPERS


12th Workshop on Virtualization in High­-Performance Cloud Computing  (VHPC
'17)

held in conjunction with the International Supercomputing Conference - High
Performance,

June 18-22, 2017, Frankfurt, Germany.

(Springer LNCS Proceedings)



Date: June 22, 2017

Workshop URL: http://vhpc.org

Paper Submission Deadline: May 2, 2017 (extended), Springer LNCS, rolling
abstract submission

Abstract/Paper Submission Link: https://edas.info/newPaper.php?c=23179

Keynotes:

Satoshi Matsuoka, Professor of Computer Science, Tokyo Institute of
Technology and

John Goodacre, Professor in Computer Architectures University of
Manchester, Director of Technology and Systems ARM Ltd. Research Group and
Chief Scientific Officer Kaleao Ltd.



Call for Papers

Virtualization technologies constitute a key enabling factor for flexible
resource management

in modern data centers, and particularly in cloud environments. Cloud
providers need to

manage complex infrastructures in a seamless fashion to support the highly
dynamic and

heterogeneous workloads and hosted applications customers deploy.
Similarly, HPC

environments have been increasingly adopting techniques that enable
flexible management of vast computing and networking resources, close to
marginal provisioning cost, which is

unprecedented in the history of scientific and commercial computing.

Various virtualization technologies contribute to the overall picture in
different ways: machine

virtualization, with its capability to enable consolidation of multiple
under­utilized servers with

heterogeneous software and operating systems (OSes), and its capability to
live­-migrate a

fully operating virtual machine (VM) with a very short downtime, enables
novel and dynamic

ways to manage physical servers; OS-­level virtualization (i.e.,
containerization), with its

capability to isolate multiple user­-space environments and to allow for
their co­existence within

the same OS kernel, promises to provide many of the advantages of machine
virtualization

with high levels of responsiveness and performance; I/O Virtualization
allows physical

NICs/HBAs to take traffic from multiple VMs or containers; network
virtualization, with its

capability to create logical network overlays that are independent of the
underlying physical

topology and IP addressing, provides the fundamental ground on top of which
evolved

network services can be realized with an unprecedented level of dynamicity
and flexibility; the

increasingly adopted paradigm of Software-­Defined Networking (SDN)
 promises to extend

this flexibility to the control and data planes of network paths.

Publication

Accepted papers will be published in a Springer LNCS proceedings volume.


Topics of Interest

The VHPC program committee solicits original, high-quality submissions
related to

virtualization across the entire software stack with a special focus on the
intersection of HPC

and the cloud.

Major Topics

- Virtualization in supercomputing environments, HPC clusters, HPC in the
cloud and grids
- OS-level virtualization and containers (Docker, rkt, Singularity,
Shifter, i.a.)
- Lightweight/specialized operating systems, unikernels
- Optimizations of virtual machine monitor platforms and hypervisors
- Hypervisor support for heterogenous resources (GPUs, co-processors,
FPGAs, etc.)
- Virtualization support for emerging memory technologies

- Virtualization in enterprise HPC and microvisors
- Software defined networks and network virtualization
- Management, deployment of virtualized environments and orchestration
(Kubernetes i.a.),
- Workflow-pipeline container-based composability
- Performance measurement, modelling and monitoring of virtualized/cloud
workloads
- Virtualization in data intensive computing and Big Data processing - HPC
convergence
- Adaptation of HPC technologies in the cloud (high performance networks,
RDMA, etc.)

- ARM-based hypervisors, ARM virtualization extensions
- I/O virtualization and cloud based storage systems

- GPU, FPGA and many-core accelerator virtualization
- Job scheduling/control/policy and container placement in virtualized
environments
- Cloud reliability, fault-tolerance and high-availability
- QoS and SLA in virtualized environments
- IaaS platforms, cloud frameworks and APIs

- Large-scale virtualization in domains such as finance and government
- Energy-efficient and power-aware virtualization

- Container security

- Configuration management tools for containers (including CFEngine,
Puppet, i.a.)

- Emerging topics including multi-kernel approaches and,NUMA in hypervisors




The Workshop on Virtualization in High­-Performance Cloud Computing (VHPC)
aims to

bring together researchers and industrial practitioners facing the
challenges

posed by virtualization in order to foster discussion, collaboration,
mutual 

RE: RDD functions using GUI

2017-04-18 Thread Ke Yang (Conan)
Ping... wonder why there aren't any such drag-n-drop GUI tool for creating 
batch query scripts?
Thanks

From: Ke Yang (Conan)
Sent: Monday, April 17, 2017 5:31 PM
To: 'dev@spark.apache.org' 
Subject: RDD functions using GUI

Hi,
  Are there drag and drop GUI (code-free) for RDD functions available? i.e. a 
GUI that generates code based on drag-n-drops?
http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

thanks for brainstorming



branch-2.2 has been cut

2017-04-18 Thread Michael Armbrust
I just cut the release branch for Spark 2.2.  If you are merging important
bug fixes, please backport as appropriate.  If you have doubts if something
should be backported, please ping me.  I'll follow with an RC later this
week.


Re: distributed computation of median

2017-04-18 Thread pavan adukuri

Do you know of any python implementation for the same?

thanks
pavan
On 4/17/17, 9:54 AM, svjk24 wrote:

Hello,
  Is there any interest in an efficient distributed computation of the 
median algorithm?
A google search pulls some stackoverflow discussion but it would be 
good to have one provided.


I have an implementation (that could be improved)
from the paper " Fast Computation of the Median by Successive Binning":

https://github.com/4d55397500/medianbinning

Thanks-