Update Public Documentation - SparkSession instead of SparkContext

2017-02-14 Thread Chetan Khatri
Hello Spark Dev Team,

I was working with my team having most of the confusion that why your
public documentation is not updated with SparkSession if SparkSession is
the ongoing extension and best practice instead of creating sparkcontext.

Thanks.


Re: Spark Improvement Proposals

2017-02-14 Thread Cody Koeninger
Thanks for doing that.

Given that there are at least 4 different Apache voting processes, "typical
Apache vote process" isn't meaningful to me.

I think the intention is that in order to pass, it needs at least 3 +1
votes from PMC members *and no -1 votes from PMC members*.  But the
document doesn't explicitly say that second part.

There's also no mention of the duration a vote should remain open.  There's
a mention of a month for finding a shepherd, but that's different.

Other than that, LGTM.

On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin  wrote:

> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li  wrote:
>
>> During the summit, I also had a lot of discussions over similar topics
>> with multiple Committers and active users. I heard many fantastic ideas. I
>> believe Spark improvement proposals are good channels to collect the
>> requirements/designs.
>>
>>
>> IMO, we also need to consider the priority when working on these items.
>> Even if the proposal is accepted, it does not mean it will be implemented
>> and merged immediately. It is not a FIFO queue.
>>
>>
>> Even if some PRs are merged, sometimes, we still have to revert them
>> back, if the design and implementation are not reviewed carefully. We have
>> to ensure our quality. Spark is not an application software. It is an
>> infrastructure software that is being used by many many companies. We have
>> to be very careful in the design and implementation, especially
>> adding/changing the external APIs.
>>
>>
>> When I developed the Mainframe infrastructure/middleware software in the
>> past 6 years, I were involved in the discussions with external/internal
>> customers. The to-do feature list was always above 100. Sometimes, the
>> customers are feeling frustrated when we are unable to deliver them on time
>> due to the resource limits and others. Even if they paid us billions, we
>> still need to do it phase by phase or sometimes they have to accept the
>> workarounds. That is the reality everyone has to face, I think.
>>
>>
>> Thanks,
>>
>>
>> Xiao Li
>>
>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger :
>>
>>> At the spark summit this week, everyone from PMC members to users I had
>>> never met before were asking me about the Spark improvement proposals
>>> idea.  It's clear that it's a real community need.
>>>
>>> But it's been almost half a year, and nothing visible has been done.
>>>
>>> Reynold, are you going to do this?
>>>
>>> If so, when?
>>>
>>> If not, why?
>>>
>>> You already did the right thing by including long-deserved committers.
>>> Please keep doing the right thing for the community.
>>>
>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin 
>>> wrote:
>>>
 +1 on all counts (consensus, time bound, define roles)

 I can update the doc in the next few days and share back. Then maybe we
 can just officially vote on this. As Tim suggested, we might not get it
 100% right the first time and would need to re-iterate. But that's fine.


 On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter 
 wrote:

> Hi Cody,
> thank you for bringing up this topic, I agree it is very important to
> keep a cohesive community around some common, fluid goals. Here are a few
> comments about the current document:
>
> 1. name: it should not overlap with an existing one such as SIP. Can
> you imagine someone trying to discuss a scala spore proposal for spark?
> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
> sounds great.
>
> 2. roles: at a high level, SPIPs are meant to reach consensus for
> technical decisions with a lasting impact. As such, the template should
> emphasize the role of the various parties during this process:
>
>  - the SPIP author is responsible for building consensus. She is the
> champion driving the process forward and is responsible for ensuring that
> the SPIP follows the general guidelines. The author should be identified 
> in
> the SPIP. The authorship of a SPIP can be transferred if the current 
> author
> is not interested and someone else wants to move the SPIP forward. There
> should probably be 2-3 authors at most for each SPIP.
>
>  - someone with voting power should probably shepherd the SPIP (and be
> recorded as such): ensuring that the final decision over the SPIP is
> recorded (rejected, accepted, etc.), and advising about the technical
> quality of the SPIP: this person need not be a champion for the SPIP or
> contribute to it, but rather makes sure it stands a chance of being
> approved when the vote happens. Also, if the 

Re: Request for comments: Java 7 removal

2017-02-14 Thread Yuming Wang
There is a way only Spark use Java 8, Hadoop still use Java 7:
spark-conf.jpg
(58K)




By the way, I have a way to install any spark version on CM5.4 - CM5.7 by
custom CSD  and
custom Spark parcel .

On Wed, Feb 15, 2017 at 6:46 AM, Koert Kuipers  wrote:

> what about the conversation about dropping scala 2.10?
>
> On Fri, Feb 10, 2017 at 11:47 AM, Sean Owen  wrote:
>
>> As you have seen, there's a WIP PR to implement removal of Java 7
>> support: https://github.com/apache/spark/pull/16871
>>
>> I have heard several +1s at https://issues.apache.org/j
>> ira/browse/SPARK-19493 but am asking for concerns too, now that there's
>> a concrete change to review.
>>
>> If this goes in for 2.2 it can be followed by more extensive update of
>> the Java code to take advantage of Java 8; this is more or less the
>> baseline change.
>>
>> We also just removed Hadoop 2.5 support. I know there was talk about
>> removing Python 2.6. I have no opinion on that myself, but, might be time
>> to revive that conversation too.
>>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Request for comments: Java 7 removal

2017-02-14 Thread Koert Kuipers
what about the conversation about dropping scala 2.10?

On Fri, Feb 10, 2017 at 11:47 AM, Sean Owen  wrote:

> As you have seen, there's a WIP PR to implement removal of Java 7 support:
> https://github.com/apache/spark/pull/16871
>
> I have heard several +1s at https://issues.apache.org/
> jira/browse/SPARK-19493 but am asking for concerns too, now that there's
> a concrete change to review.
>
> If this goes in for 2.2 it can be followed by more extensive update of the
> Java code to take advantage of Java 8; this is more or less the baseline
> change.
>
> We also just removed Hadoop 2.5 support. I know there was talk about
> removing Python 2.6. I have no opinion on that myself, but, might be time
> to revive that conversation too.
>


Re: Request for comments: Java 7 removal

2017-02-14 Thread Sean Owen
Yes, that's a key concern about the Java dependency, that its update is a
function of the OS packages and those who control them, which is often not
the end user. I think that's why this has been delayed a while. My general
position is that, of course, someone in that boat can use Spark 2.1.x. It's
likely going to see maintenance releases through the end of the year, even.
On the flip side, no (non-paid) support has been available for Java 7 for a
while. It wouldn't surprise me if some people are yet still stuck on Java
7; it would surprise me if they expect to use the latest of any package at
this stage. Taking your CDH example, yes it's been a couple years since
people have been able to deploy it on Java 8. Spark 2 isn't supported
before 5.7 anyway. The default is Java 8.

Scala 2.10 is a good point that we are dealing with now. It's not really a
question of whether it will run -- it's all libraries and bytecode to the
JVM and it will happily deal with a mix of 7 and 8 bytecode. It's a
question of whether the build for 2.10 will succeed. I believe it's 'yes'
but am following up on some tests there.

On Tue, Feb 14, 2017 at 1:15 AM Charles Allen 
wrote:

> I think the biggest concern is enterprise users/operators who do not have
> the authority or access to upgrade hadoop/yarn clusters to java8. As a
> reference point, apparently CDH 5.3
> 
>  shipped with java 8 in December 2014. I would be surprised if such users
> were active consumers of the dev mailing list, though. Unfortunately
> there's a bit of a selection bias in this list.
>
> The other concern is if there is guaranteed compatibility between scala
> and java8 for all versions you want to use (which is somewhat touched upon
> in the PR). Are you thinking about supporting scala 2.10 against java 8
> byte code?
>
> See https://groups.google.com/d/msg/druid-user/aTGQlnF1KLk/NvBPfmigAAAJ for
> the similar discussion that went forward in the Druid community.
>
>
> On Fri, Feb 10, 2017 at 8:47 AM Sean Owen  wrote:
>
> As you have seen, there's a WIP PR to implement removal of Java 7 support:
> https://github.com/apache/spark/pull/16871
>
> I have heard several +1s at
> https://issues.apache.org/jira/browse/SPARK-19493 but am asking for
> concerns too, now that there's a concrete change to review.
>
> If this goes in for 2.2 it can be followed by more extensive update of the
> Java code to take advantage of Java 8; this is more or less the baseline
> change.
>
> We also just removed Hadoop 2.5 support. I know there was talk about
> removing Python 2.6. I have no opinion on that myself, but, might be time
> to revive that conversation too.
>
>


Re: [PYTHON][DISCUSS] Moving to cloudpickle and or Py4J as a dependencies?

2017-02-14 Thread Maciej Szymkiewicz
I don't have any strong views, so just to highlight possible issues:

  * Based on different issues I've seen there is a substantial amount of
users which depend on system wide Python installations. As far as I
am aware neither Py4j nor cloudpickle are present in the standard
system repositories in Debian or Red Hat derivatives.
  * Assuming that Spark is committed to supporting Python 2 beyond its
end of life we have to be sure that any external dependency has the
same policy.
  * Py4j is missing from default Anaconda channel. Not a big issue, just
a small annoyance.
  * External dependencies with pinned versions add some overhead to the
development across versions (effectively we may need a separate env
for each major Spark release). I've seen small inconsistencies in
PySpark behavior with different Py4j versions so it is not
completely hypothetical.
  * Adding possible version conflicts. It is probably not a big risk but
something to consider (for example in combination Blaze + Dask +
PySpark).
  * Adding another party user has to trust.


On 02/14/2017 12:22 AM, Holden Karau wrote:
> It's a good question. Py4J seems to have been updated 5 times in 2016
> and is a bit involved (from a review point of view verifying the zip
> file contents is somewhat tedious).
>
> cloudpickle is a bit difficult to tell since we can have changes to
> cloudpickle which aren't correctly tagged as backporting changes from
> the fork (and this can take awhile to review since we don't always
> catch them right away as being backports).
>
> Another difficulty with looking at backports is that since our review
> process for PySpark has historically been on the slow side, changes
> benefiting systems like dask or IPython parallel were not backported
> to Spark unless they caused serious errors.
>
> I think the key benefits are better test coverage of the forked
> version of cloudpickle, using a more standardized packaging of
> dependencies, simpler updates of dependencies reduces friction to
> gaining benefits from other related projects work - Python
> serialization really isn't our secret sauce.
>
> If I'm missing any substantial benefits or costs I'd love to know :)
>
> On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin  > wrote:
>
> With any dependency update (or refactoring of existing code), I
> always ask this question: what's the benefit? In this case it
> looks like the benefit is to reduce efforts in backports. Do you
> know how often we needed to do those?
>
>
> On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau
> > wrote:
>
> Hi PySpark Developers,
>
> Cloudpickle is a core part of PySpark, and is originally
> copied from (and improved from) picloud. Since then other
> projects have found cloudpickle useful and a fork of
> cloudpickle  was
> created and is now maintained as its own library
>  (with better test
> coverage and resulting bug fixes I understand). We've had a
> few PRs backporting fixes from the cloudpickle project into
> Spark's local copy of cloudpickle - how would people feel
> about moving to taking an explicit (pinned) dependency on
> cloudpickle?
>
> We could add cloudpickle to the setup.py and a
> requirements.txt file for users who prefer not to do a system
> installation of PySpark.
>
> Py4J is maybe even a simpler case, we currently have a zip of
> py4j in our repo but could instead have a pinned version
> required. While we do depend on a lot of py4j internal APIs,
> version pinning should be sufficient to ensure functionality
> (and simplify the update process).
>
> Cheers,
>
> Holden :)
>
> -- 
> Twitter: https://twitter.com/holdenkarau
> 
>
>
>
>
>
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

-- 
Maciej Szymkiewicz



Fwd: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Dealing with missing columns in SPARK SQL in JSON)

2017-02-14 Thread Aseem Bansal
Can someone please remove tylerchap...@yahoo-inc.com from the mailing list?
I was told in a spark JIRA that dev mailing list is the right place to ask
for this.

-- Forwarded message --
From: Yahoo! No Reply 
Date: Tue, Feb 14, 2017 at 8:00 PM
Subject: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Dealing
with missing columns in SPARK SQL in JSON)
To: asmbans...@gmail.com



This is an automatically generated message.

tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and
someone will follow up with you shortly.

If you require assistance with a legal matter, please send a message to
legal-noti...@yahoo-inc.com

Thank you!


Fwd: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear all,

Can you please comment on the below mentioned use case.

Thanking you in advance

Regards,
Anis


-- Forwarded message -
From: Anis Nasir 
Date: Tue, 14 Feb 2017 at 17:01
Subject: Handling Skewness and Heterogeneity
To: 


Dear All,

I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who spark will handle such use cases.

Any help will be highly appreciated!


Regards,
Anis


Re: Cannot find checkstyle.xml

2017-02-14 Thread Jakub Dubovsky
Somebody is able to help with this? I am stuck on this in my attempt to
help solve issues:

SPARK-16599 
sparkNB-807 

Thanks

On Thu, Feb 9, 2017 at 10:18 AM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:

> Thanks Ted for trying. (see below for Ted's reply). Con somebody confirm
> that this is not an expected behaviour? Is there somebody else having same
> issue?
>
> Thanks!
>
>
> On Wed, Feb 8, 2017 at 11:42 PM, Ted Yu  wrote:
>
>> Using your command, I got:
>>
>> Caused by: org.apache.maven.project.DependencyResolutionException: Could
>> not resolve dependencies for project 
>> org.apache.spark:spark-launcher_2.11:jar:2.1.0:
>> Could not find artifact org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.1
>> in central (https://repo1.maven.org/maven2)
>> at org.apache.maven.project.DefaultProjectDependenciesResolver.
>> resolve(DefaultProjectDependenciesResolver.java:211)
>> at org.apache.maven.lifecycle.internal.LifecycleDependencyResol
>> ver.getDependencies(LifecycleDependencyResolver.java:195)
>> ... 23 more
>>
>> So I switched to:
>>
>> ./build/mvn -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn -DskipTests -e
>> clean install
>>
>> With your change to pom.xml, I got the same error.
>>
>> Without change to pom.xml, the build continues.
>>
>> On Wed, Feb 8, 2017 at 8:19 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Sorry, correct links set in text below.
>>>
>>>
>>> Hello there,

 I am trying to build spark locally so I can test something to help
 resolve this ticket 
 .

 git checkout v2.1.0
 ./build/mvn -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn
 -Dhadoop.version=2.6.0-cdh5.7.1 -DskipTests -e clean install

 This starts the build successfully. Then I changed one source file and
 a version in pom.xml (exact diff
 ).
 After this change when I run the same build command as above I get failure:

 Could not find resource 'dev/checkstyle.xml'

 whole build log
 

 How this one commit change can cause this error? checkstyle.xml is
 still there. I run maven from project root in both cases. What should I
 change to build this?

 Thanks for your help

 Jakub


>>>
>>
>