Re: Question on Spark code

2017-07-23 Thread Reynold Xin
This is a standard practice used for chaining, to support

a.setStepSize(..)
  .set setRegParam(...)


On Sun, Jul 23, 2017 at 8:47 PM, tao zhan <zhanta...@gmail.com> wrote:

> Thank you for replying.
> But I do not get it completely, why does the "this.type“” necessary?
> why could not it be like:
>
> def setStepSize(step: Double): Unit = {
> require(step > 0,
>   s"Initial step size must be positive but got ${step}")
> this.stepSize = step
> }
>
> On Mon, Jul 24, 2017 at 11:29 AM, M. Muvaffak ONUŞ <
> onus.muvaf...@gmail.com> wrote:
>
>> Doesn't it mean the return type will be type of "this" class. So, it
>> doesn't have to be this instance of the class but it has to be type of this
>> instance of the class. When you have a stack of inheritance and call that
>> function, it will return the same type with the level that you called it.
>>
>> On Sun, Jul 23, 2017 at 8:20 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> It means the same object ("this") is returned.
>>>
>>> On Sun, Jul 23, 2017 at 8:16 PM, tao zhan <zhanta...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am new to scala and spark.
>>>> What does the "this.type" in set function for?
>>>>
>>>>
>>>> ​
>>>> https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5
>>>> e2596da1d600b9d0a/mllib/src/main/scala/org/apache/spark/
>>>> mllib/optimization/GradientDescent.scala#L48
>>>>
>>>> Thanks!
>>>>
>>>> Zhan
>>>>
>>>
>>>
>


Re: Question on Spark code

2017-07-23 Thread Reynold Xin
It means the same object ("this") is returned.

On Sun, Jul 23, 2017 at 8:16 PM, tao zhan  wrote:

> Hello,
>
> I am new to scala and spark.
> What does the "this.type" in set function for?
>
>
> ​
> https://github.com/apache/spark/blob/481f0792944d9a77f0fe8b5e2596da
> 1d600b9d0a/mllib/src/main/scala/org/apache/spark/mllib/
> optimization/GradientDescent.scala#L48
>
> Thanks!
>
> Zhan
>


Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Reynold Xin
+1
On Thu, Apr 27, 2017 at 11:59 AM Michael Armbrust 
wrote:

> I'll also +1
>
> On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen  wrote:
>
>> +1 , same result as with the last RC. All checks out for me.
>>
>> On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.1. The vote is open until Sat, April 29th, 2018 at 18:00
>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.1-rc4
>>>  (
>>> 267aca5bd5042303a718d10635bc0d1a1596853f)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1232/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.0.
>>>
>>> *What happened to RC1?*
>>>
>>> There were issues with the release packaging and as a result was skipped.
>>>
>>
>


Re: Thoughts on release cadence?

2017-07-30 Thread Reynold Xin
This is reasonable ... +1


On Sun, Jul 30, 2017 at 2:19 AM, Sean Owen  wrote:

> The project had traditionally posted some guidance about upcoming
> releases. The last release cycle was about 6 months. What about penciling
> in December 2017 for 2.3.0? http://spark.apache.org/versioning-policy.html
>


Re: Interested in contributing to spark eco

2017-07-28 Thread Reynold Xin
Shashi,

Welcome! There are a lot of ways you can help contribute. There is a page
documenting some of them: http://spark.apache.org/contributing.html

On Fri, Jul 28, 2017 at 1:35 PM, Shashi Dongur 
wrote:

> Hello All,
>
> I am looking for ways to contribute to Spark repo. I want to start with
> helping on running tests and improving documentation where needed.
>
> Please let me know how I can find avenues to help. How can I spot users
> who require assistance with testing? Or gathering documentation for any new
> PRs.
>
> Thanks,
> Shashi
>


Re: Increase Timeout or optimize Spark UT?

2017-08-20 Thread Reynold Xin
It seems like it's time to look into how to cut down some of the test
runtimes. Test runtimes will slowly go up given the way development
happens. 3 hr is already a very long time for tests to run.


On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
wrote:

> Hi, All.
>
>
>
> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has
> been hitting the build timeout.
>
>
>
> Please see the build time trend.
>
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%
> 20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>
>
>
> All recent 22 builds fail due to timeout directly/indirectly. The last
> success (SBT with Hadoop-2.7) is 15th August.
>
>
>
> We may do the followings.
>
>
>
>1. Increase Build Timeout (3 hr 30 min)
>2. Optimize UTs (Scala/Java/Python/UT)
>
>
>
> But, Option 1 will be the immediate solution for now . Could you update
> the Jenkins setup?
>
>
>
> Bests,
>
> Dongjoon.
>


Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Reynold Xin
+1


On Tue, Aug 22, 2017 at 12:25 PM, Maciej Szymkiewicz <mszymkiew...@gmail.com
> wrote:

> Hi,
>
> From my experience it is possible to cut quite a lot by reducing
> spark.sql.shuffle.partitions to some reasonable value (let's say
> comparable to the number of cores). 200 is a serious overkill for most of
> the test cases anyway.
>
>
> Best,
> Maciej
>
>
>
> On 21 August 2017 at 03:00, Dong Joon Hyun <dh...@hortonworks.com> wrote:
>
>> +1 for any efforts to recover Jenkins!
>>
>>
>>
>> Thank you for the direction.
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>> *From: *Reynold Xin <r...@databricks.com>
>> *Date: *Sunday, August 20, 2017 at 5:53 PM
>> *To: *Dong Joon Hyun <dh...@hortonworks.com>
>> *Cc: *"dev@spark.apache.org" <dev@spark.apache.org>
>> *Subject: *Re: Increase Timeout or optimize Spark UT?
>>
>>
>>
>> It seems like it's time to look into how to cut down some of the test
>> runtimes. Test runtimes will slowly go up given the way development
>> happens. 3 hr is already a very long time for tests to run.
>>
>>
>>
>>
>>
>> On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun <dh...@hortonworks.com>
>> wrote:
>>
>> Hi, All.
>>
>>
>>
>> Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has
>> been hitting the build timeout.
>>
>>
>>
>> Please see the build time trend.
>>
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend
>>
>>
>>
>> All recent 22 builds fail due to timeout directly/indirectly. The last
>> success (SBT with Hadoop-2.7) is 15th August.
>>
>>
>>
>> We may do the followings.
>>
>>
>>
>>1. Increase Build Timeout (3 hr 30 min)
>>2. Optimize UTs (Scala/Java/Python/UT)
>>
>>
>>
>> But, Option 1 will be the immediate solution for now . Could you update
>> the Jenkins setup?
>>
>>
>>
>> Bests,
>>
>> Dongjoon.
>>
>>
>>
>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Reynold Xin
Yea I don't think it's a good idea to upload a doc and then call for a vote
immediately. People need time to digest ...


On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan  wrote:

> Sorry let's remove the VOTE tag as I just wanna bring this up for
> discussion.
>
> I'll restart the voting process after we have enough discussion on the
> JIRA ticket or here in this email thread.
>
> On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> -1, I don't think there has really been any discussion of this api change
>> yet or at least it hasn't occurred on the jira ticket
>>
>> On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan  wrote:
>>
>>> adding my own +1 (binding)
>>>
>>> On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan 
>>> wrote:
>>>
 Hi all,

 Following the SPIP process, I'm putting this SPIP up for a vote.

 The current data source API doesn't work well because of some
 limitations like: no partitioning/bucketing support, no columnar read, hard
 to support more operator push down, etc.

 I'm proposing a Data Source API V2 to address these problems, please
 read the full document at https://issues.apache.org/jira
 /secure/attachment/12882332/SPIP%20Data%20Source%20API%20V2.pdf

 Since this SPIP is mostly about APIs, I also created a prototype and
 put java docs on these interfaces, so that it's easier to review these
 interfaces and discuss: https://github.com/cl
 oud-fan/spark/pull/10/files

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

>>>
>>>
>


Re: spark.sql.codegen.comments not in SQLConf?

2017-05-10 Thread Reynold Xin
It's probably because it is annoying to propagate that using SQL conf.
On Wed, May 10, 2017 at 3:38 AM Jacek Laskowski  wrote:

> Hi,
>
> It seems that spark.sql.codegen.comments property [1] didn't find its
> place in SQLConf [2] that appears to be the place for all Spark
> SQL-related properties (for codegen surely).
>
> Don't think it merits a JIRA issue so just asking here.
>
> If agreed, I'd like to propose a PR. Thanks.
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L822
> [2]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Question: why is Externalizable used?

2017-06-19 Thread Reynold Xin
I responded on the ticket.


On Mon, Jun 19, 2017 at 2:36 AM, Sean Owen  wrote:

> Just wanted to call attention to this question, mostly because I'm curious:
> https://github.com/apache/spark/pull/18343#issuecomment-309388668
>
> Why is Externalizable (+ KryoSerializable) used instead of Serializable?
> and should the first two always go together?
>
>


Re: An Update on Spark on Kubernetes [Jun 23]

2017-06-23 Thread Reynold Xin
Thanks, Anirudh. This is super helpful!

On Fri, Jun 23, 2017 at 9:50 AM, Anirudh Ramanathan 
wrote:

> *Project Description: *Kubernetes cluster manager integration that
> enables native support for submitting Spark applications to a kubernetes
> cluster. The submitted applications can make use of Kubernetes native
> constructs.
>
> *JIRA*: 18278 
> *Upstream Kubernetes Issue*: 34377
> 
>
> *Summary of activity in the last 2 weeks*:
>
>- PySpark support being added by ifilolenko (Bloomberg)
>- HDFS locality fixes by kimoonkim (PepperData)
>- Minor bug fixes and performance improvement (Tencent)
>- Planning of work around Spark and HDFS on Kubernetes for Q3 2017
>- Weekly meeting notes and discussions (link
>
> 
>)
>
> *Recent talks and updates*:
>
>- Spark on Kubernetes (Spark Summit 2017) - link
>
>- HDFS on Kubernetes - Lessons Learned (Spark Summit 2017) - link
>
>
>
> For more details and to contribute to our discussions/efforts, please
> attend our weekly public meeting, or join our slack/mailing list. (link
> )
>


[SPARK-21190] SPIP: Vectorized UDFs in Python

2017-06-23 Thread Reynold Xin
Welcome to the first real SPIP.

SPIP: Vectorized UDFs for Python



https://issues.apache.org/jira/browse/SPARK-21190



Background and Motivation:



Python is one of the most popular programming languages among Spark users.
Spark currently exposes a row-at-a-time interface for defining and
executing user-defined functions (UDFs). This introduces high overhead in
serialization and deserialization, and also makes it difficult to leverage
Python libraries (e.g. numpy, Pandas) that are written in native code.



This proposal advocates introducing new APIs to support vectorized UDFs in
Python, in which a block of data is transferred over to Python in some
columnar format for execution.





Target Personas:

Data scientists, data engineers, library developers.



Goals:

   -

   Support vectorized UDFs that apply on chunks of the data frame
   -

   Low system overhead: Substantially reduce serialization and
   deserialization overhead when compared with row-at-a-time interface
   -

   UDF performance: Enable users to leverage native libraries in Python
   (e.g. numpy, Pandas) for data manipulation in these UDFs





Non-Goals:



The following are explicitly out of scope for the current SPIP, and should
be done in future SPIPs. Nonetheless, it would be good to consider these
future use cases during API design, so we can achieve some consistency when
rolling out new APIs.



- Define block oriented UDFs in other languages (that are not Python).

- Define aggregate UDFs

- Tight integration with machine learning frameworks





Proposed API Changes:



The following sketches some possibilities. I haven’t spent a lot of time
thinking about the API (wrote it down in 5 mins) and I am not attached to
this design at all. The main purpose of the SPIP is to get feedback on use
cases and see how they can impact API design.



Some things to consider are:



1. Python is dynamically typed, whereas DataFrames/SQL requires static,
analysis time typing. This means users would need to specify the return
type of their UDFs.



2. Ratio of input rows to output rows. We propose initially we require
number of output rows to be the same as the number of input rows. In the
future, we can consider relaxing this constraint with support for
vectorized aggregate UDFs.



3. On the Python side, what do we expose? The common libraries are numpy
(more for numeric / machine learning), and Pandas (more for structured data
analysis).





Proposed API sketch (using examples):



Use case 1. A function that defines all the columns of a DataFrame (similar
to a “map” function):



@spark_udf(some way to describe the return schema)

def my_func_on_entire_df(input):

 """ Some user-defined function.



 :param input: A Pandas DataFrame with two columns, a and b.

 :return: :class: A Pandas data frame.

 """

 input[c] = input[a] + input[b]

 Input[d] = input[a] - input[b]

 return input



spark.range(1000).selectExpr("id a", "id / 2 b")

 .mapBatches(my_func_on_entire_df)





Use case 2. A function that defines only one column (similar to existing
UDFs):



@spark_udf(some way to describe the return schema)

def my_func_that_returns_one_column(input):

 """ Some user-defined function.



 :param input: A Pandas DataFrame with two columns, a and b.

 :return: :class: A numpy array

 """

 return input[a] + input[b]



my_func = udf(my_func_that_returns_one_column)



df = spark.range(1000).selectExpr("id a", "id / 2 b")

df.withColumn("c", my_func(df.a, df.b))





Optional Design Sketch:



I’m more concerned about getting proper feedback for API design. The
implementation should be pretty straightforward and is not a huge concern
at this point. We can leverage the same implementation for faster toPandas
(using Arrow).





Optional Rejected Designs:



See above.


Re: New metrics for WindowExec with number of partitions and frames?

2017-05-26 Thread Reynold Xin
That would be useful (number of partitions).

On Fri, May 26, 2017 at 3:24 PM Jacek Laskowski  wrote:

> Hi,
>
> Currently WindowExec gives no metrics in the web UI's Details for Query
> page.
>
> What do you think about adding the number of partitions and frames?
> That could certainly be super useful, but am unsure if that's the kind
> of metrics Spark SQL shows in the details.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-25 Thread Reynold Xin
Zoltan,

Thanks for raising this again, although I'm a bit confused since I've
communicated with you a few times on JIRA and on private emails to explain
that you have some misunderstanding of the timestamp type in Spark and some
of your statements are wrong (e.g. the except text file part). Not sure why
you didn't get any of those.


Here's another try:


1. I think you guys misunderstood the semantics of timestamp in Spark
before session local timezone change. IIUC, Spark has always assumed
timestamps to be with timezone, since it parses timestamps with timezone
and does all the datetime conversions with timezone in mind (it doesn't
ignore timezone if a timestamp string has timezone specified). The session
local timezone change further pushes Spark to that direction, but the
semantics has been with timezone before that change. Just run Spark on
machines with different timezone and you will know what I'm talking about.

2. CSV/Text is not different. The data type has always been "with
timezone". If you put a timezone in the timestamp string, it parses the
timezone.

3. We can't change semantics now, because it'd break all existing Spark
apps.

4. We can however introduce a new timestamp without timezone type, and have
a config flag to specify which one (with tz or without tz) is the default
behavior.



On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi  wrote:

> Hi,
>
> Sorry if you receive this mail twice, it seems that my first attempt did
> not make it to the list for some reason.
>
> I would like to start a discussion about SPARK-18350
>  before it gets
> released because it seems to be going in a different direction than what
> other SQL engines of the Hadoop stack do.
>
> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT TIME
> ZONE) to have timezone-agnostic semantics - basically a type that expresses
> readings from calendars and clocks and is unaffected by time zone. In the
> Hadoop stack, Impala has always worked like this and recently Presto also
> took steps  to become
> standards compliant. (Presto's design doc
> 
> also contains a great summary of the different semantics.) Hive has a
> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
> source of incompatibility that is already being addressed
> ). A TIMESTAMP in
> SparkSQL, however, has UTC-normalized local time semantics (except for
> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
> type.
>
> Given that timezone-agnostic TIMESTAMP semantics provide standards
> compliance and consistency with most SQL engines, I was wondering whether
> SparkSQL should also consider it in order to become ANSI SQL compliant and
> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
> adapt this semantics in the future, SPARK-18350
>  may turn out to be a
> source of problems. Please correct me if I'm wrong, but this change seems
> to explicitly assign TIMESTAMP WITH TIME ZONE semantics to the TIMESTAMP
> type. I think SPARK-18350 would be a great feature for a separate TIMESTAMP
> WITH TIME ZONE type, but the plain unqualified TIMESTAMP type would be
> better becoming timezone-agnostic instead of gaining further timezone-aware
> capabilities. (Of course becoming timezone-agnostic would be a behavior
> change, so it must be optional and configurable by the user, as in Presto.)
>
> I would like to hear your opinions about this concern and about TIMESTAMP
> semantics in general. Does the community agree that a standards-compliant
> and interoperable TIMESTAMP type is desired? Do you perceive SPARK-18350 as
> a potential problem in achieving this or do I misunderstand the effects of
> this change?
>
> Thanks,
>
> Zoltan
>
> ---
>
> List of links in case in-line links do not work:
>
>-
>
>SPARK-18350: https://issues.apache.org/jira/browse/SPARK-18350
>-
>
>Presto's change: https://github.com/prestodb/presto/issues/7122
>-
>
>Presto's design doc: https://docs.google.com/document/d/
>1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit
>
> 
>
>
>


Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-05-26 Thread Reynold Xin
That's just my point 4, isn't it?


On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:

> Reynold,
> my point is that Spark should aim to follow the SQL standard instead of
> rolling its own type system.
> If I understand correctly, the existing implementation is similar to
> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH TIMEZONE
> data types which are missing from Spark.
> So, it is better (for me) if instead of extending the existing types,
> Spark would just implement the additional well-defined types properly.
> Just trying to copy-paste CREATE TABLE between SQL engines should not be
> an exercise of flags and incompatibilities.
>
> Regarding the current behaviour, if I remember correctly I had to force
> our spark O/S user into UTC so Spark wont change my timestamps.
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Zoltan,
>>
>> Thanks for raising this again, although I'm a bit confused since I've
>> communicated with you a few times on JIRA and on private emails to explain
>> that you have some misunderstanding of the timestamp type in Spark and some
>> of your statements are wrong (e.g. the except text file part). Not sure why
>> you didn't get any of those.
>>
>>
>> Here's another try:
>>
>>
>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>> before session local timezone change. IIUC, Spark has always assumed
>> timestamps to be with timezone, since it parses timestamps with timezone
>> and does all the datetime conversions with timezone in mind (it doesn't
>> ignore timezone if a timestamp string has timezone specified). The session
>> local timezone change further pushes Spark to that direction, but the
>> semantics has been with timezone before that change. Just run Spark on
>> machines with different timezone and you will know what I'm talking about.
>>
>> 2. CSV/Text is not different. The data type has always been "with
>> timezone". If you put a timezone in the timestamp string, it parses the
>> timezone.
>>
>> 3. We can't change semantics now, because it'd break all existing Spark
>> apps.
>>
>> 4. We can however introduce a new timestamp without timezone type, and
>> have a config flag to specify which one (with tz or without tz) is the
>> default behavior.
>>
>>
>>
>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>>
>>> Hi,
>>>
>>> Sorry if you receive this mail twice, it seems that my first attempt did
>>> not make it to the list for some reason.
>>>
>>> I would like to start a discussion about SPARK-18350
>>> <https://issues.apache.org/jira/browse/SPARK-18350> before it gets
>>> released because it seems to be going in a different direction than what
>>> other SQL engines of the Hadoop stack do.
>>>
>>> ANSI SQL defines the TIMESTAMP type (also known as TIMESTAMP WITHOUT
>>> TIME ZONE) to have timezone-agnostic semantics - basically a type that
>>> expresses readings from calendars and clocks and is unaffected by time
>>> zone. In the Hadoop stack, Impala has always worked like this and recently
>>> Presto also took steps <https://github.com/prestodb/presto/issues/7122>
>>> to become standards compliant. (Presto's design doc
>>> <https://docs.google.com/document/d/1UUDktZDx8fGwHZV4VyaEDQURorFbbg6ioeZ5KMHwoCk/edit>
>>> also contains a great summary of the different semantics.) Hive has a
>>> timezone-agnostic TIMESTAMP type as well (except for Parquet, a major
>>> source of incompatibility that is already being addressed
>>> <https://issues.apache.org/jira/browse/HIVE-12767>). A TIMESTAMP in
>>> SparkSQL, however, has UTC-normalized local time semantics (except for
>>> textfile), which is generally the semantics of the TIMESTAMP WITH TIME ZONE
>>> type.
>>>
>>> Given that timezone-agnostic TIMESTAMP semantics provide standards
>>> compliance and consistency with most SQL engines, I was wondering whether
>>> SparkSQL should also consider it in order to become ANSI SQL compliant and
>>> interoperable with other SQL engines of the Hadoop stack. Should SparkSQL
>>> adapt this semantics in the future, SPARK-18350
>>> <https://issues.apache.org/jira/browse/SPARK-18350> may turn out to be
>>

Re: [PYTHON] PySpark typing hints

2017-05-23 Thread Reynold Xin
Seems useful to do. Is there a way to do this so it doesn't break Python
2.x?


On Sun, May 14, 2017 at 11:44 PM, Maciej Szymkiewicz  wrote:

> Hi everyone,
>
> For the last few months I've been working on static type annotations for
> PySpark. For those of you, who are not familiar with the idea, typing hints
> have been introduced by PEP 484 (https://www.python.org/dev/peps/pep-0484/)
> and further extended with PEP 526 (https://www.python.org/dev/pe
> ps/pep-0526/) with the main goal of providing information required for
> static analysis. Right now there a few tools which support typing hints,
> including Mypy (https://github.com/python/mypy) and PyCharm (
> https://www.jetbrains.com/help/pycharm/2017.1/type-hinting-in-pycharm.html).
> Type hints can be added using function annotations (
> https://www.python.org/dev/peps/pep-3107/, Python 3 only), docstrings, or
> source independent stub files (https://www.python.org/dev/pe
> ps/pep-0484/#stub-files). Typing is optional, gradual and has no runtime
> impact.
>
> At this moment I've annotated majority of the API, including majority of 
> pyspark.sql
> and pyspark.ml. At this moment project is still rough around the edges,
> and may result in both false positive and false negatives, but I think it
> become mature enough to be useful in practice.
> The current version is compatible only with Python 3, but it is possible,
> with some limitations, to backport it to Python 2 (though it is not on my
> todo list).
>
> There is a number of possible benefits for PySpark users and developers:
>
>- Static analysis can detect a number of common mistakes to prevent
>runtime failures. Generic self is still fairly limited, so it is more
>useful with DataFrames, SS and ML than RDD, DStreams or RDD.
>- Annotations can be used for documenting complex signatures (
>https://git.io/v95JN) including dependencies on arguments and value (
>https://git.io/v95JA).
>- Detecting possible bugs in Spark (SPARK-20631) .
>- Showing API inconsistencies.
>
> Roadmap
>
>- Update the project to reflect Spark 2.2.
>- Refine existing annotations.
>
> If there will be enough interest I am happy to contribute this back to
> Spark or submit to Typeshed (https://github.com/python/typeshed -  this
> would require a formal ASF approval, and since Typeshed doesn't provide
> versioning, is probably not the best option in our case).
>
> Further inforamtion:
>
>- https://github.com/zero323/pyspark-stubs - GitHub repository
>
>
>- https://speakerdeck.com/marcobonzanini/static-type-analysis-
>for-robust-data-products-at-pydata-london-2017
>
> 
>- interesting presentation by Marco Bonzanini
>
> --
> Best,
> Maciej
>
>


Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-06-01 Thread Reynold Xin
Again (I've probably said this more than 10 times already in different
threads), SPARK-18350 has no impact on whether the timestamp type is with
timezone or without timezone. It simply allows a session specific timezone
setting rather than having Spark always rely on the machine timezone.

On Wed, May 31, 2017 at 11:58 AM, Kostas Sakellis 
wrote:

> Hey Michael,
>
> There is a discussion on TIMESTAMP semantics going on the thread "SQL
> TIMESTAMP semantics vs. SPARK-18350" which might impact Spark 2.2. Should
> we make a decision there before voting on the next RC for Spark 2.2?
>
> Thanks,
> Kostas
>
> On Tue, May 30, 2017 at 12:09 PM, Michael Armbrust  > wrote:
>
>> Last call, anything else important in-flight for 2.2?
>>
>> On Thu, May 25, 2017 at 10:56 AM, Michael Allman 
>> wrote:
>>
>>> PR is here: https://github.com/apache/spark/pull/18112
>>>
>>>
>>> On May 25, 2017, at 10:28 AM, Michael Allman 
>>> wrote:
>>>
>>> Michael,
>>>
>>> If you haven't started cutting the new RC, I'm working on a
>>> documentation PR right now I'm hoping we can get into Spark 2.2 as a
>>> migration note, even if it's just a mention: https://issues.apache
>>> .org/jira/browse/SPARK-20888.
>>>
>>> Michael
>>>
>>>
>>> On May 22, 2017, at 11:39 AM, Michael Armbrust 
>>> wrote:
>>>
>>> I'm waiting for SPARK-20814
>>>  at Marcelo's
>>> request and I'd also like to include SPARK-20844
>>> .  I think we should
>>> be able to cut another RC midweek.
>>>
>>> On Fri, May 19, 2017 at 11:53 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 All the outstanding ML QA doc and user guide items are done for 2.2 so
 from that side we should be good to cut another RC :)


 On Thu, 18 May 2017 at 00:18 Russell Spitzer 
 wrote:

> Seeing an issue with the DataScanExec and some of our integration
> tests for the SCC. Running dataframe read and writes from the shell seems
> fine but the Redaction code seems to get a "None" when doing
> SparkSession.getActiveSession.get in our integration tests. I'm not
> sure why but i'll dig into this later if I get a chance.
>
> Example Failed Test
> https://github.com/datastax/spark-cassandra-connector/blob/v
> 2.0.1/spark-cassandra-connector/src/it/scala/com/datastax/sp
> ark/connector/sql/CassandraSQLSpec.scala#L311
>
> ```[info]   org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.util.NoSuchElementException:
> None.get
> [info] java.util.NoSuchElementException: None.get
> [info] at scala.None$.get(Option.scala:347)
> [info] at scala.None$.get(Option.scala:345)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$class.org$
> apache$spark$sql$execution$DataSourceScanExec$$redact(DataSo
> urceScanExec.scala:70)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4
> .apply(DataSourceScanExec.scala:54)
> [info] at org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4
> .apply(DataSourceScanExec.scala:52)
> ```
>
> Again this only seems to repo in our IT suite so i'm not sure if this
> is a real issue.
>
>
> On Tue, May 16, 2017 at 1:40 PM Joseph Bradley 
> wrote:
>
>> All of the ML/Graph/SparkR QA blocker JIRAs have been resolved.
>> Thanks everyone who helped out on those!
>>
>> We still have open ML/Graph/SparkR JIRAs targeted at 2.2, but they
>> are essentially all for documentation.
>>
>> Joseph
>>
>> On Thu, May 11, 2017 at 3:08 PM, Marcelo Vanzin 
>> wrote:
>>
>>> Since you'll be creating a new RC, I'd wait until SPARK-20666 is
>>> fixed, since the change that caused it is in branch-2.2. Probably a
>>> good idea to raise it to blocker at this point.
>>>
>>> On Thu, May 11, 2017 at 2:59 PM, Michael Armbrust
>>>  wrote:
>>> > I'm going to -1 given the outstanding issues and lack of +1s.
>>> I'll create
>>> > another RC once ML has had time to take care of the more critical
>>> problems.
>>> > In the meantime please keep testing this release!
>>> >
>>> > On Tue, May 9, 2017 at 2:00 AM, Kazuaki Ishizaki <
>>> ishiz...@jp.ibm.com>
>>> > wrote:
>>> >>
>>> >> +1 (non-binding)
>>> >>
>>> >> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the
>>> tests for
>>> >> core have passed.
>>> >>
>>> >> $ java -version
>>> >> openjdk version "1.8.0_111"
>>> >> OpenJDK Runtime Environment (build
>>> >> 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
>>> >> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed 

Re: SQL TIMESTAMP semantics vs. SPARK-18350

2017-06-01 Thread Reynold Xin
.
>> I'm far less knowledgeable about this, so I mostly just have questions:
>>
>> * does the standard dictate what the parsing behavior should be for
>> timestamp (without time zone) when a time zone is present?
>>
>> * if it does and spark violates this standard is it worth trying to
>> retain the *other* semantics of timestamp without time zone, even if we
>> violate the parsing part?
>>
>> I did look at what postgres does for comparison:
>>
>> https://gist.github.com/squito/cb81a1bb07e8f67e9d27eaef44cc522c
>>
>> spark's timestamp certainly does not match postgres's timestamp for
>> parsing, it seems closer to postgres's "timestamp with timezone" -- though
>> I dunno if that is standard behavior at all.
>>
>> thanks,
>> Imran
>>
>> On Fri, May 26, 2017 at 1:27 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> That's just my point 4, isn't it?
>>>
>>>
>>> On Fri, May 26, 2017 at 1:07 AM, Ofir Manor <ofir.ma...@equalum.io>
>>> wrote:
>>>
>>>> Reynold,
>>>> my point is that Spark should aim to follow the SQL standard instead of
>>>> rolling its own type system.
>>>> If I understand correctly, the existing implementation is similar to
>>>> TIMESTAMP WITH LOCAL TIMEZONE data type in Oracle..
>>>> In addition, there are the standard TIMESTAMP and TIMESTAMP WITH
>>>> TIMEZONE data types which are missing from Spark.
>>>> So, it is better (for me) if instead of extending the existing types,
>>>> Spark would just implement the additional well-defined types properly.
>>>> Just trying to copy-paste CREATE TABLE between SQL engines should not
>>>> be an exercise of flags and incompatibilities.
>>>>
>>>> Regarding the current behaviour, if I remember correctly I had to force
>>>> our spark O/S user into UTC so Spark wont change my timestamps.
>>>>
>>>> Ofir Manor
>>>>
>>>> Co-Founder & CTO | Equalum
>>>>
>>>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>>>
>>>> On Thu, May 25, 2017 at 1:33 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Zoltan,
>>>>>
>>>>> Thanks for raising this again, although I'm a bit confused since I've
>>>>> communicated with you a few times on JIRA and on private emails to explain
>>>>> that you have some misunderstanding of the timestamp type in Spark and 
>>>>> some
>>>>> of your statements are wrong (e.g. the except text file part). Not sure 
>>>>> why
>>>>> you didn't get any of those.
>>>>>
>>>>>
>>>>> Here's another try:
>>>>>
>>>>>
>>>>> 1. I think you guys misunderstood the semantics of timestamp in Spark
>>>>> before session local timezone change. IIUC, Spark has always assumed
>>>>> timestamps to be with timezone, since it parses timestamps with timezone
>>>>> and does all the datetime conversions with timezone in mind (it doesn't
>>>>> ignore timezone if a timestamp string has timezone specified). The session
>>>>> local timezone change further pushes Spark to that direction, but the
>>>>> semantics has been with timezone before that change. Just run Spark on
>>>>> machines with different timezone and you will know what I'm talking about.
>>>>>
>>>>> 2. CSV/Text is not different. The data type has always been "with
>>>>> timezone". If you put a timezone in the timestamp string, it parses the
>>>>> timezone.
>>>>>
>>>>> 3. We can't change semantics now, because it'd break all existing
>>>>> Spark apps.
>>>>>
>>>>> 4. We can however introduce a new timestamp without timezone type, and
>>>>> have a config flag to specify which one (with tz or without tz) is the
>>>>> default behavior.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 24, 2017 at 5:46 PM, Zoltan Ivanfi <z...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Sorry if you receive this mail twice, it seems that my first attempt
>>>>>> did not make it to the list for some reason.
>>>>>>
>>>>>> I would like to start a discu

Re: [Spark SQL] Nanoseconds in Timestamps are set as Microseconds

2017-06-02 Thread Reynold Xin
Seems like a bug we should fix? I agree some form of truncation makes more
sense.


On Thu, Jun 1, 2017 at 1:17 AM, Anton Okolnychyi  wrote:

> Hi all,
>
> I would like to ask what the community thinks regarding the way how Spark
> handles nanoseconds in the Timestamp type.
>
> As far as I see in the code, Spark assumes microseconds precision.
> Therefore, I expect to have a truncated to microseconds timestamp or an
> exception if I specify a timestamp with nanoseconds. However, the current
> implementation just silently sets nanoseconds as microseconds in [1], which
> results in a wrong timestamp. Consider the example below:
>
> spark.sql("SELECT cast('2015-01-02 00:00:00.1' as
> TIMESTAMP)").show(false)
> ++
> |CAST(2015-01-02 00:00:00.1 AS TIMESTAMP)|
> ++
> |2015-01-02 00:00:00.01  |
> ++
>
> This issue was already raised in SPARK-17914 but I do not see any decision
> there.
>
> [1] - org.apache.spark.sql.catalyst.util.DateTimeUtils, toJavaTimestamp,
> line 204
>
> Best regards,
> Anton
>


Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join?

On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> The RDD code keeps a member as below:
> dependencies_ : seq[Dependency[_]]
>
> It is a seq, that means it can keep more than one dependency.
>
> I have an issue about this.
> Is it possible that its size is greater than one please?
> If yes, how to produce it please? Would you like show me some code please?
>
> thanks
> Robin Shao
>


Re: Custom Partitioning in Catalyst

2017-06-16 Thread Reynold Xin
Perhaps we should extend the data source API to support that.


On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer  wrote:

> I've been trying to work with making Catalyst Cassandra partitioning
> aware. There seem to be two major blocks on this.
>
> The first is that DataSourceScanExec is unable to learn what the
> underlying partitioning should be from the BaseRelation it comes from. I'm
> currently able to get around this by using the DataSourceStrategy plan and
> then transforming the resultant DataSourceScanExec.
>
> The second is that the Partitioning trait is sealed. I want to define a
> new partitioning which is Clustered but is not hashed based on certain
> columns. It would look almost identical to the HashPartitioning class
> except the
> expression which returns a valid PartitionID given expressions would be
> different.
>
> Anyone have any ideas on how to get around the second issue? Would it be
> worth while to make changes to allow BaseRelations to advertise a
> particular Partitioner?
>


Re: Custom Partitioning in Catalyst

2017-06-16 Thread Reynold Xin
Seems like a great idea to do?


On Fri, Jun 16, 2017 at 12:03 PM, Russell Spitzer <russell.spit...@gmail.com
> wrote:

> I considered adding this to DataSource APIV2 ticket but I didn't want to
> be first :P Do you think there will be any issues with opening up the
> partitioning as well?
>
> On Fri, Jun 16, 2017 at 11:58 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Perhaps we should extend the data source API to support that.
>>
>>
>> On Fri, Jun 16, 2017 at 11:37 AM, Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I've been trying to work with making Catalyst Cassandra partitioning
>>> aware. There seem to be two major blocks on this.
>>>
>>> The first is that DataSourceScanExec is unable to learn what the
>>> underlying partitioning should be from the BaseRelation it comes from. I'm
>>> currently able to get around this by using the DataSourceStrategy plan and
>>> then transforming the resultant DataSourceScanExec.
>>>
>>> The second is that the Partitioning trait is sealed. I want to define a
>>> new partitioning which is Clustered but is not hashed based on certain
>>> columns. It would look almost identical to the HashPartitioning class
>>> except the
>>> expression which returns a valid PartitionID given expressions would be
>>> different.
>>>
>>> Anyone have any ideas on how to get around the second issue? Would it be
>>> worth while to make changes to allow BaseRelations to advertise a
>>> particular Partitioner?
>>>
>>
>>


Re: Hoping contribute code-Spark 2.1.1 Documentation

2017-05-02 Thread Reynold Xin
Liucht,

Thanks for the interest. You are more than welcomed to contribute a pull
request to fix the issue, at https://github.com/apache/spark



On Tue, May 2, 2017 at 7:44 PM, cht liu  wrote:

> Hello,The Spark organizational leader :
> This is my first time to contribute the Spark code, do not know much
> about the process.I released a bug in JIRA:SPARK-20570
> :On the
> spark.apache.org home page, when I click the menu Latest Release (Spark
> 2.1.1) under the documentation menu ,the next page latest appear with
> display 2.1.0 lable in the upper left corner of the page.
> I want to contribute the Spark code perfect modify it,The hope can
> give a chance.
> Thanks very much!
>
> liucht
>


Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Reynold Xin
Thanks for the email. The process is to create a JIRA ticket and then post
a design doc for discussion. You will of course need to update your code to
work with the latest master branch, but you should wait oj that until the
community has a chance to comment on the design.

Cheers.


On Fri, May 5, 2017 at 8:01 AM Nipun Arora  wrote:

> Hi All,
>
> To support our Spark Streaming based anomaly detection tool, we have made
> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>
> I'll first explain our use-case, which I believe should be common to
> several people using Spark Streaming applications. Broadcast variables are
> often used to store values "machine learning models", which can then be
> used on streaming data to "test" and get the desired results (for our case
> anomalies). Unfortunately, in the current spark, broadcast variables are
> final and can only be initialized once before the initialization of the
> streaming context. Hence, if a new model is learned the streaming system
> cannot be updated without shutting down the application, broadcasting
> again, and restarting the application. Our goal was to re-broadcast
> variables without requiring a downtime of the streaming service.
>
> The key to this implementation is a live re-broadcastVariable() interface,
> which can be triggered in between micro-batch executions, without any
> re-boot required for the streaming application. At a high level the task is
> done by re-fetching broadcast variable information from the spark driver,
> and then re-distribute it to the workers. The micro-batch execution is
> blocked while the update is made, by taking a lock on the execution. We
> have already tested this in our prototype deployment of our anomaly
> detection service and can successfully re-broadcast the broadcast variables
> with no downtime.
>
> We would like to integrate these changes in spark, can anyone please let
> me know the process of submitting patches/ new features to spark. Also. I
> understand that the current version of Spark is 2.1. However, our changes
> have been done and tested on Spark 1.6.2, will this be a problem?
>
> Thanks
> Nipun
>


Re: PR permission to kick Jenkins?

2017-05-05 Thread Reynold Xin
I suspect the list is getting too big for Jenkins to function well. It
stopped working for me a while ago.


On Fri, May 5, 2017 at 12:06 PM, Tom Graves 
wrote:

> Does  anyone know how to configure Jenkins to allow committers to tell it
> to test prs?  I used to have this access but lately it is either not
> working or only intermittently working.
>
> The commands like "ok to test", "test this please", etc..
>
> Thanks,
> Tom
>


Re: Total memory tracking: request for comments

2017-09-20 Thread Reynold Xin
Thanks. This is an important direction to explore and my apologies for the
late reply.

One thing that is really hard about this is that with different layers of
abstractions, we often use other libraries that might allocate large amount
of memory (e.g. snappy library, Parquet itself), which makes it very
difficult to track. That's where I see how most of the OOMs or crashes
happen. How do you propose solving those?



On Tue, Jun 20, 2017 at 4:15 PM, Jose Soltren  wrote:

> https://issues.apache.org/jira/browse/SPARK-21157
>
> Hi - often times, Spark applications are killed for overrunning available
> memory by YARN, Mesos, or the OS. In SPARK-21157, I propose a design for
> grabbing and reporting "total memory" usage for Spark executors - that is,
> memory usage as visible from the OS, including on-heap and off-heap memory
> used by Spark and third party libraries. This builds on many ideas from
> SPARK-9103.
>
> I'd really welcome some review and some feedback of this design proposal.
> I think this could be a helpful feature for Spark users who are trying to
> triage memory usage issues. In the future I'd like to think about reporting
> memory usage from third party libraries like Netty, as was originally
> proposed in SPARK-9103.
>
> Cheers,
> --José
>


Re: [discuss] Data Source V2 write path

2017-09-20 Thread Reynold Xin
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan  wrote:

> Hi all,
>
> I want to have some discussion about Data Source V2 write path before
> starting a voting.
>
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not good
> for maintenance.
> 2. Data sources may need to preprocess the input data before writing, like
> cluster/sort the input by some columns. It's better to do the preprocessing
> in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
>
> To solve these pain points, I'm proposing a data source writing framework
> which is very similar to the reading framework, i.e., WriteSupport ->
> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
> prototype to see what it looks like: https://github.com/
> apache/spark/pull/19269
>
> There are some other details need further discussion:
> 1. *partitioning/bucketing*
> Currently only the built-in file-based data sources support them, but
> there is nothing stopping us from exposing them to all data sources. One
> question is, shall we make them as mix-in interfaces for data source v2
> reader/writer, or just encode them into data source options(a
> string-to-string map)? Ideally it's more like options, Spark just transfers
> these user-given informations to data sources, and doesn't do anything for
> it.
>


I'd just pass them as options, until there are clear (and strong) use cases
to do them otherwise.


+1 on the rest.



>
> 2. *input data requirement*
> Data sources should be able to ask Spark to preprocess the input data, and
> this can be a mix-in interface for DataSourceV2Writer. I think we need to
> add clustering request and sorting within partitions request, any more?
>
> 3. *transaction*
> I think we can just follow `FileCommitProtocol`, which is the internal
> framework Spark uses to guarantee transaction for built-in file-based data
> sources. Generally speaking, we need task level and job level commit/abort.
> Again you can see more details in my prototype about it:
> https://github.com/apache/spark/pull/19269
>
> 4. *data source table*
> This is the trickiest one. In Spark you can create a table which points to
> a data source, so you can read/write this data source easily by referencing
> the table name. Ideally data source table is just a pointer which points to
> a data source with a list of predefined options, to save users from typing
> these options again and again for each query.
> If that's all, then everything is good, we don't need to add more
> interfaces to Data Source V2. However, data source tables provide special
> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
> sources to have some extra ability.
> Currently these special operators only work for built-in file-based data
> sources, and I don't think we will extend it in the near future, I propose
> to mark them as out of the scope.
>
>
> Any comments are welcome!
> Thanks,
> Wenchen
>


Re: [discuss] Data Source V2 write path

2017-09-21 Thread Reynold Xin
Ah yes I agree. I was just saying it should be options (rather than
specific constructs). Having them at creation time makes a lot of sense.
Although one tricky thing is what if they need to change, but we can
probably just special case that.

On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue <rb...@netflix.com> wrote:

> I’d just pass them [partitioning/bucketing] as options, until there are
> clear (and strong) use cases to do them otherwise.
>
> I don’t think it makes sense to pass partitioning and bucketing
> information *into* this API. The writer should already know the table
> structure and should pass relevant information back out to Spark so it can
> sort and group data for storage.
>
> I think the idea of passing the table structure into the writer comes from
> the current implementation, where the table may not exist before a data
> frame is written. But that isn’t something that should be carried forward.
> I think the writer should be responsible for writing into an
> already-configured table. That’s the normal case we should design for.
> Creating a table at the same time (CTAS) is a convenience, but should be
> implemented by creating an empty table and then running the same writer
> that would have been used for an insert into an existing table.
>
> Otherwise, there’s confusion about how to handle the options. What should
> the writer do when partitioning passed in doesn’t match the table’s
> partitioning? We already have this situation in the DataFrameWriter API,
> where calling partitionBy and then insertInto throws an exception. I’d
> like to keep that case out of this API by setting the expectation that
> tables this writes to already exist.
>
> rb
> ​
>
> On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin <r...@databricks.com> wrote:
>
>>
>>
>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I want to have some discussion about Data Source V2 write path before
>>> starting a voting.
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>>
>>> To solve these pain points, I'm proposing a data source writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take
>>> a look at my prototype to see what it looks like:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> There are some other details need further discussion:
>>> 1. *partitioning/bucketing*
>>> Currently only the built-in file-based data sources support them, but
>>> there is nothing stopping us from exposing them to all data sources. One
>>> question is, shall we make them as mix-in interfaces for data source v2
>>> reader/writer, or just encode them into data source options(a
>>> string-to-string map)? Ideally it's more like options, Spark just transfers
>>> these user-given informations to data sources, and doesn't do anything for
>>> it.
>>>
>>
>>
>> I'd just pass them as options, until there are clear (and strong) use
>> cases to do them otherwise.
>>
>>
>> +1 on the rest.
>>
>>
>>
>>>
>>> 2. *input data requirement*
>>> Data sources should be able to ask Spark to preprocess the input data,
>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>> to add clustering request and sorting within partitions request, any more?
>>>
>>> 3. *transaction*
>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>> framework Spark uses to guarantee transaction for built-in file-based data
>>> sources. Generally speaking, we need task level and job level commit/abort.
>>> Again you can see more details in my prototype about it:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> 4. *data source table*
>>> This is the trickiest one. In Spark you can create a table which po

Re: Signing releases with pwendell or release manager's key?

2017-09-18 Thread Reynold Xin
Does anybody know whether this is a hard blocker? If it is not, we should
probably push 2.1.2 forward quickly and do the infrastructure improvement
in parallel.

On Mon, Sep 18, 2017 at 7:49 PM, Holden Karau  wrote:

> I'm more than willing to help migrate the scripts as part of either this
> release or the next.
>
> It sounds like there is a consensus developing around changing the process
> -- should we hold off on the 2.1.2 release or roll this into the next one?
>
> On Mon, Sep 18, 2017 at 7:37 PM, Marcelo Vanzin 
> wrote:
>
>> +1 to this. There should be a script in the Spark repo that has all
>> the logic needed for a release. That script should take the RM's key
>> as a parameter.
>>
>> if there's a desire to keep the current Jenkins job to create the
>> release, it should be based on that script. But from what I'm seeing
>> there are currently too many unknowns in the release process.
>>
>> On Mon, Sep 18, 2017 at 4:55 PM, Ryan Blue 
>> wrote:
>> > I don't understand why it is necessary to share a release key. If this
>> is
>> > something that can be automated in a Jenkins job, then can it be a
>> script
>> > with a reasonable set of build requirements for Mac and Ubuntu? That's
>> the
>> > approach I've seen the most in other projects.
>> >
>> > I'm also not just concerned about release managers. Having a key stored
>> > persistently on outside infrastructure adds the most risk, as Luciano
>> noted
>> > as well. We should also start publishing checksums in the Spark VOTE
>> thread,
>> > which are currently missing. The risk I'm concerned about is that if
>> the key
>> > were compromised, it would be possible to replace binaries with
>> perfectly
>> > valid ones, at least on some mirrors. If the Apache copy were replaced,
>> then
>> > we wouldn't even be able to catch that it had happened. Given the high
>> > profile of Spark and the number of companies that run it, I think we
>> need to
>> > take extra care to make sure that can't happen, even if it is an
>> annoyance
>> > for the release managers.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-05 Thread Reynold Xin
+1


On Mon, Oct 2, 2017 at 11:24 PM, Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Saturday October 7th at 9:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc4
>  (2abaea9e40fce81
> cd4626498e0f5c28a70917499)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1252
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the 2.1.3
> fixed into 2.1.2 as appropriate.
>
> *What happened to RC3?*
>
> Some R+zinc interactions kept it from getting out the door.
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: SparkR is now available on CRAN

2017-10-12 Thread Reynold Xin
This is huge!


On Thu, Oct 12, 2017 at 11:21 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hi all
>
> I'm happy to announce that the most recent release of Spark, 2.1.2 is now
> available for download as an R package from CRAN at
> https://cran.r-project.org/web/packages/SparkR/ . This makes it easy to
> get started with SparkR for new R users and the package includes code to
> download the corresponding Spark binaries. https://issues.
> apache.org/jira/browse/SPARK-15799 has more details on this.
>
> Many thanks to everyone who helped put this together -- especially Felix
> Cheung for making a number of fixes to meet the CRAN requirements and
> Holden Karau for the 2.1.2 release.
>
> Thanks
> Shivaram
>


Re: 2.1.2 maintenance release?

2017-09-08 Thread Reynold Xin
+1 as well. We should make a few maintenance releases.

On Fri, Sep 8, 2017 at 6:46 PM Felix Cheung 
wrote:

> +1 on both 2.1.2 and 2.2.1
>
> And would try to help and/or wrangle the release if needed.
>
> (Note: trying to backport a few changes to branch-2.1 right now)
>
> --
> *From:* Sean Owen 
> *Sent:* Friday, September 8, 2017 12:05:28 AM
> *To:* Holden Karau; dev
> *Subject:* Re: 2.1.2 maintenance release?
>
> Let's look at the standard ASF guidance, which actually surprised me when
> I first read it:
>
> https://www.apache.org/foundation/voting.html
>
> VOTES ON PACKAGE RELEASES
> Votes on whether a package is ready to be released use majority approval
> -- i.e. at least three PMC members must vote affirmatively for release, and
> there must be more positive than negative votes. Releases may not be
> vetoed. Generally the community will cancel the release vote if anyone
> identifies serious problems, but in most cases the ultimate decision, lies
> with the individual serving as release manager. The specifics of the
> process may vary from project to project, but the 'minimum quorum of three
> +1 votes' rule is universal.
>
>
> PMC votes on it, but no vetoes allowed, and the release manager makes the
> final call. Not your usual vote! doesn't say the release manager has to be
> part of the PMC though it's the role with most decision power. In practice
> I can't imagine it's a problem, but we could also just have someone on the
> PMC technically be the release manager even as someone else is really
> operating the release.
>
> The goal is, really, to be able to put out maintenance releases with
> important fixes. Secondly, to ramp up one or more additional people to
> perform the release steps. Maintenance releases ought to be the least
> controversial releases to decide.
>
> Thoughts on kicking off a release for 2.1.2 to see how it goes?
>
> Although someone can just start following the steps, I think it will
> certainly require some help from Michael, who's run the last release, to
> clarify parts of the process or possibly provide an essential credential to
> upload artifacts.
>
>
> On Thu, Sep 7, 2017 at 11:59 PM Holden Karau  wrote:
>
>> I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after
>> that) if people are ok with a committer / me running the release process
>> rather than a full PMC member.
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
en is on the read
>> side. I think it would be better to separate the read and write APIs so we
>> can focus on them individually.
>>
>> An example of why we should focus on the write path separately is that
>> the proposal says this:
>>
>> Ideally partitioning/bucketing concept should not be exposed in the Data
>> Source API V2, because they are just techniques for data skipping and
>> pre-partitioning. However, these 2 concepts are already widely used in
>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
>> To be consistent, we need to add partitioning/bucketing to Data Source V2 .
>> . .
>>
>> Essentially, the some APIs mix DDL and DML operations. I’d like to
>> consider ways to fix that problem instead of carrying the problem forward
>> to Data Source V2. We can solve this by adding a high-level API for DDL and
>> a better write/insert API that works well with it. Clearly, that discussion
>> is independent of the read path, which is why I think separating the two
>> proposals would be a win.
>>
>> rb
>> ​
>>
>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> That might be good to do, but seems like orthogonal to this effort
>>> itself. It would be a completely different interface.
>>>
>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> OK I agree with it, how about we add a new interface to push down the
>>>> query plan, based on the current framework? We can mark the
>>>> query-plan-push-down interface as unstable, to save the effort of designing
>>>> a stable representation of query plan and maintaining forward 
>>>> compatibility.
>>>>
>>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker <j.ba...@outlook.com>
>>>> wrote:
>>>>
>>>>> I'll just focus on the one-by-one thing for now - it's the thing that
>>>>> blocks me the most.
>>>>>
>>>>> I think the place where we're most confused here is on the cost of
>>>>> determining whether I can push down a filter. For me, in order to work out
>>>>> whether I can push down a filter or satisfy a sort, I might have to read
>>>>> plenty of data. That said, it's worth me doing this because I can use this
>>>>> information to avoid reading >>that much data.
>>>>>
>>>>> If you give me all the orderings, I will have to read that data many
>>>>> times (we stream it to avoid keeping it in memory).
>>>>>
>>>>> There's also a thing where our typical use cases have many filters
>>>>> (20+ is common). So, it's likely not going to work to pass us all the
>>>>> combinations. That said, if I can tell you a cost, I know what optimal
>>>>> looks like, why can't I just pick that myself?
>>>>>
>>>>> The current design is friendly to simple datasources, but does not
>>>>> have the potential to support this.
>>>>>
>>>>> So the main problem we have with datasources v1 is that it's
>>>>> essentially impossible to leverage a bunch of Spark features - I don't get
>>>>> to use bucketing or row batches or all the nice things that I really want
>>>>> to use to get decent performance. Provided I can leverage these in a
>>>>> moderately supported way which won't break in any given commit, I'll be
>>>>> pretty happy with anything that lets me opt out of the restrictions.
>>>>>
>>>>> My suggestion here is that if you make a mode which works well for
>>>>> complicated use cases, you end up being able to write simple mode in terms
>>>>> of it very easily. So we could actually provide two APIs, one that lets
>>>>> people who have more interesting datasources leverage the cool Spark
>>>>> features, and one that lets people who just want to implement basic
>>>>> features do that - I'd try to include some kind of layering here. I could
>>>>> probably sketch out something here if that'd be useful?
>>>>>
>>>>> James
>>>>>
>>>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>>
>>>>>> Hi James,
>>>>>>
>>>>>> Thanks for your feedback! I think your concerns are all valid, but we
>>>>>> need to make a tradeoff

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
So we seem to be getting into a cycle of discussing more about the details
of APIs than the high level proposal. The details of APIs are important to
debate, but those belong more in code reviews.

One other important thing is that we should avoid API design by committee.
While it is extremely useful to get feedback, understand the use cases, we
cannot do API design by incorporating verbatim the union of everybody's
feedback. API design is largely a tradeoff game. The most expressive API
would also be harder to use, or sacrifice backward/forward compatibility.
It is as important to decide what to exclude as what to include.

Unlike the v1 API, the way Wenchen's high level V2 framework is proposed
makes it very easy to add new features (e.g. clustering properties) in the
future without breaking any APIs. I'd rather us shipping something useful
that might not be the most comprehensive set, than debating about every
single feature we should add and then creating something super complicated
that has unclear value.



On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue <rb...@netflix.com> wrote:

> -1 (non-binding)
>
> Sometimes it takes a VOTE thread to get people to actually read and
> comment, so thanks for starting this one… but there’s still discussion
> happening on the prototype API, which it hasn’t been updated. I’d like to
> see the proposal shaped by the ongoing discussion so that we have a better,
> more concrete plan. I think that’s going to produces a better SPIP.
>
> The second reason for -1 is that I think the read- and write-side
> proposals should be separated. The PR
> <https://github.com/cloud-fan/spark/pull/10> currently has “write path”
> listed as a TODO item and most of the discussion I’ve seen is on the read
> side. I think it would be better to separate the read and write APIs so we
> can focus on them individually.
>
> An example of why we should focus on the write path separately is that the
> proposal says this:
>
> Ideally partitioning/bucketing concept should not be exposed in the Data
> Source API V2, because they are just techniques for data skipping and
> pre-partitioning. However, these 2 concepts are already widely used in
> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTITION.
> To be consistent, we need to add partitioning/bucketing to Data Source V2 .
> . .
>
> Essentially, the some APIs mix DDL and DML operations. I’d like to
> consider ways to fix that problem instead of carrying the problem forward
> to Data Source V2. We can solve this by adding a high-level API for DDL and
> a better write/insert API that works well with it. Clearly, that discussion
> is independent of the read path, which is why I think separating the two
> proposals would be a win.
>
> rb
> ​
>
> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> That might be good to do, but seems like orthogonal to this effort
>> itself. It would be a completely different interface.
>>
>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> OK I agree with it, how about we add a new interface to push down the
>>> query plan, based on the current framework? We can mark the
>>> query-plan-push-down interface as unstable, to save the effort of designing
>>> a stable representation of query plan and maintaining forward compatibility.
>>>
>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker <j.ba...@outlook.com>
>>> wrote:
>>>
>>>> I'll just focus on the one-by-one thing for now - it's the thing that
>>>> blocks me the most.
>>>>
>>>> I think the place where we're most confused here is on the cost of
>>>> determining whether I can push down a filter. For me, in order to work out
>>>> whether I can push down a filter or satisfy a sort, I might have to read
>>>> plenty of data. That said, it's worth me doing this because I can use this
>>>> information to avoid reading >>that much data.
>>>>
>>>> If you give me all the orderings, I will have to read that data many
>>>> times (we stream it to avoid keeping it in memory).
>>>>
>>>> There's also a thing where our typical use cases have many filters (20+
>>>> is common). So, it's likely not going to work to pass us all the
>>>> combinations. That said, if I can tell you a cost, I know what optimal
>>>> looks like, why can't I just pick that myself?
>>>>
>>>> The current design is friendly to simple datasources, but does not have
>>>> the potential to support this.
>>>>
>>>> So the main problem we have with datasources v1

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Reynold Xin
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
wrote:

> +1
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks for making the updates reflected in the current PR. It would be
>> great to see the doc updated before it is finally published though.
>>
>> Right now it feels like this SPIP is focused more on getting the basics
>> right for what many datasources are already doing in API V1 combined with
>> other private APIs, vs pushing forward state of the art for performance.
>>
>> I think that’s the right approach for this SPIP. We can add the support
>> you’re talking about later with a more specific plan that doesn’t block
>> fixing the problems that this addresses.
>> ​
>>
>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> +1 (binding)
>>>
>>> I personally believe that there is quite a big difference between having
>>> a generic data source interface with a low surface area and pushing down a
>>> significant part of query processing into a datasource. The later has much
>>> wider wider surface area and will require us to stabilize most of the
>>> internal catalyst API's which will be a significant burden on the community
>>> to maintain and has the potential to slow development velocity
>>> significantly. If you want to write such integrations then you should be
>>> prepared to work with catalyst internals and own up to the fact that things
>>> might change across minor versions (and in some cases even maintenance
>>> releases). If you are willing to go down that road, then your best bet is
>>> to use the already existing spark session extensions which will allow you
>>> to write such integrations and can be used as an `escape hatch`.
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
>>> wrote:
>>>
 +0 (non-binding)

 I think there are benefits to unifying all the Spark-internal
 datasources into a common public API for sure.  It will serve as a forcing
 function to ensure that those internal datasources aren't advantaged vs
 datasources developed externally as plugins to Spark, and that all Spark
 features are available to all datasources.

 But I also think this read-path proposal avoids the more difficult
 questions around how to continue pushing datasource performance forwards.
 James Baker (my colleague) had a number of questions about advanced
 pushdowns (combined sorting and filtering), and Reynold also noted that
 pushdown of aggregates and joins are desirable on longer timeframes as
 well.  The Spark community saw similar requests, for aggregate pushdown in
 SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
 in SPARK-12449.  Clearly a number of people are interested in this kind of
 performance work for datasources.

 To leave enough space for datasource developers to continue
 experimenting with advanced interactions between Spark and their
 datasources, I'd propose we leave some sort of escape valve that enables
 these datasources to keep pushing the boundaries without forking Spark.
 Possibly that looks like an additional unsupported/unstable interface that
 pushes down an entire (unstable API) logical plan, which is expected to
 break API on every release.   (Spark attempts this full-plan pushdown, and
 if that fails Spark ignores it and continues on with the rest of the V2 API
 for compatibility).  Or maybe it looks like something else that we don't
 know of yet.  Possibly this falls outside of the desired goals for the V2
 API and instead should be a separate SPIP.

 If we had a plan for this kind of escape valve for advanced datasource
 developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
 focused more on getting the basics right for what many datasources are
 already doing in API V1 combined with other private APIs, vs pushing
 forward state of the art for performance.

 Andrew

 On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
 suresh.thalam...@gmail.com> wrote:

> +1 (non-binding)
>
>
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>
> Hi all,
>
> In the previous discussion, we decided to split the read and write
> path of data source v2 into 2 SPIPs, and I'm sending this email to call a
> vote for Data Source V2 read path only.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
> -Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the read path is:
> https://github.com/apache/spark/pull/19136
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>

Re: [discuss] Data Source V2 write path

2017-09-24 Thread Reynold Xin
Can there be an explicit create function?


On Sun, Sep 24, 2017 at 7:17 PM, Wenchen Fan  wrote:

> I agree it would be a clean approach if data source is only responsible to
> write into an already-configured table. However, without catalog
> federation, Spark doesn't have an API to ask an external system(like
> Cassandra) to create a table. Currently it's all done by data source write
> API. Data source implementations are responsible to create or insert a
> table according to the save mode.
>
> As a workaround, I think it's acceptable to pass partitioning/bucketing
> information via data source options, and data sources should decide to take
> these informations and create the table, or throw exception if these
> informations don't match the already-configured table.
>
>
> On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue  wrote:
>
>> > input data requirement
>>
>> Clustering and sorting within partitions are a good start. We can always
>> add more later when they are needed.
>>
>> The primary use case I'm thinking of for this is partitioning and
>> bucketing. If I'm implementing a partitioned table format, I need to tell
>> Spark to cluster by my partition columns. Should there also be a way to
>> pass those columns separately, since they may not be stored in the same way
>> like partitions are in the current format?
>>
>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> I want to have some discussion about Data Source V2 write path before
>>> starting a voting.
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>>
>>> To solve these pain points, I'm proposing a data source writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take
>>> a look at my prototype to see what it looks like:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> There are some other details need further discussion:
>>> 1. *partitioning/bucketing*
>>> Currently only the built-in file-based data sources support them, but
>>> there is nothing stopping us from exposing them to all data sources. One
>>> question is, shall we make them as mix-in interfaces for data source v2
>>> reader/writer, or just encode them into data source options(a
>>> string-to-string map)? Ideally it's more like options, Spark just transfers
>>> these user-given informations to data sources, and doesn't do anything for
>>> it.
>>>
>>> 2. *input data requirement*
>>> Data sources should be able to ask Spark to preprocess the input data,
>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>> to add clustering request and sorting within partitions request, any more?
>>>
>>> 3. *transaction*
>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>> framework Spark uses to guarantee transaction for built-in file-based data
>>> sources. Generally speaking, we need task level and job level commit/abort.
>>> Again you can see more details in my prototype about it:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> 4. *data source table*
>>> This is the trickiest one. In Spark you can create a table which points
>>> to a data source, so you can read/write this data source easily by
>>> referencing the table name. Ideally data source table is just a pointer
>>> which points to a data source with a list of predefined options, to save
>>> users from typing these options again and again for each query.
>>> If that's all, then everything is good, we don't need to add more
>>> interfaces to Data Source V2. However, data source tables provide special
>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>>> sources to have some extra ability.
>>> Currently these special operators only work for built-in file-based data
>>> sources, and I don't think we will extend it in the near future, I propose
>>> to mark them as out of the scope.
>>>
>>>
>>> Any comments are welcome!
>>> Thanks,
>>> Wenchen
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


Re: Should Flume integration be behind a profile?

2017-10-01 Thread Reynold Xin
Probably should do 1, and then it is an easier transition in 3.0.

On Sun, Oct 1, 2017 at 1:28 AM Sean Owen  wrote:

> I tried and failed to do this in
> https://issues.apache.org/jira/browse/SPARK-22142 because it became clear
> that the Flume examples would have to be removed to make this work, too.
> (Well, you can imagine other solutions with extra source dirs or modules
> for flume examples enabled by a profile, but that doesn't help the docs and
> is nontrivial complexity for little gain.)
>
> It kind of suggests Flume support should be deprecated if it's put behind
> a profile. Like with Kafka 0.8. (This is why I'm raising it again to the
> whole list.)
>
> Any preferences among:
> 1. Put Flume behind a profile, remove examples, deprecate
> 2. Put Flume behind a profile, remove examples, but don't deprecate
> 3. Punt until Spark 3.0, when this integration would probably be removed
> entirely (?)
>
> On Tue, Sep 26, 2017 at 10:36 AM Sean Owen  wrote:
>
>> Not a big deal, but I'm wondering whether Flume integration should at
>> least be opt-in and behind a profile? it still sees some use (at least on
>> our end) but not applicable to the majority of users. Most other
>> third-party framework integrations are behind a profile, like YARN, Mesos,
>> Kinesis, Kafka 0.8, Docker. Just soliciting comments, not arguing for it.
>>
>> (Well, actually it annoys me that the Flume integration always fails to
>> compile in IntelliJ unless you generate the sources manually)
>>
>


Re: Configuration docs pages are broken

2017-10-03 Thread Reynold Xin
Interested in submitting a patch to fix them?

On Tue, Oct 3, 2017 at 9:53 AM Nick Dimiduk  wrote:

> Heya,
>
> Looks like the Configuration sections of your docs, both latest [0], and
> 2.1 [1] are broken. The last couple sections are smashed into a single
> unrendered paragraph of markdown at the bottom.
>
> Thanks,
> Nick
>
> [0]: https://spark.apache.org/docs/latest/configuration.html
> [1]: https://spark.apache.org/docs/2.1.0/configuration.html
>


Re: [Spark Core] Custom Catalog. Integration between Apache Ignite and Apache Spark

2017-09-25 Thread Reynold Xin
It's probably just an indication of lack of interest (or at least there
isn't a substantial overlap between Ignite users and Spark users). A new
catalog implementation is also pretty fundamental to Spark and the bar for
that would be pretty high. See my comment in SPARK-17767.

Guys - while I think this is very useful to do, I'm going to mark this as
"later" for now. The reason is that there are a lot of things to consider
before making this switch, including:

   - The ExternalCatalog API is currently internal, and we can't just make
   it public without thinking about the consequences and whether this API is
   maintainable in the long run.
   - SPARK-15777  We
   need to design this in the context of catalog federation and persistence.
   - SPARK-15691 
Refactoring
   of how we integrate with Hive.

This is not as simple as just submitting a PR to make it pluggable.

On Mon, Sep 25, 2017 at 10:50 AM, Николай Ижиков 
wrote:

> Guys.
>
> Am I miss something and wrote a fully wrong mail?
> Can you give me some feedback?
> What I have missed with my propositions?
>
> 2017-09-19 15:39 GMT+03:00 Nikolay Izhikov :
>
>> Guys,
>>
>> Anyone had a chance to look at my message?
>>
>> 15.09.2017 15:50, Nikolay Izhikov пишет:
>>
>> Hello, guys.
>>>
>>> I’m contributor of Apache Ignite project which is self-described as an
>>> in-memory computing platform.
>>>
>>> It has Data Grid features: distribute, transactional key-value store
>>> [1], Distributed SQL support [2], etc…[3]
>>>
>>> Currently, I’m working on integration between Ignite and Spark [4]
>>> I want to add support of Spark Data Frame API for Ignite.
>>>
>>> As far as Ignite is distributed store it would be useful to create
>>> implementation of Catalog [5] for an Apache Ignite.
>>>
>>> I see two ways to implement this feature:
>>>
>>>  1. Spark can provide API for any custom catalog implementation. As
>>> far as I can see there is a ticket for it [6]. It is closed with resolution
>>> “Later”. Is it suitable time to continue working on the ticket? How can I
>>> help with it?
>>>
>>>  2. I can provide an implementation of Catalog and other required
>>> API in the form of pull request in Spark, as it was implemented for Hive
>>> [7]. Can such pull request be acceptable?
>>>
>>> Which way is more convenient for Spark community?
>>>
>>> [1] https://ignite.apache.org/features/datagrid.html
>>> [2] https://ignite.apache.org/features/sql.html
>>> [3] https://ignite.apache.org/features.html
>>> [4] https://issues.apache.org/jira/browse/IGNITE-3084
>>> [5] https://github.com/apache/spark/blob/master/sql/catalyst/src
>>> /main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
>>> [6] https://issues.apache.org/jira/browse/SPARK-17767
>>> [7] https://github.com/apache/spark/blob/master/sql/hive/src/mai
>>> n/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala
>>>
>>
>
>
> --
> Nikolay Izhikov
> nizhikov@gmail.com
>


Re: [VOTE] Spark 2.1.2 (RC2)

2017-09-27 Thread Reynold Xin
+1


On Tue, Sep 26, 2017 at 9:47 PM, Holden Karau  wrote:

> Please vote on releasing the following candidate as Apache Spark version 2
> .1.2. The vote is open until Wednesday October 4th at 23:59 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.2
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.1.2-rc2
>  (fabbb7f59e47590
> 114366d14e15fbbff8c88593c)
>
> List of JIRA tickets resolved in this release can be found with this
> filter.
> 
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~holden/spark-2.1.2-rc2-bin/
>
> Release artifacts are signed with a key from:
> https://people.apache.org/~holden/holdens_keys.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1251
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~holden/spark-2.1.2-rc2-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install the
> current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with the
> RC (make sure to clean up the artifact cache before/after so you don't
> end up building with a out of date RC going forward).
>
> *What should happen to JIRA tickets still targeting 2.1.2?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.3.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1. That being said if
> there is something which is a regression form 2.1.1 that has not been
> correctly targeted please ping a committer to help target the issue (you
> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
> 
> )
>
> *What are the unresolved* issues targeted for 2.1.2
> 
> ?
>
> At this time there are no open unresolved issues.
>
> *Is there anything different about this release?*
>
> This is the first release in awhile not built on the AMPLAB Jenkins. This
> is good because it means future releases can more easily be built and
> signed securely (and I've been updating the documentation in
> https://github.com/apache/spark-website/pull/66 as I progress), however
> the chances of a mistake are higher with any change like this. If there
> something you normally take for granted as correct when checking a release,
> please double check this time :)
>
> *Should I be committing code to branch-2.1?*
>
> Thanks for asking! Please treat this stage in the RC process as "code
> freeze" so bug fixes only. If you're uncertain if something should be back
> ported please reach out. If you do commit to branch-2.1 please tag your
> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the
> 2.1.3 fixed into 2.1.2 as appropriate.
>
> *Why the longer voting window?*
>
> Since there is a large industry big data conference this week I figured
> I'd add a little bit of extra buffer time just to make sure everyone has a
> chance to take a look.
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Are there multiple processes out there running JIRA <-> Github maintenance tasks?

2017-08-28 Thread Reynold Xin
The process for doing that was down before, and might've come back up and
are going through the huge backlog.


On Mon, Aug 28, 2017 at 6:56 PM, Sean Owen  wrote:

> Like whatever reassigns JIRAs after a PR is closed?
>
> It seems to be going crazy, or maybe there are many running. Not sure who
> owns that, but can he/she take a look?
>
>


Re: SPIP: Spark on Kubernetes

2017-08-17 Thread Reynold Xin
+1 on adding Kubernetes support in Spark (as a separate module similar to
how YARN is done)

I talk with a lot of developers and teams that operate cloud services, and
k8s in the last year has definitely become one of the key projects, if not
the one with the strongest momentum in this space. I'm not 100% sure we can
make it into 2.3 but IMO based on the activities in the forked repo and
claims that certain deployments are already running in production, this
could already be a solid project and will have everlasting positive impact.



On Wed, Aug 16, 2017 at 10:24 AM, Alexander Bezzubov  wrote:

> +1 (non-binding)
>
>
> Looking forward using it as part of Apache Spark release, instead of
> Standalone cluster deployed on top of k8s.
>
>
> --
> Alex
>
> On Wed, Aug 16, 2017 at 11:11 AM, Ismaël Mejía  wrote:
>
>> +1 (non-binding)
>>
>> This is something really great to have. More schedulers and runtime
>> environments are a HUGE win for the Spark ecosystem.
>> Amazing work, Big kudos for the guys who created and continue working on
>> this.
>>
>> On Wed, Aug 16, 2017 at 2:07 AM, lucas.g...@gmail.com
>>  wrote:
>> > From our perspective, we have invested heavily in Kubernetes as our
>> cluster
>> > manager of choice.
>> >
>> > We also make quite heavy use of spark.  We've been experimenting with
>> using
>> > these builds (2.1 with pyspark enabled) quite heavily.  Given that we've
>> > already 'paid the price' to operate Kubernetes in AWS it seems rational
>> to
>> > move our jobs over to spark on k8s.  Having this project merged into the
>> > master will significantly ease keeping our Data Munging toolchain
>> primarily
>> > on Spark.
>> >
>> >
>> > Gary Lucas
>> > Data Ops Team Lead
>> > Unbounce
>> >
>> > On 15 August 2017 at 15:52, Andrew Ash  wrote:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> We're moving large amounts of infrastructure from a combination of open
>> >> source and homegrown cluster management systems to unify on Kubernetes
>> and
>> >> want to bring Spark workloads along with us.
>> >>
>> >> On Tue, Aug 15, 2017 at 2:29 PM, liyinan926 
>> wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> SPIP-Spark-on-Kubernetes-tp22147p22164.html
>> >>> Sent from the Apache Spark Developers List mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Reynold Xin
James,

Thanks for the comment. I think you just pointed out a trade-off between
expressiveness and API simplicity, compatibility and evolvability. For the
max expressiveness, we'd want the ability to expose full query plans, and
let the data source decide which part of the query plan can be pushed down.

The downside to that (full query plan push down) are:

1. It is extremely difficult to design a stable representation for logical
/ physical plan. It is doable, but we'd be the first to do it. I'm not sure
of any mainstream databases being able to do that in the past. The design
of that API itself, to make sure we have a good story for backward and
forward compatibility, would probably take months if not years. It might
still be good to do, or offer an experimental trait without compatibility
guarantee that uses the current Catalyst internal logical plan.

2. Most data source developers simply want a way to offer some data,
without any pushdown. Having to understand query plans is a burden rather
than a gift.


Re: your point about the proposed v2 being worse than v1 for your use case.

Can you say more? You used the argument that in v2 there are more support
for broader pushdown and as a result it is harder to implement. That's how
it is supposed to be. If a data source simply implements one of the trait,
it'd be logically identical to v1. I don't see why it would be worse or
better, other than v2 provides much stronger forward compatibility
guarantees than v1.


On Tue, Aug 29, 2017 at 4:54 AM, James Baker  wrote:

> Copying from the code review comments I just submitted on the draft API (
> https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
>
> Context here is that I've spent some time implementing a Spark datasource
> and have had some issues with the current API which are made worse in V2.
>
> The general conclusion I’ve come to here is that this is very hard to
> actually implement (in a similar but more aggressive way than DataSource
> V1, because of the extra methods and dimensions we get in V2).
>
> In DataSources V1 PrunedFilteredScan, the issue is that you are passed in
> the filters with the buildScan method, and then passed in again with the
> unhandledFilters method.
>
> However, the filters that you can’t handle might be data dependent, which
> the current API does not handle well. Suppose I can handle filter A some of
> the time, and filter B some of the time. If I’m passed in both, then either
> A and B are unhandled, or A, or B, or neither. The work I have to do to
> work this out is essentially the same as I have to do while actually
> generating my RDD (essentially I have to generate my partitions), so I end
> up doing some weird caching work.
>
> This V2 API proposal has the same issues, but perhaps moreso. In
> PrunedFilteredScan, there is essentially one degree of freedom for pruning
> (filters), so you just have to implement caching between unhandledFilters
> and buildScan. However, here we have many degrees of freedom; sorts,
> individual filters, clustering, sampling, maybe aggregations eventually -
> and these operations are not all commutative, and computing my support
> one-by-one can easily end up being more expensive than computing all in one
> go.
>
> For some trivial examples:
>
> - After filtering, I might be sorted, whilst before filtering I might not
> be.
>
> - Filtering with certain filters might affect my ability to push down
> others.
>
> - Filtering with aggregations (as mooted) might not be possible to push
> down.
>
> And with the API as currently mooted, I need to be able to go back and
> change my results because they might change later.
>
> Really what would be good here is to pass all of the filters and sorts etc
> all at once, and then I return the parts I can’t handle.
>
> I’d prefer in general that this be implemented by passing some kind of
> query plan to the datasource which enables this kind of replacement.
> Explicitly don’t want to give the whole query plan - that sounds painful -
> would prefer we push down only the parts of the query plan we deem to be
> stable. With the mix-in approach, I don’t think we can guarantee the
> properties we want without a two-phase thing - I’d really love to be able
> to just define a straightforward union type which is our supported pushdown
> stuff, and then the user can transform and return it.
>
> I think this ends up being a more elegant API for consumers, and also far
> more intuitive.
>
> James
>
> On Mon, 28 Aug 2017 at 18:00 蒋星博  wrote:
>
>> +1 (Non-binding)
>>
>> Xiao Li 于2017年8月28日 周一下午5:38写道:
>>
>>> +1
>>>
>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger :
>>>
 Just wanted to point out that because the jira isn't labeled SPIP, it
 won't have shown up linked from

 http://spark.apache.org/improvement-proposals.html

 On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan 

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Ok, thanks.

+1 on the SPIP for scope etc


On API details (will deal with in code reviews as well but leaving a note
here in case I forget)

1. I would suggest having the API also accept data type specification in
string form. It is usually simpler to say "long" then "LongType()".

2. Think about what error message to show when the rows numbers don't match
at runtime.


On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ues...@happy-camper.st>
wrote:

> Yes, the aggregation is out of scope for now.
> I think we should continue discussing the aggregation at JIRA and we will
> be adding those later separately.
>
> Thanks.
>
>
> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> Is the idea aggregate is out of scope for the current effort and we will
>> be adding those later?
>>
>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ues...@happy-camper.st>
>> wrote:
>>
>>> Hi all,
>>>
>>> We've been discussing to support vectorized UDFs in Python and we almost
>>> got a consensus about the APIs, so I'd like to summarize and call for a
>>> vote.
>>>
>>> Note that this vote should focus on APIs for vectorized UDFs, not APIs
>>> for vectorized UDAFs or Window operations.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>
>>>
>>> *Proposed API*
>>>
>>> We introduce a @pandas_udf decorator (or annotation) to define
>>> vectorized UDFs which takes one or more pandas.Series or one integer
>>> value meaning the length of the input value for 0-parameter UDFs. The
>>> return value should be pandas.Series of the specified type and the
>>> length of the returned value should be the same as input value.
>>>
>>> We can define vectorized UDFs as:
>>>
>>>   @pandas_udf(DoubleType())
>>>   def plus(v1, v2):
>>>   return v1 + v2
>>>
>>> or we can define as:
>>>
>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>
>>> We can use it similar to row-by-row UDFs:
>>>
>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>
>>> As for 0-parameter UDFs, we can define and use as:
>>>
>>>   @pandas_udf(LongType())
>>>   def f0(size):
>>>   return pd.Series(1).repeat(size)
>>>
>>>   df.select(f0())
>>>
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: SPIP: Spark on Kubernetes

2017-09-01 Thread Reynold Xin
Anirudh (or somebody else familiar with spark-on-k8s),

Can you create a short plan on how we would integrate and do code review to
merge the project? If the diff is too large it'd be difficult to review and
merge in one shot. Once we have a plan we can create subtickets to track
the progress.



On Thu, Aug 31, 2017 at 5:21 PM, Anirudh Ramanathan <ramanath...@google.com>
wrote:

> The proposal is in the process of being updated to include the details on
> testing that we have, that Imran pointed out.
> Please expect an update on the SPARK-18278
> <https://issues.apache.org/jira/browse/SPARK-18278>.
>
> Mridul had a couple of points as well, about exposing an SPI and we've
> been exploring that, to ascertain the effort involved.
> That effort is separate, fairly long-term and we should have a working
> group of representatives from all cluster managers to make progress on it.
> A proposal regarding this will be in SPARK-19700
> <https://issues.apache.org/jira/browse/SPARK-19700>.
>
> This vote has passed.
> So far, there have been 4 binding +1 votes, ~25 non-binding votes, and no
> -1 votes.
>
> Thanks all!
>
> +1 votes (binding):
> Reynold Xin
> Matei Zahari
> Marcelo Vanzin
> Mark Hamstra
>
> +1 votes (non-binding):
> Anirudh Ramanathan
> Erik Erlandson
> Ilan Filonenko
> Sean Suchter
> Kimoon Kim
> Timothy Chen
> Will Benton
> Holden Karau
> Seshu Adunuthula
> Daniel Imberman
> Shubham Chopra
> Jiri Kremser
> Yinan Li
> Andrew Ash
> 李书明
> Gary Lucas
> Ismael Mejia
> Jean-Baptiste Onofré
> Alexander Bezzubov
> duyanghao
> elmiko
> Sudarshan Kadambi
> Varun Katta
> Matt Cheah
> Edward Zhang
> Vaquar Khan
>
>
>
>
>
> On Wed, Aug 30, 2017 at 10:42 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> This has passed, hasn't it?
>>
>> On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan <fox...@google.com>
>> wrote:
>>
>>> Spark on Kubernetes effort has been developed separately in a fork, and
>>> linked back from the Apache Spark project as an experimental backend
>>> <http://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types>.
>>> We're ~6 months in, have had 5 releases
>>> <https://github.com/apache-spark-on-k8s/spark/releases>.
>>>
>>>- 2 Spark versions maintained (2.1, and 2.2)
>>>- Extensive integration testing and refactoring efforts to maintain
>>>code quality
>>>- Developer
>>><https://github.com/apache-spark-on-k8s/spark#getting-started> and
>>>user-facing <https://apache-spark-on-k8s.github.io/userdocs/> docu
>>>mentation
>>>- 10+ consistent code contributors from different organizations
>>>
>>> <https://apache-spark-on-k8s.github.io/userdocs/contribute.html#project-contributions>
>>>  involved
>>>in actively maintaining and using the project, with several more members
>>>involved in testing and providing feedback.
>>>- The community has delivered several talks on Spark-on-Kubernetes
>>>generating lots of feedback from users.
>>>- In addition to these, we've seen efforts spawn off such as:
>>>- HDFS on Kubernetes
>>>   <https://github.com/apache-spark-on-k8s/kubernetes-HDFS> with
>>>   Locality and Performance Experiments
>>>   - Kerberized access
>>>   
>>> <https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg/edit>
>>>  to
>>>   HDFS from Spark running on Kubernetes
>>>
>>> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>>>
>>>- +1: Yeah, let's go forward and implement the SPIP.
>>>- +0: Don't really care.
>>>- -1: I don't think this is a good idea because of the following
>>>technical reasons.
>>>
>>> If there is any further clarification desired, on the design or the
>>> implementation, please feel free to ask questions or provide feedback.
>>>
>>>
>>> SPIP: Kubernetes as A Native Cluster Manager
>>>
>>> Full Design Doc: link
>>> <https://issues.apache.org/jira/secure/attachment/12881586/SPARK-18278%20Spark%20on%20Kubernetes%20Design%20Proposal%20Revision%202%20%281%29.pdf>
>>>
>>> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>>>
>>> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>>>
>>> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
newSupplier(currentFilters);
>>>>
>>>> public CachingFoo(Foo delegate) {
>>>> this.delegate = delegate;
>>>> }
>>>>
>>>> private Supplier newSupplier(List filters) {
>>>> return Suppliers.memoize(() -> delegate.computeBar(filters));
>>>> }
>>>>
>>>> @Override
>>>> public Bar computeBar(List filters) {
>>>> if (!filters.equals(currentFilters)) {
>>>> currentFilters = filters;
>>>> barSupplier = newSupplier(filters);
>>>> }
>>>>
>>>> return barSupplier.get();
>>>> }
>>>> }
>>>>
>>>> which caches the result required in unhandledFilters on the expectation
>>>> that Spark will call buildScan afterwards and get to use the result..
>>>>
>>>> This kind of cache becomes more prominent, but harder to deal with in
>>>> the new APIs. As one example here, the state I will need in order to
>>>> compute accurate column stats internally will likely be a subset of the
>>>> work required in order to get the read tasks, tell you if I can handle
>>>> filters, etc, so I'll want to cache them for reuse. However, the cached
>>>> information needs to be appropriately invalidated when I add a new filter
>>>> or sort order or limit, and this makes implementing the APIs harder and
>>>> more error-prone.
>>>>
>>>> One thing that'd be great is a defined contract of the order in which
>>>> Spark calls the methods on your datasource (ideally this contract could be
>>>> implied by the way the Java class structure works, but otherwise I can just
>>>> throw).
>>>>
>>>> James
>>>>
>>>> On Tue, 29 Aug 2017 at 02:56 Reynold Xin <r...@databricks.com> wrote:
>>>>
>>>>> James,
>>>>>
>>>>> Thanks for the comment. I think you just pointed out a trade-off
>>>>> between expressiveness and API simplicity, compatibility and evolvability.
>>>>> For the max expressiveness, we'd want the ability to expose full query
>>>>> plans, and let the data source decide which part of the query plan can be
>>>>> pushed down.
>>>>>
>>>>> The downside to that (full query plan push down) are:
>>>>>
>>>>> 1. It is extremely difficult to design a stable representation for
>>>>> logical / physical plan. It is doable, but we'd be the first to do it. I'm
>>>>> not sure of any mainstream databases being able to do that in the past. 
>>>>> The
>>>>> design of that API itself, to make sure we have a good story for backward
>>>>> and forward compatibility, would probably take months if not years. It
>>>>> might still be good to do, or offer an experimental trait without
>>>>> compatibility guarantee that uses the current Catalyst internal logical
>>>>> plan.
>>>>>
>>>>> 2. Most data source developers simply want a way to offer some data,
>>>>> without any pushdown. Having to understand query plans is a burden rather
>>>>> than a gift.
>>>>>
>>>>>
>>>>> Re: your point about the proposed v2 being worse than v1 for your use
>>>>> case.
>>>>>
>>>>> Can you say more? You used the argument that in v2 there are more
>>>>> support for broader pushdown and as a result it is harder to implement.
>>>>> That's how it is supposed to be. If a data source simply implements one of
>>>>> the trait, it'd be logically identical to v1. I don't see why it would be
>>>>> worse or better, other than v2 provides much stronger forward 
>>>>> compatibility
>>>>> guarantees than v1.
>>>>>
>>>>>
>>>>> On Tue, Aug 29, 2017 at 4:54 AM, James Baker <j.ba...@outlook.com>
>>>>> wrote:
>>>>>
>>>>>> Copying from the code review comments I just submitted on the draft
>>>>>> API (
>>>>>> https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745):
>>>>>>
>>>>>>
>>>>>> Context here is that I've spent some time implementing a Spark
>>>>>> datasource and have had some issues with the current API which are

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-01 Thread Reynold Xin
Why does ordering matter here for sort vs filter? The source should be able
to handle it in whatever way it wants (which is almost always filter
beneath sort I'd imagine).

The only ordering that'd matter in the current set of pushdowns is limit -
it should always mean the root of the pushded tree.


On Fri, Sep 1, 2017 at 3:22 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

> > Ideally also getting sort orders _after_ getting filters.
>
> Yea we should have a deterministic order when applying various push downs,
> and I think filter should definitely go before sort. This is one of the
> details we can discuss during PR review :)
>
> On Fri, Sep 1, 2017 at 9:19 AM, James Baker <j.ba...@outlook.com> wrote:
>
>> I think that makes sense. I didn't understand backcompat was the primary
>> driver. I actually don't care right now about aggregations on the
>> datasource I'm integrating with - I just care about receiving all the
>> filters (and ideally also the desired sort order) at the same time. I am
>> mostly fine with anything else; but getting filters at the same time is
>> important for me, and doesn't seem overly contentious? (e.g. it's
>> compatible with datasources v1). Ideally also getting sort orders _after_
>> getting filters.
>>
>> That said, an unstable api that gets me the query plan would be
>> appreciated by plenty I'm sure :) (and would make my implementation more
>> straightforward - the state management is painful atm).
>>
>> James
>>
>> On Wed, 30 Aug 2017 at 14:56 Reynold Xin <r...@databricks.com> wrote:
>>
>>> Sure that's good to do (and as discussed earlier a good compromise might
>>> be to expose an interface for the source to decide which part of the
>>> logical plan they want to accept).
>>>
>>> To me everything is about cost vs benefit.
>>>
>>> In my mind, the biggest issue with the existing data source API is
>>> backward and forward compatibility. All the data sources written for Spark
>>> 1.x broke in Spark 2.x. And that's one of the biggest value v2 can bring.
>>> To me it's far more important to have data sources implemented in 2017 to
>>> be able to work in 2027, in Spark 10.x.
>>>
>>> You are basically arguing for creating a new API that is capable of
>>> doing arbitrary expression, aggregation, and join pushdowns (you only
>>> mentioned aggregation so far, but I've talked to enough database people
>>> that I know once Spark gives them aggregation pushdown, they will come back
>>> for join pushdown). We can do that using unstable APIs, and creating stable
>>> APIs would be extremely difficult (still doable, just would take a long
>>> time to design and implement). As mentioned earlier, it basically involves
>>> creating a stable representation for all of logical plan, which is a lot of
>>> work. I think we should still work towards that (for other reasons as
>>> well), but I'd consider that out of scope for the current one. Otherwise
>>> we'd not release something probably for the next 2 or 3 years.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 30, 2017 at 11:50 PM, James Baker <j.ba...@outlook.com>
>>> wrote:
>>>
>>>> I guess I was more suggesting that by coding up the powerful mode as
>>>> the API, it becomes easy for someone to layer an easy mode beneath it to
>>>> enable simpler datasources to be integrated (and that simple mode should be
>>>> the out of scope thing).
>>>>
>>>> Taking a small step back here, one of the places where I think I'm
>>>> missing some context is in understanding the target consumers of these
>>>> interfaces. I've done some amount (though likely not enough) of research
>>>> about the places where people have had issues of API surface in the past -
>>>> the concrete tickets I've seen have been based on Cassandra integration
>>>> where you want to indicate clustering, and SAP HANA where they want to push
>>>> down more complicated queries through Spark. This proposal supports the
>>>> former, but the amount of change required to support clustering in the
>>>> current API is not obviously high - whilst the current proposal for V2
>>>> seems to make it very difficult to add support for pushing down plenty of
>>>> aggregations in the future (I've found the question of how to add GROUP BY
>>>> to be pretty tricky to answer for the current proposal).
>>>>
>>>> Googling around for implementations of the current Prun

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-01 Thread Reynold Xin
Is the idea aggregate is out of scope for the current effort and we will be
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN  wrote:

> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for
> vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> *Proposed API*
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized
> UDFs which takes one or more pandas.Series or one integer value meaning
> the length of the input value for 0-parameter UDFs. The return value should
> be pandas.Series of the specified type and the length of the returned
> value should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>   return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>   return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: SPIP: Spark on Kubernetes

2017-08-30 Thread Reynold Xin
This has passed, hasn't it?

On Tue, Aug 15, 2017 at 5:33 PM Anirudh Ramanathan 
wrote:

> Spark on Kubernetes effort has been developed separately in a fork, and
> linked back from the Apache Spark project as an experimental backend
> .
> We're ~6 months in, have had 5 releases
> .
>
>- 2 Spark versions maintained (2.1, and 2.2)
>- Extensive integration testing and refactoring efforts to maintain
>code quality
>- Developer
> and
>user-facing 
> documentation
>- 10+ consistent code contributors from different organizations
>
> 
>  involved
>in actively maintaining and using the project, with several more members
>involved in testing and providing feedback.
>- The community has delivered several talks on Spark-on-Kubernetes
>generating lots of feedback from users.
>- In addition to these, we've seen efforts spawn off such as:
>- HDFS on Kubernetes
>    with
>   Locality and Performance Experiments
>   - Kerberized access
>   
> 
>  to
>   HDFS from Spark running on Kubernetes
>
> *Following the SPIP process, I'm putting this SPIP up for a vote.*
>
>- +1: Yeah, let's go forward and implement the SPIP.
>- +0: Don't really care.
>- -1: I don't think this is a good idea because of the following
>technical reasons.
>
> If there is any further clarification desired, on the design or the
> implementation, please feel free to ask questions or provide feedback.
>
>
> SPIP: Kubernetes as A Native Cluster Manager
>
> Full Design Doc: link
> 
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-18278
>
> Kubernetes Issue: https://github.com/kubernetes/kubernetes/issues/34377
>
> Authors: Yinan Li, Anirudh Ramanathan, Erik Erlandson, Andrew Ash, Matt
> Cheah,
>
> Ilan Filonenko, Sean Suchter, Kimoon Kim
> Background and Motivation
>
> Containerization and cluster management technologies are constantly
> evolving in the cluster computing world. Apache Spark currently implements
> support for Apache Hadoop YARN and Apache Mesos, in addition to providing
> its own standalone cluster manager. In 2014, Google announced development
> of Kubernetes  which has its own unique feature
> set and differentiates itself from YARN and Mesos. Since its debut, it has
> seen contributions from over 1300 contributors with over 5 commits.
> Kubernetes has cemented itself as a core player in the cluster computing
> world, and cloud-computing providers such as Google Container Engine,
> Google Compute Engine, Amazon Web Services, and Microsoft Azure support
> running Kubernetes clusters.
>
> This document outlines a proposal for integrating Apache Spark with
> Kubernetes in a first class way, adding Kubernetes to the list of cluster
> managers that Spark can be used with. Doing so would allow users to share
> their computing resources and containerization framework between their
> existing applications on Kubernetes and their computational Spark
> applications. Although there is existing support for running a Spark
> standalone cluster on Kubernetes
> ,
> there are still major advantages and significant interest in having native
> execution support. For example, this integration provides better support
> for multi-tenancy and dynamic resource allocation. It also allows users to
> run applications of different Spark versions of their choices in the same
> cluster.
>
> The feature is being developed in a separate fork
>  in order to minimize risk
> to the main project during development. Since the start of the development
> in November of 2016, it has received over 100 commits from over 20
> contributors and supports two releases based on Spark 2.1 and 2.2
> respectively. Documentation is also being actively worked on both in the
> main project repository and also in the repository
> https://github.com/apache-spark-on-k8s/userdocs. Regarding real-world use
> cases, we have seen cluster setup that uses 1000+ cores. We are also seeing
> growing interests on this project from more and more organizations.
>
> While it is easy to bootstrap the project in a forked repository, it is
> hard to maintain it in the long run because of the tricky process of
> rebasing onto the 

Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-09 Thread Reynold Xin
+1

One thing with MetadataSupport - It's a bad idea to call it that unless
adding new functions in that trait wouldn't break source/binary
compatibility in the future.


On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:

> I'm adding my own +1 (binding).
>
> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan  wrote:
>
>> I'm going to update the proposal: for the last point, although the
>> user-facing API (`df.write.format(...).option(...).mode(...).save()`)
>> mixes data and metadata operations, we are still able to separate them in
>> the data source write API. We can have a mix-in trait `MetadataSupport`
>> which has a method `create(options)`, so that data sources can mix in this
>> trait and provide metadata creation support. Spark will call this `create`
>> method inside `DataFrameWriter.save` if the specified data source has it.
>>
>> Note that file format data sources can ignore this new trait and still
>> write data without metadata(it doesn't have metadata anyway).
>>
>> With this updated proposal, I'm calling a new vote for the data source v2
>> write path.
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> After we merge the infrastructure of data source v2 read path, and have
>>> some discussion for the write path, now I'm sending this email to call a
>>> vote for Data Source v2 write path.
>>>
>>> The full document of the Data Source API V2 is:
>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>
>>> The ready-for-review PR that implements the basic infrastructure for the
>>> write path:
>>> https://github.com/apache/spark/pull/19269
>>>
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>> To solve these pain points, I'm proposing the data source v2 writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>>>
>>> Data Source V2 write path follows the existing FileCommitProtocol, and
>>> have task/job level commit/abort, so that data sources can implement
>>> transaction easier.
>>>
>>> We can create a mix-in trait for DataSourceV2Writer to specify the
>>> requirement for input data, like clustering and ordering.
>>>
>>> Spark provides a very simple protocol for uses to connect to data
>>> sources. A common way to write a dataframe to data sources:
>>> `df.write.format(...).option(...).mode(...).save()`.
>>> Spark passes the options and save mode to data sources, and schedules
>>> the write job on the input data. And the data source should take care of
>>> the metadata, e.g., the JDBC data source can create the table if it doesn't
>>> exist, or fail the job and ask users to create the table in the
>>> corresponding database first. Data sources can define some options for
>>> users to carry some metadata information like partitioning/bucketing.
>>>
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>
>>
>


Re: Interested to Contribute in Spark Development

2017-10-04 Thread Reynold Xin
Kumar,

This is a good start: http://spark.apache.org/contributing.html


On Wed, Oct 4, 2017 at 10:00 AM, vaquar khan  wrote:

> Hi Nishant,
>
> 1) Start with helping spark users on mailing list and stack .
>
> 2) Start helping build and testing.
>
> 3) Once comfortable with code start working on Spark Jira.
>
>
> Regards,
> Vaquar khan
>
> On Oct 4, 2017 11:29 AM, "Kumar Nishant"  wrote:
>
>> Hi Team,
>> I am new to Apache community and I would love to contribute effort in
>> Spark development. Can anyone mentor & guide me how to proceed and start
>> contributing? I am beginner here so I am not sure what process is be
>> followed.
>>
>> Thanks
>> Nishant
>>
>>


[discuss] SPIP: Continuous Processing Mode for Structured Streaming

2017-10-23 Thread Reynold Xin
Please take a look at the attached PDF for the SPIP: Continuous Processing
Mode for Structured Streaming

https://issues.apache.org/jira/browse/SPARK-20928

It is meant to be a very small, surgical change to Structured Streaming to
enable ultra-low latency. This is great timing because we are also
designing and implementing data source API v2. If designed properly, we can
have the same data source API working for both streaming and batch.


Re: [discuss][SQL] Partitioned column type inference proposal

2017-11-14 Thread Reynold Xin
Most of those thoughts from Wenchen make sense to me.


Rather than a list, can we create a table? X-axis is data type, and Y-axis
is also data type, and the intersection explains what the coerced type is?
Can we also look at what Hive, standard SQL (Postgres?) do?


Also, this shouldn't be isolated to partition column inference. We should
make sure most of the type coercions are consistent across different
functionalities, with the caveat that we need to preserve backward
compatibility.



On Tue, Nov 14, 2017 at 8:33 AM, Wenchen Fan  wrote:

> My 2 cents:
>
> 1. when merging NullType with another type, the result should always be
> that type.
> 2. when merging StringType with another type, the result should always be
> StringType.
> 3. when merging integral types, the priority from high to low:
> DecimalType, LongType, IntegerType. This is because DecimalType is used as
> big integer when paring partition column values.
> 4. DoubleType can't be merged with other types, except DoubleType itself.
> 5. when merging TimestampType with DateType, return TimestampType.
>
>
> On Tue, Nov 14, 2017 at 3:54 PM, Hyukjin Kwon  wrote:
>
>> Hi dev,
>>
>> I would like to post a proposal about partitioned column type inference
>> (related with 'spark.sql.sources.partitionColumnTypeInference.enabled'
>> configuration).
>>
>> This thread focuses on the type coercion (finding the common type) in
>> partitioned columns, in particular, when the different form of data is
>> inserted for the partition column and then it is read back with the type
>> inference.
>>
>>
>> *Problem:*
>>
>>
>> val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
>> df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
>> spark.read.load("/tmp/foo").printSchema()
>> val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
>> df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
>> spark.read.load("/tmp/bar").printSchema()
>>
>>
>>
>> It currently returns:
>>
>>
>> root
>>  |-- i: integer (nullable = true)
>>  |-- ts: date (nullable = true)
>>
>> root
>>  |-- i: integer (nullable = true)
>>  |-- decimal: integer (nullable = true)
>>
>>
>> The type coercion looks less well designed yet and currently there are
>> few holes which is not quite ideal:
>>
>>
>> private val upCastingOrder: Seq[DataType] =
>>   Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
>> ...
>> literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
>>
>>
>>
>> The current way does not deal with when the types are outside of the
>> upCastingOrder. It just returns the first type, as the type coerced one.
>>
>> This has been being discussed in https://github.com/apache/s
>> park/pull/19389#discussion_r150426911, but I would like to have more
>> feedback from community as it possibly is a breaking change.
>>
>> For the current releases of Spark (2.2.0 <=), we support the types below
>> for partitioned column schema inference, given my investigation -
>> https://github.com/apache/spark/pull/19389#discussion_r150528207:
>>
>>   NullType
>>   IntegerType
>>   LongType
>>   DoubleType,
>>   *DecimalType(...)
>>   DateType
>>   TimestampType
>>   StringType
>>
>>   *DecimalType only when it's bigger than LongType:
>>
>> I believe this is something we should definitely fix.
>>
>>
>>
>> *Proposal:*
>>
>> I propose the change - https://github.com/apache/spark/pull/19389
>>
>> Simply, it reuses the case 2 specified in https://github.com/apache/s
>> park/blob/6412ea1759d39a2380c572ec24cfd8ae4f2d81f7/sql/catal
>> yst/src/main/scala/org/apache/spark/sql/catalyst/analysis/
>> TypeCoercion.scala#L40-L43
>>
>> Please refer the chart I produced here - https://github.com/apache/sp
>> ark/pull/19389/files#r150528361. The current proposal will brings the
>> type coercion behaviour change in those cases below:
>>
>>
>> Input typesOld output typeNew output type
>> [NullType, DecimalType(38,0)] StringType DecimalType(38,0)
>> [NullType, DateType] StringType DateType
>> [NullType, TimestampType] StringType TimestampType
>> [IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)
>> [IntegerType, DateType] IntegerType StringType
>> [IntegerType, TimestampType] IntegerType StringType
>> [LongType, DecimalType(38,0)] LongType DecimalType(38,0)
>> [LongType, DateType] LongType StringType
>> [LongType, TimestampType] LongType StringType
>> [DoubleType, DateType] DoubleType StringType
>> [DoubleType, TimestampType] DoubleType StringType
>> [DecimalType(38,0), NullType] StringType DecimalType(38,0)
>> [DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)
>> [DecimalType(38,0), LongType] LongType DecimalType(38,0)
>> [DecimalType(38,0), DateType] DecimalType(38,0) StringType
>> [DecimalType(38,0), TimestampType] DecimalType(38,0) StringType
>> [DateType, NullType] StringType DateType
>> [DateType, IntegerType] IntegerType StringType
>> [DateType, LongType] LongType StringType
>> [DateType, 

Re: OutputMetrics empty for DF writes - any hints?

2017-11-27 Thread Reynold Xin
Is this due to the insert command not having metrics? It's a problem we
should fix.


On Mon, Nov 27, 2017 at 10:45 AM, Jason White 
wrote:

> I'd like to use the SparkListenerInterface to listen for some metrics for
> monitoring/logging/metadata purposes. The first ones I'm interested in
> hooking into are recordsWritten and bytesWritten as a measure of
> throughput.
> I'm using PySpark to write Parquet files from DataFrames.
>
> I'm able to extract a rich set of metrics this way, but for some reason the
> two that I want are always 0. This mirrors what I see in the Spark
> Application Master - the # records written field is always missing.
>
> I've filed a JIRA already for this issue:
> https://issues.apache.org/jira/browse/SPARK-22605
>
> I _think_ how this works is that inside the ResultTask.runTask method, the
> rdd.iterator call is incrementing the bytes read & records read via
> RDD.getOrCompute. Where would the equivalent be for the records written
> metrics?
>
> These metrics are populated properly if I save the data as an RDD via
> df.rdd.saveAsTextFile, so the code path exists somewhere. Any hints as to
> where I should be looking?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [RESULT][VOTE] Spark 2.2.1 (RC2)

2017-12-01 Thread Reynold Xin
Congrats.


On Fri, Dec 1, 2017 at 12:10 AM, Felix Cheung 
wrote:

> This vote passes. Thanks everyone for testing this release.
>
>
> +1:
>
> Sean Owen (binding)
>
> Herman van Hövell tot Westerflier (binding)
>
> Wenchen Fan (binding)
>
> Shivaram Venkataraman (binding)
>
> Felix Cheung
>
> Henry Robinson
>
> Hyukjin Kwon
>
> Dongjoon Hyun
>
> Kazuaki Ishizaki
>
> Holden Karau
>
> Weichen Xu
>
>
> 0: None
>
> -1: None
>
>
>
>
> On Wed, Nov 29, 2017 at 3:21 PM Weichen Xu 
> wrote:
>
>> +1
>>
>> On Thu, Nov 30, 2017 at 6:27 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1
>>>
>>> SHA, MD5 and signatures look fine. Built and ran Maven tests on my
>>> Macbook.
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau 
>>> wrote:
>>>
 +1 (non-binding)

 PySpark install into a virtualenv works, PKG-INFO looks correctly
 populated (mostly checking for the pypandoc conversion there).

 Thanks for your hard work Felix (and all of the testers :)) :)

 On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan 
 wrote:

> +1
>
> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki  > wrote:
>
>> +1 (non-binding)
>>
>> I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
>> for core/sql-core/sql-catalyst/mllib/mllib-local have passed.
>>
>> $ java -version
>> openjdk version "1.8.0_131"
>> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.
>> 16.04.3-b11)
>> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>>
>> % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn
>> -Phadoop-2.7 -T 24 clean package install
>> % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
>> core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
>> ...
>> Run completed in 13 minutes, 54 seconds.
>> Total number of tests run: 1118
>> Suites: completed 170, aborted 0
>> Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
>> All tests passed.
>> [INFO] 
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Core . SUCCESS
>> [17:13 min]
>> [INFO] Spark Project ML Local Library . SUCCESS [
>>  6.065 s]
>> [INFO] Spark Project Catalyst . SUCCESS
>> [11:51 min]
>> [INFO] Spark Project SQL .. SUCCESS
>> [17:55 min]
>> [INFO] Spark Project ML Library ... SUCCESS
>> [17:05 min]
>> [INFO] 
>> 
>> [INFO] BUILD SUCCESS
>> [INFO] 
>> 
>> [INFO] Total time: 01:04 h
>> [INFO] Finished at: 2017-11-30T01:48:15+09:00
>> [INFO] Final Memory: 128M/329M
>> [INFO] 
>> 
>> [WARNING] The requested profile "hive" could not be activated because
>> it does not exist.
>>
>> Kazuaki Ishizaki
>>
>>
>>
>> From:Dongjoon Hyun 
>> To:Hyukjin Kwon 
>> Cc:Spark dev list , Felix Cheung <
>> felixche...@apache.org>, Sean Owen 
>> Date:2017/11/29 12:56
>> Subject:Re: [VOTE] Spark 2.2.1 (RC2)
>> --
>>
>>
>>
>> +1 (non-binding)
>>
>> RC2 is tested on CentOS, too.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon <*gurwls...@gmail.com*
>> > wrote:
>> +1
>>
>> 2017-11-29 8:18 GMT+09:00 Henry Robinson <*he...@apache.org*
>> >:
>> (My vote is non-binding, of course).
>>
>> On 28 November 2017 at 14:53, Henry Robinson <*he...@apache.org*
>> > wrote:
>> +1, tests all pass for me on Ubuntu 16.04.
>>
>> On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
>> *hvanhov...@databricks.com* > wrote:
>> +1
>>
>> On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung <
>> *felixche...@apache.org* > wrote:
>> +1
>>
>> Thanks Sean. Please vote!
>>
>> Tested various scenarios with R package. Ubuntu, Debian, Windows
>> r-devel and release and on r-hub. Verified CRAN checks are clean (only 1
>> NOTE!) and no leaked files (.cache removed, /tmp clean)
>>
>>
>> On Sun, Nov 26, 2017 at 11:55 AM Sean Owen 

Re: Decimals

2017-12-13 Thread Reynold Xin
Responses inline

On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido  wrote:

> Hi all,
>
> I saw in these weeks that there are a lot of problems related to decimal
> values (SPARK-22036, SPARK-22755, for instance). Some are related to
> historical choices, which I don't know, thus please excuse me if I am
> saying dumb things:
>
>  - why are we interpreting literal constants in queries as Decimal and not
> as Double? I think it is very unlikely that a user can enter a number which
> is beyond Double precision.
>

Probably just to be consistent with some popular databases.



>  - why are we returning null in case of precision loss? Is this approach
> better than just giving a result which might loose some accuracy?
>

The contract with decimal is that it should never lose precision (it is
created for financial reports, accounting, etc). Returning null is at least
telling the user the data type can no longer support the precision required.



>
> Thanks,
> Marco
>


Re: [01/51] [partial] spark-website git commit: 2.2.1 generated doc

2017-12-17 Thread Reynold Xin
There is an additional step that's needed to update the symlink, and that
step hasn't been done yet.


On Sun, Dec 17, 2017 at 12:32 PM, Jacek Laskowski  wrote:

> Hi Sean,
>
> What does "Not all the pieces are released yet" mean if you don't mind me
> asking? 2.2.1 has already been announced, hasn't it? [1]
>
> [1] http://spark.apache.org/news/spark-2-2-1-released.html
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Sun, Dec 17, 2017 at 4:19 PM, Sean Owen  wrote:
>
>> /latest does not point to 2.2.1 yet. Not all the pieces are released yet,
>> as I understand?
>>
>> On Sun, Dec 17, 2017 at 8:12 AM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I saw the following commit, but I can't seem to see 2.2.1 as the version
>>> in the header of the documentation pages under http://spark.apache.org/
>>> docs/latest/ (that is still 2.2.0). Is this being worked on?
>>>
>>> http://spark.apache.org/docs/2.2.1 is available and shows the proper
>>> version, but not http://spark.apache.org/docs/latest :(
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>>
>>>
>


Re: Faster and Lower memory implementation toPandas

2017-11-16 Thread Reynold Xin
Please send a PR. Thanks for looking at this.

On Thu, Nov 16, 2017 at 7:27 AM Andrew Andrade 
wrote:

> Hello devs,
>
> I know a lot of great work has been done recently with pandas to spark
> dataframes and vice versa using Apache Arrow, but I faced a specific pain
> point on a low memory setup without Arrow.
>
> Specifically I was finding a driver OOM running a toPandas on a small
> dataset (<100 MB compressed).  There was discussion about toPandas being
> slow
> 
> in March 2016 due to a self.collect().  A solution was found to create Pandas
> DataFrames or Numpy Arrays using MapPartitions for each partition
> , but it was never
> implemented back into dataframe.py
>
> I understand that using Apache arrow will solve this, but in a setup
> without Arrow (like the one where I faced the painpoint), I investigated
> memory usage of toPanda and to_pandas (dataframe per partition) and played
> with the number of partitions.  The findings are here
> 
> .
>
> The summary of the findings are that on a 147MB dataset, toPandas memory
> usage was about 784MB while while doing it partition by partition (with 100
> partitions) had an overhead of 76.30 MM and took almost half of the time to
> run.  I realize that Arrow solves this but the modification is quite small
> and would greatly assist anyone who isn't able to use Arrow.
>
> Would a PR [1] from me to address this issue be welcome?
>
> Thanks,
>
> Andrew
>
> [1] From Josh's Gist
>
> def _map_to_pandas(rdds):
> """ Needs to be here due to pickling issues """
> return [pd.DataFrame(list(rdds))]
>
> def toPandas(df, n_partitions=None):
> """
> Returns the contents of `df` as a local `pandas.DataFrame` in a speedy
> fashion. The DataFrame is
> repartitioned if `n_partitions` is passed.
> :param df:  pyspark.sql.DataFrame
> :param n_partitions:int or None
> :return:pandas.DataFrame
> """
> if n_partitions is not None: df = df.repartition(n_partitions)
> df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()
> df_pand = pd.concat(df_pand)
> df_pand.columns = df.columns
> return df_pand
>


Re: how to replace hdfs with a custom distributed fs ?

2017-11-11 Thread Reynold Xin
You can implement the Hadoop FileSystem API for your distributed java fs
and just plug into Spark using the Hadoop API.


On Sat, Nov 11, 2017 at 9:37 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:

> hi i have my distributed java fs and i would like to implement my class
> for storing data in spark.
> How to do? it there a example how to do?
>


Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be
independent and we should be able to make a lot of maintenance releases as
needed.

On Thu, Nov 2, 2017 at 7:13 PM Sean Owen  wrote:

> The feature freeze is "mid November" :
> http://spark.apache.org/versioning-policy.html
> Let's say... Nov 15? any body have a better date?
>
> Although it'd be nice to get 2.2.1 out sooner than later in all events,
> and kind of makes sense to get out first, they need not go in order. It
> just might be distracting to deal with 2 at once.
>
> (BTW there was still one outstanding issue from the last release:
> https://issues.apache.org/jira/browse/SPARK-22401 )
>
> On Thu, Nov 2, 2017 at 6:06 PM Felix Cheung 
> wrote:
>
>> I think it will be great to set a feature freeze date for 2.3.0 first, as
>> a minor release. There are a few new stuff that would be good to have and
>> then we will likely need time to stabilize, before cutting RCs.
>>
>>


Re: [SS] Custom Sinks

2017-11-01 Thread Reynold Xin
They will probably both change, but I wouldn't block on the change if you
have an immediate need.


On Wed, Nov 1, 2017 at 10:41 AM, Anton Okolnychyi <
anton.okolnyc...@gmail.com> wrote:

> Hi all,
>
> I have a question about the future of custom data sinks in Structured
> Streaming. In particular, I want to know how continuous processing and the
> Datasource API V2 will impact them.
>
> Right now, it is possible to have custom data sinks via the current
> Datasource API (V1) by implementing StreamSinkProvider and Sink traits. I
> am wondering if this approach is planned to be the right way or it is
> better to wait for the Datasource API V2 and the final implementation of
> continuous processing.
>
> As far as I understand SPARK-20928, there will be changes in the Source
> API in SS. But what about the Sink API? Is it safe to implement it now?
>
> Best regards,
> Anton
>


Re: Jenkins upgrade/Test Parallelization & Containerization

2017-11-07 Thread Reynold Xin
My understanding is that AMP actually can provide more resources or adapt
changes, while ASF needs to manage 200+ projects and it's hard to
accommodate much. I could be wrong though.


On Tue, Nov 7, 2017 at 2:14 PM, Holden Karau  wrote:

> True, I think we've seen that the Amp Lab Jenkins needs to be more focused
> on running AMP Lab projects, and while I don't know how difficult the ASF
> Jenkins is I assume it might be an easier place to make changes going
> forward? (Of course this could be the grass is greener on the other side
> and I don't mean to say it's been hard to make changes on the AMP lab
> hardware, folks have been amazingly helpful - its just the projects on each
> have different needs).
>
> On Tue, Nov 7, 2017 at 12:52 PM, Sean Owen  wrote:
>
>> Faster tests would be great. I recall that the straightforward ways to
>> parallelize via Maven haven't worked because many tests collide with one
>> another. Is this about running each module's tests in a container? that
>> should work.
>>
>> I can see how this is becoming essential for repeatable and reliable
>> Python/R builds, which depend on the environment to a much greater extent
>> than the JVM does.
>>
>> I don't have a strong preference for AMPLab vs ASF builds. I suppose
>> using the ASF machinery is a little tidier. If it's got a later Jenkins
>> that's required, also a plus, but I assume updating AMPLab isn't so hard
>> here either. I think the key issue is which environment is easier to
>> control and customize over time.
>>
>>
>> On Wed, Nov 1, 2017 at 6:05 AM Xin Lu  wrote:
>>
>>> Hi everyone,
>>>
>>> I tried sending emails to this list and I'm not sure if it went through
>>> so I'm trying again.  Anyway, a couple months ago before I left Databricks
>>> I was working on a proof of concept that parallelized Spark tests on
>>> jenkins.  The way it worked was basically it build the spark jars and then
>>> ran all the tests in a docker container on a bunch of slaves in parallel.
>>> This cut the testing time down from 4 hours to approximately 1.5 hours.
>>> This required a newer version of jenkins and the Jenkins Pipeline plugin.
>>> I am wondering if it is possible to do this on amplab jenkins.  It looks
>>> like https://builds.apache.org/ has upgraded so Amplabs jenkins is a
>>> year or so behind.  I am happy to help with this project if it is something
>>> that people think is worthwhile.
>>>
>>> Thanks
>>>
>>> Xin
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Reynold Xin
The vote has passed with the following +1s:

Reynold Xin*
Debasish Das
Noman Khan
Wenchen Fan*
Matei Zaharia*
Weichen Xu
Vaquar Khan
Burak Yavuz
Xiao Li
Tom Graves*
Michael Armbrust*
Joseph Bradley*
Shixiong Zhu*


And the following +0s:

Sean Owen*


Thanks for the feedback!


On Wed, Nov 1, 2017 at 8:37 AM, Reynold Xin <r...@databricks.com> wrote:

> Earlier I sent out a discussion thread for CP in Structured Streaming:
>
> https://issues.apache.org/jira/browse/SPARK-20928
>
> It is meant to be a very small, surgical change to Structured Streaming to
> enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
>
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
>
>


[Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
Earlier I sent out a discussion thread for CP in Structured Streaming:

https://issues.apache.org/jira/browse/SPARK-20928

It is meant to be a very small, surgical change to Structured Streaming to
enable ultra-low latency. This is great timing because we are also
designing and implementing data source API v2. If designed properly, we can
have the same data source API working for both streaming and batch.


Following the SPIP process, I'm putting this SPIP up for a vote.

+1: Let's go ahead and design / implement the SPIP.
+0: Don't really care.
-1: I do not think this is a good idea for the following reasons.


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-01 Thread Reynold Xin
I just replied.


On Wed, Nov 1, 2017 at 5:50 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Was there any answer to my question around the effect of changes to
> the sink api regarding access to underlying offsets?
>
> On Wed, Nov 1, 2017 at 11:32 AM, Reynold Xin <r...@databricks.com> wrote:
> > Most of those should be answered by the attached design sketch in the
> JIRA
> > ticket.
> >
> > On Wed, Nov 1, 2017 at 5:29 PM Debasish Das <debasish.da...@gmail.com>
> > wrote:
> >>
> >> +1
> >>
> >> Is there any design doc related to API/internal changes ? Will CP be the
> >> default in structured streaming or it's a mode in conjunction with
> exisiting
> >> behavior.
> >>
> >> Thanks.
> >> Deb
> >>
> >> On Nov 1, 2017 8:37 AM, "Reynold Xin" <r...@databricks.com> wrote:
> >>
> >> Earlier I sent out a discussion thread for CP in Structured Streaming:
> >>
> >> https://issues.apache.org/jira/browse/SPARK-20928
> >>
> >> It is meant to be a very small, surgical change to Structured Streaming
> to
> >> enable ultra-low latency. This is great timing because we are also
> designing
> >> and implementing data source API v2. If designed properly, we can have
> the
> >> same data source API working for both streaming and batch.
> >>
> >>
> >> Following the SPIP process, I'm putting this SPIP up for a vote.
> >>
> >> +1: Let's go ahead and design / implement the SPIP.
> >> +0: Don't really care.
> >> -1: I do not think this is a good idea for the following reasons.
> >>
> >>
> >>
> >
>


Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Reynold Xin
Thanks Tom. I'd imagine more details belong either in a full design doc, or
a PR description. Might make sense to do an additional design doc, if there
is enough delta from the current sketch doc.


On Mon, Nov 6, 2017 at 7:29 AM, Tom Graves <tgraves...@yahoo.com> wrote:

> +1 for the idea and feature, but I think the design is definitely lacking
> detail on the internal changes needed and how the execution pieces work and
> the communication.  Are you planning on posting more of those details or
> were you just planning on discussing in PR?
>
> Tom
>
> On Wednesday, November 1, 2017, 11:29:21 AM CDT, Debasish Das <
> debasish.da...@gmail.com> wrote:
>
>
> +1
>
> Is there any design doc related to API/internal changes ? Will CP be the
> default in structured streaming or it's a mode in conjunction with
> exisiting behavior.
>
> Thanks.
> Deb
>
> On Nov 1, 2017 8:37 AM, "Reynold Xin" <r...@databricks.com> wrote:
>
> Earlier I sent out a discussion thread for CP in Structured Streaming:
>
> https://issues.apache.org/ jira/browse/SPARK-20928
> <https://issues.apache.org/jira/browse/SPARK-20928>
>
> It is meant to be a very small, surgical change to Structured Streaming to
> enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
>
>
> Following the SPIP process, I'm putting this SPIP up for a vote.
>
> +1: Let's go ahead and design / implement the SPIP.
> +0: Don't really care.
> -1: I do not think this is a good idea for the following reasons.
>
>
>
>


Re: Dataset API Question

2017-10-25 Thread Reynold Xin
It is a bit more than syntactic sugar, but not much more:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533

BTW this is basically writing all the data out, and then create a new
Dataset to load them in.


On Wed, Oct 25, 2017 at 6:51 AM, Bernard Jesop 
wrote:

> Hello everyone,
>
> I have a question about checkpointing on dataset.
>
> It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD
> there is no Dataset.isCheckpointed().
>
> I wonder if Dataset.checkpoint is a syntactic sugar for
> Dataset.rdd.checkpoint.
> When I do :
>
> Dataset.checkpoint; Dataset.count
> Dataset.rdd.isCheckpointed // result: false
>
> However, when I explicitly do:
> Dataset.rdd.checkpoint; Dataset.rdd.count
> Dataset.rdd.isCheckpointed // result: true
>
> Could someone explain this behavior to me, or provide some references?
>
> Best regards,
> Bernard
>


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yup. Sounds great. This is something simple Spark can do and provide huge
value to the end users.


On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote:

> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior to
> PySpark classes. We could also add a configuration property to determine
> whether the dataset evaluation is eager or not. That would make it turn-key
> for anyone running PySpark in Jupyter.
>
> For JVM languages, we could also add a dependency on jvm-repr and do the
> same thing.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> s/underestimated/overestimated/
>>
>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote:
>>
>>> Marco,
>>>
>>> There is understanding how Spark works, and there is finding bugs early
>>> in their own program. One can perfectly understand how Spark works and
>>> still find it valuable to get feedback asap, and that's why we built eager
>>> analysis in the first place.
>>>
>>> Also I'm afraid you've significantly underestimated the level of
>>> technical sophistication of users. In many cases they struggle to get
>>> anything to work, and performance optimization of their programs is
>>> secondary to getting things working. As John Ousterhout says, "the greatest
>>> performance improvement of all is when a system goes from not-working to
>>> working".
>>>
>>> I really like Ryan's approach. Would be great if it is something more
>>> turn-key.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com>
>>> wrote:
>>>
>>>> I am not sure how this is useful. For students, it is important to
>>>> understand how Spark works. This can be critical in many decision they have
>>>> to take (whether and what to cache for instance) in order to have
>>>> performant Spark application. Creating a eager execution probably can help
>>>> them having something running more easily, but let them also using Spark
>>>> knowing less about how it works, thus they are likely to write worse
>>>> application and to have more problems in debugging any kind of problem
>>>> which may later (in production) occur (therefore affecting their experience
>>>> with the tool).
>>>>
>>>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>>>> execution, helping in the debugging phase. So they can achieve without a
>>>> big effort the same result, but with a big difference: they are aware of
>>>> what is really happening, which may help them later.
>>>>
>>>> Thanks,
>>>> Marco
>>>>
>>>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>>>>
>>>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>>>> sessions. For anyone interested, this mode of interaction is really easy 
>>>>> to
>>>>> add in Jupyter and PySpark. You would just define a different
>>>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>>>> take(100) and formats the result.
>>>>>
>>>>> That way, the output of a cell or console execution always causes the
>>>>> dataframe to run and get displayed for that immediate feedback. But, there
>>>>> is no change to Spark’s behavior because the action is run by the REPL, 
>>>>> and
>>>>> only when a dataframe is a result of an execution in order to display it.
>>>>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>>>>> too many executions and would still support method chaining in the
>>>>> dataframe API (which would be horrible with an aggressive execution 
>>>>> model).
>>>>>
>>>>> There are ways to do this in JVM languages as well if you are using a
>>>>> Scala or Java interpreter (see jvm-repr
>>>>> <https://github.com/jupyter/jvm-repr>). This is actually what we do
>>>>> in our Spark-based SQL interpreter to display results.
>>>>>
>>>>> rb
>>>>> ​
>>>>>
>>>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
&

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
IIRC we switched all internals to UnsafeRow for simplicity. It is easier to
serialize UnsafeRows, compute hash codes, etc. At some point we had a bug
with unioning two plans producing different types of rows, so we forced the
conversion at input.

Can't your "wish" be satisfied by having the public API producing the
internals of UnsafeRow (without actually exposing UnsafeRow)?


On Tue, May 8, 2018 at 4:16 PM Ryan Blue <rb...@netflix.com> wrote:

> Is the goal to design an API so the consumers of the API can directly
> produces what Spark expects internally, to cut down perf cost?
>
> No. That has already been done. The problem on the API side is that it
> makes little sense to force implementers to create UnsafeRow when it almost
> certainly causes them to simply use UnsafeProjection and copy it. If
> we’re just making a copy and we can defer that copy to get better
> performance, why would we make implementations handle it? Instead, I think
> we should accept InternalRow from v2 data sources and copy to unsafe when
> it makes sense to do so: after filters are run and only if there isn’t
> another projection that will do it already.
>
> But I don’t want to focus on the v2 API for this. What I’m asking in this
> thread is what the intent is for the SQL engine. Is this an accident that
> nearly everything works with InternalRow? If we were to make a choice
> here, should we mandate that rows passed into the SQL engine must be
> UnsafeRow?
>
> Personally, I think it makes sense to say that everything should accept
> InternalRow, but produce UnsafeRow, with the understanding that UnsafeRow
> will usually perform better.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 4:09 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> What the internal operators do are strictly internal. To take one step
>> back, is the goal to design an API so the consumers of the API can directly
>> produces what Spark expects internally, to cut down perf cost?
>>
>>
>> On Tue, May 8, 2018 at 1:22 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> While moving the new data source API to InternalRow, I noticed a few odd
>>> things:
>>>
>>>- Spark scans always produce UnsafeRow, but that data is passed
>>>around as InternalRow with explicit casts.
>>>- Operators expect InternalRow and nearly all codegen works with
>>>InternalRow (I’ve tested this with quite a few queries.)
>>>- Spark uses unchecked casts
>>>
>>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L254>
>>>from InternalRow to UnsafeRow in places, assuming that data will be
>>>unsafe, even though that isn’t what the type system guarantees.
>>>
>>> To me, it looks like the idea was to code SQL operators to the abstract
>>> InternalRow so we can swap out the implementation, but ended up with a
>>> general assumption that rows will always be unsafe. This is the worst of
>>> both options: we can’t actually rely on everything working with
>>> InternalRow but code must still use it, until it is inconvenient and an
>>> unchecked cast gets inserted.
>>>
>>> The main question I want to answer is this: *what data format should
>>> SQL use internally?* What was the intent when building catalyst?
>>>
>>> The v2 data source API depends on the answer, but I also found that this
>>> introduces a significant performance penalty in Parquet (and probably other
>>> formats). A quick check on one of our tables showed a 6% performance hit
>>> caused by unnecessary copies from InternalRow to UnsafeRow. So if we
>>> can guarantee that all operators should support InternalRow, then there
>>> is an easy performance win that also simplifies the v2 data source API.
>>>
>>> rb
>>> ​
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
s/underestimated/overestimated/

On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com> wrote:

> Marco,
>
> There is understanding how Spark works, and there is finding bugs early in
> their own program. One can perfectly understand how Spark works and still
> find it valuable to get feedback asap, and that's why we built eager
> analysis in the first place.
>
> Also I'm afraid you've significantly underestimated the level of technical
> sophistication of users. In many cases they struggle to get anything to
> work, and performance optimization of their programs is secondary to
> getting things working. As John Ousterhout says, "the greatest performance
> improvement of all is when a system goes from not-working to working".
>
> I really like Ryan's approach. Would be great if it is something more
> turn-key.
>
>
>
>
>
>
> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> wrote:
>
>> I am not sure how this is useful. For students, it is important to
>> understand how Spark works. This can be critical in many decision they have
>> to take (whether and what to cache for instance) in order to have
>> performant Spark application. Creating a eager execution probably can help
>> them having something running more easily, but let them also using Spark
>> knowing less about how it works, thus they are likely to write worse
>> application and to have more problems in debugging any kind of problem
>> which may later (in production) occur (therefore affecting their experience
>> with the tool).
>>
>> Moreover, as Ryan also mentioned, there are tools/ways to force the
>> execution, helping in the debugging phase. So they can achieve without a
>> big effort the same result, but with a big difference: they are aware of
>> what is really happening, which may help them later.
>>
>> Thanks,
>> Marco
>>
>> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>>
>>> At Netflix, we use Jupyter notebooks and consoles for interactive
>>> sessions. For anyone interested, this mode of interaction is really easy to
>>> add in Jupyter and PySpark. You would just define a different
>>> *repr_html* or *repr* method for Dataset that runs a take(10) or
>>> take(100) and formats the result.
>>>
>>> That way, the output of a cell or console execution always causes the
>>> dataframe to run and get displayed for that immediate feedback. But, there
>>> is no change to Spark’s behavior because the action is run by the REPL, and
>>> only when a dataframe is a result of an execution in order to display it.
>>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>>> too many executions and would still support method chaining in the
>>> dataframe API (which would be horrible with an aggressive execution model).
>>>
>>> There are ways to do this in JVM languages as well if you are using a
>>> Scala or Java interpreter (see jvm-repr
>>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
>>> our Spark-based SQL interpreter to display results.
>>>
>>> rb
>>> ​
>>>
>>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> yeah we run into this all the time with new hires. they will send
>>>> emails explaining there is an error in the .write operation and they are
>>>> debugging the writing to disk, focusing on that piece of code :)
>>>>
>>>> unrelated, but another frequent cause for confusion is cascading
>>>> errors. like the FetchFailedException. they will be debugging the reducer
>>>> task not realizing the error happened before that, and the
>>>> FetchFailedException is not the root cause.
>>>>
>>>>
>>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>>>> sending another email on what I've learned recently from Spark users. I
>>>>> recently talked to some educators that have been teaching Spark in their
>>>>> (top-tier) university classes. They are some of the most important users
>>>>> for adoption because of the multiplicative effect they have on the future
>>>>> generation.
>>>>>
>>>>> To my surprise the single biggest ask they want is to enable eager
>>>>> execution mode on 

eager execution and debuggability

2018-05-08 Thread Reynold Xin
Similar to the thread yesterday about improving ML/DL integration, I'm
sending another email on what I've learned recently from Spark users. I
recently talked to some educators that have been teaching Spark in their
(top-tier) university classes. They are some of the most important users
for adoption because of the multiplicative effect they have on the future
generation.

To my surprise the single biggest ask they want is to enable eager
execution mode on all operations for teaching and debuggability:

(1) Most of the students are relatively new to programming, and they need
multiple iterations to even get the most basic operation right. In these
cases, in order to trigger an error, they would need to explicitly add
actions, which is non-intuitive.

(2) If they don't add explicit actions to every operation and there is a
mistake, the error pops up somewhere later where an action is triggered.
This is in a different position from the code that causes the problem, and
difficult for students to correlate the two.

I suspect in the real world a lot of Spark users also struggle in similar
ways as these students. While eager execution is really not practical in
big data, in learning environments or in development against small, sampled
datasets it can be pretty helpful.


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Marco,

There is understanding how Spark works, and there is finding bugs early in
their own program. One can perfectly understand how Spark works and still
find it valuable to get feedback asap, and that's why we built eager
analysis in the first place.

Also I'm afraid you've significantly underestimated the level of technical
sophistication of users. In many cases they struggle to get anything to
work, and performance optimization of their programs is secondary to
getting things working. As John Ousterhout says, "the greatest performance
improvement of all is when a system goes from not-working to working".

I really like Ryan's approach. Would be great if it is something more
turn-key.






On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com> wrote:

> I am not sure how this is useful. For students, it is important to
> understand how Spark works. This can be critical in many decision they have
> to take (whether and what to cache for instance) in order to have
> performant Spark application. Creating a eager execution probably can help
> them having something running more easily, but let them also using Spark
> knowing less about how it works, thus they are likely to write worse
> application and to have more problems in debugging any kind of problem
> which may later (in production) occur (therefore affecting their experience
> with the tool).
>
> Moreover, as Ryan also mentioned, there are tools/ways to force the
> execution, helping in the debugging phase. So they can achieve without a
> big effort the same result, but with a big difference: they are aware of
> what is really happening, which may help them later.
>
> Thanks,
> Marco
>
> 2018-05-08 21:37 GMT+02:00 Ryan Blue <rb...@netflix.com.invalid>:
>
>> At Netflix, we use Jupyter notebooks and consoles for interactive
>> sessions. For anyone interested, this mode of interaction is really easy to
>> add in Jupyter and PySpark. You would just define a different *repr_html*
>> or *repr* method for Dataset that runs a take(10) or take(100) and
>> formats the result.
>>
>> That way, the output of a cell or console execution always causes the
>> dataframe to run and get displayed for that immediate feedback. But, there
>> is no change to Spark’s behavior because the action is run by the REPL, and
>> only when a dataframe is a result of an execution in order to display it.
>> Intermediate results wouldn’t be run, but that gives users a way to avoid
>> too many executions and would still support method chaining in the
>> dataframe API (which would be horrible with an aggressive execution model).
>>
>> There are ways to do this in JVM languages as well if you are using a
>> Scala or Java interpreter (see jvm-repr
>> <https://github.com/jupyter/jvm-repr>). This is actually what we do in
>> our Spark-based SQL interpreter to display results.
>>
>> rb
>> ​
>>
>> On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> yeah we run into this all the time with new hires. they will send emails
>>> explaining there is an error in the .write operation and they are debugging
>>> the writing to disk, focusing on that piece of code :)
>>>
>>> unrelated, but another frequent cause for confusion is cascading errors.
>>> like the FetchFailedException. they will be debugging the reducer task not
>>> realizing the error happened before that, and the FetchFailedException is
>>> not the root cause.
>>>
>>>
>>> On Tue, May 8, 2018 at 2:52 PM, Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Similar to the thread yesterday about improving ML/DL integration, I'm
>>>> sending another email on what I've learned recently from Spark users. I
>>>> recently talked to some educators that have been teaching Spark in their
>>>> (top-tier) university classes. They are some of the most important users
>>>> for adoption because of the multiplicative effect they have on the future
>>>> generation.
>>>>
>>>> To my surprise the single biggest ask they want is to enable eager
>>>> execution mode on all operations for teaching and debuggability:
>>>>
>>>> (1) Most of the students are relatively new to programming, and they
>>>> need multiple iterations to even get the most basic operation right. In
>>>> these cases, in order to trigger an error, they would need to explicitly
>>>> add actions, which is non-intuitive.
>>>>
>>>> (2) If they don't add explicit actions to every operation and there is
>>>> a mis

Re: [DISCUSS] Spark SQL internal data: InternalRow or UnsafeRow?

2018-05-08 Thread Reynold Xin
What the internal operators do are strictly internal. To take one step
back, is the goal to design an API so the consumers of the API can directly
produces what Spark expects internally, to cut down perf cost?


On Tue, May 8, 2018 at 1:22 PM Ryan Blue  wrote:

> While moving the new data source API to InternalRow, I noticed a few odd
> things:
>
>- Spark scans always produce UnsafeRow, but that data is passed around
>as InternalRow with explicit casts.
>- Operators expect InternalRow and nearly all codegen works with
>InternalRow (I’ve tested this with quite a few queries.)
>- Spark uses unchecked casts
>
> 
>from InternalRow to UnsafeRow in places, assuming that data will be
>unsafe, even though that isn’t what the type system guarantees.
>
> To me, it looks like the idea was to code SQL operators to the abstract
> InternalRow so we can swap out the implementation, but ended up with a
> general assumption that rows will always be unsafe. This is the worst of
> both options: we can’t actually rely on everything working with
> InternalRow but code must still use it, until it is inconvenient and an
> unchecked cast gets inserted.
>
> The main question I want to answer is this: *what data format should SQL
> use internally?* What was the intent when building catalyst?
>
> The v2 data source API depends on the answer, but I also found that this
> introduces a significant performance penalty in Parquet (and probably other
> formats). A quick check on one of our tables showed a 6% performance hit
> caused by unnecessary copies from InternalRow to UnsafeRow. So if we can
> guarantee that all operators should support InternalRow, then there is an
> easy performance win that also simplifies the v2 data source API.
>
> rb
> ​
> --
> Ryan Blue
> Software Engineer
> Netflix
>


parser error?

2018-05-13 Thread Reynold Xin
Just saw this in one of my PR that's doc only:

[error] warning(154): SqlBase.g4:400:0: rule fromClause contains an
optional block with at least one alternative that can match an empty
string


Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
Hi all,

Xiangrui and I were discussing with a heavy Apache Spark user last week on
their experiences integrating machine learning (and deep learning)
frameworks with Spark and some of their pain points. Couple things were
obvious and I wanted to share our learnings with the list.

(1) Most organizations already use Spark for data plumbing and want to be
able to run their ML part of the stack on Spark as well (not necessarily
re-implementing all the algorithms but by integrating various frameworks
like tensorflow, mxnet with Spark).

(2) The integration is however painful, from the systems perspective:


   - Performance: data exchange between Spark and other frameworks are
   slow, because UDFs across process boundaries (with native code) are slow.
   This works much better now with Pandas UDFs (given a lot of the ML/DL
   frameworks are in Python). However, there might be some low hanging fruit
   gaps here.


   - Fault tolerance and execution model: Spark assumes fine-grained task
   recovery, i.e. if something fails, only that task is rerun. This doesn’t
   match the execution model of distributed ML/DL frameworks that are
   typically MPI-based, and rerunning a single task would lead to the entire
   system hanging. A whole stage needs to be re-run.


   - Accelerator-aware scheduling: The DL frameworks leverage GPUs and
   sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
   aware of those resources, leading to either over-utilizing the accelerators
   or under-utilizing the CPUs.


The good thing is that none of these seem very difficult to address (and we
have already made progress on one of them). Xiangrui has graciously
accepted the challenge to come up with solutions and SPIP to these.

Xiangrui - please also chime in if I didn’t capture everything.


Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
I don't think it's sufficient to have them in YARN (or any other services)
without Spark aware of them. If Spark is not aware of them, then there is
no way to really efficiently utilize these accelerators when you run
anything that require non-accelerators (which is almost 100% of the cases
in real world workloads).

For the other two, the point is not to implement all the ML/DL algorithms
in Spark, but make Spark integrate well with ML/DL frameworks. Otherwise
you will have the problems I described (super low performance when
exchanging data between Spark and ML/DL frameworks, and hanging issues with
MPI-based programs).


On Mon, May 7, 2018 at 10:05 PM Jörn Franke <jornfra...@gmail.com> wrote:

> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA
> scheduling, so it might be worth to have the last point generic that not
> only the Spark scheduler, but all supported schedulers can use GPU.
>
> For the other 2 points I just wonder if it makes sense to address this in
> the ml frameworks themselves or in Spark.
>
> On 8. May 2018, at 06:59, Xiangrui Meng <m...@databricks.com> wrote:
>
> Thanks Reynold for summarizing the offline discussion! I added a few
> comments inline. -Xiangrui
>
> On Mon, May 7, 2018 at 5:37 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Hi all,
>>
>> Xiangrui and I were discussing with a heavy Apache Spark user last week
>> on their experiences integrating machine learning (and deep learning)
>> frameworks with Spark and some of their pain points. Couple things were
>> obvious and I wanted to share our learnings with the list.
>>
>> (1) Most organizations already use Spark for data plumbing and want to be
>> able to run their ML part of the stack on Spark as well (not necessarily
>> re-implementing all the algorithms but by integrating various frameworks
>> like tensorflow, mxnet with Spark).
>>
>> (2) The integration is however painful, from the systems perspective:
>>
>>
>>- Performance: data exchange between Spark and other frameworks are
>>slow, because UDFs across process boundaries (with native code) are slow.
>>This works much better now with Pandas UDFs (given a lot of the ML/DL
>>frameworks are in Python). However, there might be some low hanging fruit
>>gaps here.
>>
>> The Arrow support behind Pands UDFs can be reused to exchange data with
> other frameworks. And one possibly performance improvement is to support
> pipelining when supplying data to other frameworks. For example, while
> Spark is pumping data from external sources into TensorFlow, TensorFlow
> starts the computation on GPUs. This would significant improve speed and
> resource utilization.
>
>>
>>- Fault tolerance and execution model: Spark assumes fine-grained
>>task recovery, i.e. if something fails, only that task is rerun. This
>>doesn’t match the execution model of distributed ML/DL frameworks that are
>>typically MPI-based, and rerunning a single task would lead to the entire
>>system hanging. A whole stage needs to be re-run.
>>
>> This is not only useful for integrating with 3rd-party frameworks, but
> also useful for scaling MLlib algorithms. One of my earliest attempts in
> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up with
> some compromised solutions. With the new execution model, we can set up a
> hybrid cluster and do all-reduce properly.
>
>
>>
>>- Accelerator-aware scheduling: The DL frameworks leverage GPUs and
>>sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
>>aware of those resources, leading to either over-utilizing the 
>> accelerators
>>or under-utilizing the CPUs.
>>
>>
>> The good thing is that none of these seem very difficult to address (and
>> we have already made progress on one of them). Xiangrui has graciously
>> accepted the challenge to come up with solutions and SPIP to these.
>>
>>
> I will do more home work, exploring existing JIRAs or creating new JIRAs
> for the proposal. We'd like to hear your feedback and past efforts along
> those directions if they were not fully captured by our JIRA.
>
>
>> Xiangrui - please also chime in if I didn’t capture everything.
>>
>>
>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>
>


Re: eager execution and debuggability

2018-05-08 Thread Reynold Xin
Yes would be great if possible but it’s non trivial (might be impossible to
do in general; we already have stacktraces that point to line numbers when
an error occur in UDFs but clearly that’s not sufficient). Also in
environments like REPL it’s still more useful to show error as soon as it
occurs, rather than showing it potentially 30 lines later.

On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> This may be technically impractical, but it would be fantastic if we could
> make it easier to debug Spark programs without needing to rely on eager
> execution. Sprinkling .count() and .checkpoint() at various points in my
> code is still a debugging technique I use, but it always makes me wish
> Spark could point more directly to the offending transformation when
> something goes wrong.
>
> Is it somehow possible to have each individual operator (is that the
> correct term?) in a DAG include metadata pointing back to the line of code
> that generated the operator? That way when an action triggers an error, the
> failing operation can point to the relevant line of code — even if it’s a
> transformation — and not just the action on the tail end that triggered the
> error.
>
> I don’t know how feasible this is, but addressing it would directly solve
> the issue of linking failures to the responsible transformation, as opposed
> to leaving the user to break up a chain of transformations with several
> debug actions. And this would benefit new and experienced users alike.
>
> Nick
>
> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
> <http://mailto:rb...@netflix.com.invalid>님이 작성:
>
> I've opened SPARK-24215 to track this.
>>
>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Yup. Sounds great. This is something simple Spark can do and provide
>>> huge value to the end users.
>>>
>>>
>>> On Tue, May 8, 2018 at 3:53 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Would be great if it is something more turn-key.
>>>>
>>>> We can easily add the __repr__ and _repr_html_ methods and behavior to
>>>> PySpark classes. We could also add a configuration property to determine
>>>> whether the dataset evaluation is eager or not. That would make it turn-key
>>>> for anyone running PySpark in Jupyter.
>>>>
>>>> For JVM languages, we could also add a dependency on jvm-repr and do
>>>> the same thing.
>>>>
>>>> rb
>>>> ​
>>>>
>>>> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> s/underestimated/overestimated/
>>>>>
>>>>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Marco,
>>>>>>
>>>>>> There is understanding how Spark works, and there is finding bugs
>>>>>> early in their own program. One can perfectly understand how Spark works
>>>>>> and still find it valuable to get feedback asap, and that's why we built
>>>>>> eager analysis in the first place.
>>>>>>
>>>>>> Also I'm afraid you've significantly underestimated the level of
>>>>>> technical sophistication of users. In many cases they struggle to get
>>>>>> anything to work, and performance optimization of their programs is
>>>>>> secondary to getting things working. As John Ousterhout says, "the 
>>>>>> greatest
>>>>>> performance improvement of all is when a system goes from not-working to
>>>>>> working".
>>>>>>
>>>>>> I really like Ryan's approach. Would be great if it is something more
>>>>>> turn-key.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido <marcogaid...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not sure how this is useful. For students, it is important to
>>>>>>> understand how Spark works. This can be critical in many decision they 
>>>>>>> have
>>>>>>> to take (whether and what to cache for instance) in order to have
>>>>>>> performant Spark application. Creating a eager execution probably can 
>>>>>>> help

Re: Documenting the various DataFrame/SQL join types

2018-05-08 Thread Reynold Xin
Would be great to document. Probably best with examples.

On Tue, May 8, 2018 at 6:13 AM Nicholas Chammas 
wrote:

> The documentation for DataFrame.join()
> 
> lists all the join types we support:
>
>- inner
>- cross
>- outer
>- full
>- full_outer
>- left
>- left_outer
>- right
>- right_outer
>- left_semi
>- left_anti
>
> Some of these join types are also listed on the SQL Programming Guide
> 
> .
>
> Is it obvious to everyone what all these different join types are? For
> example, I had never heard of a LEFT ANTI join until stumbling on it in the
> PySpark docs. It’s quite handy! But I had to experiment with it a bit just
> to understand what it does.
>
> I think it would be a good service to our users if we either documented
> these join types ourselves clearly, or provided a link to an external
> resource that documented them sufficiently. I’m happy to file a JIRA about
> this and do the work itself. It would be great if the documentation could
> be expressed as a series of simple doc tests, but brief prose describing
> how each join works would still be valuable.
>
> Does this seem worthwhile to folks here? And does anyone want to offer
> guidance on how best to provide this kind of documentation so that it’s
> easy to find by users, regardless of the language they’re using?
>
> Nick
> ​
>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
I think that's what Xiangrui was referring to. Instead of retrying a single
task, retry the entire stage, and the entire stage of tasks need to be
scheduled all at once.


On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

>
>>
>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>doesn’t match the execution model of distributed ML/DL frameworks that 
>>> are
>>>typically MPI-based, and rerunning a single task would lead to the entire
>>>system hanging. A whole stage needs to be re-run.
>>>
>>> This is not only useful for integrating with 3rd-party frameworks, but
>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>> ). But we ended up
>> with some compromised solutions. With the new execution model, we can set
>> up a hybrid cluster and do all-reduce properly.
>>
>>
> Is there a particular new execution model you are referring to or do we
> plan to investigate a new execution model ?  For the MPI-like model, we
> also need gang scheduling (i.e. schedule all tasks at once or none of them)
> and I dont think we have support for that in the scheduler right now.
>
>>
>>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com] 
>>
>
>


Re: Integrating ML/DL frameworks with Spark

2018-05-08 Thread Reynold Xin
Yes, Nan, totally agree. To be on the same page, that's exactly what I
wrote wasn't it?

On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote:

> besides that, one of the things which is needed by multiple frameworks is
> to schedule tasks in a single wave
>
> i.e.
>
> if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
> is desired to provide a capability to ensure that either we run 50 tasks at
> once, or we should quit the complete application/job after some timeout
> period
>
> Best,
>
> Nan
>
> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> I think that's what Xiangrui was referring to. Instead of retrying a
>> single task, retry the entire stage, and the entire stage of tasks need to
>> be scheduled all at once.
>>
>>
>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>>
>>>>
>>>>>- Fault tolerance and execution model: Spark assumes fine-grained
>>>>>task recovery, i.e. if something fails, only that task is rerun. This
>>>>>doesn’t match the execution model of distributed ML/DL frameworks that 
>>>>> are
>>>>>typically MPI-based, and rerunning a single task would lead to the 
>>>>> entire
>>>>>system hanging. A whole stage needs to be re-run.
>>>>>
>>>>> This is not only useful for integrating with 3rd-party frameworks, but
>>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>>> with some compromised solutions. With the new execution model, we can set
>>>> up a hybrid cluster and do all-reduce properly.
>>>>
>>>>
>>> Is there a particular new execution model you are referring to or do we
>>> plan to investigate a new execution model ?  For the MPI-like model, we
>>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>>> and I dont think we have support for that in the scheduler right now.
>>>
>>>>
>>>>> --
>>>>
>>>> Xiangrui Meng
>>>>
>>>> Software Engineer
>>>>
>>>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>


Re: Running lint-java during PR builds?

2018-05-21 Thread Reynold Xin
Can we look into if there is a plugin for sbt that works and then we can
put everything into one single builder?

On Mon, May 21, 2018 at 11:17 AM Dongjoon Hyun 
wrote:

> Thank you for reconsidering this, Hyukjin. :)
>
> Bests,
> Dongjoon.
>
>
> On Mon, May 21, 2018 at 9:20 AM, Marcelo Vanzin 
> wrote:
>
>> Is there a way to trigger it conditionally? e.g. only if the diff
>> touches java files.
>>
>> On Mon, May 21, 2018 at 9:17 AM, Felix Cheung 
>> wrote:
>> > One concern is with the volume of test runs on Travis.
>> >
>> > In ASF projects Travis could get significantly
>> > backed up since - if I recall - all of ASF shares one queue.
>> >
>> > At the number of PRs Spark has this could be a big issue.
>> >
>> >
>> > 
>> > From: Marcelo Vanzin 
>> > Sent: Monday, May 21, 2018 9:08:28 AM
>> > To: Hyukjin Kwon
>> > Cc: Dongjoon Hyun; dev
>> > Subject: Re: Running lint-java during PR builds?
>> >
>> > I'm fine with it. I tried to use the existing checkstyle sbt plugin
>> > (trying to fix SPARK-22269), but it depends on an ancient version of
>> > checkstyle, and I don't know sbt enough to figure out how to hack
>> > classpaths and class loaders when applying rules, so gave up.
>> >
>> > On Mon, May 21, 2018 at 1:47 AM, Hyukjin Kwon 
>> wrote:
>> >> I am going to open an INFRA JIRA if there's no explicit objection in
>> few
>> >> days.
>> >>
>> >> 2018-05-21 13:09 GMT+08:00 Hyukjin Kwon :
>> >>>
>> >>> I would like to revive this proposal. Travis CI. Shall we give this
>> try?
>> >>> I
>> >>> think it's worth trying it.
>> >>>
>> >>> 2016-11-17 3:50 GMT+08:00 Dongjoon Hyun :
>> 
>>  Hi, Marcelo and Ryan.
>> 
>>  That was the main purpose of my proposal about Travis.CI.
>>  IMO, that is the only way to achieve that without any harmful
>>  side-effect
>>  on Jenkins infra.
>> 
>>  Spark is already ready for that. Like AppVoyer, if one of you files
>> an
>>  INFRA jira issue to enable that, they will turn on that. Then, we can
>>  try it
>>  and see the result. Also, you can turn off easily again if you don't
>>  want.
>> 
>>  Without this, we will consume more community efforts. For example, we
>>  merged lint-java error fix PR seven hours ago, but the master branch
>>  still
>>  has one lint-java error.
>> 
>>  https://travis-ci.org/dongjoon-hyun/spark/jobs/176351319
>> 
>>  Actually, I've been monitoring the history here. (It's synced every
>> 30
>>  minutes.)
>> 
>>  https://travis-ci.org/dongjoon-hyun/spark/builds
>> 
>>  Could we give a change to this?
>> 
>>  Bests,
>>  Dongjoon.
>> 
>>  On 2016-11-15 13:40 (-0800), "Shixiong(Ryan) Zhu"
>>   wrote:
>>  > I remember it's because you need to run `mvn install` before
>> running
>>  > lint-java if the maven cache is empty, and `mvn install` is pretty
>>  > heavy.
>>  >
>>  > On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin <
>> van...@cloudera.com>
>>  > wrote:
>>  >
>>  > > Hey all,
>>  > >
>>  > > Is there a reason why lint-java is not run during PR builds? I
>> see
>>  > > it
>>  > > seems to be maven-only, is it really expensive to run after an
>> sbt
>>  > > build?
>>  > >
>>  > > I see a lot of PRs coming in to fix Java style issues, and those
>> all
>>  > > seem a little unnecessary. Either we're enforcing style checks or
>>  > > we're not, and right now it seems we aren't.
>>  > >
>>  > > --
>>  > > Marcelo
>>  > >
>>  > >
>>  > >
>> -
>>  > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>  > >
>>  > >
>>  >
>> 
>>  -
>>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Marcelo
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>


Re: Optimizer rule ConvertToLocalRelation causes expressions to be eager-evaluated in Planning phase

2018-06-08 Thread Reynold Xin
But from the user's perspective, optimization is not run right? So it is
still lazy.


On Fri, Jun 8, 2018 at 12:35 PM Li Jin  wrote:

> Hi All,
>
> Sorry for the long email title. I am a bit surprised to find that the
> current optimizer rule "ConvertToLocalRelation" causes expressions to be
> eager-evaluated in planning phase, this can be demonstrated with the
> following code:
>
> scala> val myUDF = udf((x: String) => { println("UDF evaled"); "result" })
>
> myUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(,StringType,Some(List(StringType)))
>
>
> scala> val df = Seq(("test", 1)).toDF("s", "i").select(myUDF($"s"))
>
> df: org.apache.spark.sql.DataFrame = [UDF(s): string]
>
>
> scala> println(df.queryExecution.optimizedPlan)
>
> UDF evaled
>
> LocalRelation [UDF(s)#9]
>
>  This is somewhat unexpected to me because of Spark's lazy execution model.
>
> I am wondering if this behavior is by design?
>
> Thanks!
> Li
>
>
>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Reynold Xin
The behavior change is not good...

On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:

> Ah, looks like it's this change:
>
> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>
> It seems strange that by default Spark doesn't build with Hive but by
> default PySpark requires it...
>
> This might also be a behavior change to PySpark users that build Spark
> without Hive. The old behavior is "fall back to non-hive support" and the
> new behavior is "program won't start".
>
> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>
>> I think you would have to build with the 'hive' profile? but if so that
>> would have been true for a while now.
>>
>>
>> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>>
>>> Hey all,
>>>
>>> I just did a clean checkout of github.com/apache/spark but failed to
>>> start PySpark, this is what I did:
>>>
>>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
>>> bin/pyspark
>>>
>>> And got this exception:
>>>
>>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>>
>>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>>
>>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>>
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>>
>>> Setting default log level to "WARN".
>>>
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>>
>>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>>> UserWarning: Failed to initialize Spark session.
>>>
>>>   warnings.warn("Failed to initialize Spark session.")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
>>> 41, in 
>>>
>>> spark = SparkSession._create_shell_session()
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>>> line 564, in _create_shell_session
>>>
>>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>>
>>> TypeError: 'JavaPackage' object is not callable
>>>
>>> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
>>> but no luck. I wonder:
>>>
>>>
>>>1. I have not seen this before, could this be caused by recent
>>>change to head?
>>>2. Am I doing something wrong in the build process?
>>>
>>>
>>> Thanks much!
>>> Li
>>>
>>>
>


Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Reynold Xin
+1 on the proposal.


On Fri, Jun 1, 2018 at 8:17 PM Hossein  wrote:

> Hi Shivaram,
>
> We converged on a CRAN release process that seems identical to current
> SparkR.
>
> --Hossein
>
> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Hossein -- Can you clarify what the resolution on the repository /
>> release issue discussed on SPIP ?
>>
>> Shivaram
>>
>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>> wrote:
>> > +1
>> > With my concerns in the SPIP discussion.
>> >
>> > 
>> > From: Hossein 
>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>> > To: dev@spark.apache.org
>> > Subject: [VOTE] SPIP ML Pipelines in R
>> >
>> > Hi,
>> >
>> > I started discussion thread for a new R package to expose MLlib
>> pipelines in
>> > R.
>> >
>> > To summarize we will work on utilities to generate R wrappers for MLlib
>> > pipeline API for a new R package. This will lower the burden for
>> exposing
>> > new API in future.
>> >
>> > Following the SPIP process, I am proposing the SPIP for a vote.
>> >
>> > +1: Let's go ahead and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I do not think this is a good idea for the following reasons.
>> >
>> > Thanks,
>> > --Hossein
>>
>
>


Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new feature.
> How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache Spark
> not yet supporting Scala 2.12, and that got me to think perhaps it is about
> time for Spark to work towards the 3.0 release. By the time it comes out,
> it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to give
> more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016.
> If we were to maintain the ~ 2 year cadence, it is time to work on Spark
> 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
> >>>
> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
> >>>
> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temp

Re: [SQL] Purpose of RuntimeReplaceable unevaluable unary expressions?

2018-05-30 Thread Reynold Xin
SQL expressions?

On Wed, May 30, 2018 at 11:09 AM Jacek Laskowski  wrote:

> Hi,
>
> I've been exploring RuntimeReplaceable expressions [1] and have been
> wondering what their purpose is.
>
> Quoting the scaladoc [2]:
>
> > An expression that gets replaced at runtime (currently by the optimizer)
> into a different expression for evaluation. This is mainly used to provide
> compatibility with other databases.
>
> For example, ParseToTimestamp expression is a RuntimeReplaceable
> expression and it is replaced by Cast(left, TimestampType)
> or Cast(UnixTimestamp(left, format), TimestampType) per to_timestamp
> function (there are two variants).
>
> My question is why is this RuntimeReplaceable better than simply using the
> Casts as the implementation of to_timestamp functions?
>
> def to_timestamp(s: Column, fmt: String): Column = withExpr {
>   // pseudocode
>   Cast(UnixTimestamp(left, format), TimestampType)
> }
>
> What's wrong with the above implementation compared to the current one?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L275
>
> [2]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L266-L267
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Reynold Xin
+1

On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.1.
>
> Given that I expect at least a few people to be busy with Spark Summit next
> week, I'm taking the liberty of setting an extended voting period. The vote
> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>
> It passes with a majority of +1 votes, which must include at least 3 +1
> votes
> from the PMC.
>
> [ ] +1 Release this package as Apache Spark 2.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
> https://github.com/apache/spark/tree/v2.3.1-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1272/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>
> The list of bug fixes going into 2.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.1?
> ===
>
> The current list of open tickets targeted at 2.3.1 can be found at:
> https://s.apache.org/Q3Uo
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.1 (RC3)

2018-06-01 Thread Reynold Xin
Yes everybody please cc the release manager on changes that merit -1. It's
high overhead and let's make this smoother.


On Fri, Jun 1, 2018 at 1:28 PM Marcelo Vanzin  wrote:

> Xiao,
>
> This is the third time in this release cycle that this is happening.
> Sorry to single out you guys, but can you please do two things:
>
> - do not merge things in 2.3 you're not absolutely sure about
> - make sure that things you backport to 2.3 are not causing problems
> - let the RM know about these things as soon as you discover them, not
> when they send the next RC for voting.
>
> Even though I was in the middle of preparing the rc, I could have
> easily aborted that and skipped this whole thread.
>
> This vote is canceled. I'll prepare a new RC right away. I hope this
> does not happen again.
>
>
> On Fri, Jun 1, 2018 at 1:20 PM, Xiao Li  wrote:
> > Sorry, I need to say -1
> >
> > This morning, just found a regression in 2.3.1 and reverted
> > https://github.com/apache/spark/pull/21443
> >
> > Xiao
> >
> > 2018-06-01 13:09 GMT-07:00 Marcelo Vanzin :
> >>
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.3.1.
> >>
> >> Given that I expect at least a few people to be busy with Spark Summit
> >> next
> >> week, I'm taking the liberty of setting an extended voting period. The
> >> vote
> >> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
> >>
> >> It passes with a majority of +1 votes, which must include at least 3 +1
> >> votes
> >> from the PMC.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.3.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.3.1-rc3 (commit 1cc5f68b):
> >> https://github.com/apache/spark/tree/v2.3.1-rc3
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc3-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1271/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc3-docs/
> >>
> >> The list of bug fixes going into 2.3.1 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.3.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.3.1 can be found at:
> >> https://s.apache.org/Q3Uo
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: What about additional support on deeply nested data?

2018-06-20 Thread Reynold Xin
Seems like you are also looking for transform and reduce for arrays?

https://issues.apache.org/jira/browse/SPARK-23908

https://issues.apache.org/jira/browse/SPARK-23911

On Wed, Jun 20, 2018 at 10:43 AM bobotu  wrote:

> I store some trajectories data in parquet with this schema:
>
> create table traj (
>   id   string,
>   points array lat:  double,
> lng: double,
> time:   bigint,
> speed: double,
> ... lots attributes here
> candidate_roads: arraystructlinestring: string, score:
> double>>
>>>
> )
>
> It contains a lots of attribute comes from sensors. It also have a nested
> array which contains information generated during map-matching algorithm.
>
> All of my algorithm run on this dataset is trajectory-oriented, which means
> they often do iteration on points, and use a subset of point's attributes
> to
> do some computation. With this schema I can get points of trajectory
> without
> doing `group by` and `collect_list`.
>
> Because Parquet works very well on deeply nested data, so I directly store
> it in parquet format with no flatten.
> It works very well with Impala, because Impala has some special support on
> nested data:
>
> select
>   id,
>   avg_speed
> from
>   traj t,
>   (select avg(speed) avg_speed from t.points where time < '2018-06-19')
>
> As you can see, Impala treat array of structs as a table nested in each
> row,
> and can do some computation on array elements at pre-row level. And Impala
> will use Parquet's features to prune unused attributes in point struct.
>
> I use Spark for some complex algorithm which cannot written in pure SQL.
> But
> I meet some trouble with Spark DataFrame API:
> 1. Spark cannot do schema prune and filter push-down on nested column.
> And it seems like there is no handy syntax to play with deeply nested data.
> 2. `explode` not help in my scenario, because I need to preserve the
> trajectory-points hierarchy. If I use `explode` here, I need do a extra
> `group by` on `id`.
> 3. Although, I can directly select `points.lat`, but it lost it
> structure. If I need array of (lat, lng) pair, I need to zip two array. And
> it cannot work at deeper nested level, such as select
> `points.candidate_road.score`.
> 4. Maybe I can use parquet-mr to read file as RDD, set read schema and
> push-down filters. But this manner lost Hive integration and the table
> abstraction.
>
> So, I think it is nice to have some additional supports on nested data.
> Maybe an Impala style subquery syntax on complex data, or something like a
> schema projection function on nested data like:
>
> select id, extract(points, lat, lng, extract(candidate_road, score)) from
> traj
>
> which produce schema as:
>
> |- id string
> |- points array of struct
> |- lat double
> |- lng double
> |- candidate_road array of struct
> |- score double
>
> And user can play with points with desired schema and data prune in
> Parquet.
>
> Or if there are some existing syntax to done my work?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Beam's recent community development work

2018-07-02 Thread Reynold Xin
That's fair, and it's great to find high quality contributors. But I also
feel the two projects have very different background and maturity phase.
There are 1300+ contributors to Spark, and only 300 to Beam, with the vast
majority of contributions coming from a single company for Beam (based on
my cursory look at the two pages of commits on github). With the recent
security and correctness storms, I actually worry about more quality (which
requires more infrastructure) than just people adding more code to the
project.



On Mon, Jul 2, 2018 at 5:25 PM Holden Karau  wrote:

> As someone who floats a bit between both projects (as a contributor) I'd
> love to see us adopt some of these techniques to be pro-active about
> growing our committer-ship (I think perhaps we could do this by also moving
> some of the newer committers into the PMC faster so there are more eyes out
> looking for people to bring forward)?
>
> On Mon, Jul 2, 2018 at 4:54 PM, Sean Owen  wrote:
>
>> Worth, I think, a read and consideration from Spark folks. I'd be
>> interested in comments; I have a few reactions too.
>>
>>
>> -- Forwarded message -
>> From: Kenneth Knowles 
>> Date: Sat, Jun 30, 2018 at 1:15 AM
>> Subject: Beam's recent community development work
>> To: , , Griselda Cuevas <
>> g...@apache.org>, dev 
>>
>>
>> Hi all,
>>
>> The ASF board suggested that we (Beam) share some of what we've been
>> doing for community development with d...@community.apache.org and
>> memb...@apache.org. So here is a long description. I have included
>> d...@beam.apache.org because it is the subject, really, and this is &
>> should be all public knowledge.
>>
>> We would love feedback! We based a lot of this on reading the community
>> project site, and probably could have learned even more with more study.
>>
>> # Background
>>
>> We face two problems in our contributor/committer-base:
>>
>> 1. Not enough committers to review all the code being contributed, in
>> part due to recent departure of a few committers
>> 2. We want our contributor-base (hence committer-base) to be more spread
>> across companies and backgrounds, for the usual Apache reasons. Our user
>> base is not active and varied enough to make this automatic. One solution
>> is to make the right software to get a varied user base, but that is a
>> different thread :-) so instead we have to work hard to build our community
>> around the software we have.
>>
>> # What we did
>>
>> ## Committer guidelines
>>
>> We published committer guidelines [1] for transparency and as an
>> invitation. We start by emphasizing that there are many kinds of
>> contributions, not just code (we have committers from community
>> development, tech writing, training, etc). Then we have three aspects:
>>
>> 1. ASF code of conduct
>> 2. ASF committer responsibilities
>> 3. Beam-specific committer responsibilities
>>
>> The best way to understand is to follow the link at the bottom of this
>> email. The important part is that you shouldn't be proposing a committer
>> for other reasons, and you shouldn't be blocking a committer for other
>> reasons.
>>
>> ## Instead of just "[DISCUSS] Potential committer XYZ" we discuss every
>> layer
>>
>> Gris (CC'd) outlined this: people go through these phases of relationship
>> with our project:
>>
>> 1. aware of it
>> 2. interested in it / checking it out
>> 3. using it for real
>> 4. first-time contributor
>> 5. repeat contributor
>> 6. committer
>> 7. PMC
>>
>> As soon as we notice someone, like a user asking really deep questions,
>> we invite discussion on private@ on how we can move them to the next
>> level of engagement.
>>
>> ## Monthly cadence
>>
>> Every ~month, we call for new discussions and revisit ~all prior
>> discussions. This way we do not forget to keep up this effort.
>>
>> ## Individual discussions
>>
>> For each person we have a separate thread on private@. This ensures we
>> have quality focused discussions that lead to feedback. In collective
>> discussions that we used to do, we often didn't really come up with
>> actionable feedback and ended up not even contacting potential committers
>> to encourage them. And consensus was much less clear.
>>
>> ## Feedback!
>>
>> If someone is brought up for a discussion, that means they got enough
>> attention that we hope to engage them more. But unsolicited feedback is
>> never a good idea. For a potential committer, we did this:
>>
>> 1. Send an email saying something like "you were discussed as a potential
>> committer - do you want to become one? do you want feedback?"
>> 2. If they say yes (so far everyone) we send a few bullet points from the
>> discussion and *most important* tie each bullet to the committer
>> guidelines. If we have no feedback about which guidelines were a concern,
>> that is a red flag that we are being driven by bias.
>>
>> We saw a *very* significant increase in engagement from those we sent
>> feedback to, and the trend is that they almost all will become 

Re: Feature request: Java-specific transform method in Dataset

2018-07-01 Thread Reynold Xin
This wouldn’t be a problem with Scala 2.12 right?

On Sun, Jul 1, 2018 at 12:23 PM Sean Owen  wrote:

> I see, transform() doesn't have the same overload that other methods do in
> order to support Java 8 lambdas as you'd expect. One option is to introduce
> something like MapFunction for transform and introduce an overload.
>
> I think transform() isn't used much at all, so maybe why it wasn't
> Java-fied. Before Java 8 it wouldn't have made much sense in Java. Now it
> might. I think it could be OK to add the overload to match how map works.
>
> On Sun, Jul 1, 2018 at 1:33 PM Ismael Carnales 
> wrote:
>
>> No, because Function1 from Scala is not a functional interface.
>> You can see a simple example of what I'm trying to accomplish In the unit
>> test here:
>>
>> https://github.com/void/spark/blob/java-transform/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java#L73
>>
>>
>> On Sun, Jul 1, 2018 at 2:48 PM Sean Owen  wrote:
>>
>>> Don't Java 8 lambdas let you do this pretty immediately? Can you give an
>>> example here of what you want to do and how you are trying to do it?
>>>
>>> On Sun, Jul 1, 2018, 12:42 PM Ismael Carnales 
>>> wrote:
>>>
 Hi,
  it would be nice to have an easier way to use the Dataset transform
 method from Java than implementing a Function1 from Scala.

 I've made a simple implentation here:

 https://github.com/void/spark/tree/java-transform

 Should I open a JIRA?

 Ismael Carnales

>>>


Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-03 Thread Reynold Xin
Why do you need the underlying RDDs? Can't you just unpersist the
dataframes that you don't need?


On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas 
wrote:

> This seems to be an underexposed part of the API. My use case is this: I
> want to unpersist all DataFrames except a specific few. I want to do this
> because I know at a specific point in my pipeline that I have a handful of
> DataFrames that I need, and everything else is no longer needed.
>
> The problem is that there doesn’t appear to be a way to identify specific
> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
> which is the only way I’m aware of to ask Spark for all currently persisted
> RDDs:
>
> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
> [(3, JavaObject id=o36)]
>
> As you can see, the id of the persisted RDD, 8, doesn’t match the id
> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
> returned by getPersistentRDDs() and know which ones I want to keep.
>
> id() itself appears to be an undocumented method of the RDD API, and in
> PySpark getPersistentRDDs() is buried behind the Java sub-objects
> , so I know I’m
> reaching here. But is there a way to do what I want in PySpark without
> manually tracking everything I’ve persisted myself?
>
> And more broadly speaking, do we want to add additional APIs, or formalize
> currently undocumented APIs like id(), to make this use case possible?
>
> Nick
> ​
>


Re: Anyone knows how to build and spark on jdk9?

2017-10-26 Thread Reynold Xin
It probably depends on the Scala version we use in Spark supporting Java 9
first.

On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun  wrote:

> Hi all:
>
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9
> env. I search for jiras related to JDK9. I only found SPARK-13278
> .  This means now
> spark can build or run successfully on JDK9 ?
>
>
>
>
>
> Best Regards
>
> Kelly Zhang/Zhang,Liyun
>
>
>


Re: Spark-XML maintenance

2017-10-26 Thread Reynold Xin
Adding Hyukjin who has been maintaining it.

The easiest is probably to leave comments in the repo.

On Thu, Oct 26, 2017 at 9:44 AM Jörn Franke  wrote:

> I would address databricks with this issue - it is their repository
>
> > On 26. Oct 2017, at 18:43, comtef  wrote:
> >
> > I've used spark for a couple of years and I found a way to contribute to
> the
> > cause :).
> > I've found a blocker in Spark XML extension
> > (https://github.com/databricks/spark-xml). I'd like to know if this is
> the
> > right place to discuss issues about this extension?
> >
> > I've opened a PR to adress this problem but it's been open for a few
> months
> > now without any review...
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Result obtained before the completion of Stages

2017-12-27 Thread Reynold Xin
Is it possible there is a bug for the UI? If you can run jstack on the
executor process to see whether anything is actually running, that can help
narrow down the issue.

On Tue, Dec 26, 2017 at 10:28 PM ckhari4u  wrote:

> Hi Reynold,
>
> I am running a Spark SQL query.
>
> val df = spark.sql("select * from table1 t1 join table2 t2 on
> t1.col1=t2.col1")
> df.count()
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in
Jenkins (perhaps not run on every commit but can be run weekly or
pre-release smoke tests), that'd be great. Then it relies less on
contributors manually testing.


On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen  wrote:

> 2) will be ideal but given the velocity of main branch, what Mesos
> ended up doing was simply having a separate repo since it will take
> too long to merge back to main.
>
> We ended up running it pre-release (or major PR merged) and not on
> every PR, I will also comment on asking users to run it.
>
> We did have conversations with Reynold about potentially have the
> ability to run the CI on every [Mesos] tagged PR but we never got
> there.
>
> Tim
>
> On Mon, Jan 8, 2018 at 10:16 PM, Anirudh Ramanathan
>  wrote:
> > This is with regard to the Kubernetes Scheduler Backend and scaling the
> > process to accept contributions. Given we're moving past upstreaming
> changes
> > from our fork, and into getting new patches, I wanted to start this
> > discussion sooner than later. This is more of a post-2.3 question - not
> > something we're looking to solve right away.
> >
> > While unit tests are handy, they're not nearly as good at giving us
> > confidence as a successful run of our integration tests against
> > single/multi-node k8s clusters. Currently, we have integration testing
> setup
> > at https://github.com/apache-spark-on-k8s/spark-integration and it's
> running
> > continuously against apache/spark:master in pepperdata-jenkins (on
> minikube)
> > & k8s-testgrid (in GKE clusters). Now, the question is - how do we make
> > integration-tests part of the PR author's workflow?
> >
> > 1. Keep the integration tests in the separate repo and require that
> > contributors run them, add new tests prior to accepting their PRs as a
> > policy. Given minikube is easy to setup and can run on a single-node, it
> > would certainly be possible. Friction however, stems from contributors
> > potentially having to modify the integration test code hosted in that
> > separate repository when adding/changing functionality in the scheduler
> > backend. Also, it's certainly going to lead to at least brief
> > inconsistencies between the two repositories.
> >
> > 2. Alternatively, we check in the integration tests alongside the actual
> > scheduler backend code. This would work really well and is what we did in
> > our fork. It would have to be a separate package which would take certain
> > parameters (like cluster endpoint) and run integration test code against
> a
> > local or remote cluster. It would include least some code dealing with
> > accessing the cluster, reading results from K8s containers, test
> fixtures,
> > etc.
> >
> > I see value in adopting (2), given it's a clearer path for contributors
> and
> > lets us keep the two pieces consistent, but it seems uncommon elsewhere.
> How
> > do the other backends, i.e. YARN, Mesos and Standalone deal with
> accepting
> > patches and ensuring that they do not break existing clusters? Is there
> > automation employed for this thus far? Would love to get opinions on (1)
> v/s
> > (2).
> >
> > Thanks,
> > Anirudh
> >
> >
>


Re: Kryo 4 serialized form changes -- a problem?

2018-01-19 Thread Reynold Xin
I don’t think Spark relies on Kryo or Java for persistence. User programs
might though so it would be great if we can shade it.

On Fri, Jan 19, 2018 at 5:55 AM Sean Owen  wrote:

> See:
>
> https://issues.apache.org/jira/browse/SPARK-23131
> https://github.com/apache/spark/pull/20301#issuecomment-358473199
>
> I expected a major Kryo upgrade to be problematic, but it worked fine. It
> picks up a number of fixes:
> https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
>
> It might be good for Spark 2.4.
>
> Its serialized format isn't entirely compatible though. I'm trying to
> recall whether this is a problem in practice. We don't guarantee wire
> compatibility across mismatched Spark versions, right?
>
> But does the Kryo serialized form show up in any persistent stored form? I
> don't believe any normal output, even that of saveAsObjectFile, uses it.
>
> I'm wondering if I am not recalling why this would be a problem to update?
>
> Sean
>


Re: Spark 3

2018-01-19 Thread Reynold Xin
We can certainly provide a build for Scala 2.12, even in 2.x.


On Fri, Jan 19, 2018 at 10:17 AM, Justin Miller <
justin.mil...@protectwise.com> wrote:

> Would that mean supporting both 2.12 and 2.11? Could be a while before
> some of our libraries are off of 2.11.
>
> Thanks,
> Justin
>
>
> On Jan 19, 2018, at 10:53 AM, Koert Kuipers  wrote:
>
> i was expecting to be able to move to scala 2.12 sometime this year
>
> if this cannot be done in spark 2.x then that could be a compelling reason
> to move spark 3 up to 2018 i think
>
> hadoop 3 sounds great but personally i have no use case for it yet
>
> On Fri, Jan 19, 2018 at 12:31 PM, Sean Owen  wrote:
>
>> Forking this thread to muse about Spark 3. Like Spark 2, I assume it
>> would be more about making all those accumulated breaking changes and
>> updating lots of dependencies. Hadoop 3 looms large in that list as well as
>> Scala 2.12.
>>
>> Spark 1 was release in May 2014, and Spark 2 in July 2016. If Spark 2.3
>> is out in Feb 2018 and it takes the now-usual 6 months until a next
>> release, Spark 3 could reasonably be next.
>>
>> However the release cycles are naturally slowing down, and it could also
>> be said that 2019 would be more on schedule for Spark 3.
>>
>> Nothing particularly urgent about deciding, but I'm curious if anyone had
>> an opinion on whether to move on to Spark 3 next or just continue with 2.4
>> later this year.
>>
>> On Fri, Jan 19, 2018 at 11:13 AM Sean Owen  wrote:
>>
>>> Yeah, if users are using Kryo directly, they should be insulated from a
>>> Spark-side change because of shading.
>>> However this also entails updating (unshaded) Chill from 0.8.x to 0.9.x.
>>> I am not sure if that causes problems for apps.
>>>
>>> Normally I'd avoid any major-version change in a minor release. This one
>>> looked potentially entirely internal.
>>> I think if there are any doubts, we can leave it for Spark 3. There was
>>> a bug report that needed a fix from Kryo 4, but it might be minor after all.
>>>


>
>


Re: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify ColumnVector.getArray

2018-01-26 Thread Reynold Xin
I have no idea. Some JIRA update? Might want to file an INFRA ticket.


On Fri, Jan 26, 2018 at 10:04 AM, Sean Owen  wrote:

> This is an example of the "*** UNCHECKED ***" message I was talking about
> -- it's part of the email subject rather than JIRA.
>
> -- Forwarded message -
> From: Xiao Li (JIRA) 
> Date: Fri, Jan 26, 2018 at 11:18 AM
> Subject: ***UNCHECKED*** [jira] [Resolved] (SPARK-23218) simplify
> ColumnVector.getArray
> To: 
>
>
>
>  [ https://issues.apache.org/jira/browse/SPARK-23218?page=
> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Xiao Li resolved SPARK-23218.
> -
>Resolution: Fixed
> Fix Version/s: 2.3.0
>
> > simplify ColumnVector.getArray
> > --
> >
> > Key: SPARK-23218
> > URL: https://issues.apache.org/jira/browse/SPARK-23218
> > Project: Spark
> >  Issue Type: Sub-task
> >  Components: SQL
> >Affects Versions: 2.3.0
> >Reporter: Wenchen Fan
> >Assignee: Wenchen Fan
> >Priority: Major
> > Fix For: 2.3.0
> >
> >
>
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> -
> To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
> For additional commands, e-mail: issues-h...@spark.apache.org
>
>


Re: What is "*** UNCHECKED ***"?

2018-01-26 Thread Reynold Xin
Examples?


On Fri, Jan 26, 2018 at 9:56 AM, Sean Owen  wrote:

> I probably missed this, but what is the new "*** UNCHECKED ***" message in
> the subject line of some JIRAs?
>


<    5   6   7   8   9   10   11   12   13   >