Re: Back to SQL

2018-10-03 Thread Reynold Xin
No we used to have that (for views) but it wasn’t working well enough so we
removed it.

On Wed, Oct 3, 2018 at 6:41 PM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> Is there any known way to go from a Spark SQL Logical Plan (optimised ?)
> Back to a SQL query ?
>
> Regards,
>
> Olivier.
>
-- 
--
excuse the brevity and lower case due to wrist injury


Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
I'm personally not a big fan of doing it that way in the PR. It is
perfectly fine to employ randomized tests, and in this case it might even
be fine to just pick couple different timezones like the way it happened in
the PR, but we should:

1. Document in the code comment why we did it that way.

2. Use a seed and log the seed, so any test failures can be reproduced
deterministically. For this one, it'd be better to pick the seed from a
seed environmental variable. If the env variable is not set, set to a
random seed.



On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:

> Recently, I've seen 3 pull requests that try to speed up a test suite
> that tests a bunch of cases by randomly choosing different subsets of
> cases to test on each Jenkins run.
>
> There's disagreement about whether this is good approach to improving
> test runtime. Here's a discussion on one that was committed:
> https://github.com/apache/spark/pull/22631/files#r223190476
>
> I'm flagging it for more input.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Random sampling in tests

2018-10-08 Thread Reynold Xin
Marco - the issue is to reproduce. It is much more annoying for somebody
else who might not have touched this test case to be able to reproduce the
error, just given a timezone. It is much easier to just follow some
documentation saying "please run TEST_SEED=5 build/sbt ~ ".


On Mon, Oct 8, 2018 at 4:33 PM Marco Gaido  wrote:

> Hi all,
>
> thanks for bringing up the topic Sean. I agree too with Reynold's idea,
> but in the specific case, if there is an error the timezone is part of the
> error message.
> So we know exactly which timezone caused the failure. Hence I thought that
> logging the seed is not necessary, as we can directly use the failing
> timezone.
>
> Thanks,
> Marco
>
> Il giorno lun 8 ott 2018 alle ore 16:24 Xiao Li 
> ha scritto:
>
>> For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>>
>> In general, I like Reynold’s idea by including the seed value and we add
>> the seed name in the test case name. This can help us reproduce it.
>>
>> Xiao
>>
>> On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin  wrote:
>>
>>> I'm personally not a big fan of doing it that way in the PR. It is
>>> perfectly fine to employ randomized tests, and in this case it might even
>>> be fine to just pick couple different timezones like the way it happened in
>>> the PR, but we should:
>>>
>>> 1. Document in the code comment why we did it that way.
>>>
>>> 2. Use a seed and log the seed, so any test failures can be reproduced
>>> deterministically. For this one, it'd be better to pick the seed from a
>>> seed environmental variable. If the env variable is not set, set to a
>>> random seed.
>>>
>>>
>>>
>>> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen  wrote:
>>>
>>>> Recently, I've seen 3 pull requests that try to speed up a test suite
>>>> that tests a bunch of cases by randomly choosing different subsets of
>>>> cases to test on each Jenkins run.
>>>>
>>>> There's disagreement about whether this is good approach to improving
>>>> test runtime. Here's a discussion on one that was committed:
>>>> https://github.com/apache/spark/pull/22631/files#r223190476
>>>>
>>>> I'm flagging it for more input.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


Re: Remove Flume support in 3.0.0?

2018-10-11 Thread Reynold Xin
Sounds like a good idea...

> On Oct 11, 2018, at 6:40 PM, Sean Owen  wrote:
> 
> Yep, that already exists as Bahir.
> Also, would anyone object to declaring Flume support at least
> deprecated in 2.4.0?
>> On Wed, Oct 10, 2018 at 2:29 PM Jörn Franke  wrote:
>> 
>> I think it makes sense to remove it.
>> If it is not too much effort and the architecture of the flume source is not 
>> considered as too strange one may extract it as a separate project and put 
>> it on github in a dedicated non-supported repository. This would enable 
>> distributors and other companies to continue to use it with minor adaptions 
>> in case their architecture depends on it. Furthermore, if there is a growing 
>> interest then one could pick it up and create a clean connector based on the 
>> current Spark architecture to be available as a dedicated connector or again 
>> in later Spark versions.
>> 
>> That being said there are also „indirect“ ways to use Flume with Spark (eg 
>> via Kafka), so i believe people would not be affected so much by a removal.
>> 
>> (Non-Voting just my opinion)
>> 
>>> Am 10.10.2018 um 22:31 schrieb Sean Owen :
>>> 
>>> Marcelo makes an argument that Flume support should be removed in
>>> 3.0.0 at https://issues.apache.org/jira/browse/SPARK-25598
>>> 
>>> I tend to agree. Is there an argument that it needs to be supported,
>>> and can this move to Bahir if so?
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> 
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Reynold Xin
We shouldn’t merge new features into release branches anymore.

On Tue, Oct 16, 2018 at 6:32 PM Rob Vesse  wrote:

> Right now the Kerberos support for Spark on K8S is only on master AFAICT
> i.e. the feature is not present on branch-2.4
>
>
>
> Therefore I don’t see any point in adding the tests into branch-2.4 unless
> the plan is to also merge the Kerberos support to branch-2.4
>
>
>
> Rob
>
>
>
> *From: *Erik Erlandson 
> *Date: *Tuesday, 16 October 2018 at 16:47
> *To: *dev 
> *Subject: *[DISCUSS][K8S][TESTS] Include Kerberos integration tests for
> Spark 2.4
>
>
>
> I'd like to propose including integration testing for Kerberos on the
> Spark 2.4 release:
>
> https://github.com/apache/spark/pull/22608
>
>
>
> Arguments in favor:
>
> 1) it improves testing coverage on a feature important for integrating
> with HDFS deployments
>
> 2) its intersection with existing code is small - it consists primarily of
> new testing code, with a bit of refactoring into 'main' and 'test'
> sub-trees. These new tests appear stable.
>
> 3) Spark 2.4 is still in RC, with outstanding correctness issues.
>
>
>
> The argument 'against' that I'm aware of would be the relatively large
> size of the PR. I believe this is considered above, but am soliciting
> community feedback before committing.
>
> Cheers,
>
> Erik
>
>
>


Re: some doubt on code understanding

2018-10-17 Thread Reynold Xin
Rounding.

On Wed, Oct 17, 2018 at 6:25 PM Sandeep Katta <
sandeep0102.opensou...@gmail.com> wrote:

> Hi Guys,
>
> I am trying to understand structured streaming code flow by doing so I
> came across below code flow
>
> def nextBatchTime(now: Long): Long = {
>   if (intervalMs == 0) now else now / intervalMs * intervalMs + intervalMs
> }
>
>  else part could also have been written as
>
> now + intervalMs
>
> is there any specific reason why that code is written like above,or is it by 
> mistake ?
>
>
> Apologies upfront if this is really silly/basic question.
>
>
> Regards
>
> Sandeep Katta
>
>


Re: [discuss] replacing SPIP template with Heilmeier's Catechism?

2018-10-25 Thread Reynold Xin
I incorporated the feedbacks here and updated the SPIP page:
https://github.com/apache/spark-website/pull/156

The new version is live now:
https://spark.apache.org/improvement-proposals.html


On Fri, Aug 31, 2018 at 4:35 PM Ryan Blue  wrote:

> +1
>
> I think this is a great suggestion. I agree a bit with Sean, but I think
> it is really about mapping these questions into some of the existing
> structure. These are a great way to think about projects, but they're
> general and it would help to rephrase them for a software project, like
> Matei's comment on considering cost. Similarly, we might rephrase
> objectives to be goals/non-goals and add something to highlight that we
> expect absolutely no Jargon. A design sketch is needed to argue how long it
> will take, what is new, and why it would be successful; adding these
> questions will help people understand how to go from that design sketch to
> an argument for that design. I think these will guide people to write
> proposals that is persuasive and well-formed.
>
> rb
>
> On Fri, Aug 31, 2018 at 4:17 PM Jules Damji  wrote:
>
>> +1
>>
>> One could argue that the litany of the questions are really a
>> double-click on the essence: why, what, how. The three interrogatives ought
>> to be the essence and distillation of any proposal or technical exposition.
>>
>> Cheers
>> Jules
>>
>> Sent from my iPhone
>> Pardon the dumb thumb typos :)
>>
>> On Aug 31, 2018, at 11:23 AM, Reynold Xin  wrote:
>>
>> I helped craft the current SPIP template
>> <https://spark.apache.org/improvement-proposals.html> last year. I was
>> recently (re-)introduced to the Heilmeier Catechism, a set of questions
>> DARPA developed to evaluate proposals. The set of questions are:
>>
>> - What are you trying to do? Articulate your objectives using absolutely
>> no jargon.
>> - How is it done today, and what are the limits of current practice?
>> - What is new in your approach and why do you think it will be successful?
>> - Who cares? If you are successful, what difference will it make?
>> - What are the risks?
>> - How much will it cost?
>> - How long will it take?
>> - What are the mid-term and final “exams” to check for success?
>>
>> When I read the above list, it resonates really well because they are
>> almost always the same set of questions I ask myself and others before I
>> decide whether something is worth doing. In some ways, our SPIP template
>> tries to capture some of these (e.g. target persona), but are not as
>> explicit and well articulated.
>>
>> What do people think about replacing the current SPIP template with the
>> above?
>>
>> At a high level, I think the Heilmeier's Catechism emphasizes less about
>> the "how", and more the "why" and "what", which is what I'd argue SPIPs
>> should be about. The hows should be left in design docs for larger projects.
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-25 Thread Reynold Xin
I have some pretty serious concerns over this proposal. I agree that there
are many things that can be improved, but at the same time I also think the
cost of introducing a new IR in the middle is extremely high. Having
participated in designing some of the IRs in other systems, I've seen more
failures than successes. The failures typically come from two sources: (1)
in general it is extremely difficult to design IRs that are both expressive
enough and are simple enough; (2) typically another layer of indirection
increases the complexity a lot more, beyond the level of understanding and
expertise that most contributors can obtain without spending years in the
code base and learning about all the gotchas.

In either case, I'm not saying "no please don't do this". This is one of
those cases in which the devils are in the details that cannot be captured
by a high level document, and I want to explicitly express my concern here.




On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki 
wrote:

> Hi Xiao,
> Thank you very much for becoming a shepherd.
> If you feel the discussion settles, we would appreciate it if you would
> start a voting.
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:Xiao Li 
> To:Kazuaki Ishizaki 
> Cc:dev , Takeshi Yamamuro <
> linguin@gmail.com>
> Date:2018/10/22 16:31
> Subject:Re: SPIP: SPARK-25728 Structured Intermediate
> Representation (Tungsten IR) for generating Java code
> --
>
>
>
> Hi, Kazuaki,
>
> Thanks for your great SPIP! I am willing to be the shepherd of this SPIP.
>
> Cheers,
>
> Xiao
>
>
> On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> > wrote:
> Hi Yamamuro-san,
> Thank you for your comments. This SPIP gets several valuable comments and
> feedback on Google Doc:
> *https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing*
> 
> .
> I hope that this SPIP could go forward based on these feedback.
>
> Based on this SPIP procedure
> *http://spark.apache.org/improvement-proposals.html*
> , can I ask one or
> more PMCs to become a shepherd of this SPIP?
> I would appreciate your kindness and cooperation.
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Takeshi Yamamuro <*linguin@gmail.com*
> >
> To:Spark dev list <*dev@spark.apache.org* >
> Cc:*ishiz...@jp.ibm.com* 
> Date:2018/10/15 12:12
> Subject:Re: SPIP: SPARK-25728 Structured Intermediate
> Representation (Tungsten IR) for generating Java code
> --
>
>
>
> Hi, ishizaki-san,
>
> Cool activity, I left some comments on the doc.
>
> best,
> takeshi
>
>
> On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> > wrote:
> Hello community,
>
> I am writing this e-mail in order to start a discussion about adding
> structure intermediate representation for generating Java code from a
> program using DataFrame or Dataset API, in addition to the current
> String-based representation.
> This addition is based on the discussions in a thread at
> *https://github.com/apache/spark/pull/21537#issuecomment-413268196*
> 
>
> Please feel free to comment on the JIRA ticket or Google Doc.
>
> JIRA ticket: *https://issues.apache.org/jira/browse/SPARK-25728*
> 
> Google Doc:
> *https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing*
> 
>
> Looking forward to hear your feedback
>
> Best Regards,
> Kazuaki Ishizaki
>
>
> --
> ---
> Takeshi Yamamuro
>
>
>
> --
>
> 
>
>


Re: DataSourceV2 hangouts sync

2018-10-25 Thread Reynold Xin
+1



On Thu, Oct 25, 2018 at 4:12 PM Li Jin  wrote:

> Although I am not specifically involved in DSv2, I think having this kind
> of meeting is definitely helpful to discuss, move certain effort forward
> and keep people on the same page. Glad to see this kind of working group
> happening.
>
> On Thu, Oct 25, 2018 at 5:58 PM John Zhuge  wrote:
>
>> Great idea!
>>
>> On Thu, Oct 25, 2018 at 1:10 PM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> There's been some great discussion for DataSourceV2 in the last few
>>> months, but it has been difficult to resolve some of the discussions and I
>>> don't think that we have a very clear roadmap for getting the work done.
>>>
>>> To coordinate better as a community, I'd like to start a regular sync-up
>>> over google hangouts. We use this in the Parquet community to have more
>>> effective community discussions about thorny technical issues and to get
>>> aligned on an overall roadmap. It is really helpful in that community and I
>>> think it would help us get DSv2 done more quickly.
>>>
>>> Here's how it works: people join the hangout, we go around the list to
>>> gather topics, have about an hour-long discussion, and then send a summary
>>> of the discussion to the dev list for anyone that couldn't participate.
>>> That way we can move topics along, but we keep the broader community in the
>>> loop as well for further discussion on the mailing list.
>>>
>>> I'll volunteer to set up the sync and send invites to anyone that wants
>>> to attend. If you're interested, please reply with the email address you'd
>>> like to put on the invite list (if there's a way to do this without
>>> specific invites, let me know). Also for the first sync, please note what
>>> times would work for you so we can try to account for people in different
>>> time zones.
>>>
>>> For the first one, I was thinking some day next week (time TBD by those
>>> interested) and starting off with a general roadmap discussion before
>>> diving into specific technical topics.
>>>
>>> Thanks,
>>>
>>> rb
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>


Re: What if anything to fix about k8s for the 2.4.0 RC5?

2018-10-25 Thread Reynold Xin
I also think we should get this in:
https://github.com/apache/spark/pull/22841

It's to deprecate a confusing & broken window function API, so we can
remove them in 3.0 and redesign a better one. See
https://issues.apache.org/jira/browse/SPARK-25841 for more information.


On Thu, Oct 25, 2018 at 4:55 PM Sean Owen  wrote:

> Yep, we're going to merge a change to separate the k8s tests into a
> separate profile, and fix up the Scala 2.12 thing. While non-critical those
> are pretty nice to have for 2.4. I think that's doable within the next 12
> hours even.
>
> @skonto I think there's one last minor thing needed on this PR?
> https://github.com/apache/spark/pull/22838/files#r228363727
>
> On Thu, Oct 25, 2018 at 6:42 PM Wenchen Fan  wrote:
>
>> Any updates on this topic? https://github.com/apache/spark/pull/22827 is
>> merged and 2.4 is unblocked.
>>
>> I'll cut RC5 shortly after the weekend, and it will be great to include
>> the change proposed here.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Oct 26, 2018 at 12:55 AM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> I think it's worth getting in a change to just not enable this module,
 which ought to be entirely safe, and avoid two of the issues we
 identified.

>>>
>>> Besides disabling it, when someone wants to run the tests with 2.12 he
>>> should be able to do so. So propagating the Scala profile still makes sense
>>> but it is not related to the release other than making sure things work
>>> fine.
>>>
>>> On Thu, Oct 25, 2018 at 7:02 PM, Sean Owen  wrote:
>>>
 I think it's worth getting in a change to just not enable this module,
 which ought to be entirely safe, and avoid two of the issues we
 identified.
 that said it didn't block RC4 so need not block RC5.
 But should happen today if we're doing it.
 On Thu, Oct 25, 2018 at 10:47 AM Xiao Li  wrote:
 >
 > Hopefully, this will not delay RC5. Since this is not a blocker
 ticket, RC5 will start if all the blocker tickets are resolved.
 >
 > Thanks,
 >
 > Xiao
 >
 > Sean Owen  于2018年10月25日周四 上午8:44写道:
 >>
 >> Yes, I agree, and perhaps you are best placed to do that for 2.4.0
 RC5 :)
 >>
 >> On Thu, Oct 25, 2018 at 10:41 AM Stavros Kontopoulos
 >>  wrote:
 >> >
 >> > I agree these tests should be manual for now but should be run
 somehow before a release to make sure things are working right?
 >> >
 >> > For the other issue:
 https://issues.apache.org/jira/browse/SPARK-25835 .
 >> >
 >> >
 >> > On Thu, Oct 25, 2018 at 6:29 PM, Stavros Kontopoulos <
 stavros.kontopou...@lightbend.com> wrote:
 >> >>
 >> >> I will open a jira for the profile propagation issue and have a
 look to fix it.
 >> >>
 >> >> Stavros
 >> >>
 >> >> On Thu, Oct 25, 2018 at 6:16 PM, Erik Erlandson <
 eerla...@redhat.com> wrote:
 >> >>>
 >> >>>
 >> >>> I would be comfortable making the integration testing manual for
 now.  A JIRA for ironing out how to make it reliable for automatic as a
 goal for 3.0 seems like a good idea.
 >> >>>
 >> >>> On Thu, Oct 25, 2018 at 8:11 AM Sean Owen 
 wrote:
 >> 
 >>  Forking this thread.
 >> 
 >>  Because we'll have another RC, we could possibly address these
 two
 >>  issues. Only if we have a reliable change of course.
 >> 
 >>  Is it easy enough to propagate the -Pscala-2.12 profile? can't
 hurt.
 >> 
 >>  And is it reasonable to essentially 'disable'
 >>  kubernetes/integration-tests by removing it from the kubernetes
 >>  profile? it doesn't mean it goes away, just means it's run
 manually,
 >>  not automatically. Is that actually how it's meant to be used
 anyway?
 >>  in the short term? given the discussion around its requirements
 and
 >>  minikube and all that?
 >> 
 >>  (Actually, this would also 'solve' the Scala 2.12 build problem
 too)
 >> 
 >>  On Tue, Oct 23, 2018 at 2:45 PM Sean Owen 
 wrote:
 >>  >
 >>  > To be clear I'm currently +1 on this release, with much
 commentary.
 >>  >
 >>  > OK, the explanation for kubernetes tests makes sense. Yes I
 think we need to propagate the scala-2.12 build profile to make it work. Go
 for it, if you have a lead on what the change is.
 >>  > This doesn't block the release as it's an issue for tests,
 and only affects 2.12. However if we had a clean fix for this and there
 were another RC, I'd include it.
 >>  >
 >>  > Dongjoon has a good point about the
 spark-kubernetes-integration-tests artifact. That doesn't sound like it
 should be published in this way, though, of course, we publish the test
 artifacts from every module already. This is only a bit odd in being a
 non-t

Re: Drop support for old Hive in Spark 3.0?

2018-10-26 Thread Reynold Xin
People do use it, and the maintenance cost is pretty low so I don't think
we should just drop it. We can be explicit about there are not a lot of
developments going on and we are unlikely to add a lot of new features to
it, and users are also welcome to use other JDBC/ODBC endpoint
implementations built by the ecosystem, so the Spark project itself is not
pressured to continue adding a lot of features.


On Fri, Oct 26, 2018 at 10:48 AM Sean Owen  wrote:

> Maybe that's what I really mean (you can tell I don't follow the Hive part
> closely)
> In my travels, indeed the thrift server has been viewed as an older
> solution to a problem probably better met by others.
> From my perspective it's worth dropping, but, that's just anecdotal.
> Any other arguments for or against the thrift server?
>
> On Fri, Oct 26, 2018 at 12:30 PM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> one big problem about getting rid of the Hive fork is the thriftserver,
>> which relies on the HiveServer from the Hive fork.
>> We might migrate to an apache/hive dependency, but not sure this would
>> help that much.
>> I think a broader topic would be the actual opportunity of having a
>> thriftserver directly into Spark. It has many well-known limitations (not
>> fault tolerant, no security/impersonation, etc.etc.) and there are other
>> project which target to provide a thrift/JDBC interface to Spark. Just to
>> be clear I am not proposing to remove the thriftserver in 3.0, but maybe it
>> is something we could evaluate in the long term.
>>
>> Thanks,
>> Marco
>>
>>
>> Il giorno ven 26 ott 2018 alle ore 19:07 Sean Owen  ha
>> scritto:
>>
>>> OK let's keep this about Hive.
>>>
>>> Right, good point, this is really about supporting metastore versions,
>>> and there is a good argument for retaining backwards-compatibility with
>>> older metastores. I don't know how far, but I guess, as far as is practical?
>>>
>>> Isn't there still a lot of Hive 0.x test code? is that something that's
>>> safe to drop for 3.0?
>>>
>>> And, basically, what must we do to get rid of the Hive fork? that seems
>>> like a must-do.
>>>
>>>
>>>
>>> On Fri, Oct 26, 2018 at 11:51 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, Sean and All.

 For the first question, we support only Hive Metastore from 1.x ~ 2.x.
 And, we can support Hive Metastore 3.0 simultaneously. Spark is designed
 like that.

 I don't think we need to drop old Hive Metastore Support. Is it
 for avoiding Hive Metastore sharing between Spark2 and Spark3 clusters?

 I think we should allow that use cases, especially for new Spark 3
 clusters. How do you think so?


 For the second question, Apache Spark 2.x doesn't support Hive
 officially. It's only a best-effort approach in a boundary of Spark.


 http://spark.apache.org/docs/latest/sql-programming-guide.html#unsupported-hive-functionality

 http://spark.apache.org/docs/latest/sql-programming-guide.html#incompatible-hive-udf


 Not only the documented one, decimal literal(HIVE-17186) makes a query
 result difference even in the well-known benchmark like TPC-H.

 Bests,
 Dongjoon.

 PS. For Hadoop, let's have another thread if needed. I expect another
 long story. :)


 On Fri, Oct 26, 2018 at 7:11 AM Sean Owen  wrote:

> Here's another thread to start considering, and I know it's been
> raised before.
> What version(s) of Hive should Spark 3 support?
>
> If at least we know it won't include Hive 0.x, could we go ahead and
> remove those tests from master? It might significantly reduce the run time
> and flakiness.
>
> It seems that maintaining even the Hive 1.x fork is untenable going
> forward, right? does that also imply this support is almost certainly not
> maintained in 3.0?
>
> Per below, it seems like it might even be hard to both support Hive 3
> and Hadoop 2 at the same time?
>
> And while we're at it, what's the + and - for simply only supporting
> Hadoop 3 in Spark 3? Is the difference in client / HDFS API even that big?
> Or what about focusing only on Hadoop 2.9.x support + 3.x support?
>
> Lots of questions, just interested now in informal reactions, not a
> binding decision.
>
> On Thu, Oct 25, 2018 at 11:49 PM Dagang Wei 
> wrote:
>
>> Do we really want to switch to Hive 2.3? From this page
>> https://hive.apache.org/downloads.html, Hive 2.3 works with Hadoop
>> 2.x (Hive 3.x works with Hadoop 3.x).
>>
>> —
>> You are receiving this because you were mentioned.
>> Reply to this email directly, view it on GitHub
>> ,
>> or mute the thread
>> 
>> .
>>
>


Re: Helper methods for PySpark discussion

2018-10-28 Thread Reynold Xin
I agree - it is very easy for users to shoot themselves in the foot if we
don't put in the safeguards, or mislead them by giving them the impression
that operations are cheap. DataFrame in Spark isn't like a single node
in-memory data structure.

Note that the repr string work is very different. There it is off by
default, and requires opt-in, and is designed for a specific use case if
you go read my original email that proposed adding this.



On Sat, Oct 27, 2018 at 8:40 AM Leif Walsh  wrote:

> In the case of len, I think we should examine how python does iterators
> and generators. https://docs.python.org/3/library/collections.abc.html
>
> Iterators have __iter__ and __next__ but are not Sized so they don’t have
> __len__. If you ask for the len() of a generator (like len(x for x in
> range(10) if x % 2 == 0)) you get a reasonable error message and might
> respond by calling len(list(g)) if you know you can afford to materialize
> g’s contents. Of course, with a DataFrame materializing all the columns for
> all rows back on the python side is way more expensive than df.count(), so
> we don’t want to ever steer people to call len(list(df)), but I think
> len(df) having an expensive side effect would definitely surprise me.
>
> Perhaps we can consider the abstract base classes that DataFrames and RDDs
> should implement. I actually think it’s not many of them, we don’t have
> random access, or sizes, or even a cheap way to do set membership.
>
> For the case of len(), I think the best option is to show an error message
> that tells you to call count instead.
> On Fri, Oct 26, 2018 at 21:06 Holden Karau  wrote:
>
>> Ok so let's say you made a spark dataframe, you call length -- what do
>> you expect to happen?
>>
>> Personallt I expect Spark to evaluate the dataframe, this is what happens
>> with collections and even iterables.
>>
>> The interplay with cache is a bit strange, but presumably if you've
>> marked your Dataframe for caching you want to cache it (we don't
>> automatically madk Dataframes for caching outside of some cases inside ML
>> pipelines where this would not apply).
>>
>> On Fri, Oct 26, 2018, 10:56 AM Li Jin >
>>> > (2) If the method forces evaluation this matches most obvious way that
>>> would implemented then we should add it with a note in the docstring
>>>
>>> I am not sure about this because force evaluation could be something
>>> that has side effect. For example, df.count() can realize a cache and if we
>>> implement __len__ to call df.count() then len(df) would end up populating
>>> some cache and can be unintuitive.
>>>
>>> On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh  wrote:
>>>
 That all sounds reasonable but I think in the case of 4 and maybe also
 3 I would rather see it implemented to raise an error message that explains
 what’s going on and suggests the explicit operation that would do the most
 equivalent thing. And perhaps raise a warning (using the warnings module)
 for things that might be unintuitively expensive.
 On Fri, Oct 26, 2018 at 12:15 Holden Karau 
 wrote:

> Coming out of https://github.com/apache/spark/pull/21654 it was
> agreed the helper methods in question made sense but there was some desire
> for a plan as to which helper methods we should use.
>
> I'd like to purpose a light weight solution to start with for helper
> methods that match either Pandas or general Python collection helper
> methods:
> 1) If the helper method doesn't collect the DataFrame back or force
> evaluation to the driver then we should add it without discussion
> 2) If the method forces evaluation this matches most obvious way that
> would implemented then we should add it with a note in the docstring
> 3) If the method does collect the DataFrame back to the driver and
> that is the most obvious way it would implemented (e.g. calling list to 
> get
> back a list would have to collect the DataFrame) then we should add it 
> with
> a warning in the docstring
> 4) If the method collects the DataFrame but a reasonable Python
> developer wouldn't expect that behaviour not implementing the helper 
> method
> would be better
>
> What do folks think?
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
 --
 --
 Cheers,
 Leif

>>> --
> --
> Cheers,
> Leif
>


Re: [VOTE] SPARK 2.4.0 (RC5)

2018-10-31 Thread Reynold Xin
+1

Look forward to the release!



On Mon, Oct 29, 2018 at 3:22 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.0.
>
> The vote is open until November 1 PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.0-rc5 (commit
> 0a4c03f7d084f1d2aa48673b99f3b9496893ce8d):
> https://github.com/apache/spark/tree/v2.4.0-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1291
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc5-docs/
>
> The list of bug fixes going into 2.4.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12342385
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.0?
> ===
>
> The current list of open tickets targeted at 2.4.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: Removing non-deprecated R methods that were deprecated in Python, Scala?

2018-11-06 Thread Reynold Xin
Maybe deprecate and remove in next version? It is bad to just remove a
method without deprecation notice.

On Tue, Nov 6, 2018 at 5:44 AM Sean Owen  wrote:

> See https://github.com/apache/spark/pull/22921#discussion_r230568058
>
> Methods like toDegrees, toRadians, approxCountDistinct were 'renamed'
> in Spark 2.1: deprecated, and replaced with an identical method with
> different name. However, these weren't actually deprecated in SparkR.
>
> Is it an oversight that we should just correct anyway by removing, to
> stay synced?
> Or deprecate and retain these in Spark 3.0.0?
>
> Sean
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Test and support only LTS JDK release?

2018-11-06 Thread Reynold Xin
What does OpenJDK do and other non-Oracle VMs? I know there was a lot of
discussions from Redhat etc to support.


On Tue, Nov 6, 2018 at 11:24 AM DB Tsai  wrote:

> Given Oracle's new 6-month release model, I feel the only realistic option
> is to only test and support JDK such as JDK 11 LTS and future LTS release.
> I would like to have a discussion on this in Spark community.
>
> Thanks,
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
>


Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-06 Thread Reynold Xin
Have we deprecated Scala 2.11 already in an existing release?

On Tue, Nov 6, 2018 at 4:43 PM DB Tsai  wrote:

> Ideally, supporting only Scala 2.12 in Spark 3 will be ideal.
>
> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
>
> > On Nov 6, 2018, at 2:55 PM, Felix Cheung 
> wrote:
> >
> > So to clarify, only scala 2.12 is supported in Spark 3?
> >
> >
> > From: Ryan Blue 
> > Sent: Tuesday, November 6, 2018 1:24 PM
> > To: d_t...@apple.com
> > Cc: Sean Owen; Spark Dev List; cdelg...@apple.com
> > Subject: Re: Make Scala 2.12 as default Scala version in Spark 3.0
> >
> > +1 to Scala 2.12 as the default in Spark 3.0.
> >
> > On Tue, Nov 6, 2018 at 11:50 AM DB Tsai  wrote:
> > +1 on dropping Scala 2.11 in Spark 3.0 to simplify the build.
> >
> > As Scala 2.11 will not support Java 11 unless we make a significant
> investment, if we decide not to drop Scala 2.11 in Spark 3.0, what we can
> do is have only Scala 2.12 build support Java 11 while Scala 2.11 support
> Java 8. But I agree with Sean that this can make the decencies really
> complicated; hence I support to drop Scala 2.11 in Spark 3.0 directly.
> >
> > DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
> >
> >> On Nov 6, 2018, at 11:38 AM, Sean Owen  wrote:
> >>
> >> I think we should make Scala 2.12 the default in Spark 3.0. I would
> >> also prefer to drop Scala 2.11 support in 3.0. In theory, not dropping
> >> 2.11 support it means we'd support Scala 2.11 for years, the lifetime
> >> of Spark 3.x. In practice, we could drop 2.11 support in a 3.1.0 or
> >> 3.2.0 release, kind of like what happened with 2.10 in 2.x.
> >>
> >> Java (9-)11 support also complicates this. I think getting it to work
> >> will need some significant dependency updates, and I worry not all
> >> will be available for 2.11 or will present some knotty problems. We'll
> >> find out soon if that forces the issue.
> >>
> >> Also note that Scala 2.13 is pretty close to release, and we'll want
> >> to support it soon after release, perhaps sooner than the long delay
> >> before 2.12 was supported (because it was hard!). It will probably be
> >> out well before Spark 3.0. Cross-compiling for 3 Scala versions sounds
> >> like too much. 3.0 could support 2.11 and 2.12, and 3.1 support 2.12
> >> and 2.13, or something. But if 2.13 support is otherwise attainable at
> >> the release of Spark 3.0, I wonder if that too argues for dropping
> >> 2.11 support.
> >>
> >> Finally I'll say that Spark itself isn't dropping 2.11 support for a
> >> while, no matter what; it still exists in the 2.4.x branch of course.
> >> People who can't update off Scala 2.11 can stay on Spark 2.x, note.
> >>
> >> Sean
> >>
> >>
> >> On Tue, Nov 6, 2018 at 1:13 PM DB Tsai  wrote:
> >>>
> >>> We made Scala 2.11 as default Scala version in Spark 2.0. Now, the
> next Spark version will be 3.0, so it's a great time to discuss should we
> make Scala 2.12 as default Scala version in Spark 3.0.
> >>>
> >>> Scala 2.11 is EOL, and it came out 4.5 ago; as a result, it's unlikely
> to support JDK 11 in Scala 2.11 unless we're willing to sponsor the needed
> work per discussion in Scala community,
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166
> >>>
> >>> We have initial support of Scala 2.12 in Spark 2.4. If we decide to
> make Scala 2.12 as default for Spark 3.0 now, we will have ample time to
> work on bugs and issues that we may run into.
> >>>
> >>> What do you think?
> >>>
> >>> Thanks,
> >>>
> >>> DB Tsai  |  Siri Open Source Technologies [not a contribution]  |  
> Apple, Inc
> >>>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Did the 2.4 release email go out?

2018-11-08 Thread Reynold Xin
The website is already up but I didn’t see any email announcement yet.


Re: DataSourceV2 capability API

2018-11-08 Thread Reynold Xin
This is currently accomplished by having traits that data sources can
extend, as well as runtime exceptions right? It's hard to argue one way vs
another without knowing how things will evolve (e.g. how many different
capabilities there will be).


On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue  wrote:

> Hi everyone,
>
> I’d like to propose an addition to DataSourceV2 tables, a capability API.
> This API would allow Spark to query a table to determine whether it
> supports a capability or not:
>
> val table = catalog.load(identifier)
> val supportsContinuous = table.isSupported("continuous-streaming")
>
> There are a couple of use cases for this. First, we want to be able to
> fail fast when a user tries to stream a table that doesn’t support it. The
> design of our read implementation doesn’t necessarily support this. If we
> want to share the same “scan” across streaming and batch, then we need to
> “branch” in the API after that point, but that is at odds with failing
> fast. We could use capabilities to fail fast and not worry about that
> concern in the read design.
>
> I also want to use capabilities to change the behavior of some validation
> rules. The rule that validates appends, for example, doesn’t allow a write
> that is missing an optional column. That’s because the current v1 sources
> don’t support reading when columns are missing. But Iceberg does support
> reading a missing column as nulls, so that users can add a column to a
> table without breaking a scheduled job that populates the table. To fix
> this problem, I would use a table capability, like
> read-missing-columns-as-null.
>
> Any comments on this approach?
>
> rb
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Reynold Xin
Do you have a cached copy? I see it here

http://spark.apache.org/downloads.html



On Thu, Nov 8, 2018 at 4:12 PM Li Gao  wrote:

> this is wonderful !
> I noticed the official spark download site does not have 2.4 download
> links yet.
>
> On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  wrote:
>
>> Great news.. thank you very much!
>>
>> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com wrote:
>>
>>> Awesome!
>>>
>>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>>
 Indeed!

 Sent from my iPhone
 Pardon the dumb thumb typos :)

 On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
 wrote:

 Finally, thank you all. Especially, thanks to the release manager,
 Wenchen!

 Bests,
 Dongjoon.


 On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan 
 wrote:

> + user list
>
> On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan 
> wrote:
>
>> resend
>>
>> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
>> wrote:
>>
>>>
>>>
>>> -- Forwarded message -
>>> From: Wenchen Fan 
>>> Date: Thu, Nov 8, 2018 at 10:55 PM
>>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>>> To: Spark dev list 
>>>
>>>
>>> Hi all,
>>>
>>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This
>>> release adds Barrier Execution Mode for better integration with deep
>>> learning frameworks, introduces 30+ built-in and higher-order functions 
>>> to
>>> deal with complex data type easier, improves the K8s integration, along
>>> with experimental Scala 2.12 support. Other major updates include the
>>> built-in Avro data source, Image data source, flexible streaming sinks,
>>> elimination of the 2GB block size limitation during transfer, Pandas UDF
>>> improvements. In addition, this release continues to focus on usability,
>>> stability, and polish while resolving around 1100 tickets.
>>>
>>> We'd like to thank our contributors and users for their
>>> contributions and early feedback to this release. This release would not
>>> have been possible without you.
>>>
>>> To download Spark 2.4.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> PS: If you see any issues with the release notes, webpage or
>>> published artifacts, please contact me directly off-list.
>>>
>>
>>>
>>>
>>>
>>>


Re: DataSourceV2 capability API

2018-11-09 Thread Reynold Xin
How do we deal with forward compatibility? Consider, Spark adds a new
"property". In the past the data source supports that property, but since
it was not explicitly defined, in the new version of Spark that data source
would be considered not supporting that property, and thus throwing an
exception.


On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue  wrote:

> I'd have two places. First, a class that defines properties supported and
> identified by Spark, like the SQLConf definitions. Second, in documentation
> for the v2 table API.
>
> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung 
> wrote:
>
>> One question is where will the list of capability strings be defined?
>>
>>
>> --
>> *From:* Ryan Blue 
>> *Sent:* Thursday, November 8, 2018 2:09 PM
>> *To:* Reynold Xin
>> *Cc:* Spark Dev List
>> *Subject:* Re: DataSourceV2 capability API
>>
>>
>> Yes, we currently use traits that have methods. Something like “supports
>> reading missing columns” doesn’t need to deliver methods. The other example
>> is where we don’t have an object to test for a trait (
>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>> done. That could be expensive so we can use a capability to fail faster.
>>
>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin  wrote:
>>
>>> This is currently accomplished by having traits that data sources can
>>> extend, as well as runtime exceptions right? It's hard to argue one way vs
>>> another without knowing how things will evolve (e.g. how many different
>>> capabilities there will be).
>>>
>>>
>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue 
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I’d like to propose an addition to DataSourceV2 tables, a capability
>>>> API. This API would allow Spark to query a table to determine whether it
>>>> supports a capability or not:
>>>>
>>>> val table = catalog.load(identifier)
>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>
>>>> There are a couple of use cases for this. First, we want to be able to
>>>> fail fast when a user tries to stream a table that doesn’t support it. The
>>>> design of our read implementation doesn’t necessarily support this. If we
>>>> want to share the same “scan” across streaming and batch, then we need to
>>>> “branch” in the API after that point, but that is at odds with failing
>>>> fast. We could use capabilities to fail fast and not worry about that
>>>> concern in the read design.
>>>>
>>>> I also want to use capabilities to change the behavior of some
>>>> validation rules. The rule that validates appends, for example, doesn’t
>>>> allow a write that is missing an optional column. That’s because the
>>>> current v1 sources don’t support reading when columns are missing. But
>>>> Iceberg does support reading a missing column as nulls, so that users can
>>>> add a column to a table without breaking a scheduled job that populates the
>>>> table. To fix this problem, I would use a table capability, like
>>>> read-missing-columns-as-null.
>>>>
>>>> Any comments on this approach?
>>>>
>>>> rb
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: DataSourceV2 capability API

2018-11-09 Thread Reynold Xin
"If there is no way to report a feature (e.g., able to read missing as
null) then there is no way for Spark to take advantage of it in the first
place"

Consider this (just a hypothetical scenario): We added "supports-decimal"
in the future, because we see a lot of data sources don't support decimal
and we want a more graceful error handling. That'd break all existing data
sources.

You can say we would never add any "existing" features to the feature list
in the future, as a requirement for the feature list. But then I'm
wondering how much does it really give you, beyond telling data sources to
throw exceptions when they don't support a specific operation.


On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue  wrote:

> Do you have an example in mind where we might add a capability and break
> old versions of data sources?
>
> These are really for being able to tell what features a data source has.
> If there is no way to report a feature (e.g., able to read missing as null)
> then there is no way for Spark to take advantage of it in the first place.
> For the uses I've proposed, forward compatibility isn't a concern. When we
> add a capability, we add handling for it that old versions wouldn't be able
> to use anyway. The advantage is that we don't have to treat all sources the
> same.
>
> On Fri, Nov 9, 2018 at 11:32 AM Reynold Xin  wrote:
>
>> How do we deal with forward compatibility? Consider, Spark adds a new
>> "property". In the past the data source supports that property, but since
>> it was not explicitly defined, in the new version of Spark that data source
>> would be considered not supporting that property, and thus throwing an
>> exception.
>>
>>
>> On Fri, Nov 9, 2018 at 9:11 AM Ryan Blue  wrote:
>>
>>> I'd have two places. First, a class that defines properties supported
>>> and identified by Spark, like the SQLConf definitions. Second, in
>>> documentation for the v2 table API.
>>>
>>> On Fri, Nov 9, 2018 at 9:00 AM Felix Cheung 
>>> wrote:
>>>
>>>> One question is where will the list of capability strings be defined?
>>>>
>>>>
>>>> --
>>>> *From:* Ryan Blue 
>>>> *Sent:* Thursday, November 8, 2018 2:09 PM
>>>> *To:* Reynold Xin
>>>> *Cc:* Spark Dev List
>>>> *Subject:* Re: DataSourceV2 capability API
>>>>
>>>>
>>>> Yes, we currently use traits that have methods. Something like
>>>> “supports reading missing columns” doesn’t need to deliver methods. The
>>>> other example is where we don’t have an object to test for a trait (
>>>> scan.isInstanceOf[SupportsBatch]) until we have a Scan with pushdown
>>>> done. That could be expensive so we can use a capability to fail faster.
>>>>
>>>> On Thu, Nov 8, 2018 at 1:54 PM Reynold Xin  wrote:
>>>>
>>>>> This is currently accomplished by having traits that data sources can
>>>>> extend, as well as runtime exceptions right? It's hard to argue one way vs
>>>>> another without knowing how things will evolve (e.g. how many different
>>>>> capabilities there will be).
>>>>>
>>>>>
>>>>> On Thu, Nov 8, 2018 at 12:50 PM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I’d like to propose an addition to DataSourceV2 tables, a capability
>>>>>> API. This API would allow Spark to query a table to determine whether it
>>>>>> supports a capability or not:
>>>>>>
>>>>>> val table = catalog.load(identifier)
>>>>>> val supportsContinuous = table.isSupported("continuous-streaming")
>>>>>>
>>>>>> There are a couple of use cases for this. First, we want to be able
>>>>>> to fail fast when a user tries to stream a table that doesn’t support it.
>>>>>> The design of our read implementation doesn’t necessarily support this. 
>>>>>> If
>>>>>> we want to share the same “scan” across streaming and batch, then we need
>>>>>> to “branch” in the API after that point, but that is at odds with failing
>>>>>> fast. We could use capabilities to fail fast and not worry about that
>>>>>> concern in the read design.
>>>>>>
>>>>>> I also want to use capabilities to change the behavior of some
>>>>>> validation rules. The rule that validates appends, for example, doesn’t
>>>>>> allow a write that is missing an optional column. That’s because the
>>>>>> current v1 sources don’t support reading when columns are missing. But
>>>>>> Iceberg does support reading a missing column as nulls, so that users can
>>>>>> add a column to a table without breaking a scheduled job that populates 
>>>>>> the
>>>>>> table. To fix this problem, I would use a table capability, like
>>>>>> read-missing-columns-as-null.
>>>>>>
>>>>>> Any comments on this approach?
>>>>>>
>>>>>> rb
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
3.0. In terms of time lining, unless we change anything specifically, Spark
feature releases are on a 6-mo cadence. Spark 2.4 was just released last
week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:

> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc&m=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY&s=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc&e=>
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
All API removal and deprecation JIRAs should be tagged "releasenotes", so
we can reference them when we build release notes. I don't know if
everybody is still following that practice, but it'd be great to do that.
Since we don't have that many PRs, we should still be able to retroactively
tag.

We can also add a new tag for API changes, but I feel at this stage it
might be easier to just use "releasenotes".


On Mon, Nov 12, 2018 at 3:49 PM Matt Cheah  wrote:

> I wanted to clarify what categories of APIs are eligible to be broken in
> Spark 3.0. Specifically:
>
>
>
>- Are we removing all deprecated methods? If we’re only removing some
>subset of deprecated methods, what is that subset? I see a bunch were
>removed in https://github.com/apache/spark/pull/22921 for example. Are
>we only committed to removing methods that were deprecated in some Spark
>version and earlier?
>- Aside from removing support for Scala 2.11, what other kinds of
>(non-experimental and non-evolving) APIs are eligible to be broken?
>- Is there going to be a way to track the current list of all proposed
>breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
>ticket that can be filtered down to somehow?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
> *From: *Vinoo Ganesh 
> *Date: *Monday, November 12, 2018 at 2:48 PM
> *To: *Reynold Xin 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Makes sense, thanks Reynold.
>
>
>
> *From: *Reynold Xin 
> *Date: *Monday, November 12, 2018 at 16:57
> *To: *Vinoo Ganesh 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
> 3.0. In terms of time lining, unless we change anything specifically, Spark
> feature releases are on a 6-mo cadence. Spark 2.4 was just released last
> week, so 3.0 will be roughly 6 month from now.
>
>
>
> On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:
>
> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java&d=DwMFaQ&c=

Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Reynold Xin
I used to, before each release during the RC phase, go through every single doc 
page to make sure we don’t unintentionally leave things public. I no longer 
have time to do that unfortunately. I find that very useful because I always 
catch some mistakes through organic development.

> On Nov 13, 2018, at 8:00 PM, Wenchen Fan  wrote:
> 
> > Could you clarify what you mean here? Mima has some known limitations such 
> > as not handling "private[blah]" very well
> 
> Yes that's what I mean.
> 
> What I want to know here is, which classes/methods we expect them to be 
> private. I think things marked as "private[blabla]" are expected to be 
> private for sure, it's just the MiMa and doc generator can't handle it well. 
> We can fix them later, by using the @Private annotation probably.
> 
> > seems like it's tracked by a bunch of exclusions in the Unidoc object
> 
> That's good. At least we have a clear definition about which packages are 
> meant to be private. We should make it consistent between MiMa and doc 
> generator though.
> 
>> On Wed, Nov 14, 2018 at 10:41 AM Marcelo Vanzin  wrote:
>> On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan  wrote:
>> > Recently I updated the MiMa exclusion rules, and found MiMa tracks some 
>> > private classes/methods unexpectedly.
>> 
>> Could you clarify what you mean here? Mima has some known limitations
>> such as not handling "private[blah]" very well (because that means
>> public in Java). Spark has (had?) this tool to generate an exclusions
>> file for Mima, but not sure how up-to-date it is.
>> 
>> > AFAIK, we have several rules:
>> > 1. everything which is really private that end users can't access, e.g. 
>> > package private classes, private methods, etc.
>> > 2. classes under certain packages. I don't know if we have a list, the 
>> > catalyst package is considered as a private package.
>> > 3. everything which has a @Private annotation.
>> 
>> That's my understanding of the scope of the rules.
>> 
>> (2) to me means "things that show up in the public API docs". That's,
>> AFAIK, tracked in SparkBuild.scala; seems like it's tracked by a bunch
>> of exclusions in the Unidoc object (I remember that being different in
>> the past).
>> 
>> (3) might be a limitation of the doc generation tool? Not sure if it's
>> easy to say "do not document classes that have @Private". At the very
>> least, that annotation seems to be missing the "@Documented"
>> annotation, which would make that info present in the javadoc. I do
>> not know if the scala doc tool handles that.
>> 
>> -- 
>> Marcelo


Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I want 
to put some pressure myself there too.

On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < sro...@apache.org > wrote:

> 
> 
> 
> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
> noise.
> 
> 
> 
> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com (
> van...@cloudera.com ) > wrote:
> 
> 
>> 
>> 
>> Hmm, it also seems that github comments are being sync'ed to jira. That's
>> gonna get old very quickly, we should probably ask infra to disable that
>> (if we can't do it ourselves).
>> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org (
>> sro...@apache.org ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Update for committers: now that my user ID is synced, I can successfully
>>> push to remote https:/ / github. com/ apache/ spark (
>>> https://github.com/apache/spark ) directly. Use that as the 'apache' remote
>>> (if you like; gitbox also works). I confirmed the sync works both ways.
>>> 
>>> 
>>> 
>>> As a bonus you can directly close pull requests when needed instead of
>>> using "Close Stale PRs" pull requests.
>>> 
>>> 
>>> 
>>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org (
>>> sro...@apache.org ) > wrote:
>>> 
>>> 
 
 
 Per the thread last week, the Apache Spark repos have migrated from https:/
 / git-wip-us. apache. org/ repos/ asf (
 https://git-wip-us.apache.org/repos/asf ) to
 https:/ / gitbox. apache. org/ repos/ asf (
 https://gitbox.apache.org/repos/asf )
 
 
 
 Non-committers:
 
 
 
 This just means repointing any references to the old repository to the new
 one. It won't affect you if you were already referencing https:/ / github.
 com/ apache/ spark ( https://github.com/apache/spark ).
 
 
 
 Committers:
 
 
 
 Follow the steps at https:/ / reference. apache. org/ committer/ github (
 https://reference.apache.org/committer/github ) to fully sync your ASF and
 Github accounts, and then wait up to an hour for it to finish.
 
 
 
 Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
 our standard setup that works with the merge script, that should be your
 'apache' remote. For example here are my current remotes:
 
 
 
 $ git remote -v
 apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git (
 https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / 
 gitbox.
 apache. org/ repos/ asf/ spark. git (
 https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github
 git://github.com/apache/spark (fetch) apache-github
 git://github.com/apache/spark (push) origin https:/ / github. com/ srowen/
 spark ( https://github.com/srowen/spark ) (fetch)
 origin https:/ / github. com/ srowen/ spark (
 https://github.com/srowen/spark ) (push)
 upstream https:/ / github. com/ apache/ spark (
 https://github.com/apache/spark ) (fetch)
 upstream https:/ / github. com/ apache/ spark (
 https://github.com/apache/spark ) (push)
 
 
 
 In theory we also have read/write access to github. com (
 http://github.com/ ) now too, but right now it hadn't yet worked for me. It
 may need to sync. This note just makes sure anyone knows how to keep
 pushing commits right now to the new ASF repo.
 
 
 
 Report any problems here!
 
 
 
 Sean
 
 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
>> 
>> --
>> Marcelo
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Apache Spark git repo moved to gitbox.apache.org

2018-12-11 Thread Reynold Xin
I filed a ticket: https://issues.apache.org/jira/browse/INFRA-17403

Please add your support there.

On Tue, Dec 11, 2018 at 4:58 PM, Sean Owen < sro...@apache.org > wrote:

> 
> I asked on the original ticket at https:/ / issues. apache. org/ jira/ browse/
> INFRA-17385 ( https://issues.apache.org/jira/browse/INFRA-17385 ) but no
> follow-up. Go ahead and open a new INFRA ticket.
> 
> On Tue, Dec 11, 2018 at 6:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> Thanks, Sean. Which INFRA ticket is it? It's creating a lot of noise so I
>> want to put some pressure myself there too.
>> 
>> 
>> 
>> On Mon, Dec 10, 2018 at 9:51 AM, Sean Owen < srowen@ apache. org (
>> sro...@apache.org ) > wrote:
>> 
>>> 
>>> 
>>> Agree, I'll ask on the INFRA ticket and follow up. That's a lot of extra
>>> noise.
>>> 
>>> 
>>> 
>>> On Mon, Dec 10, 2018 at 11:37 AM Marcelo Vanzin < vanzin@ cloudera. com (
>>> van...@cloudera.com ) > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Hmm, it also seems that github comments are being sync'ed to jira. That's
>>>> gonna get old very quickly, we should probably ask infra to disable that
>>>> (if we can't do it ourselves).
>>>> On Mon, Dec 10, 2018 at 9:13 AM Sean Owen < srowen@ apache. org (
>>>> sro...@apache.org ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Update for committers: now that my user ID is synced, I can successfully
>>>>> push to remote https:/ / github. com/ apache/ spark (
>>>>> https://github.com/apache/spark ) directly. Use that as the 'apache' 
>>>>> remote
>>>>> (if you like; gitbox also works). I confirmed the sync works both ways.
>>>>> 
>>>>> 
>>>>> 
>>>>> As a bonus you can directly close pull requests when needed instead of
>>>>> using "Close Stale PRs" pull requests.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Dec 10, 2018 at 10:30 AM Sean Owen < srowen@ apache. org (
>>>>> sro...@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Per the thread last week, the Apache Spark repos have migrated from 
>>>>>> https:/
>>>>>> / git-wip-us. apache. org/ repos/ asf (
>>>>>> https://git-wip-us.apache.org/repos/asf ) to
>>>>>> https:/ / gitbox. apache. org/ repos/ asf (
>>>>>> https://gitbox.apache.org/repos/asf )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Non-committers:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> This just means repointing any references to the old repository to the 
>>>>>> new
>>>>>> one. It won't affect you if you were already referencing https:/ / 
>>>>>> github.
>>>>>> com/ apache/ spark ( https://github.com/apache/spark ).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Committers:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Follow the steps at https:/ / reference. apache. org/ committer/ github (
>>>>>> https://reference.apache.org/committer/github ) to fully sync your ASF 
>>>>>> and
>>>>>> Github accounts, and then wait up to an hour for it to finish.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Then repoint your git-wip-us remotes to gitbox in your git checkouts. For
>>>>>> our standard setup that works with the merge script, that should be your
>>>>>> 'apache' remote. For example here are my current remotes:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> $ git remote -v
>>>>>> apache https:/ / gitbox. apache. org/ repos/ asf/ spark. git (
>>>>>> https://gitbox.apache.org/repos/asf/spark.git ) (fetch) apache https:/ / 
>>>>>> gitbox.
>>>>>> apache. org/ repos/ asf/ spark. git (
>>>>>> https://gitbox.apache.org/repos/asf/spark.git ) (push) apache-github 
>>>>>> git://
>>>>>> github. com/ apache/ spark ( http://github.com/apache/spark ) (fetch)
>>>>>> apache-github git:// github. com/ apache/ spark (
>>>>>> http://github.com/apache/spark ) (push) origin https:/ / github. com/ 
>>>>>> srowen/
>>>>>> spark ( https://github.com/srowen/spark ) (fetch)
>>>>>> origin https:/ / github. com/ srowen/ spark (
>>>>>> https://github.com/srowen/spark ) (push)
>>>>>> upstream https:/ / github. com/ apache/ spark (
>>>>>> https://github.com/apache/spark ) (fetch)
>>>>>> upstream https:/ / github. com/ apache/ spark (
>>>>>> https://github.com/apache/spark ) (push)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> In theory we also have read/write access to github. com (
>>>>>> http://github.com/ ) now too, but right now it hadn't yet worked for me. 
>>>>>> It
>>>>>> may need to sync. This note just makes sure anyone knows how to keep
>>>>>> pushing commits right now to the new ASF repo.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Report any problems here!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Sean
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> - To
>>>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>>> dev-unsubscr...@spark.apache.org )
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Marcelo
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>>> 
>> 
>> 
> 
>

dsv2 remaining work

2018-12-12 Thread Reynold Xin
Unfortunately I can't make it to the DSv2 sync today. Sending an email with my 
thoughts instead. I spent a few hours thinking about this. It's evident that 
progress has been slow, because this is an important API and people from 
different perspectives have very different requirements, and the priorities are 
weighted very differently (e.g. issues that are super important to one might be 
not as important to another, and people just talk past each other arguing why 
one ignored a broader issue in a PR or proposal).

I think the only real way to make progress is to decouple the efforts into 
major areas, and make progress somewhat independently. Of course, some care is 
needed to take care of

Here's one attempt at listing some of the remaining big rocks:

1. Basic write API -- with the current SaveMode.

2. Add Overwrite (or Replace) logical plan, and the associated API in Table.

3. Add APIs for per-table metadata operations (note that I'm not calling it a 
catalog API here). Create/drop/alter table goes here. We also need to figure 
out how to do this for the file system sources in which there is no underlying 
catalog. One idea is to treat the file system as a catalog (with arbitrary 
levels of databases). To do that, it'd be great if the identifier for a table 
is not a fixed 2 or 3 part name, but just a string array.

4. Remove SaveMode. This is blocked on at least 1 + 2, and potentially 3.

5. Design a stable, fast, smaller surface row format to replace the existing 
InternalRow (and all the internal data types), which is internal and unstable. 
This can be further decoupled into the design for each data type.

The above are the big one I can think of. I probably missed some, but a lot of 
other smaller things can be improved on later.

removing most of the config functions in SQLConf?

2018-12-13 Thread Reynold Xin
In SQLConf, for each config option, we declare them in two places:

First in the SQLConf object, e.g.:

*val* CSV_PARSER_COLUMN_PRUNING = buildConf ( 
"spark.sql.csv.parser.columnPruning.enabled" )

.internal()

.doc( "If it is set to true, column names of the requested schema are passed to 
CSV parser. " +

"Other column values can be ignored during parsing even if they are malformed." 
)

.booleanConf

.createWithDefault( *true* )

Second in SQLConf class:

*def* csvColumnPruning : Boolean = getConf(SQLConf. CSV_PARSER_COLUMN_PRUNING )

As the person that introduced both, I'm now thinking we should remove almost 
all of the latter, unless it is used more than 5 times. The vast majority of 
config options are read only in one place, so the functions are pretty 
redundant ...

Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
Having a way to register UDFs that are not using Hive APIs would be great!

On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
> 
> 
> 
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the
> JVM.
> 
> 
> 
> There is already syntax to create a function from a JVM class (
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
> ) in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument
> translation and doesn’t support codegen. Beyond the problem of the API and
> performance, it is annoying to require registering every function
> individually with a CREATE FUNCTION statement.
> 
> 
> 
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
> 
> 
> 
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog to
> create/drop/load/alter tables. Another interface could be UDFCatalog that
> can load UDFs.
> 
> interface UDFCatalog extends CatalogPlugin { UserDefinedFunction
> loadUDF(String name) }
> 
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog
> name and the function name.
> 
> 
> 
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
> 
> 
> 
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse (
> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Leveraging-Brickhouse-in-Spark2-pivot/m-p/59943
> ) , without users needing to ensure the Jar is loaded and register
> individual functions.
> 
> 
> 
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog , and we’d have to
> solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
> 
> 
> 
> Thanks,
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Function plugins

2018-12-14 Thread Reynold Xin
I don’t think it is realistic to support codegen for UDFs. It’s hooked deep
into intervals.

On Fri, Dec 14, 2018 at 6:52 PM Matt Cheah  wrote:

> How would this work with:
>
>1. Codegen – how does one generate code given a user’s UDF? Would the
>user be able to specify the code that is generated that represents their
>function? In practice that’s pretty hard to get right.
>2. Row serialization and representation – Will the UDF receive
>catalyst rows with optimized internal representations, or will Spark have
>to convert to something more easily consumed by a UDF?
>
>
>
> Otherwise +1 for trying to get this to work without Hive. I think even
> having something without codegen and optimized row formats is worthwhile if
> only because it’s easier to use than Hive UDFs.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Reynold Xin 
> *Date: *Friday, December 14, 2018 at 1:49 PM
> *To: *"rb...@netflix.com" 
> *Cc: *Spark Dev List 
> *Subject: *Re: [DISCUSS] Function plugins
>
>
>
> [image: Image removed by sender.]
>
> Having a way to register UDFs that are not using Hive APIs would be great!
>
>
>
>
>
>
>
> On Fri, Dec 14, 2018 at 1:30 PM, Ryan Blue 
> wrote:
>
> Hi everyone,
> I’ve been looking into improving how users of our Spark platform register
> and use UDFs and I’d like to discuss a few ideas for making this easier.
>
> The motivation for this is the use case of defining a UDF from SparkSQL or
> PySpark. We want to make it easy to write JVM UDFs and use them from both
> SQL and Python. Python UDFs work great in most cases, but we occasionally
> don’t want to pay the cost of shipping data to python and processing it
> there so we want to make it easy to register UDFs that will run in the JVM.
>
> There is already syntax to create a function from a JVM class
> [docs.databricks.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.databricks.com_spark_latest_spark-2Dsql_language-2Dmanual_create-2Dfunction.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=k_fqMI22guBLW5lj5ZJ21QeKoXoa6LuPP5yA2tlj-TE&e=>
> in SQL that would work, but this option requires using the Hive UDF API
> instead of Spark’s simpler Scala API. It also requires argument translation
> and doesn’t support codegen. Beyond the problem of the API and performance,
> it is annoying to require registering every function individually with a 
> CREATE
> FUNCTION statement.
>
> The alternative that I’d like to propose is to add a way to register a
> named group of functions using the proposed catalog plugin API.
>
> For anyone unfamiliar with the proposed catalog plugins, the basic idea is
> to load and configure plugins using a simple property-based scheme. Those
> plugins expose functionality through mix-in interfaces, like TableCatalog
> to create/drop/load/alter tables. Another interface could be UDFCatalog
> that can load UDFs.
>
> interface UDFCatalog extends CatalogPlugin {
>
>   UserDefinedFunction loadUDF(String name)
>
> }
>
> To use this, I would create a UDFCatalog class that returns my Scala
> functions as UDFs. To look up functions, we would use both the catalog name
> and the function name.
>
> This would allow my users to write Scala UDF instances, package them using
> a UDFCatalog class (provided by me), and easily use them in Spark with a
> few configuration options to add the catalog in their environment.
>
> This would also allow me to expose UDF libraries easily in my
> configuration, like brickhouse [community.cloudera.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.cloudera.com_t5_Advanced-2DAnalytics-2DApache-2DSpark_Leveraging-2DBrickhouse-2Din-2DSpark2-2Dpivot_m-2Dp_59943&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=A89zvby1qgVX4Zrstgfnlf1mCBIZUyOhADRR-czy4Fw&s=UztUaaJiaM74bMR2lIutW8hWYRbufd2tKCVj23ReIfs&e=>,
> without users needing to ensure the Jar is loaded and register individual
> functions.
>
> Any thoughts on this high-level approach? I know that this ignores things
> like creating and storing functions in a FunctionCatalog, and we’d have
> to solve challenges with function naming (whether there is a db component).
> Right now I’d like to think through the overall idea and not get too
> focused on those details.
>
> Thanks,
>
> rb
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>


Re: Decimals with negative scale

2018-12-18 Thread Reynold Xin
Is this an analysis time thing or a runtime thing?

On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido  wrote:

> Hi all,
>
> as you may remember, there was a design doc to support operations
> involving decimals with negative scales. After the discussion in the design
> doc, now the related PR is blocked because for 3.0 we have another option
> which we can explore, ie. forbidding negative scales. This is probably a
> cleaner solution, as most likely we didn't want negative scales, but it is
> a breaking change: so we wanted to check the opinion of the community.
>
> Getting to the topic, here there are the 2 options:
> * - Forbidding negative scales*
>   Pros: many sources do not support negative scales (so they can create
> issues); they were something which was not considered as possible in the
> initial implementation, so we get to a more stable situation.
>   Cons: some operations which were supported earlier, won't be working
> anymore. Eg. since our max precision is 38, if the scale cannot be negative
> 1e36 * 1e36 would cause an overflow, while now works fine (producing a
> decimal with negative scale); basically impossible to create a config which
> controls the behavior.
>
>  *- Handling negative scales in operations*
>   Pros: no regressions; we support all the operations we supported on 2.x.
>   Cons: negative scales can cause issues in other moments, eg. when saving
> to a data source which doesn't support them.
>
> Looking forward to hear your thoughts,
> Thanks.
> Marco
>
>
>


Re: Decimals with negative scale

2018-12-18 Thread Reynold Xin
So why can't we just do validation to fail sources that don't support negative 
scale, if it is not supported? This way, we don't need to break backward 
compatibility in anyway and it becomes a strict improvement.

On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido < marcogaid...@gmail.com > wrote:

> 
> This is at analysis time.
> 
> On Tue, 18 Dec 2018, 17:32 Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) wrote:
> 
> 
>> Is this an analysis time thing or a runtime thing?
>> 
>> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido < marcogaido91@ gmail. com (
>> marcogaid...@gmail.com ) > wrote:
>> 
>> 
>>> Hi all,
>>> 
>>> 
>>> as you may remember, there was a design doc to support operations
>>> involving decimals with negative scales. After the discussion in the
>>> design doc, now the related PR is blocked because for 3.0 we have another
>>> option which we can explore, ie. forbidding negative scales. This is
>>> probably a cleaner solution, as most likely we didn't want negative
>>> scales, but it is a breaking change: so we wanted to check the opinion of
>>> the community.
>>> 
>>> 
>>> Getting to the topic, here there are the 2 options:
>>> * - Forbidding negative scales*
>>>   Pros: many sources do not support negative scales (so they can create
>>> issues); they were something which was not considered as possible in the
>>> initial implementation, so we get to a more stable situation.
>>>   Cons: some operations which were supported earlier, won't be working
>>> anymore. Eg. since our max precision is 38, if the scale cannot be
>>> negative 1e36 * 1e36 would cause an overflow, while now works fine
>>> (producing a decimal with negative scale); basically impossible to create
>>> a config which controls the behavior.
>>> 
>>> 
>>> 
>>>  *- Handling negative scales in operations*
>>>   Pros: no regressions; we support all the operations we supported on 2.x.
>>> 
>>>   Cons: negative scales can cause issues in other moments, eg. when saving
>>> to a data source which doesn't support them.
>>> 
>>> 
>>> 
>>> Looking forward to hear your thoughts,
>>> Thanks.
>>> Marco
>>> 
>> 
>> 
> 
>

Re: [build system] jenkins master needs reboot, temporary downtime

2018-12-19 Thread Reynold Xin
Thanks for taking care of this, Shane!

On Wed, Dec 19, 2018 at 9:45 AM, shane knapp < skn...@berkeley.edu > wrote:

> 
> master is back up and building.
> 
> On Wed, Dec 19, 2018 at 9:31 AM shane knapp < sknapp@ berkeley. edu (
> skn...@berkeley.edu ) > wrote:
> 
> 
>> the jenkins process seems to be wedged again, and i think we're going to
>> hit it w/the reboot hammer, rather than just killing/restarting the
>> master.
>> 
>> 
>> this should take at most 30 mins, and i'll send an all-clear when it's
>> done.
>> 
>> 
>> --
>> Shane Knapp
>> 
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> 
>> https:/ / rise. cs. berkeley. edu ( https://rise.cs.berkeley.edu )
>> 
>> 
> 
> 
> 
> 
> --
> Shane Knapp
> 
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> 
> https:/ / rise. cs. berkeley. edu ( https://rise.cs.berkeley.edu )
>

Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I think there is an infra ticket open for it right now.

On Wed, Dec 19, 2018 at 6:58 PM Nicholas Chammas 
wrote:

> Can we somehow disable these new email alerts coming through for the Spark
> website repo?
>
> On Wed, Dec 19, 2018 at 8:25 PM GitBox  wrote:
>
>> ueshin commented on a change in pull request #163: Announce the schedule
>> of 2019 Spark+AI summit at SF
>> URL:
>> https://github.com/apache/spark-website/pull/163#discussion_r243130975
>>
>>
>>
>>  ##
>>  File path: site/sitemap.xml
>>  ##
>>  @@ -139,657 +139,661 @@
>>  
>>  
>>  
>> -  https://spark.apache.org/releases/spark-release-2-4-0.html
>> +  
>> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html
>> 
>>
>>  Review comment:
>>Still remaining `localhost:4000` in this file.
>>
>> 
>> This is an automated message from the Apache Git Service.
>> To respond to the message, please log on GitHub and use the
>> URL above to go to the specific comment.
>>
>> For queries about this service, please contact Infrastructure at:
>> us...@infra.apache.org
>>
>>
>> With regards,
>> Apache Git Services
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Noisy spark-website notifications

2018-12-19 Thread Reynold Xin
I added my comment there too!

On Wed, Dec 19, 2018 at 7:26 PM, Hyukjin Kwon < gurwls...@gmail.com > wrote:

> 
> Yea, that's a bit noisy .. I would just completely disable it to be
> honest. I failed https:/ / issues. apache. org/ jira/ browse/ INFRA-17469 (
> https://issues.apache.org/jira/browse/INFRA-17469 ) before. I would
> appreciate if there would be more inputs there :-)
> 
> 
> 2018년 12월 20일 (목) 오전 11:22, Nicholas Chammas < nicholas. chammas@ gmail. com
> ( nicholas.cham...@gmail.com ) >님이 작성:
> 
> 
>> I'd prefer it if we disabled all git notifications for spark-website.
>> Folks who want to stay on top of what's happening with the site can simply
>> watch the repo on GitHub ( https://github.com/apache/spark-website ) , no?
>> 
>> 
>> On Wed, Dec 19, 2018 at 10:00 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> 
>> 
>>> +1, at least it should only send one email when a PR is merged.
>>> 
>>> 
>>> On Thu, Dec 20, 2018 at 10:58 AM Nicholas Chammas < nicholas. chammas@ 
>>> gmail.
>>> com ( nicholas.cham...@gmail.com ) > wrote:
>>> 
>>> 
 Can we somehow disable these new email alerts coming through for the Spark
 website repo?
 
 On Wed, Dec 19, 2018 at 8:25 PM GitBox < git@ apache. org ( g...@apache.org
 ) > wrote:
 
 
> ueshin commented on a change in pull request #163: Announce the schedule
> of 2019 Spark+AI summit at SF
> URL: https:/ / github. com/ apache/ spark-website/ pull/ 
> 163#discussion_r243130975
> ( https://github.com/apache/spark-website/pull/163#discussion_r243130975 )
> 
> 
> 
> 
>  ##
>  File path: site/sitemap.xml
>  ##
>  @@ -139,657 +139,661 @@
>  
>  
>  
> -   https:/ / spark. apache. org/ releases/ spark-release-2-4-0. html
> ( https://spark.apache.org/releases/spark-release-2-4-0.html ) 
> +   http:/ / localhost:4000/ news/ 
> spark-ai-summit-apr-2019-agenda-posted.
> html (
> http://localhost:4000/news/spark-ai-summit-apr-2019-agenda-posted.html ) 
> 
> 
> 
>  Review comment:
>    Still remaining `localhost:4000` in this file.
> 
> 
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on GitHub and use the
> URL above to go to the specific comment.
> 
> For queries about this service, please contact Infrastructure at:
> users@ infra. apache. org ( us...@infra.apache.org )
> 
> 
> With regards,
> Apache Git Services
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
 
 
 
>>> 
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] Default values and data sources

2018-12-21 Thread Reynold Xin
I'd only do any of the schema evolution things as add-on on top. This is an 
extremely complicated area and we could risk never shipping anything because 
there would be a lot of different requirements.

On Fri, Dec 21, 2018 at 9:46 AM, Russell Spitzer < russell.spit...@gmail.com > 
wrote:

> 
> I definitely would like to have a "column can be missing" capability,
> allowing the underlying datasource to fill in a default if it wants to (or
> not).
> 
> On Fri, Dec 21, 2018 at 1:40 AM Alessandro Solimando < alessandro. solimando@
> gmail. com ( alessandro.solima...@gmail.com ) > wrote:
> 
> 
>> Hello,
>> I agree that Spark should check whether the underlying datasource
>> support default values or not, and adjust its behavior accordingly.
>> 
>> If we follow this direction, do you see the default-values capability
>> in scope of the "DataSourceV2 capability API"?
>> 
>> Best regards,
>> Alessandro
>> 
>> On Fri, 21 Dec 2018 at 03:31, Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> >
>> > Hi Ryan,
>> >
>> > That's a good point. Since in this case Spark is just a channel to pass
>> user's action to the data source, we should think of what actions the data
>> source supports.
>> >
>> > Following this direction, it makes more sense to delegate everything to
>> data sources.
>> >
>> > As the first step, maybe we should not add DDL commands to change schema
>> of data source, but just use the capability API to let data source decide
>> what to do when input schema doesn't match the table schema during
>> writing. Users can use native client of data source to change schema.
>> >
>> > On Fri, Dec 21, 2018 at 8:03 AM Ryan Blue < rblue@ netflix. com (
>> rb...@netflix.com ) > wrote:
>> >>
>> >> I think it is good to know that not all sources support default values.
>> That makes me think that we should delegate this behavior to the source
>> and have a way for sources to signal that they accept default values in
>> DDL (a capability) and assume that they do not in most cases.
>> >>
>> >> On Thu, Dec 20, 2018 at 1:32 PM Russell Spitzer < russell. spitzer@ gmail.
>> com ( russell.spit...@gmail.com ) > wrote:
>> >>>
>> >>> I guess my question is why is this a Spark level behavior? Say the
>> user has an underlying source where they have a different behavior at the
>> source level. In Spark they set a new default behavior and it's added to
>> the catalogue, is the Source expected to propagate this? Or does the user
>> have to be aware that their own Source settings may be different for a
>> client connecting via Spark or via a native driver.
>> >>>
>> >>> For example say i'm using C* (sorry but obviously I'm always thinking
>> about C*), and I add a new column to the database. When i connect to the
>> database with a non-spark application I expect to be able to insert to the
>> table given that I satisfy the required columns. In Spark someone sets the
>> columns as having a default value (there is no such feature in C*), now
>> depending on how I connect to the source I have two different behaviors.
>> If I insert from the native app I get empty cells, if I insert from spark
>> i get a default value inserted. That sounds more confusing to an end-user
>> to than having a consistent behavior between native clients and Spark
>> clients. This is why I asked if the goal was to just have a common "Spark"
>> behavior because I don't think it makes sense if you consider multiple
>> interaction points for a source.
>> >>>
>> >>> On Wed, Dec 19, 2018 at 9:28 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> 
>>  So you agree with my proposal that we should follow RDBMS/SQL
>> standard regarding the behavior?
>> 
>>  > pass the default through to the underlying data source
>> 
>>  This is one way to implement the behavior.
>> 
>>  On Thu, Dec 20, 2018 at 11:12 AM Ryan Blue < rblue@ netflix. com (
>> rb...@netflix.com ) > wrote:
>> >
>> > I don't think we have to change the syntax. Isn't the right thing
>> (for option 1) to pass the default through to the underlying data source?
>> Sources that don't support defaults would throw an exception.
>> >
>> > On Wed, Dec 19, 2018 at 6:29 PM Wenchen Fan < cloud0fan@ gmail. com (
>> cloud0...@gmail.com ) > wrote:
>> >>
>> >> The standard ADD COLUMN SQL syntax is: ALTER TABLE table_name ADD
>> COLUMN column_name datatype [DEFAULT value];
>> >>
>> >> If the DEFAULT statement is not specified, then the default value
>> is null. If we are going to change the behavior and say the default value
>> is decided by the underlying data source, we should use a new SQL syntax(I
>> don't have a proposal in mind), instead of reusing the existing syntax, to
>> be SQL compatible.
>> >>
>> >> Personally I don't like re-invent wheels. It's better to just
>> implement the SQL standard ADD COLUMN command, which means the default
>> value is decided by the end-users.
>> >>
>> >> O

Re: Trigger full GC during executor idle time?

2018-12-31 Thread Reynold Xin
Not sure how reputable or representative that paper is...

On Mon, Dec 31, 2018 at 10:57 AM Sean Owen  wrote:

> https://github.com/apache/spark/pull/23401
>
> Interesting PR; I thought it was not worthwhile until I saw a paper
> claiming this can speed things up to the tune of 2-6%. Has anyone
> considered this before?
>
> Sean
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Remove non-Tungsten mode in Spark 3?

2019-01-03 Thread Reynold Xin
The issue with the offheap mode is it is a pretty big behavior change and does 
require additional setup (also for users that run with UDFs that allocate a lot 
of heap memory, it might not be as good).

I can see us removing the legacy mode since it's been legacy for a long time 
and perhaps very few users need it. How much code does it remove though?

On Thu, Jan 03, 2019 at 2:55 PM, Sean Owen < sro...@apache.org > wrote:

> 
> 
> 
> Just wondering if there is a good reason to keep around the pre-tungsten
> on-heap memory mode for Spark 3, and make spark.memory.offHeap.enabled
> always true? It would simplify the code somewhat, but I don't feel I'm so
> aware of the tradeoffs.
> 
> 
> 
> I know we didn't deprecate it, but it's been off by default for a long
> time. It could be deprecated, too.
> 
> 
> 
> Same question for spark.memory.useLegacyMode and all its various
> associated settings? Seems like these should go away at some point, and
> Spark 3 is a good point. Same issue about deprecation though.
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [DISCUSS] Handling correctness/data loss jiras

2019-01-04 Thread Reynold Xin
Committers,

When you merge tickets fixing correctness bugs, please make sure you tag the 
tickets with "correctness" label. I've found multiple tickets today that didn't 
do that.

On Fri, Aug 17, 2018 at 7:11 AM, Tom Graves < tgraves...@yahoo.com.invalid > 
wrote:

> 
> Since we haven't heard any objections to this, the documentation has been
> updated (Thanks to Sean).
> 
> 
> All devs please make sure to re-read: http:/ / spark. apache. org/ 
> contributing.
> html ( http://spark.apache.org/contributing.html ) .
> 
> 
> Note the set of labels used in Jira has been documented and correctness or
> data loss issues should be marked as blocker by default.  There is also a
> label to mark the jira as having something needing to go into the
> release-notes.
> 
> 
> 
> 
> Tom
> 
> 
> On Tuesday, August 14, 2018, 3:32:27 PM CDT, Imran Rashid < irashid@ cloudera.
> com. INVALID ( iras...@cloudera.com.INVALID ) > wrote:
> 
> 
> 
> 
> +1 on what we should do.
> 
> 
> On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves < tgraves_cs@ yahoo. com. invalid
> ( tgraves...@yahoo.com.invalid ) > wrote:
> 
>> 
>> 
>> 
>> > I mean, what are concrete steps beyond saying this is a problem? That's
>> the important thing to discuss.
>> 
>> 
>> Sorry I'm a bit confused by your statement but also think I agree.  I
>> started this thread for this reason. I pointed out that I thought it was a
>> problem and also brought up things I thought we could do to help fix it.  
>> 
>> 
>> 
>> Maybe I wasn't clear in the first email, the list of things I had were
>> proposals on what we do for a jira that is for a correctness/data loss
>> issue. Its the committers and developers that are involved in this though
>> so if people don't agree or aren't going to do them, then it doesn't work.
>> 
>> 
>> 
>> Just to restate what I think we should do:
>> 
>> 
>> - label any correctness/data loss jira with "correctness"
>> - jira should be marked as a blocker by default if someone suspects a
>> corruption/loss issue
>> - Make sure the description is clear about when it occurs and impact to
>> the user.   
>> - ensure its back ported to all active branches
>> - See if we can have a separate section in the release notes for these
>> 
>> 
>> The last one I guess is more a one time thing that i can file a jira for. 
>> The first 4 would be done for each jira filed.
>> 
>> 
>> I'm proposing we do these things and as such if people agree we would also
>> document those things in the committers or developers guide and send email
>> to the list. 
>> 
>> 
>>  
>> 
>> 
>> Tom
>> On Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen < srowen@ apache. org
>> ( sro...@apache.org ) > wrote:
>> 
>> 
>> 
>> 
>> Generally: if someone thinks correctness fix X should be backported
>> further, I'd say just do it, if it's to an active release branch (see
>> below). Anything that important has to outweigh most any other concern,
>> like behavior changes.
>> 
>> 
>> On Mon, Aug 13, 2018 at 11:08 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves...@yahoo.com ) > wrote:
>> 
>>> I'm not really sure what you mean by this, this proposal is to introduce a
>>> process for this type of issue so its at least brought to peoples
>>> attention. We can't do anything to make people work on certain things.  If
>>> they aren't raised as important issues then its really easy to miss these
>>> things.  If its a blocker we should also not be doing any new releases
>>> without a fix for it which may motivate people to look at it.
>>> 
>> 
>> 
>> 
>> I mean, what are concrete steps beyond saying this is a problem? That's
>> the important thing to discuss.
>> 
>> 
>> There's a good one here: let's say anything that's likely to be a
>> correctness or data loss issue should automatically be labeled
>> 'correctness' as such and set to Blocker. 
>> 
>> 
>> That can go into the how-to-contribute manual in the docs and in a note to
>> dev@.
>>  
>>  
>> 
>>> 
>>> I agree it would be good for us to make it more official about which
>>> branches are being maintained.  I think at this point its still 2.1.x,
>>> 2.2.x, and 2.3.x since we recently did releases of all of these.  Since
>>> 2.4 will be coming out we should definitely think about stop maintaining
>>> 2.1.x.  Perhaps we need a table on our release page about this.  But this
>>> should be a separate thread.
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> I propose writing something like this in the 'versioning' doc page, to at
>> least establish a policy:
>> 
>> 
>> Minor release branches will, generally, be maintained with bug fixes
>> releases for a period of 18 months. For example, branch 2.1.x is no longer
>> considered maintained as of July 2018, 18 months after the release of
>> 2.1.0 in December 2106.
>> 
>> 
>> This gives us -- and more importantly users -- some understanding of what
>> to expect for backporting and fixes.
>> 
>> 
>> 
>> 
>> I am going to revive the thread about adding PMC / committers as it's
>> overdue. That may not do much, but, more han

Re: [DISCUSS] Identifiers with multi-catalog support

2019-01-13 Thread Reynold Xin
Thanks for writing this up. Just to show why option 1 is not sufficient. MySQL 
and Postgres are the two most popular open source database systems, and both 
support database → schema → table 3 part identification, so Spark supporting 
only 2 part name passing to the data source (option 1) isn't sufficient.

For the issues you brought up w.r.t. nesting - what's the challenge in 
supporting it? I can also see us not supporting it for now (no nesting allowed, 
leaf - 1 level can only contain leaf tables), and adding support for nesting in 
the future.

On Sun, Jan 13, 2019 at 1:38 PM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> 
> 
> In the DSv2 sync up, we tried to discuss the Table metadata proposal but
> were side-tracked on its use of TableIdentifier. There were good points
> about how Spark should identify tables, views, functions, etc, and I want
> to start a discussion here.
> 
> 
> 
> Identifiers are orthogonal to the TableCatalog proposal that can be
> updated to use whatever identifier class we choose. That proposal is
> concerned with what information should be passed to define a table, and
> how to pass that information.
> 
> 
> 
> The main question for this discussion is: *how should Spark identify
> tables, views, and functions when it supports multiple catalogs?*
> 
> 
> 
> There are two main approaches:
> 
> * Use a 3-part identifier, catalog.database.table
> * Use an identifier with an arbitrary number of parts
> 
> 
> *Option 1: use 3-part identifiers*
> 
> 
> 
> The argument for option #1 is that it is simple. If an external data store
> has additional logical hierarchy layers, then that hierarchy would be
> mapped to multiple catalogs in Spark. Spark can support show tables and
> show databases without much trouble. This is the approach used by Presto,
> so there is some precedent for it.
> 
> 
> 
> The drawback is that mapping a more complex hierarchy into Spark requires
> more configuration. If an external DB has a 3-level hierarchy — say, for
> example, schema.database.table — then option #1 requires users to configure
> a catalog for each top-level structure, each schema. When a new schema is
> added, it is not automatically accessible.
> 
> 
> 
> Catalog implementations could use session options could provide a rough
> work-around by changing the plugin’s “current” schema. I think this is an
> anti-pattern, so another strike against this option is that it encourages
> bad practices.
> 
> 
> 
> *Option 2: use n-part identifiers*
> 
> 
> 
> That drawback for option #1 is the main argument for option #2: Spark
> should allow users to easily interact with the native structure of an
> external store. For option #2, a full identifier would be an
> arbitrary-length list of identifiers. For the example above, using 
> catalog.schema.database.table
> is allowed. An identifier would be something like this:
> 
> case class CatalogIdentifier(parts: Seq[String])
> 
> The problem with option #2 is how to implement a listing and discovery
> API, for operations like SHOW TABLES. If the catalog API requires a 
> list(ident:
> CatalogIdentifier) , what does it return? There is no guarantee that the
> listed objects are tables and not nested namespaces. How would Spark
> handle arbitrary nesting that differs across catalogs?
> 
> 
> 
> Hopefully, I’ve captured the design question well enough for a productive
> discussion. Thanks!
> 
> 
> 
> rb
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
There are a few things to keep in mind:

1. Structured Streaming isn't an independent project. It actually (by design) 
depends on all the rest of Spark SQL, and virtually all improvements to Spark 
SQL benefit Structured Streaming.

2. The project as far as I can tell is relatively mature for core ETL and 
incremental processing purpose. I interact with a lot of users using it 
everyday. We can always expand the use cases and add more, but that also adds 
maintenance burden. In any case, it'd be good to get some activity here.

On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas < nicholas.cham...@gmail.com 
> wrote:

> 
> As an observer, this thread is interesting and concerning. Is there an
> emerging consensus that Structured Streaming is somehow not relevant
> anymore? Or is it just that folks consider it "complete enough"?
> 
> 
> Structured Streaming was billed as the replacement to DStreams. If
> committers, generally speaking, have lost interest in Structured
> Streaming, does that mean the Apache Spark project is somehow no longer
> aiming to provide a "first-class" solution to the problem of stream
> processing?
> 
> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim < kabhwan@ gmail. com (
> kabh...@gmail.com ) > wrote:
> 
> 
>> Cody, I guess I already addressed your comments in the PR (#22138). The
>> approach was changed to address your concern, and after that Gabor helped
>> to review the PR. Please take a look again when you have time to get into.
>> 
>> 
>> 
>> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger < cody@ koeninger. org (
>> c...@koeninger.org ) >님이 작성:
>> 
>> 
>>> I feel like I've already said my piece on
>>> https:/ / github. com/ apache/ spark/ pull/ 22138 (
>>> https://github.com/apache/spark/pull/22138 ) let me know if you have
>>> more questions.
>>> 
>>> As for SS in general, I don't have a production SS deployment, so I'm
>>> less comfortable with reviewing large changes to it.  But if no other
>>> committers are working on it...
>>> 
>>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> >
>>> > Yes you're preaching to the choir here. SS does seem somewhat
>>> > abandoned by those that have worked on it. I have also been at times
>>> > frustrated that some areas fall into this pattern.
>>> >
>>> > There isn't a way to make people work on it, and I personally am not
>>> > interested in it nor have a background in SS.
>>> >
>>> > I did leave some comments on your PR and will see if we can get
>>> > comfortable with merging it, as I presume you are pretty knowledgeable
>>> > about the change.
>>> >
>>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim < kabhwan@ gmail. com (
>>> kabh...@gmail.com ) > wrote:
>>> > >
>>> > > Sean, this is actually a fail-back on pinging committers. I know who
>>> can review and merge in SS area, and pinged to them, didn't work. Even
>>> there's a PR which approach was encouraged by committer and reviewed the
>>> first phase, and no review.
>>> > >
>>> > > That's not the first time I have faced the situation, and I used the
>>> fail-back approach at that time. (You can see there was no response even
>>> in the mail thread.) Not sure which approach worked.
>>> > > https:/ / lists. apache. org/ thread. html/ 
>>> > > c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@
>>> %3Cdev. spark. apache. org%3E (
>>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>>> )
>>> > >
>>> > > I've observed that only (critical) bugfixes are being reviewed and
>>> merged in time for SS area. For other stuffs like new features and
>>> improvements, both discussions and PRs were pretty less popular from
>>> committers: though there was even participation/approve from non-committer
>>> community. I don't think SS is the thing to be turned into maintenance.
>>> > >
>>> > > I guess PMC members should try to resolve such situation, as it will
>>> (slowly and quietly) make some issues like contributors leaving, module
>>> stopped growing up, etc.. The problem will grow up like a snowball:
>>> getting bigger and bigger. I don't mind if there's no interest on both
>>> contributors and committers for such module, but SS is not. Maybe either
>>> other committers who weren't familiar with should try to get familiar and
>>> cover the area, or the area needs more committers.
>>> > >
>>> > > -Jungtaek Lim (HeartSaVioR)
>>> > >
>>> > > 2019년 1월 13일 (일) 오후 11:37, Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) >님이 작성:
>>> > >>
>>> > >> Jungtaek, the best strategy is to find who wrote the code you are
>>> > >> modifying (use Github history or git blame) and ping them directly on
>>> 
>>> > >> the PR. I don't know this code well myself.
>>> > >> It also helps if you can address why the functionality is important,
>>> > >> and describe compatibility implications.
>>> > >>
>>> > >> Most PRs are not merged, note. Not commenting on this particular one,
>>> 
>>> > >> bu

Re: Ask for reviewing on Structured Streaming PRs

2019-01-14 Thread Reynold Xin
BTW the largest change to SS right now is probably the entire data source API 
v2 effort, which aims to unify streaming and batch from data source 
perspective, and provide a reliable, expressive source/sink API.

On Mon, Jan 14, 2019 at 5:34 PM, Reynold Xin < r...@databricks.com > wrote:

> 
> There are a few things to keep in mind:
> 
> 
> 
> 1. Structured Streaming isn't an independent project. It actually (by
> design) depends on all the rest of Spark SQL, and virtually all
> improvements to Spark SQL benefit Structured Streaming.
> 
> 
> 
> 2. The project as far as I can tell is relatively mature for core ETL and
> incremental processing purpose. I interact with a lot of users using it
> everyday. We can always expand the use cases and add more, but that also
> adds maintenance burden. In any case, it'd be good to get some activity
> here.
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, Jan 14, 2019 at 5:11 PM, Nicholas Chammas < nicholas. chammas@ gmail.
> com ( nicholas.cham...@gmail.com ) > wrote:
> 
>> As an observer, this thread is interesting and concerning. Is there an
>> emerging consensus that Structured Streaming is somehow not relevant
>> anymore? Or is it just that folks consider it "complete enough"?
>> 
>> 
>> Structured Streaming was billed as the replacement to DStreams. If
>> committers, generally speaking, have lost interest in Structured
>> Streaming, does that mean the Apache Spark project is somehow no longer
>> aiming to provide a "first-class" solution to the problem of stream
>> processing?
>> 
>> On Mon, Jan 14, 2019 at 3:43 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> Cody, I guess I already addressed your comments in the PR (#22138). The
>>> approach was changed to address your concern, and after that Gabor helped
>>> to review the PR. Please take a look again when you have time to get into.
>>> 
>>> 
>>> 
>>> 2019년 1월 15일 (화) 오전 1:01, Cody Koeninger < cody@ koeninger. org (
>>> c...@koeninger.org ) >님이 작성:
>>> 
>>> 
>>>> I feel like I've already said my piece on
>>>> https:/ / github. com/ apache/ spark/ pull/ 22138 (
>>>> https://github.com/apache/spark/pull/22138 ) let me know if you have
>>>> more questions.
>>>> 
>>>> As for SS in general, I don't have a production SS deployment, so I'm
>>>> less comfortable with reviewing large changes to it.  But if no other
>>>> committers are working on it...
>>>> 
>>>> On Sun, Jan 13, 2019 at 5:19 PM Sean Owen < srowen@ gmail. com (
>>>> sro...@gmail.com ) > wrote:
>>>> >
>>>> > Yes you're preaching to the choir here. SS does seem somewhat
>>>> > abandoned by those that have worked on it. I have also been at times
>>>> > frustrated that some areas fall into this pattern.
>>>> >
>>>> > There isn't a way to make people work on it, and I personally am not
>>>> > interested in it nor have a background in SS.
>>>> >
>>>> > I did leave some comments on your PR and will see if we can get
>>>> > comfortable with merging it, as I presume you are pretty knowledgeable
>>>> > about the change.
>>>> >
>>>> > On Sun, Jan 13, 2019 at 4:55 PM Jungtaek Lim < kabhwan@ gmail. com (
>>>> kabh...@gmail.com ) > wrote:
>>>> > >
>>>> > > Sean, this is actually a fail-back on pinging committers. I know who
>>>> can review and merge in SS area, and pinged to them, didn't work. Even
>>>> there's a PR which approach was encouraged by committer and reviewed the
>>>> first phase, and no review.
>>>> > >
>>>> > > That's not the first time I have faced the situation, and I used the
>>>> fail-back approach at that time. (You can see there was no response even
>>>> in the mail thread.) Not sure which approach worked.
>>>> > > https:/ / lists. apache. org/ thread. html/ 
>>>> > > c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@
>>>> %3Cdev. spark. apache. org%3E (
>>>> https://lists.apache.org/thread.html/c61f32249949b1ff1b265c1a7148c2ea7eda08891e3016fb24008561@%3Cdev.spark.apache.org%3E
>>>> )
>>>> > >
>>>> > > I've observed that only (critical) bugfixes are being reviewed and
>>&g

Re: Make proactive check for closure serializability optional?

2019-01-21 Thread Reynold Xin
Did you actually observe a perf issue?

On Mon, Jan 21, 2019 at 10:04 AM Sean Owen  wrote:

> The ClosureCleaner proactively checks that closures passed to
> transformations like RDD.map() are serializable, before they're
> executed. It does this by just serializing it with the JavaSerializer.
>
> That's a nice feature, although there's overhead in always trying to
> serialize the closure ahead of time, especially if the closure is
> large. It shouldn't be large, usually. But I noticed it when coming up
> with this fix: https://github.com/apache/spark/pull/23600
>
> It made me wonder, should this be optional, or even not the default?
> Closures that don't serialize still fail, just later when an action is
> invoked. I don't feel strongly about it, just checking if anyone had
> pondered this before.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Make proactive check for closure serializability optional?

2019-01-22 Thread Reynold Xin
Typically very large closures include some array, and the serialization itself 
should be much more expensive than the closure check. Does anybody have actual 
data on this could be a problem? We don't want to add a config flag if for 
virtually any case it doesn't make sense to change.

On Mon, Jan 21, 2019 at 12:37 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> Agreed on the pros / cons, esp driver could be the data science notebook.
> Is it worthwhile making it configurable?
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, January 21, 2019 10:42 AM
> *To:* Reynold Xin
> *Cc:* dev
> *Subject:* Re: Make proactive check for closure serializability optional?
>  
> None except the bug / PR I linked to, which is really just a bug in
> the RowMatrix implementation; a 2GB closure isn't reasonable.
> I doubt it's much overhead in the common case, because closures are
> small and this extra check happens once per execution of the closure.
> 
> I can also imagine middle-ground cases where people are dragging along
> largeish 10MB closures (like, a model or some data) and this could add
> non-trivial memory pressure on the driver. They should be broadcasting
> those things, sure.
> 
> Given just that I'd leave it alone, but was wondering if anyone had
> ever had the same thought or more arguments that it should be
> disable-able. In 'production' one would imagine all the closures do
> serialize correctly and so this is just a bit overhead that could be
> skipped.
> 
> On Mon, Jan 21, 2019 at 12:17 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> >
> > Did you actually observe a perf issue?
> >
> > On Mon, Jan 21, 2019 at 10:04 AM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> >>
> >> The ClosureCleaner proactively checks that closures passed to
> >> transformations like RDD.map() are serializable, before they're
> >> executed. It does this by just serializing it with the JavaSerializer.
> >>
> >> That's a nice feature, although there's overhead in always trying to
> >> serialize the closure ahead of time, especially if the closure is
> >> large. It shouldn't be large, usually. But I noticed it when coming up
> >> with this fix: https:/ / github. com/ apache/ spark/ pull/ 23600 (
> https://github.com/apache/spark/pull/23600 )
> >>
> >> It made me wonder, should this be optional, or even not the default?
> >> Closures that don't serialize still fail, just later when an action is
> >> invoked. I don't feel strongly about it, just checking if anyone had
> >> pondered this before.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> >>
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-22 Thread Reynold Xin
Actually a non trivial fraction of users / customers I interact with still use 
very old Hive metastores. Because it’s very difficult to upgrade Hive metastore 
wholesale (it’d require all the production jobs that access the same metastore 
be upgraded at once). This is even harder than JVM upgrade which can be done on 
a per job basis, or OS upgrade that can be done on a per machine basis.

Is there high maintenance cost with keeping these? My understanding is that 
Michael did a good job initially with classloader isolation and modular design 
that they are very easy to maintain.

> On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon  wrote:
> 
> Yea, I was thinking about that too. They are too old to keep. +1 for removing 
> them out.
> 
> 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:
>> Hi, All.
>> 
>> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
>> Among them, HMS 0.x releases look very old since we are in 2019.
>> If these are not used in the production any more, can we drop HMS 0.x 
>> supports in 3.0.0?
>> 
>> hive-0.12.0 2013-10-10
>> hive-0.13.0 2014-04-15
>> hive-0.13.1 2014-11-16
>> hive-0.14.0 2014-11-16
>> ( https://archive.apache.org/dist/hive/ )
>> 
>> In addition, if there is someone who is still using these HMS versions and 
>> has a plan to install and use Spark 3.0.0 with these HMS versions, could you 
>> reply this email thread? If there is a reason, that would be very helpful 
>> for me.
>> 
>> Thanks,
>> Dongjoon.


Re: Removing old HiveMetastore(0.12~0.14) from Spark 3.0.0?

2019-01-23 Thread Reynold Xin
It is not even an old “cluster”. It is a central metastore shares by
multiple clusters.

On Wed, Jan 23, 2019 at 10:04 AM Dongjoon Hyun 
wrote:

> Got it. Thank you for sharing that, Reynold.
>
> So, you mean they will use `Apache Spark 3.0.0` on the old clusters with
> Hive 0.x, right?
>
> If that happens actually, no problem to keep them.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Jan 22, 2019 at 11:49 PM Xiao Li  wrote:
>
>> Based on my experience in development of Spark SQL, the maintenance cost
>> is very small for supporting different versions of Hive metastore. Feel
>> free to ping me if we hit any issue about it.
>>
>> Cheers,
>>
>> Xiao
>>
>> Reynold Xin  于2019年1月22日周二 下午11:18写道:
>>
>>> Actually a non trivial fraction of users / customers I interact with
>>> still use very old Hive metastores. Because it’s very difficult to upgrade
>>> Hive metastore wholesale (it’d require all the production jobs that access
>>> the same metastore be upgraded at once). This is even harder than JVM
>>> upgrade which can be done on a per job basis, or OS upgrade that can be
>>> done on a per machine basis.
>>>
>>> Is there high maintenance cost with keeping these? My understanding is
>>> that Michael did a good job initially with classloader isolation and
>>> modular design that they are very easy to maintain.
>>>
>>> On Jan 22, 2019, at 11:13 PM, Hyukjin Kwon  wrote:
>>>
>>> Yea, I was thinking about that too. They are too old to keep. +1 for
>>> removing them out.
>>>
>>> 2019년 1월 23일 (수) 오전 11:30, Dongjoon Hyun 님이 작성:
>>>
>>>> Hi, All.
>>>>
>>>> Currently, Apache Spark supports Hive Metastore(HMS) 0.12 ~ 2.3.
>>>> Among them, HMS 0.x releases look very old since we are in 2019.
>>>> If these are not used in the production any more, can we drop HMS 0.x
>>>> supports in 3.0.0?
>>>>
>>>> hive-0.12.0 2013-10-10
>>>> hive-0.13.0 2014-04-15
>>>> hive-0.13.1 2014-11-16
>>>> hive-0.14.0 2014-11-16
>>>> ( https://archive.apache.org/dist/hive/ )
>>>>
>>>> In addition, if there is someone who is still using these HMS versions
>>>> and has a plan to install and use Spark 3.0.0 with these HMS versions,
>>>> could you reply this email thread? If there is a reason, that would be very
>>>> helpful for me.
>>>>
>>>> Thanks,
>>>> Dongjoon.
>>>>
>>>


Re: [PySpark] Revisiting PySpark type annotations

2019-01-25 Thread Reynold Xin
If we can make the annotation compatible with Python 2, why don’t we add
type annotation to make life easier for users of Python 3 (with type)?

On Fri, Jan 25, 2019 at 7:53 AM Maciej Szymkiewicz 
wrote:

>
> Hello everyone,
>
> I'd like to revisit the topic of adding PySpark type annotations in 3.0.
> It has been discussed before (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-friendly-API-for-Spark-3-0-td25016.html
> and
> http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-PySpark-typing-hints-td21560.html)
> and is tracked by SPARK-17333 (
> https://issues.apache.org/jira/browse/SPARK-17333). Is there any
> consensus here?
>
> In the spirit of full disclosure I am trying to decide if, and if yes to
> what extent, migrate my stub package (
> https://github.com/zero323/pyspark-stubs) to 3.0 and beyond. Maintaining
> such package is relatively time consuming (not being active PySpark user
> anymore, it is the least priority for me at the moment) and if there any
> official plans to make it obsolete, it would be a valuable information for
> me.
>
> If there are no plans to add native annotations to PySpark, I'd like to
> use this opportunity to ask PySpark commiters, to drop by and open issue (
> https://github.com/zero323/pyspark-stubs/issues)  when new methods are
> introduced, or there are changes in the existing API (PR's are of course
> welcomed as well). Thanks in advance.
>
> --
> Best,
> Maciej
>
>
>


Re: Make .unpersist() non-blocking by default?

2019-01-28 Thread Reynold Xin
Seems to make sense to have it false by default.

(I agree this deserves a dev list mention though even if there is easy
consensus). We should make sure we mark the Jira with releasenotes so we
can add it to uograde guide.

On Mon, Jan 28, 2019 at 8:47 AM Sean Owen  wrote:

> Interesting notion at https://github.com/apache/spark/pull/23650 :
>
> .unpersist() takes an optional 'blocking' argument. If true, the call
> waits until the resource is freed. Otherwise it doesn't.
>
> The default looks pretty inconsistent:
> - RDD: true
> - Broadcast: true
> - Dataset / DataFrame: false
> - Graph (in GraphX): false
> - Pyspark RDD: (no option)
> - Pyspark Broadcast: false
> - Pyspark DataFrame: false
>
> I think false is a better default, as I'd expect it's much more likely
> that the caller doesn't want to wait around while resources are freed,
> especially as this happens on the driver. The possible downside is
> that if the resources don't free up quickly, other operations might
> not have as much memory available as they otherwise might have.
>
> What about making the default false everywhere for Spark 3?
> I raised it to dev@ just because that seems like a nontrivial behavior
> change, but maybe it isn't controversial.
>
> Sean
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [build system] speeding up maven build building only changed modules compared to master branch

2019-01-28 Thread Reynold Xin
This might be useful to do.

BTW, based on my experience with different build systems in the past few years 
(extensively SBT/Maven/Bazel, and to a less extent Gradle/Cargo), I think the 
longer term solution is to move to Bazel. It is so much easier to understand 
and use, and also much more feature rich with great support for multiple 
languages. It also supports distributed/local build cache so builds can be much 
faster.

On Mon, Jan 28, 2019 at 1:28 AM, Gabor Somogyi < gabor.g.somo...@gmail.com > 
wrote:

> 
> Do you have some numbers how much is this faster? I'm asking it because
> previously I've evaluated another plugin and found the following:
> - Incremental build didn't bring too much even in bigger than spark
> projects
> - Incremental test was buggy and sometimes the required tests were not
> executed which caused several issues
> All in all a single tiny little bug in the incremental test could cause
> horror for developers so it must be rock solid.
> Is this project used somewhere in production?
> 
> On Sat, Jan 26, 2019 at 4:03 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> Sounds interesting; would it be able to handle R and Python modules built
>> by this project ? The home grown solution here does I think and that is
>> helpful. 
>> 
>> On Sat, Jan 26, 2019, 6:57 AM vaclavkosar < admin@ vaclavkosar. com (
>> ad...@vaclavkosar.com ) wrote:
>> 
>> 
>>> 
>>> 
>>> I think it would be good idea to use gitflow-incremental-builder maven
>>> plugin for Spark builds. It saves resources by building only modules that
>>> are impacted by changes compared to git master branch via
>>> gitflow-incremental-builder maven plugin. For example if there is only a
>>> change introduced into on of files of spark-avro_2.11 then only that maven
>>> module and its maven dependencies and dependents would be build or tested.
>>> If there are no disagreements, I can submit a pull request for that.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Project URL: https:/ / github. com/ vackosar/ gitflow-incremental-builder (
>>> https://github.com/vackosar/gitflow-incremental-builder )
>>> 
>>> 
>>> 
>>> Disclaimer: I originally created the project. But most of recent
>>> improvements and maintenance were deved by Falko.
>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: merge script stopped working; Python 2/3 input() issue?

2019-02-15 Thread Reynold Xin
lol

On Fri, Feb 15, 2019 at 4:02 PM, Marcelo Vanzin < van...@cloudera.com.invalid > 
wrote:

> 
> 
> 
> You're talking about the spark-website script, right? The main repo's
> script has been working for me, the website one is broken.
> 
> 
> 
> I think it was caused by this dude changing raw_input to input recently:
> 
> 
> 
> commit 8b6e7dceaf5d73de3f92907ceeab8925a2586685
> Author: Sean Owen < sean. owen@ databricks. com ( sean.o...@databricks.com
> ) > Date: Sat Jan 19 19:02:30 2019 -0600
> 
> 
> 
> More minor style fixes for merge script
> 
> 
> 
> On Fri, Feb 15, 2019 at 3:55 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> I'm seriously confused on this one. The spark-website merge script just
>> stopped working for me. It fails on the call to input() that expects a y/n
>> response, saying 'y' isn't defined.
>> 
>> 
>> 
>> Indeed, it seems like Python 2's input() tries to evaluate the input,
>> rather than return a string. Python 3 input() returns as a string, as does
>> Python 2's raw_input().
>> 
>> 
>> 
>> But the script clearly requires Python 2 as it imports urllib2, and my
>> local "python" is Python 2.
>> 
>> 
>> 
>> And nothing has changed recently and this has worked for a long time. The
>> main spark merge script does the same.
>> 
>> 
>> 
>> How on earth has this worked?
>> 
>> 
>> 
>> I could replace input() with raw_input(), or just go ahead and fix the
>> merge scripts to work with / require Python 3. But am I missing something
>> basic?
>> 
>> 
>> 
>> If not, which change would people be OK with?
>> 
>> 
>> 
>> Sean
>> 
>> 
>> 
>> - To
>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>> dev-unsubscr...@spark.apache.org )
>> 
>> 
> 
> 
> 
> --
> Marcelo
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: [DISCUSS] SPIP: Relational Cache

2019-02-24 Thread Reynold Xin
How is this different from materialized views?

On Sun, Feb 24, 2019 at 3:44 PM Daoyuan Wang  wrote:

> Hi everyone,
>
> We'd like to discuss our proposal of Spark relational cache in this
> thread. Spark has native command for RDD caching, but the use of CACHE
> command in Spark SQL is limited, as we cannot use the cache cross session,
> as well as we have to rewrite queries by ourselves to make use of existing
> cache.
> To resolve this, we have done some initial work to do the following:
>
>  1. allow user to persist cache on HDFS in format of Parquet.
>  2. rewrite user queries in Catalyst, to utilize any existing cache (on
> HDFS or defined as in memory in current session) if possible.
>
> I have created a jira ticket(
> https://issues.apache.org/jira/browse/SPARK-26764) for this and attached
> an official SPIP document.
>
> Thanks for taking a look at the proposal.
>
> Best Regards,
> Daoyuan
>


Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-02-24 Thread Reynold Xin
The challenge with the Scala/Java API in the past is that when there are 
multipe parameters, it'd lead to an explosion of function overloads. 

On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I hear three topics in this thread
> 
> 
> 1. I don’t think we should remove string. Column and string can both be
> “type safe”. And I would agree we don’t *need* to break API compatibility
> here.
> 
> 
> 2. Gaps in python API. Extending on #1, definitely we should be consistent
> and add string as param where it is missed.
> 
> 
> 3. Scala API for string - hard to say but make sense if nothing but for
> consistency. Though I can also see the argument of Column only in Scala.
> String might be more natural in python and much less significant in Scala
> because of $”foo” notation.
> 
> 
> (My 2 c)
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Sunday, February 24, 2019 6:59 AM
> *To:* André Mello
> *Cc:* dev
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
>  
> I just commented on the PR -- I personally don't think it's worth
> removing support for, say, max("foo") over max(col("foo")) or
> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
> seems like it would unnecessarily break a lot of code. The string arg
> is more concise in Python and I can't think of cases where it's
> particularly ambiguous or confusing; on the contrary it's more natural
> coming from SQL.
> 
> What we do have are inconsistencies and errors in support of string vs
> Column as fixed in the PR. I was surprised to see that
> df. select ( http://df.select/ ) (abs("col")) throws an error while df. select
> ( http://df.select/ ) (sqrt("col"))
> doesn't. I think that's easy to fix on the Python side. Really I think
> the question is: do we need to add methods like "def abs(String)" and
> more in Scala? that would remain inconsistent even if the Pyspark side
> is fixed.
> 
> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello. br@ gmail. com (
> asmello...@gmail.com ) > wrote:
> >
> > # Context
> >
> > This comes from [SPARK-26979], which became PR #23879 and then PR
> > #23882. The following reflects all the findings made so far.
> >
> > # Description
> >
> > Currently, in the Scala API, some SQL functions have two overloads,
> > one taking a string that names the column to be operated on, the other
> > taking a proper Column object. This allows for two patterns of calling
> > these functions, which is a source of inconsistency and generates
> > confusion for new users, since it is hard to predict which functions
> > will take a column name or not.
> >
> > The PySpark API partially solves this problem by internally converting
> > the argument to a Column object prior to passing it through to the
> > underlying JVM implementation. This allows for a consistent use of
> > name literals across the API, except for a few violations:
> >
> > - lower()
> > - upper()
> > - abs()
> > - bitwiseNOT()
> > - ltrim()
> > - rtrim()
> > - trim()
> > - ascii()
> > - base64()
> > - unbase64()
> >
> > These violations happen because for a subset of the SQL functions,
> > PySpark uses a functional mechanism (`_create_function`) to directly
> > call the underlying JVM equivalent by name, thus skipping the
> > conversion step. In most cases the column name pattern still works
> > because the Scala API has its own support for string arguments, but
> > the aforementioned functions are also exceptions there.
> >
> > My proposal was to solve this problem by adding the string support
> > where it was missing in the PySpark API. Since this is a purely
> > additive change, it doesn't break past code. Additionally, I find the
> > API sugar to be a positive feature, since code like `max("foo")` is
> > more concise and readable than `max(col("foo"))`. It adheres to the
> > DRY philosophy and is consistent with Python's preference for
> > readability over type protection.
> >
> > However, upon submission of the PR, a discussion was started about
> > whether it wouldn't be better to entirely deprecate string support
> > instead - in particular with major release 3.0 in mind. The reasoning,
> > as I understood it, was that this approach is more explicit and type
> > safe, which is preferred in Java/Scala, plus it reduces the API
> > surface area - and the Python API should be consistent with the others
> > as well.
> >
> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
> > by this mailing list.
> >
> > # Summary
> >
> > There is a problem with inconsistency in the Scala/Python SQL API,
> > where sometimes you can use a column name string as a proxy, and
> > sometimes you have to use a proper Column object. To solve it there
> > are two approaches - to remove the string support entirely, or to add
> > it where it is missing. Which approach is best?
> >
> > Hope this is clear.
> >
> > -- André.
> >
> > ---

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-26 Thread Reynold Xin
We will have to fix that before we declare dev2 is stable, because
InternalRow is not a stable API. We don’t necessarily need to do it in 3.0.

On Tue, Feb 26, 2019 at 5:10 PM Matt Cheah  wrote:

> Will that then require an API break down the line? Do we save that for
> Spark 4?
>
>
>
> -Matt Cheah?
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:53 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> That's a good question.
>
>
>
> While I'd love to have a solution for that, I don't think it is a good
> idea to delay DSv2 until we have one. That is going to require a lot of
> internal changes and I don't see how we could make the release date if we
> are including an InternalRow replacement.
>
>
>
> On Tue, Feb 26, 2019 at 4:41 PM Matt Cheah  wrote:
>
> Reynold made a note earlier about a proper Row API that isn’t InternalRow
> – is that still on the table?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"rb...@netflix.com" 
> *Date: *Tuesday, February 26, 2019 at 4:40 PM
> *To: *Matt Cheah 
> *Cc: *Sean Owen , Wenchen Fan ,
> Xiao Li , Matei Zaharia ,
> Spark Dev List 
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> Thanks for bumping this, Matt. I think we can have the discussion here to
> clarify exactly what we’re committing to and then have a vote thread once
> we’re agreed.
>
> Getting back to the DSv2 discussion, I think we have a good handle on what
> would be added:
>
> · Plugin system for catalogs
>
> · TableCatalog interface (I’ll start a vote thread for this SPIP
> shortly)
>
> · TableCatalog implementation backed by SessionCatalog that can
> load v2 tables
>
> · Resolution rule to load v2 tables using the new catalog
>
> · CTAS logical and physical plan nodes
>
> · Conversions from SQL parsed logical plans to v2 logical plans
>
> Initially, this will always use the v2 catalog backed by SessionCatalog to
> avoid dependence on the multi-catalog work. All of those are already
> implemented and working, so I think it is reasonable that we can get them
> in.
>
> Then we can consider a few stretch goals:
>
> · Get in as much DDL as we can. I think create and drop table
> should be easy.
>
> · Multi-catalog identifier parsing and multi-catalog support
>
> If we get those last two in, it would be great. We can make the call
> closer to release time. Does anyone want to change this set of work?
>
>
>
> On Tue, Feb 26, 2019 at 4:23 PM Matt Cheah  wrote:
>
> What would then be the next steps we'd take to collectively decide on
> plans and timelines moving forward? Might I suggest scheduling a conference
> call with appropriate PMCs to put our ideas together? Maybe such a
> discussion can take place at next week's meeting? Or do we need to have a
> separate formalized voting thread which is guided by a PMC?
>
> My suggestion is to try to make concrete steps forward and to avoid
> letting this slip through the cracks.
>
> I also think there would be merits to having a project plan and estimates
> around how long each of the features we want to complete is going to take
> to implement and review.
>
> -Matt Cheah
>
> On 2/24/19, 3:05 PM, "Sean Owen"  wrote:
>
> Sure, I don't read anyone making these statements though? Let's assume
> good intent, that "foo should happen" as "my opinion as a member of
> the community, which is not solely up to me, is that foo should
> happen". I understand it's possible for a person to make their opinion
> over-weighted; this whole style of decision making assumes good actors
> and doesn't optimize against bad ones. Not that it can't happen, just
> not seeing it here.
>
> I have never seen any vote on a feature list, by a PMC or otherwise.
> We can do that if really needed I guess. But that also isn't the
> authoritative process in play here, in contrast.
>
> If there's not a more specific subtext or issue here, which is fine to
> say (on private@ if it's sensitive or something), yes, let's move on
> in good faith.
>
> On Sun, Feb 24, 2019 at 3:45 PM Mark Hamstra 
> wrote:
> > There is nothing wrong with individuals advocating for what they
> think should or should not be in Spark 3.0, nor should anyone shy away from
> explaining why they think delaying the release for some reason is or isn't
> a good idea. What is a problem, or is at least something that I have a
> problem with, are declarative, pseudo-authoritative statements that 3.0 (or
> some other release) will or won't contain some feature, API, etc. or that
> some issue is or is not blocker or worth delaying for. When the PMC has not
> voted on such issues, I'm often left thinking, "Wait... what? Who decided
> that, or where did that decision come from?"
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>

Re: CombinePerKey and GroupByKey

2019-02-28 Thread Reynold Xin
This should be fine. Dataset.groupByKey is a logical operation, not a
physical one (as in Spark wouldn’t always materialize all the groups in
memory).

On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot 
wrote:

> Hi all,
>
> I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no
> more there in the Dataset API. So, I translated it to:
>
>
> KeyValueGroupedDataset> groupedDataset =
> keyedDataset.groupByKey(KVHelpers.extractKey(), 
> EncoderHelpers.genericEncoder());
>
> Dataset> combinedDataset =
> groupedDataset.agg(
> new Aggregator, AccumT, OutputT>(combineFn).toColumn());
>
>
> I have an interrogation regarding performance : as GroupByKey is generally
> less performant (entails shuffle and possible OOM if a given key has a lot
> of data associated to it), I was wondering if the new spark optimizer
> translates such a DAG into a combinePerKey behind the scene. In other
> words, is such a DAG going to be translated to a local (or partial I don't
> know what terminology you use) combine and then a global combine to avoid
> shuffle?
>
> Thanks
>
> Etienne
>


Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Reynold Xin
Ryan - can you take the public user facing API part out of that SPIP?

In general it'd be better to have the SPIPs be higher level, and put the 
detailed APIs in a separate doc. Alternatively, put them in the SPIP but 
explicitly vote on the high level stuff and not the detailed APIs. 

I don't want to get to a situation in which two months later the identical APIs 
were committed with the justification that they were voted on a while ago. In 
this case, it's even more serious because while I think we all have consensus 
on the higher level internal API, not much discussion has happened with the 
user-facing API and we should just leave that out explicitly.

On Fri, Mar 01, 2019 at 1:00 PM, Anthony Young-Garner < 
anthony.young-gar...@cloudera.com.invalid > wrote:

> 
> +1 (non-binding)
> 
> 
> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1 (non-binding)
>> 
>> 
>> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah < mcheah@ palantir. com (
>> mch...@palantir.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1 (non-binding)
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> *From:* Jamison Bennett < jamison. bennett@ cloudera. com. INVALID (
>>> jamison.benn...@cloudera.com.INVALID ) >
>>> *Date:* Thursday, February 28, 2019 at 8:28 AM
>>> *To:* Ryan Blue < rblue@ netflix. com ( rb...@netflix.com ) >, Spark Dev
>>> List < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>> *Subject:* Re: [VOTE] SPIP: Spark API for Table Metadata
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> +1 (non-binding)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *Jamison Bennett*
>>> 
>>> 
>>> 
>>> Cloudera Software Engineer
>>> 
>>> 
>>> 
>>> jamison. bennett@ cloudera. com ( jamison.benn...@cloudera.com )
>>> 
>>> 
>>> 
>>> 515 Congress Ave, Suite 1212  |   Austin, TX  |   78701
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Thu, Feb 28, 2019 at 10:20 AM Ryan Blue < rblue@ netflix. com. invalid (
>>> rb...@netflix.com.invalid ) > wrote:
>>> 
>>> 
>>> 
 
 
 +1 (non-binding)
 
 
 
 
  
 
 
 
 On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer < russell. spitzer@ gmail.
 com ( russell.spit...@gmail.com ) > wrote:
 
 
 
> 
> 
> +1 (non-binding) 
> 
> 
> 
> On Wed, Feb 27, 2019, 6:28 PM Ryan Blue < rblue@ netflix. com. invalid (
> rb...@netflix.com.invalid ) > wrote:
> 
> 
> 
>> 
>> 
>> Hi everyone,
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> In the last DSv2 sync, the consensus was that the table metadata SPIP was
>> ready to bring up for a vote. Now that the multi-catalog identifier SPIP
>> vote has passed, I'd like to start one for the table metadata API,
>> TableCatalog.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The proposal is for adding a TableCatalog interface that will be used by
>> v2 plans. That interface has methods to load, create, drop, alter,
>> refresh, rename, and check existence for tables. It also specifies the 
>> set
>> of metadata used to configure tables: schema, partitioning, and key-value
>> properties. For more information, please read the SPIP proposal doc 
>> [docs.
>> google. com] (
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.google.com_document_d_1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo_edit-23heading-3Dh.m45webtwxf2d&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=JmgvL6ffL9tyoLWWZtWujDe9FNiSguMApA53YK9NTP8&s=eSx5nMZvdB5hS9VepuvvFZFXjTCrdde-AdzkHC5jRYk&e=
>> ).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Please vote in the next 3 days.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> [ ] +1: Accept the proposal as an official SPIP
>> 
>> 
>> 
>> 
>> [ ] +0
>> 
>> 
>> 
>> 
>> [ ] -1: I don't think this is a good idea because ...
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> --
>> 
>> 
>> 
>> Ryan Blue
>> 
>> 
>> 
>> Software Engineer
>> 
>> 
>> 
>> 
>> Netflix
>> 
>> 
>> 
> 
> 
 
 
 
 
 
 
 
 
  
 
 
 
 
 --
 
 
 
 Ryan Blue
 
 
 
 Software Engineer
 
 
 
 
 Netflix
 
 
 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> --
>> John Zhuge
>> 
> 
>

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-03-01 Thread Reynold Xin
Thanks Ryan. +1.

On Fri, Mar 01, 2019 at 5:33 PM, Ryan Blue < rb...@netflix.com > wrote:

> 
> Actually, I went ahead and removed the confusing section. There is no
> public API in the doc now, so that it is clear that it isn't a relevant
> part of this vote.
> 
> On Fri, Mar 1, 2019 at 4:58 PM Ryan Blue < rblue@ netflix. com (
> rb...@netflix.com ) > wrote:
> 
> 
>> I moved the public API to the "Implementation Sketch" section. That API is
>> not an important part of this, as that section notes.
>> 
>> 
>> I completely agree that SPIPs should be high-level and that the specifics,
>> like method names, are not hard requirements. The proposal was more of a
>> sketch, but I was asked by Xiao in the DSv2 sync to make sure the list of
>> methods was complete. I think as long as we have agreement that the intent
>> is not to make the exact names binding, we should be okay.
>> 
>> 
>> I can remove the user-facing API sketch, but I'd prefer to leave it in the
>> sketch section so we have it documented somewhere.
>> 
>> On Fri, Mar 1, 2019 at 4:51 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Ryan - can you take the public user facing API part out of that SPIP?
>>> 
>>> 
>>> 
>>> In general it'd be better to have the SPIPs be higher level, and put the
>>> detailed APIs in a separate doc. Alternatively, put them in the SPIP but
>>> explicitly vote on the high level stuff and not the detailed APIs. 
>>> 
>>> 
>>> 
>>> I don't want to get to a situation in which two months later the identical
>>> APIs were committed with the justification that they were voted on a while
>>> ago. In this case, it's even more serious because while I think we all
>>> have consensus on the higher level internal API, not much discussion has
>>> happened with the user-facing API and we should just leave that out
>>> explicitly.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Mar 01, 2019 at 1:00 PM, Anthony Young-Garner < anthony. 
>>> young-garner@
>>> cloudera. com. invalid ( anthony.young-gar...@cloudera.com.invalid ) > 
>>> wrote:
>>> 
>>> 
>>>> +1 (non-binding)
>>>> 
>>>> 
>>>> On Thu, Feb 28, 2019 at 5:54 PM John Zhuge < jzhuge@ apache. org (
>>>> jzh...@apache.org ) > wrote:
>>>> 
>>>> 
>>>>> +1 (non-binding)
>>>>> 
>>>>> 
>>>>> On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah < mcheah@ palantir. com (
>>>>> mch...@palantir.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> +1 (non-binding)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> *From:* Jamison Bennett < jamison. bennett@ cloudera. com. INVALID (
>>>>>> jamison.benn...@cloudera.com.INVALID ) >
>>>>>> *Date:* Thursday, February 28, 2019 at 8:28 AM
>>>>>> *To:* Ryan Blue < rblue@ netflix. com ( rb...@netflix.com ) >, Spark Dev
>>>>>> List < dev@ spark. apache. org ( dev@spark.apache.org ) >
>>>>>> *Subject:* Re: [VOTE] SPIP: Spark API for Table Metadata
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>  
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> +1 (non-binding)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> *Jamison Bennett*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Cloudera Software Engineer
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> jamison. bennett@ cloudera. com ( jamison.benn...@cloudera.com )
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 515 Congress Ave, Suite 1212  |   Austin, TX  |   78701
>>>>>> 
>>>>>> 
>>>>>> 
>>>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

2019-03-06 Thread Reynold Xin
I think the general philosophy here should be Python should be the most liberal 
and support a column object, or a literal value. It's also super useful to 
support column name, but we need to decide what happens for a string column. Is 
a string passed in a literal string value, or a column name?

On Mon, Mar 04, 2019 at 6:00 AM, André Mello < ame...@palantir.com > wrote:

> 
> 
> 
> Hey everyone,
> 
> 
> 
>  
> 
> 
> 
> Progress has been made with PR #23882 (
> https://github.com/apache/spark/pull/23882 ) , and it is now in a state
> where it could be merged with master.
> 
> 
> 
>  
> 
> 
> 
> This is what we’re doing for now:
> 
> * PySpark *will* support strings consistently throughout its API.
> * Arguably string support makes syntax closer to SQL and Scala, where you
> can use similar shorthands to specify columns, and the general direction
> of the PySpark API has been to be consistent with those other two;
> * This is a small, additive change that will not break anything;
> * The reason support was not there in the first place was because the code
> that generated functions was originally designed for aggregators, which
> all support column names, but it was being used for other functions (e.g.
> lower, abs) that did not, so it seems like it was not intentional.
> 
> 
> 
> 
> 
> 
> We are NOT going to:
> 
> * Make any code changes in Scala;
> * This requires first deciding if string support is desirable or not;
> * Decide whether or not strings should be supported in the Scala API;
> * This requires a larger discussion and the above changes are independent
> of this;
> * Make PySpark support Column objects where it currently only supports
> strings (e.g. multi-argument version of drop());
> * Converting from Column to column name is not something the API does
> right now, so this is a stronger change;
> * This can be considered separately.
> * Do anything with R for now.
> * Anyone is free to take on this, but I have no experience with R.
> 
> 
>  
> 
> 
> 
> If you folks agree with this, let us know, so we can move forward with the
> merge.
> 
> 
> 
>  
> 
> 
> 
> Best.
> 
> 
> 
>  
> 
> 
> 
> -- André.
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Monday, 25 February 2019 at 00:49
> *To:* Felix Cheung < felixcheun...@hotmail.com >
> *Cc:* dev < dev@spark.apache.org >, Sean Owen < sro...@gmail.com >, André
> Mello < asmello...@gmail.com >
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The challenge with the Scala/Java API in the past is that when there are
> multipe parameters, it'd lead to an explosion of function overloads. 
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheun...@hotmail.com >
> wrote:
> 
> 
>> 
>> 
>> I hear three topics in this thread
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 1. I don’t think we should remove string. Column and string can both be
>> “type safe”. And I would agree we don’t *need* to break API compatibility
>> here.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 2. Gaps in python API. Extending on #1, definitely we should be consistent
>> and add string as param where it is missed.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 3. Scala API for string - hard to say but make sense if nothing but for
>> consistency. Though I can also see the argument of Column only in Scala.
>> String might be more natural in python and much less significant in Scala
>> because of $”foo” notation.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> (My 2 c)
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> *From:* Sean Owen < sro...@gmail.com >
>> *Sent:* Sunday, February 24, 2019 6:59 AM
>> *To:* André Mello
>> *Cc:* dev
>> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
>> functions
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> I just commented on the PR -- I personally don't think it's worth
>> removing support for, say, max("foo") over max(col("foo")) or
>> max($"foo"

Re: Hive Hash in Spark

2019-03-06 Thread Reynold Xin
I think they might be used in bucketing? Not 100% sure.

On Wed, Mar 06, 2019 at 1:40 PM, < tcon...@gmail.com > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
>  
> 
> 
> 
> I noticed the existence of a Hive Hash partitioning implementation in
> Spark, but also noticed that it’s not being used, and that the Spark hash
> partitioning function is presently hardcoded to Murmur3. My question is
> whether Hive Hash is dead code or are their future plans to support
> reading and understanding data the has been partitioned using Hive Hash?
> By understanding, I mean that I’m able to avoid a full shuffle join on
> Table A (partitioned by Hive Hash) when joining with a Table B that I can
> shuffle via Hive Hash to Table A.
> 
> 
> 
>  
> 
> 
> 
> Thank you,
> 
> 
> 
> Tyson
> 
> 
>

Re: [SQL] hash: 64-bits and seeding

2019-03-06 Thread Reynold Xin
Rather than calling it hash64, it'd be better to just call it xxhash64. The 
reason being ten years from now, we probably would look back and laugh at a 
specific hash implementation. It'd be better to just name the expression what 
it is.

On Wed, Mar 06, 2019 at 7:59 PM, < huon.wil...@data61.csiro.au > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
> I’m working on something that requires deterministic randomness, i.e. a
> row gets the same “random” value no matter the order of the DataFrame. A
> seeded hash seems to be the perfect way to do this, but the existing
> hashes have various limitations:
> 
> 
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot
> of collisions for many tables: the birthday paradox implies >50% chance of
> at least one for tables larger than 77000 rows, and likely ~1.6 billion
> collisions in a table of size 4 billion)
> - sha1/sha2/md5: single binary column input, string output
> 
> 
> 
> It seems there’s already support for a 64-bit hash function that can work
> with an arbitrary number of arbitrary-typed columns (XxHash64), and
> exposing this for DataFrames seems like it’s essentially one line in
> sql/functions.scala to match `hash` (plus docs, tests, function registry
> etc.):
> 
> 
> 
> def hash64(cols: Column*): Column = withExpr { new
> XxHash64(cols.map(_.expr)) }
> 
> 
> 
> For my use case, this can then be used to get a 64-bit “random” column
> like
> 
> 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> 
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https:/ /
> github. com/ apache/ spark/ compare/ master... huonw:hash64 (
> https://github.com/apache/spark/compare/master...huonw:hash64 ) ; should I
> open a JIRA and submit it as a pull request?
> 
> 
> 
> Additionally, both hash and the new hash64 already have support for being
> seeded, but this isn’t exposed directly and instead requires something
> like the `lit` above. Would it make sense to add overloads like the
> following?
> 
> 
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> 
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed
> first.
> 
> 
> 
> (I sent this email to user@ spark. apache. org ( u...@spark.apache.org ) a
> few days ago, but didn't get any discussion about the Spark aspects of
> this, so I'm resending it here; I apologise in advance if I'm breaking
> protocol!)
> 
> 
> 
> - Huon Wilson
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Reynold Xin
If you use UDFs in Python, you would want to use Pandas UDF for better
performance.

On Mon, Mar 11, 2019 at 7:50 PM Jonathan Winandy 
wrote:

> Thanks, I didn't know!
>
> That being said, any udf use seems to affect badly code generation (and
> the performance).
>
>
> On Mon, 11 Mar 2019, 15:13 Dylan Guedes,  wrote:
>
>> Btw, even if you are using Python you can register your UDFs in Scala and
>> use them in Python.
>>
>> On Mon, Mar 11, 2019 at 6:55 AM Jonathan Winandy <
>> jonathan.wina...@gmail.com> wrote:
>>
>>> Hello Snehasish
>>>
>>> If you are not using UDFs, you will have very similar performance with
>>> those languages on SQL.
>>>
>>> So it go down to :
>>> * if you know python, go for python.
>>> * if you are used to the JVM, and are ready for a bit of paradigm shift,
>>> go for Scala.
>>>
>>> Our team is using Scala, however we help other data engs that are using
>>> python.
>>>
>>> I would say go for pure functional programming, however that is biased
>>> and python gets the job done anyway.
>>>
>>> Cheers,
>>> Jonathan
>>>
>>> On Mon, 11 Mar 2019, 10:34 SNEHASISH DUTTA, 
>>> wrote:
>>>
 Hi

 Is there a way to get performance benchmarks for development of
 application using either Java/Scala/Python

 Use case mostly involve SQL pipeline/data ingested from various sources
 including Kafka

 What should be the most preferred language and it would be great if the
 preference for language can be justified from the perspective of
 application development

 Thanks and Regards
 Snehasish

>>>


Re: understanding the plans of spark sql

2019-03-18 Thread Reynold Xin
This is more of a question for the connector. It depends on how the connector 
is implemented. Some implements aggregate pushdown, but most don't.

On Mon, Mar 18, 2019 at 10:05 AM, asma zgolli < zgollia...@gmail.com > wrote:

> 
> Hello,
> 
> 
> I'm executing using spark SQL an SQL workload on data stored in MongoDB.
> 
> 
> i have a question about the locality of execution of the aggregation. I m
> wondering if the aggregation is pushed down to MongoDB (like pushing down
> filters and projection) or executed in spark. I m displaying the physical
> plan in spark, this plan includes hashaggregation operators but in the log
> of my MongoDB server the execution plan has pipelines for aggregation. 
> 
> 
> I am really confused. thank you very much for your answers. 
> yours sincerely 
> Asma ZGOLLI
> 
> 
> PhD student in data engineering - computer science
> 
> Email : zgolliasma@ gmail. com ( zgollia...@gmail.com )
> email alt: asma. zgolli@ univ-grenoble-alpes. fr ( zgollia...@gmail.com )
>

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin
+1 on doing this in 3.0.

On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I’m +1 if 3.0
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, March 25, 2019 6:48 PM
> *To:* Hyukjin Kwon
> *Cc:* dev; Bryan Cutler; Takuya UESHIN; shane knapp
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>  
> I don't know a lot about Arrow here, but seems reasonable. Is this for
> Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
> seems right.
> 
> On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> >
> > Hi all,
> >
> > We really need to upgrade the minimal version soon. It's actually
> slowing down the PySpark dev, for instance, by the overhead that sometimes
> we need currently to test all multiple matrix of Arrow and Pandas. Also,
> it currently requires to add some weird hacks or ugly codes. Some bugs
> exist in lower versions, and some features are not supported in low
> PyArrow, for instance.
> >
> > Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and
> my opinion as well, we should better increase the minimal version to
> 0.12.x. (Also, note that Pandas <> Arrow is an experimental feature).
> >
> > So, I and Bryan will proceed this roughly in few days if there isn't
> objections assuming we're fine with increasing it to 0.12.x. Please let me
> know if there are some concerns.
> >
> > For clarification, this requires some jobs in Jenkins to upgrade the
> minimal version of PyArrow (I cc'ed Shane as well).
> >
> > PS: I roughly heard that Shane's busy for some work stuff .. but it's
> kind of important in my perspective.
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin
At some point we should celebrate having the larger RC number ever in Spark ...

On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote:

> 
> 
> 
> RC9 was just cut. Will send out another thread once the build is finished.
> 
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Mon, Mar 25, 2019 at 5:10 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> That's all merged now. I think you're clear to start an RC.
>> 
>> 
>> 
>> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai < dbtsai@ dbtsai. com. invalid (
>> dbt...@dbtsai.com.invalid ) > wrote:
>> 
>> 
>>> 
>>> 
>>> I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961 https:/ / 
>>> github.
>>> com/ apache/ spark/ pull/ 24126 (
>>> https://github.com/apache/spark/pull/24126 ) , anything critical that we
>>> have to wait for 2.4.1 release? Thanks!
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ / 
>>> www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Sun, Mar 24, 2019 at 8:19 PM Sean Owen < srowen@ apache. org (
>>> sro...@apache.org ) > wrote:
>>> 
>>> 
 
 
 Still waiting on a successful test - hope this one works.
 
 
 
 On Sun, Mar 24, 2019, 10:13 PM DB Tsai < dbtsai@ dbtsai. com (
 dbt...@dbtsai.com ) > wrote:
 
 
> 
> 
> Hello Sean,
> 
> 
> 
> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we can
> merge it into 2.4 branch soon?
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / 
> www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> I think we can/should get in SPARK-26961 too; it's all but ready to
>> commit.
>> 
>> 
>> 
>> On Sat, Mar 23, 2019 at 2:02 PM DB Tsai < dbtsai@ dbtsai. com (
>> dbt...@dbtsai.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> -1
>>> 
>>> 
>>> 
>>> I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
>>> SPARK-27178, SPARK-27112. Please let me know if there is any critical PR
>>> that has to be back-ported into branch-2.4.
>>> 
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ 
>>> / www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Fri, Mar 22, 2019 at 12:28 AM DB Tsai < dbtsai@ dbtsai. com (
>>> dbt...@dbtsai.com ) > wrote:
>>> 
>>> 
 
 
 Since we have couple concerns and hesitations to release rc8, how 
 about we
 give it couple days, and have another vote on March 25, Monday? In this
 case, I will cut another rc9 in the Monday morning.
 
 
 
 Darcy, as Dongjoon mentioned,
 https:/ / github. com/ apache/ spark/ pull/ 24092 (
 https://github.com/apache/spark/pull/24092 ) is conflict against
 branch-2.4, can you make anther PR against branch-2.4 so we can include
 the ORC fix in 2.4.1?
 
 
 
 Thanks.
 
 
 
 Sincerely,
 
 
 
 DB Tsai
 -- Web: 
 https:/ / www.
 dbtsai. com ( https://www.dbtsai.com/ )
 PGP Key ID: 42E5B25A8F7A82C1
 
 
 
 On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung < felixcheung_m@ hotmail. 
 com
 ( felixcheun...@hotmail.com ) > wrote:
 
 
> 
> 
> Reposting for shane here
> 
> 
> 
> [SPARK-27178]
> https:/ / github. com/ apache/ spark/ commit/ 
> 342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> (
> https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> )
> 
> 
> 
> Is problematic too and it’s not in the rc8 cut
> 
> 
> 
> https:/ / github. com/ apache/ spark/ commits/ branch-2. 4 (
> https://github.com/apache/spark/commits/branch-2.4 )
> 
> 
> 
> (Personally I don’t want to delay 2.4.1 either..)
> 
> 
> 
> 
> From

Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have been thinking about some of these issues. Some of them are harder
to do, e.g. Spark DataFrames are fundamentally immutable, and making the
logical plan mutable is a significant deviation from the current paradigm
that might confuse the hell out of some users. We are considering building
a shim layer as a separate project on top of Spark (so we can make rapid
releases based on feedback) just to test this out and see how well it could
work in practice.

On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari 
wrote:

> Hi,
> I was doing some spark to pandas (and vice versa) conversion because some
> of the pandas codes we have don't work on huge data. And some spark codes
> work very slow on small data.
>
> It was nice to see that pyspark had some similar syntax for the common
> pandas operations that the python community is used to.
>
> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
> Column selects: df[['col1', 'col2']]
> Row Filters: df[df['col1'] < 3.0]
>
> I was wondering about a bunch of other functions in pandas which seemed
> common. And thought there must've been a discussion about it in the
> community - hence started this thread.
>
> I was wondering whether there has been discussion on adding the following
> functions:
>
> *Column setters*:
> In Pandas:
> df['col3'] = df['col1'] * 3.0
> While I do the following in PySpark:
> df = df.withColumn('col3', df['col1'] * 3.0)
>
> *Column apply()*:
> In Pandas:
> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
> While I do the following in PySpark:
> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>
> I understand that this one cannot be as simple as in pandas due to the
> output-type that's needed here. But could be done like:
> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>
> Multi column in pandas is:
> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
> directly it would be similar (?):
> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
> 'float')
>
> *Rename*:
> In Pandas:
> df.rename(columns={...})
> While I do the following in PySpark:
> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>
> *To Dictionary*:
> In Pandas:
> df.to_dict(orient='list')
> While I do the following in PySpark:
> {f.name: [row[i] for row in df.collect()] for i, f in
> enumerate(df.schema.fields)}
>
> I thought I'd start the discussion with these and come back to some of the
> others I see that could be helpful.
>
> *Note*: (with the column functions in mind) I understand the concept of
> the DataFrame cannot be modified. And I am not suggesting we change that
> nor any underlying principle. Just trying to add syntactic sugar here.
>
>


Re: PySpark syntax vs Pandas syntax

2019-03-25 Thread Reynold Xin
We have some early stuff there but not quite ready to talk about it in
public yet (I hope soon though). Will shoot you a separate email on it.

On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari 
wrote:

> Thanks for the reply Reynold - Has this shim project started ?
> I'd love to contribute to it - as it looks like I have started making a
> bunch of helper functions to do something similar for my current task and
> would prefer not doing it in isolation.
> Was considering making a git repo and pushing stuff there just today
> morning. But if there's already folks working on it - I'd prefer
> collaborating.
>
> Note - I'm not recommending we make the logical plan mutable (as I am
> scared of that too!). I think there are other ways of handling that - but
> we can go into details later.
>
> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin  wrote:
>
>> We have been thinking about some of these issues. Some of them are harder
>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>> logical plan mutable is a significant deviation from the current paradigm
>> that might confuse the hell out of some users. We are considering building
>> a shim layer as a separate project on top of Spark (so we can make rapid
>> releases based on feedback) just to test this out and see how well it could
>> work in practice.
>>
>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <
>> abdealikoth...@gmail.com> wrote:
>>
>>> Hi,
>>> I was doing some spark to pandas (and vice versa) conversion because
>>> some of the pandas codes we have don't work on huge data. And some spark
>>> codes work very slow on small data.
>>>
>>> It was nice to see that pyspark had some similar syntax for the common
>>> pandas operations that the python community is used to.
>>>
>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>> Column selects: df[['col1', 'col2']]
>>> Row Filters: df[df['col1'] < 3.0]
>>>
>>> I was wondering about a bunch of other functions in pandas which seemed
>>> common. And thought there must've been a discussion about it in the
>>> community - hence started this thread.
>>>
>>> I was wondering whether there has been discussion on adding the
>>> following functions:
>>>
>>> *Column setters*:
>>> In Pandas:
>>> df['col3'] = df['col1'] * 3.0
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', df['col1'] * 3.0)
>>>
>>> *Column apply()*:
>>> In Pandas:
>>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>>> While I do the following in PySpark:
>>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(
>>> df['col1']))
>>>
>>> I understand that this one cannot be as simple as in pandas due to the
>>> output-type that's needed here. But could be done like:
>>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>>>
>>> Multi column in pandas is:
>>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>>> directly it would be similar (?):
>>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>>> 'float')
>>>
>>> *Rename*:
>>> In Pandas:
>>> df.rename(columns={...})
>>> While I do the following in PySpark:
>>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>>>
>>> *To Dictionary*:
>>> In Pandas:
>>> df.to_dict(orient='list')
>>> While I do the following in PySpark:
>>> {f.name: [row[i] for row in df.collect()] for i, f in
>>> enumerate(df.schema.fields)}
>>>
>>> I thought I'd start the discussion with these and come back to some of
>>> the others I see that could be helpful.
>>>
>>> *Note*: (with the column functions in mind) I understand the concept of
>>> the DataFrame cannot be modified. And I am not suggesting we change that
>>> nor any underlying principle. Just trying to add syntactic sugar here.
>>>
>>>


Re: PySpark syntax vs Pandas syntax

2019-03-26 Thread Reynold Xin
We just made the repo public: https://github.com/databricks/spark-pandas

On Tue, Mar 26, 2019 at 1:20 AM, Timothee Hunter < timhun...@databricks.com > 
wrote:

> 
> To add more details to what Reynold mentioned. As you said, there is going
> to be some slight differences in any case between Pandas and Spark in any
> case, simply because Spark needs to know the return types of the
> functions. In your case, you would need to slightly refactor your apply
> method to the following (in python 3) to add type hints:
> 
> 
> ```
> def f(x) -> float: return x * 3.0
> df['col3'] = df['col1'].apply(f)
> ```
> 
> 
> This has the benefit of keeping your code fully compliant with both pandas
> and pyspark. We will share more information in the future.
> 
> 
> Tim
> 
> On Tue, Mar 26, 2019 at 8:08 AM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> 
> 
>> BTW, I am working on the documentation related with this subject at https:/
>> / issues. apache. org/ jira/ browse/ SPARK-26022 (
>> https://issues.apache.org/jira/browse/SPARK-26022 ) to describe the
>> difference
>> 
>> 2019년 3월 26일 (화) 오후 3:34, Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) >님이 작성:
>> 
>> 
>>> We have some early stuff there but not quite ready to talk about it in
>>> public yet (I hope soon though). Will shoot you a separate email on it.
>>> 
>>> On Mon, Mar 25, 2019 at 11:32 PM Abdeali Kothari < abdealikothari@ gmail. 
>>> com
>>> ( abdealikoth...@gmail.com ) > wrote:
>>> 
>>> 
>>>> Thanks for the reply Reynold - Has this shim project started ?
>>>> I'd love to contribute to it - as it looks like I have started making a
>>>> bunch of helper functions to do something similar for my current task and
>>>> would prefer not doing it in isolation.
>>>> Was considering making a git repo and pushing stuff there just today
>>>> morning. But if there's already folks working on it - I'd prefer
>>>> collaborating.
>>>> 
>>>> 
>>>> Note - I'm not recommending we make the logical plan mutable (as I am
>>>> scared of that too!). I think there are other ways of handling that - but
>>>> we can go into details later.
>>>> 
>>>> On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> We have been thinking about some of these issues. Some of them are harder
>>>>> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
>>>>> logical plan mutable is a significant deviation from the current paradigm
>>>>> that might confuse the hell out of some users. We are considering building
>>>>> a shim layer as a separate project on top of Spark (so we can make rapid
>>>>> releases based on feedback) just to test this out and see how well it
>>>>> could work in practice.
>>>>> 
>>>>> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari < abdealikothari@ gmail. 
>>>>> com
>>>>> ( abdealikoth...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> Hi,
>>>>>> I was doing some spark to pandas (and vice versa) conversion because some
>>>>>> of the pandas codes we have don't work on huge data. And some spark codes
>>>>>> work very slow on small data.
>>>>>> 
>>>>>> It was nice to see that pyspark had some similar syntax for the common
>>>>>> pandas operations that the python community is used to.
>>>>>> 
>>>>>> 
>>>>>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>>>>>> 
>>>>>> Column selects: df[['col1', 'col2']]
>>>>>> 
>>>>>> Row Filters: df[df['col1'] < 3.0]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I was wondering about a bunch of other functions in pandas which seemed
>>>>>> common. And thought there must've been a discussion about it in the
>>>>>> community - hence started this thread.
>>>>>> 
>>>>>> 
>>>>>> I was wondering whether there has been discussion on adding the following
>>>>>> functions

Re: [DISCUSS] Spark Columnar Processing

2019-03-26 Thread Reynold Xin
26% improvement is underwhelming if it requires massive refactoring of the 
codebase. Also you can't just add the benefits up this way, because:

- Both vectorization and codegen reduces the overhead in virtual function calls

- Vectorization code is more friendly to compilers / CPUs, but requires 
materializing a lot of data in memory (or cache)

- Codegen reduces the amount of data that flows through memory, but for complex 
queries the generated code might not be very compiler / CPU friendly

I see massive benefits in leveraging GPUs (and other accelerators) for numeric 
workloads (e.g. machine learning), so I think it makes a lot of sense to be 
able to get data out of Spark quickly into UDFs for such workloads.

I don't see as much benefits for general data processing, for a few reasons:

1. GPU machines are much more expensive & difficult to get (e.g. in the cloud 
they are 3 to 5x more expensive, with limited availability based on my 
experience), so it is difficult to build a farm

2. Bandwidth from system to GPUs is usually small, so if you could fit the 
working set in GPU memory and repeatedly work on it (e.g. machine learning), 
it's great, but otherwise it's not great.

3. It's a massive effort.

In general it's a cost-benefit trade-off. I'm not aware of any general 
framework that allows us to write code once and have it work against both GPUs 
and CPUs reliably. If such framework exists, it will change the equation.

On Tue, Mar 26, 2019 at 6:57 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> Cloudera reports a 26% improvement in hive query runtimes by enabling
> vectorization. I would expect to see similar improvements but at the cost
> of keeping more data in memory.  But remember this also enables a number
> of different hardware acceleration techniques.  If the data format is
> arrow compatible and off-heap someone could offload the processing to
> native code which typically results in a 2x improvement over java (and the
> cost of a JNI call would be amortized over processing an entire batch at
> once).  Also, we plan on adding in GPU acceleration and ideally making it
> a standard part of Spark.  In our initial prototype, we saw queries which
> we could make fully columnar/GPU enabled being 5-6x faster.  But that
> really was just a proof of concept and we expect to be able to do quite a
> bit better when we are completely done.  Many commercial GPU enabled SQL
> engines claim to be 20x to 200x faster than Spark, depending on the use
> case. Digging deeply you see that they are not apples to apples
> comparisons, i.e. reading from cached GPU memory and having spark read
> from a file, or using parquet as input but asking spark to read CSV.  That
> being said I would expect that we can achieve something close to the 20x
> range for most queries and possibly more if they are computationally
> intensive.
> 
> 
> Also as a side note, we initially thought that the conversion would not be
> too expensive and that we could just move computationally intensive
> processing onto the GPU piecemeal with conversions on both ends.  In
> practice, we found that the cost of conversion quickly starts to dominate
> the queries we were testing.
> 
> On Mon, Mar 25, 2019 at 11:53 PM Wenchen Fan < cloud0fan@ gmail. com (
> cloud0...@gmail.com ) > wrote:
> 
> 
>> Do you have some initial perf numbers? It seems fine to me to remain
>> row-based inside Spark with whole-stage-codegen, and convert rows to
>> columnar batches when communicating with external systems.
>> 
>> On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans < bobby@ apache. org (
>> bo...@apache.org ) > wrote:
>> 
>> 
>>> 
>>> 
>>> This thread is to discuss adding in support for data frame processing
>>> using an in-memory columnar format compatible with Apache Arrow.  My main
>>> goal in this is to lay the groundwork so we can add in support for GPU
>>> accelerated processing of data frames, but this feature has a number of
>>> other benefits.  Spark currently supports Apache Arrow formatted data as
>>> an option to exchange data with python for pandas UDF processing. There
>>> has also been discussion around extending this to allow for exchanging
>>> data with other tools like pytorch, tensorflow, xgboost,... If Spark
>>> supports processing on Arrow compatible data it could eliminate the
>>> serialization/deserialization overhead when going between these systems. 
>>> It also would allow for doing optimizations on a CPU with SIMD
>>> instructions similar to what Hive currently supports. Accelerated
>>> processing using a GPU is something that we will start a separate
>>> discussion thread on, but I wanted to set the context a bit.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Jason Lowe, Tom Graves, and I created a prototype over the past few months
>>> to try and understand how to make this work.  What we are proposing is
>>> based off of lessons learned when building this prototype, but we really
>>> wanted to get feedback early on from the community. 

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Yes this is known and an issue for performance. Do you have any thoughts on
how to fix this?

On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson  wrote:

> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating data structures (UDTs)
> used by UDAFs are serialized to a Row object, and de-serialized, for every
> row in a data frame.
> Cheers,
> Erik
>
>


Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
They are unfortunately all pretty substantial (which is why this problem 
exists) ...

On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> At a high level, some candidate strategies are:
> 
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can do the right thing.
> 2. Expose TypedImperativeAggregate to users for defining their own, since
> it already does the right thing.
> 
> 3. As a workaround, allow users to define their own sub-classes of
> DataType.  It would essentially allow one to define the sqlType of the UDT
> to be the aggregating object itself and make ser/de a no-op.  I tried
> doing this and it will compile, but spark's internals only consider a
> predefined universe of DataType classes.
> 
> 
> All of these options are likely to have implications for the catalyst
> systems. I'm not sure if they are minor more substantial.
> 
> 
> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> Yes this is known and an issue for performance. Do you have any thoughts
>> on how to fix this?
>> 
>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>> eerla...@redhat.com ) > wrote:
>> 
>> 
>>> I describe some of the details here:
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>> 
>>> 
>>> 
>>> The short version of the story is that aggregating data structures (UDTs)
>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>> row in a data frame.
>>> Cheers,
>>> Erik
>>> 
>> 
>> 
> 
>

Re: UDAFs have an inefficiency problem

2019-03-27 Thread Reynold Xin
Not that I know of. We did do some work to make it work faster in the case of 
lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949

On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote:

> 
> BTW, if this is known, is there an existing JIRA I should link to?
> 
> 
> On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson < eerlands@ redhat. com (
> eerla...@redhat.com ) > wrote:
> 
> 
>> 
>> 
>> At a high level, some candidate strategies are:
>> 
>> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
>> trait itself) so that the update method can do the right thing.
>> 2. Expose TypedImperativeAggregate to users for defining their own, since
>> it already does the right thing.
>> 
>> 3. As a workaround, allow users to define their own sub-classes of
>> DataType.  It would essentially allow one to define the sqlType of the UDT
>> to be the aggregating object itself and make ser/de a no-op.  I tried
>> doing this and it will compile, but spark's internals only consider a
>> predefined universe of DataType classes.
>> 
>> 
>> All of these options are likely to have implications for the catalyst
>> systems. I'm not sure if they are minor more substantial.
>> 
>> 
>> On Wed, Mar 27, 2019 at 4:20 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Yes this is known and an issue for performance. Do you have any thoughts
>>> on how to fix this?
>>> 
>>> On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson < eerlands@ redhat. com (
>>> eerla...@redhat.com ) > wrote:
>>> 
>>> 
>>>> I describe some of the details here:
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27296 (
>>>> https://issues.apache.org/jira/browse/SPARK-27296 )
>>>> 
>>>> 
>>>> 
>>>> The short version of the story is that aggregating data structures (UDTs)
>>>> used by UDAFs are serialized to a Row object, and de-serialized, for every
>>>> row in a data frame.
>>>> Cheers,
>>>> Erik
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] Enable blacklisting feature by default in 3.0

2019-03-29 Thread Reynold Xin
We tried enabling blacklisting for some customers and in the cloud, very 
quickly they end up having 0 executors due to various transient errors. So 
unfortunately I think the current implementation is terrible for cloud 
deployments, and shouldn't be on by default. The heart of the issue is that the 
current implementation is not great at dealing with transient errors vs 
catastrophic errors.

+Chris who was involved with those tests.

On Thu, Mar 28, 2019 at 3:32 PM, Ankur Gupta < ankur.gu...@cloudera.com.invalid 
> wrote:

> 
> Hi all,
> 
> 
> This is a follow-on to my PR: https:/ / github. com/ apache/ spark/ pull/ 
> 24208
> ( https://github.com/apache/spark/pull/24208 ) , where I aimed to enable
> blacklisting for fetch failure by default. From the comments, there is
> interest in the community to enable overall blacklisting feature by
> default. I have listed down 3 different things that we can do and would
> like to gather feedback and see if anyone has objections with regards to
> this. Otherwise, I will just create a PR for the same.
> 
> 
> 1. *Enable blacklisting feature by default*. The blacklisting feature was
> added as part of SPARK-8425 and is available since 2.2.0. This feature was
> deemed experimental and was disabled by default. The feature blacklists an
> executor/node from running a particular task, any task in a particular
> stage or all tasks in application based on number of failures. There are
> various configurations available which control those thresholds.
> Additionally, the executor/node is only blacklisted for a configurable
> time period. The idea is to enable blacklisting feature with existing
> defaults, which are following:
> * spark.blacklist.task.maxTaskAttemptsPerExecutor = 1
> 
> * spark.blacklist.task.maxTaskAttemptsPerNode = 2
> 
> * spark.blacklist.stage.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.stage.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.application.maxFailedTasksPerExecutor = 2
> 
> * spark.blacklist.application.maxFailedExecutorsPerNode = 2
> 
> * spark.blacklist.timeout = 1 hour
> 
> 2. *Kill blacklisted executors/nodes by default*. This feature was added
> as part of SPARK-16554 and is available since 2.2.0. This is a follow-on
> feature to blacklisting, such that if an executor/node is blacklisted for
> the application, then it also terminates all running tasks on that
> executor for faster failure recovery.
> 
> 
> 3. *Remove legacy blacklisting timeout config* :
> spark.scheduler.executorTaskBlacklistTime
> 
> 
> Thanks,
> Ankur
>

Do you use single-quote syntax for the DataFrame API?

2019-03-30 Thread Reynold Xin
As part of evolving the Scala language, the Scala team is considering removing 
single-quote syntax for representing symbols. Single-quote syntax is one of the 
ways to represent a column in Spark's DataFrame API. While I personally don't 
use them (I prefer just using strings for column names, or using expr 
function), I see them used quite a lot by other people's code, e.g.

df.select ( http://df.select/ ) ('id, 'name).show()

I want to bring this to more people's attention, in case they are depending on 
this. The discussion thread is: 
https://contributors.scala-lang.org/t/proposal-to-deprecate-and-remove-symbol-literals/2953

Re: [DISCUSS] Spark Columnar Processing

2019-04-01 Thread Reynold Xin
I just realized I didn't make it very clear my stance here ... here's another 
try:

I think it's a no brainer to have a good columnar UDF interface. This would 
facilitate a lot of high performance applications, e.g. GPU-based accelerations 
for machine learning algorithms.

On rewriting the entire internals of Spark SQL to leverage columnar processing, 
I don't see enough evidence to suggest that's a good idea yet.

On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> Kazuaki Ishizaki,
> 
> 
> Yes, ColumnarBatchScan does provide a framework for doing code generation
> for the processing of columnar data.  I have to admit that I don't have a
> deep understanding of the code generation piece, so if I get something
> wrong please correct me.  From what I had seen only input formats
> currently inherent from ColumnarBatchScan, and from comments in the trait
> 
> 
>   /**
>    * Generate [[ColumnVector]] expressions for our parent to consume as
> rows.
>    * This is called once per [[ColumnarBatch]].
>    */
> https:/ / github. com/ apache/ spark/ blob/ 
> 956b52b1670985a67e49b938ac1499ae65c79f6e/
> sql/ core/ src/ main/ scala/ org/ apache/ spark/ sql/ execution/ 
> ColumnarBatchScan.
> scala#L42-L43 (
> https://github.com/apache/spark/blob/956b52b1670985a67e49b938ac1499ae65c79f6e/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L42-L43
> )
> 
> 
> 
> It appears that ColumnarBatchScan is really only intended to pull out the
> data from the batch, and not to process that data in a columnar fashion. 
> The Loading stage that you mentioned.
> 
> 
> > The SIMDzation or GPUization capability depends on a compiler that
> translates native code from the code generated by the whole-stage codegen.
> 
> To be able to support vectorized processing Hive stayed with pure java and
> let the JVM detect and do the SIMDzation of the code.  To make that happen
> they created loops to go through each element in a column and remove all
> conditionals from the body of the loops.  To the best of my knowledge that
> would still require a separate code path like I am proposing to make the
> different processing phases generate code that the JVM can compile down to
> SIMD instructions.  The generated code is full of null checks for each
> element which would prevent the operations we want.  Also, the
> intermediate results are often stored in UnsafeRow instances.  This is
> really fast for row-based processing, but the complexity of how they work
> I believe would prevent the JVM from being able to vectorize the
> processing.  If you have a better way to take java code and vectorize it
> we should put it into OpenJDK instead of spark so everyone can benefit
> from it.
> 
> 
> Trying to compile directly from generated java code to something a GPU can
> process is something we are tackling but we decided to go a different
> route from what you proposed.  From talking with several compiler experts
> here at NVIDIA my understanding is that IBM in partnership with NVIDIA
> attempted in the past to extend the JVM to run at least partially on GPUs,
> but it was really difficult to get right, especially with how java does
> memory management and memory layout.
> 
> 
> To avoid that complexity we decided to split the JITing up into two
> separate pieces.  I didn't mention any of this before because this
> discussion was intended to just be around the memory layout support, and
> not GPU processing.  The first part would be to take the Catalyst AST and
> produce CUDA code directly from it.  If properly done we should be able to
> do the selection and projection phases within a single kernel.  The
> biggest issue comes with UDFs as they cannot easily be vectorized for the
> CPU or GPU.  So to deal with that we have a prototype written by the
> compiler team that is trying to tackle SPARK-14083 which can translate
> basic UDFs into catalyst expressions.  If the UDF is too complicated or
> covers operations not yet supported it will fall back to the original UDF
> processing.  I don't know how close the team is to submit a SPIP or a
> patch for it, but I do know that they have some very basic operations
> working.  The big issue is that it requires java 11+ so it can use
> standard APIs to get the byte code of scala UDFs.  
> 
> 
> We split it this way because we thought it would be simplest to implement,
> and because it would provide a benefit to more than just GPU accelerated
> queries.
> 
> 
> Thanks,
> 
> 
> Bobby
> 
> On Tue, Mar 26, 2019 at 11:59 PM Kazuaki Ishizaki < ISHIZAKI@ jp. ibm. com
> ( ishiz...@jp.ibm.com ) > wrote:
> 
> 
>> Looks interesting discussion.
>> Let me describe the current structure and remaining issues. This is
>> orthogonal to cost-benefit trade-off discussion.
>> 
>> The code generation basically consists of three parts.
>> 1. Loading
>> 2. Selection (map, filter, ...)
>> 3. Projection
>> 
>> 1. Columnar storage (e.g. Parquet, Orc, Arrow , and table cache) is 

Re: [DISCUSS] Spark Columnar Processing

2019-04-11 Thread Reynold Xin
I just realized we had an earlier SPIP on a similar topic: 
https://issues.apache.org/jira/browse/SPARK-24579

Perhaps we should tie the two together. IIUC, you'd want to expose the existing 
ColumnBatch API, but also provide utilities to directly convert from/to Arrow.

On Thu, Apr 11, 2019 at 7:13 AM, Bobby Evans < bo...@apache.org > wrote:

> 
> The SPIP has been up for almost 6 days now with really no discussion on
> it.  I am hopeful that means it's okay and we are good to call a vote on
> it, but I want to give everyone one last chance to take a look and
> comment.  If there are no comments by tomorrow I this we will start a vote
> for this.
> 
> 
> Thanks,
> 
> 
> Bobby
> 
> On Fri, Apr 5, 2019 at 2:24 PM Bobby Evans < bobby@ apache. org (
> bo...@apache.org ) > wrote:
> 
> 
>> I just filed SPARK-27396 as the SPIP for this proposal.  Please use that
>> JIRA for further discussions.
>> 
>> 
>> Thanks for all of the feedback,
>> 
>> 
>> Bobby
>> 
>> On Wed, Apr 3, 2019 at 7:15 PM Bobby Evans < bobby@ apache. org (
>> bo...@apache.org ) > wrote:
>> 
>> 
>>> I am still working on the SPIP and should get it up in the next few days. 
>>> I have the basic text more or less ready, but I want to get a high-level
>>> API concept ready too just to have something more concrete.  I have not
>>> really done much with contributing new features to spark so I am not sure
>>> where a design document really fits in here because from http:/ / spark. 
>>> apache.
>>> org/ improvement-proposals. html (
>>> http://spark.apache.org/improvement-proposals.html ) and http:/ / spark. 
>>> apache.
>>> org/ contributing. html ( http://spark.apache.org/contributing.html ) it
>>> does not mention a design anywhere.  I am happy to put one up, but I was
>>> hoping the API concept would cover most of that.
>>> 
>>> 
>>> Thanks,
>>> 
>>> 
>>> Bobby
>>> 
>>> On Tue, Apr 2, 2019 at 9:16 PM Renjie Liu < liurenjie2008@ gmail. com (
>>> liurenjie2...@gmail.com ) > wrote:
>>> 
>>> 
>>>> Hi, Bobby:
>>>> Do you have design doc? I'm also interested in this topic and want to help
>>>> contribute.
>>>> 
>>>> On Tue, Apr 2, 2019 at 10:00 PM Bobby Evans < bobby@ apache. org (
>>>> bo...@apache.org ) > wrote:
>>>> 
>>>> 
>>>>> Thanks to everyone for the feedback.
>>>>> 
>>>>> 
>>>>> Overall the feedback has been really positive for exposing columnar as a
>>>>> processing option to users.  I'll write up a SPIP on the proposed changes
>>>>> to support columnar processing (not necessarily implement it) and then
>>>>> ping the list again for more feedback and discussion.
>>>>> 
>>>>> 
>>>>> Thanks again,
>>>>> 
>>>>> 
>>>>> Bobby
>>>>> 
>>>>> On Mon, Apr 1, 2019 at 5:09 PM Reynold Xin < rxin@ databricks. com (
>>>>> r...@databricks.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> I just realized I didn't make it very clear my stance here ... here's
>>>>>> another try:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I think it's a no brainer to have a good columnar UDF interface. This
>>>>>> would facilitate a lot of high performance applications, e.g. GPU-based
>>>>>> accelerations for machine learning algorithms.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On rewriting the entire internals of Spark SQL to leverage columnar
>>>>>> processing, I don't see enough evidence to suggest that's a good idea 
>>>>>> yet.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 27, 2019 at 8:10 AM, Bobby Evans < bobby@ apache. org (
>>>>>> bo...@apache.org ) > wrote:
>>>>>> 
>>>>>>> Kazuaki Ishizaki,
>>>>>>> 
>>>>>>> 
>>>>>>> Yes, ColumnarBatchScan does provide a framework for doing code 
>>>>>>> generation
>>>>>>> for the p

Re: pyspark.sql.functions ide friendly

2019-04-17 Thread Reynold Xin
Are you talking about the ones that are defined in a dictionary? If yes, that 
was actually not that great in hindsight (makes it harder to read & change), so 
I'm OK changing it.

E.g.

_functions = {

    'lit': _lit_doc,

    'col': 'Returns a :class:`Column` based on the given column name.',

    'column': 'Returns a :class:`Column` based on the given column name.',

    'asc': 'Returns a sort expression based on the ascending order of the given 
column name.',

    'desc': 'Returns a sort expression based on the descending order of the 
given column name.',

}

On Wed, Apr 17, 2019 at 4:35 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> I use IntelliJ and have never seen an issue parsing the pyspark
> functions... you're just saying the linter has an optional inspection to
> flag it? just disable that?
> I don't think we want to complicate the Spark code just for this. They are
> declared at runtime for a reason.
> 
> 
> 
> On Wed, Apr 17, 2019 at 6:27 AM educhana@ gmail. com ( educh...@gmail.com )
> < educhana@ gmail. com ( educh...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Hi,
>> 
>> 
>> 
>> I'm aware of various workarounds to make this work smoothly in various
>> IDEs, but wouldn't better to solve the root cause?
>> 
>> 
>> 
>> I've seen the code and don't see anything that requires such level of
>> dynamic code, the translation is 99% trivial.
>> 
>> 
>> 
>> On 2019/04/16 12:16:41, 880f0464 < 880f0464@ protonmail. com. INVALID (
>> 880f0...@protonmail.com.INVALID ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Hi.
>>> 
>>> 
>>> 
>>> That's a problem with Spark as such and in general can be addressed on IDE
>>> to IDE basis - see for example https:/ / stackoverflow. com/ q/ 40163106 (
>>> https://stackoverflow.com/q/40163106 ) for some hints.
>>> 
>>> 
>>> 
>>> Sent with ProtonMail Secure Email.
>>> 
>>> 
>>> 
>>> ‐‐‐ Original Message ‐‐‐
>>> On Tuesday, April 16, 2019 2:10 PM, educhana < educhana@ gmail. com (
>>> educh...@gmail.com ) > wrote:
>>> 
>>> 
 
 
 Hi,
 
 
 
 Currently using pyspark.sql.functions from an IDE like PyCharm is causing
 the linters complain due to the functions being declared at runtime.
 
 
 
 Would a PR fixing this be welcomed? Is there any problems/difficulties I'm
 unaware?
 
 
 
 --
 
 
 
 
 Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
 ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
 
 
 
 --
 
 
 
 To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
 dev-unsubscr...@spark.apache.org )
 
 
>>> 
>>> 
>>> 
>>> - To
>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>> dev-unsubscr...@spark.apache.org )
>>> 
>>> 
>> 
>> 
>> 
>> - To
>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>> dev-unsubscr...@spark.apache.org )
>> 
>> 
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
For Jackson - are you worrying about JSON parsing for users or internal
Spark functionality breaking?

On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:

> There's only one other item on my radar, which is considering updating
> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> behavior non-trivially. That said back-porting the update PR to 2.4
> worked out OK locally. Any strong opinions on this one?
>
> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan  wrote:
> >
> > I volunteer to be the release manager for 2.4.2, as I was also going to
> propose 2.4.2 because of the reverting of SPARK-25250. Is there any other
> ongoing bug fixes we want to include in 2.4.2? If no I'd like to start the
> release process today (CST).
> >
> > Thanks,
> > Wenchen
> >
> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
> >>
> >> I think the 'only backport bug fixes to branches' principle remains
> sound. But what's a bug fix? Something that changes behavior to match what
> is explicitly supposed to happen, or implicitly supposed to happen --
> implied by what other similar things do, by reasonable user expectations,
> or simply how it worked previously.
> >>
> >> Is this a bug fix? I guess the criteria that matches is that behavior
> doesn't match reasonable user expectations? I don't know enough to have a
> strong opinion. I also don't think there is currently an objection to
> backporting it, whatever it's called.
> >>
> >>
> >> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
> >>
> >>
> >> The meta question remains: is a 'bug fix' definition even agreed, and
> being consistently applied? There aren't correct answers, only best guesses
> from each person's own experience, judgment and priorities. These can
> differ even when applied in good faith.
> >>
> >> Sometimes the variance of opinion comes because people have different
> info that needs to be surfaced. Here, maybe it's best to share what about
> that offline conversation was convincing, for example.
> >>
> >> I'd say it's also important to separate what one would prefer from what
> one can't live with(out). Assuming one trusts the intent and experience of
> the handful of others with an opinion, I'd defer to someone who wants X and
> will own it, even if I'm moderately against it. Otherwise we'd get little
> done.
> >>
> >> In that light, it seems like both of the PRs at issue here are not
> _wrong_ to backport. This is a good pair that highlights why, when there
> isn't a clear reason to do / not do something (e.g. obvious errors,
> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
> >>
> >>
> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
> >>>
> >>> Sorry, I should be more clear about what I'm trying to say here.
> >>>
> >>> In the past, Xiao has taken the opposite stance. A good example is PR
> #21060 that was a very similar situation: behavior didn't match what was
> expected and there was low risk. There was a long argument and the patch
> didn't make it into 2.3 (to my knowledge).
> >>>
> >>> What we call these low-risk behavior fixes doesn't matter. I called it
> a bug on #21060 but I'm applying Xiao's previous definition here to make a
> point. Whatever term we use, we clearly have times when we want to allow a
> patch because it is low risk and helps someone. Let's just be clear that
> that's perfectly fine.
> >>>
> >>> On Wed, Apr 17, 2019 at 9:34 AM Ryan Blue  wrote:
> 
>  How is this a bug fix?
> 
>  On Wed, Apr 17, 2019 at 9:30 AM Xiao Li 
> wrote:
> >
> > Michael and I had an offline discussion about this PR
> https://github.com/apache/spark/pull/24365. He convinced me that this is
> a bug fix. The code changes of this bug fix are very tiny and the risk is
> very low. To avoid any behavior change in the patch releases, this PR also
> added a legacy flag whose default value is off.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark 2.4.2

2019-04-17 Thread Reynold Xin
We should have shaded all Spark’s dependencies :(

On Wed, Apr 17, 2019 at 11:47 PM Sean Owen  wrote:

> For users that would inherit Jackson and use it directly, or whose
> dependencies do. Spark itself (with modifications) should be OK with
> the change.
> It's risky and normally wouldn't backport, except that I've heard a
> few times about concerns about CVEs affecting Databind, so wondering
> who else out there might have an opinion. I'm not pushing for it
> necessarily.
>
> On Wed, Apr 17, 2019 at 6:18 PM Reynold Xin  wrote:
> >
> > For Jackson - are you worrying about JSON parsing for users or internal
> Spark functionality breaking?
> >
> > On Wed, Apr 17, 2019 at 6:02 PM Sean Owen  wrote:
> >>
> >> There's only one other item on my radar, which is considering updating
> >> Jackson to 2.9 in branch-2.4 to get security fixes. Pros: it's come up
> >> a few times now that there are a number of CVEs open for 2.6.7. Cons:
> >> not clear they affect Spark, and Jackson 2.6->2.9 does change Jackson
> >> behavior non-trivially. That said back-porting the update PR to 2.4
> >> worked out OK locally. Any strong opinions on this one?
> >>
> >> On Wed, Apr 17, 2019 at 7:49 PM Wenchen Fan 
> wrote:
> >> >
> >> > I volunteer to be the release manager for 2.4.2, as I was also going
> to propose 2.4.2 because of the reverting of SPARK-25250. Is there any
> other ongoing bug fixes we want to include in 2.4.2? If no I'd like to
> start the release process today (CST).
> >> >
> >> > Thanks,
> >> > Wenchen
> >> >
> >> > On Thu, Apr 18, 2019 at 3:44 AM Sean Owen  wrote:
> >> >>
> >> >> I think the 'only backport bug fixes to branches' principle remains
> sound. But what's a bug fix? Something that changes behavior to match what
> is explicitly supposed to happen, or implicitly supposed to happen --
> implied by what other similar things do, by reasonable user expectations,
> or simply how it worked previously.
> >> >>
> >> >> Is this a bug fix? I guess the criteria that matches is that
> behavior doesn't match reasonable user expectations? I don't know enough to
> have a strong opinion. I also don't think there is currently an objection
> to backporting it, whatever it's called.
> >> >>
> >> >>
> >> >> Is the question whether this needs a new release? There's no harm in
> another point release, other than needing a volunteer release manager. One
> could say, wait a bit longer to see what more info comes in about 2.4.1.
> But given that 2.4.1 took like 2 months, it's reasonable to move towards a
> release cycle again. I don't see objection to that either (?)
> >> >>
> >> >>
> >> >> The meta question remains: is a 'bug fix' definition even agreed,
> and being consistently applied? There aren't correct answers, only best
> guesses from each person's own experience, judgment and priorities. These
> can differ even when applied in good faith.
> >> >>
> >> >> Sometimes the variance of opinion comes because people have
> different info that needs to be surfaced. Here, maybe it's best to share
> what about that offline conversation was convincing, for example.
> >> >>
> >> >> I'd say it's also important to separate what one would prefer from
> what one can't live with(out). Assuming one trusts the intent and
> experience of the handful of others with an opinion, I'd defer to someone
> who wants X and will own it, even if I'm moderately against it. Otherwise
> we'd get little done.
> >> >>
> >> >> In that light, it seems like both of the PRs at issue here are not
> _wrong_ to backport. This is a good pair that highlights why, when there
> isn't a clear reason to do / not do something (e.g. obvious errors,
> breaking public APIs) we give benefit-of-the-doubt in order to get it later.
> >> >>
> >> >>
> >> >> On Wed, Apr 17, 2019 at 12:09 PM Ryan Blue 
> wrote:
> >> >>>
> >> >>> Sorry, I should be more clear about what I'm trying to say here.
> >> >>>
> >> >>> In the past, Xiao has taken the opposite stance. A good example is
> PR #21060 that was a very similar situation: behavior didn't match what was
> expected and there was low risk. There was a long argument and the patch
> didn't make it into 2.3 (to my know

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Reynold Xin
"if others think it would be helpful, we can cancel this vote, update the SPIP 
to clarify exactly what I am proposing, and then restart the vote after we have 
gotten more agreement on what APIs should be exposed"

That'd be very useful. At least I was confused by what the SPIP was about. No 
point voting on something when there is still a lot of confusion about what it 
is.

On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans < reva...@gmail.com > wrote:

> 
> 
> 
> Xiangrui Meng,
> 
> 
> 
> I provided some examples in the original discussion thread.
> 
> 
> 
> https:/ / lists. apache. org/ thread. html/ 
> f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@
> %3Cdev. spark. apache. org%3E (
> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
> )
> 
> 
> 
> But the concrete use case that we have is GPU accelerated ETL on Spark.
> Primarily as data preparation and feature engineering for ML tools like
> XGBoost, which by the way exposes a Spark specific scala API, not just a
> python one. We built a proof of concept and saw decent performance gains.
> Enough gains to more than pay for the added cost of a GPU, with the
> potential for even better performance in the future. With that proof of
> concept, we were able to make all of the processing columnar end-to-end
> for many queries so there really wasn't any data conversion costs to
> overcome, but we did want the design flexible enough to include a
> cost-based optimizer. \
> 
> 
> 
> It looks like there is some confusion around this SPIP especially in how
> it relates to features in other SPIPs around data exchange between
> different systems. I didn't want to update the text of this SPIP while it
> was under an active vote, but if others think it would be helpful, we can
> cancel this vote, update the SPIP to clarify exactly what I am proposing,
> and then restart the vote after we have gotten more agreement on what APIs
> should be exposed.
> 
> 
> 
> Thanks,
> 
> 
> 
> Bobby
> 
> 
> 
> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng < mengxr@ gmail. com (
> men...@gmail.com ) > wrote:
> 
> 
>> 
>> 
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP. I
>> think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>> 
>> 
>> 
>> 
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>> 
>> 
>> 
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>> 
>> 
>> 
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an Arrow
>> library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also used
>> by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>> 
>> 
>> 
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>> 
>> 
>> 
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves < tgraves_cs@ yahoo. com (
>> tgraves...@yahoo.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> Based on there is still discussion and Spark Summit is this week, I'm
>>> going to extend the vote til Friday the 26th.
>>> 
>>> 
>>> 
>>> Tom
>>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans < revans2@ gmail. com
>>> ( reva...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> Yes, it is technically possible for the layout to change. No, it is not
>>> going to happen. It is already baked into several different official
>>> libraries which are widely used, not just for holding and processing the
>>> data, but also for transfer of the data between the various
>>> implementations. There would have to be a really serious reason to force
>>> an incompatible change at this point. So in the worst case, we can version
>>> the layout and bake that into the API that exposes the internal layout of
>>> the data. That way code that wants to program against a JAVA API can do so
>>> using the API that Spark provides, those who want to interface with
>>> something that expects the data in arrow format will already have to know
>>> what version of the format it was programmed against and in the worst case
>>> if the layout does change we can support the new layout if needed.
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2019 at 12:45 AM Bryan C

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-26 Thread Reynold Xin
I do feel it'd be better to not switch default Scala versions in a minor 
release. I don't know how much downstream this impacts. Dotnet is a good data 
point. Anybody else hit this issue?

On Thu, Apr 25, 2019 at 11:36 PM, Terry Kim < yumin...@gmail.com > wrote:

> 
> 
> 
> Very much interested in hearing what you folks decide. We currently have a
> couple asking us questions at https:/ / github. com/ dotnet/ spark/ issues
> ( https://github.com/dotnet/spark/issues ).
> 
> 
> 
> Thanks,
> Terry
> 
> 
> 
> --
> Sent from: http:/ / apache-spark-developers-list. 1001551. n3. nabble. com/
> ( http://apache-spark-developers-list.1001551.n3.nabble.com/ )
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Looks like a great idea to make changes in Spark 3.0 to prepare for Scala 2.13 
upgrade.

Are there breaking changes that would require us to have two different source 
code for 2.12 vs 2.13?

On Fri, May 10, 2019 at 11:41 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> While that's not happening soon (2.13 isn't out), note that some of the
> changes to collections will be fairly breaking changes.
> 
> 
> 
> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
> https://issues.apache.org/jira/browse/SPARK-25075 )
> https:/ / docs. scala-lang. org/ overviews/ core/ collections-migration-213.
> html (
> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
> )
> 
> 
> 
> Some of this may impact a public API, so may need to start proactively
> fixing stuff for 2.13 before 3.0 comes out where possible.
> 
> 
> 
> Here's an example: Traversable goes away. We have a method
> SparkConf.setAll(Traversable). We can't support 2.13 while that still
> exists. Of course, we can decide to deprecate it with replacement (use
> Iterable) and remove it in the version that supports 2.13. But that would
> mean a little breaking change, and we either have to accept that for a
> future 3.x release, or it waits until 4.x.
> 
> 
> 
> I wanted to put that on the radar now to gather opinions about whether
> this will probably be acceptable, or whether we really need to get methods
> like that changed before 3.0.
> 
> 
> 
> Also: there's plenty of straightforward but medium-sized changes we can
> make now in anticipation of 2.13 support, like, make the type of Seq we
> use everywhere explicit (will be good for a like 1000 file change I'm
> sure). Or see if we can swap out Traversable everywhere. Remove
> MutableList, etc.
> 
> 
> 
> I was going to start fiddling with that unless it just sounds too
> disruptive.
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-10 Thread Reynold Xin
Yea my main point is when we do support 2.13, it'd be great if we don't have to 
break APIs. That's why doing the prep work in 3.0 would be great.

On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < iras...@cloudera.com > wrote:

> 
> +1 on making whatever api changes we can now for 3.0.
> 
> 
> I don't think that is making any commitments to supporting scala 2.13 in
> any specific version.  We'll have to deal with all the other points you
> raised when we do cross that bridge, but hopefully those are things we can
> cover in a minor release.
> 
> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
> sro...@gmail.com ) > wrote:
> 
> 
>> I really hope we don't have to have separate source trees for some files,
>> but yeah it's an option too. OK, will start looking into changes we can
>> make now that don't break things now, and deprecations we need to make now
>> proactively.
>> 
>> 
>> I should also say that supporting Scala 2.13 will mean dependencies have
>> to support Scala 2.13, and that could take a while, because there are a
>> lot.
>> In particular, I think we'll find our SBT 0.13 build won't make it,
>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>> and it seemed to need quite a lot of rewrite, again in part due to how
>> newer plugin versions changed. I failed and gave up.
>> 
>> 
>> At some point maybe we figure out whether we can remove the SBT-based
>> build if it's super painful, but only if there's not much other choice.
>> That is for a future thread.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>> r...@databricks.com ) > wrote:
>> 
>> 
>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>> 2.13 upgrade.
>>> 
>>> 
>>> 
>>> Are there breaking changes that would require us to have two different
>>> source code for 2.12 vs 2.13?
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>>> 
>>>> 
>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>> changes to collections will be fairly breaking changes.
>>>> 
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>> collections-migration-213.
>>>> html (
>>>> https://docs.scala-lang.org/overviews/core/collections-migration-213.html
>>>> )
>>>> 
>>>> 
>>>> 
>>>> Some of this may impact a public API, so may need to start proactively
>>>> fixing stuff for 2.13 before 3.0 comes out where possible.
>>>> 
>>>> 
>>>> 
>>>> Here's an example: Traversable goes away. We have a method
>>>> SparkConf.setAll(Traversable). We can't support 2.13 while that still
>>>> exists. Of course, we can decide to deprecate it with replacement (use
>>>> Iterable) and remove it in the version that supports 2.13. But that would
>>>> mean a little breaking change, and we either have to accept that for a
>>>> future 3.x release, or it waits until 4.x.
>>>> 
>>>> 
>>>> 
>>>> I wanted to put that on the radar now to gather opinions about whether
>>>> this will probably be acceptable, or whether we really need to get methods
>>>> like that changed before 3.0.
>>>> 
>>>> 
>>>> 
>>>> Also: there's plenty of straightforward but medium-sized changes we can
>>>> make now in anticipation of 2.13 support, like, make the type of Seq we
>>>> use everywhere explicit (will be good for a like 1000 file change I'm
>>>> sure). Or see if we can swap out Traversable everywhere. Remove
>>>> MutableList, etc.
>>>> 
>>>> 
>>>> 
>>>> I was going to start fiddling with that unless it just sounds too
>>>> disruptive.
>>>> 
>>>> 
>>>> 
>>>> - To
>>>> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
>>>> dev-unsubscr...@spark.apache.org )
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Interesting implications of supporting Scala 2.13

2019-05-11 Thread Reynold Xin
If the number of changes that would require two source trees are small, another 
thing we can do is to reach out to the Scala team and kindly ask them whether 
they could change Scala 2.13 itself so it'd be easier to maintain compatibility 
with Scala 2.12.

On Sat, May 11, 2019 at 4:25 PM, Sean Owen < sro...@gmail.com > wrote:

> 
> 
> 
> For those interested, here's the first significant problem I see that will
> require separate source trees or a breaking change: https:/ / issues. apache.
> org/ jira/ browse/ SPARK-27683?focusedCommentId=16837967&page=com. atlassian.
> jira. plugin. system. issuetabpanels%3Acomment-tabpanel#comment-16837967 (
> https://issues.apache.org/jira/browse/SPARK-27683?focusedCommentId=16837967&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16837967
> )
> 
> 
> 
> Interested in thoughts on how to proceed on something like this, as there
> will probably be a few more similar issues.
> 
> 
> 
> On Fri, May 10, 2019 at 3:32 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> 
>> 
>> Yea my main point is when we do support 2.13, it'd be great if we don't
>> have to break APIs. That's why doing the prep work in 3.0 would be great.
>> 
>> 
>> 
>> On Fri, May 10, 2019 at 1:30 PM, Imran Rashid < irashid@ cloudera. com (
>> iras...@cloudera.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> +1 on making whatever api changes we can now for 3.0.
>>> 
>>> 
>>> 
>>> I don't think that is making any commitments to supporting scala 2.13 in
>>> any specific version. We'll have to deal with all the other points you
>>> raised when we do cross that bridge, but hopefully those are things we can
>>> cover in a minor release.
>>> 
>>> 
>>> 
>>> On Fri, May 10, 2019 at 2:31 PM Sean Owen < srowen@ gmail. com (
>>> sro...@gmail.com ) > wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> I really hope we don't have to have separate source trees for some files,
>>>> but yeah it's an option too. OK, will start looking into changes we can
>>>> make now that don't break things now, and deprecations we need to make now
>>>> proactively.
>>>> 
>>>> 
>>>> 
>>>> I should also say that supporting Scala 2.13 will mean dependencies have
>>>> to support Scala 2.13, and that could take a while, because there are a
>>>> lot. In particular, I think we'll find our SBT 0.13 build won't make it,
>>>> perhaps just because of the plugins it needs. I tried updating to SBT 1.x
>>>> and it seemed to need quite a lot of rewrite, again in part due to how
>>>> newer plugin versions changed. I failed and gave up.
>>>> 
>>>> 
>>>> 
>>>> At some point maybe we figure out whether we can remove the SBT-based
>>>> build if it's super painful, but only if there's not much other choice.
>>>> That is for a future thread.
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 10, 2019 at 1:51 PM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Looks like a great idea to make changes in Spark 3.0 to prepare for Scala
>>>>> 2.13 upgrade.
>>>>> 
>>>>> 
>>>>> 
>>>>> Are there breaking changes that would require us to have two different
>>>>> source code for 2.12 vs 2.13?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 10, 2019 at 11:41 AM, Sean Owen < srowen@ gmail. com (
>>>>> sro...@gmail.com ) > wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> While that's not happening soon (2.13 isn't out), note that some of the
>>>>>> changes to collections will be fairly breaking changes.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25075 (
>>>>>> https://issues.apache.org/jira/browse/SPARK-25075 )
>>>>>> https:/ / docs. scala-lang. org/ overviews/ core/ 
>>>>>> collections-migration-213.
>>>>>> html (
>>>>>> https://docs.scala-lang.org/overviews/core/coll

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-25 Thread Reynold Xin
Can we push this to June 1st? I have been meaning to read it but
unfortunately keeps traveling...

On Sat, May 25, 2019 at 8:31 PM Dongjoon Hyun 
wrote:

> +1
>
> Thanks,
> Dongjoon.
>
> On Fri, May 24, 2019 at 17:03 DB Tsai  wrote:
>
>> +1 on exposing the APIs for columnar processing support.
>>
>> I understand that the scope of this SPIP doesn't cover AI / ML
>> use-cases. But I saw a good performance gain when I converted data
>> from rows to columns to leverage on SIMD architectures in a POC ML
>> application.
>>
>> With the exposed columnar processing support, I can imagine that the
>> heavy lifting parts of ML applications (such as computing the
>> objective functions) can be written as columnar expressions that
>> leverage on SIMD architectures to get a good speedup.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Wed, May 15, 2019 at 2:59 PM Bobby Evans  wrote:
>> >
>> > It would allow for the columnar processing to be extended through the
>> shuffle.  So if I were doing say an FPGA accelerated extension it could
>> replace the ShuffleExechangeExec with one that can take a ColumnarBatch as
>> input instead of a Row. The extended version of the ShuffleExchangeExec
>> could then do the partitioning on the incoming batch and instead of
>> producing a ShuffleRowRDD for the exchange they could produce something
>> like a ShuffleBatchRDD that would let the serializing and deserializing
>> happen in a column based format for a faster exchange, assuming that
>> columnar processing is also happening after the exchange. This is just like
>> providing a columnar version of any other catalyst operator, except in this
>> case it is a bit more complex of an operator.
>> >
>> > On Wed, May 15, 2019 at 12:15 PM Imran Rashid
>>  wrote:
>> >>
>> >> sorry I am late to the discussion here -- the jira mentions using this
>> extensions for dealing with shuffles, can you explain that part?  I don't
>> see how you would use this to change shuffle behavior at all.
>> >>
>> >> On Tue, May 14, 2019 at 10:59 AM Thomas graves 
>> wrote:
>> >>>
>> >>> Thanks for replying, I'll extend the vote til May 26th to allow your
>> >>> and other people feedback who haven't had time to look at it.
>> >>>
>> >>> Tom
>> >>>
>> >>> On Mon, May 13, 2019 at 4:43 PM Holden Karau 
>> wrote:
>> >>> >
>> >>> > I’d like to ask this vote period to be extended, I’m interested but
>> I don’t have the cycles to review it in detail and make an informed vote
>> until the 25th.
>> >>> >
>> >>> > On Tue, May 14, 2019 at 1:49 AM Xiangrui Meng 
>> wrote:
>> >>> >>
>> >>> >> My vote is 0. Since the updated SPIP focuses on ETL use cases, I
>> don't feel strongly about it. I would still suggest doing the following:
>> >>> >>
>> >>> >> 1. Link the POC mentioned in Q4. So people can verify the POC
>> result.
>> >>> >> 2. List public APIs we plan to expose in Appendix A. I did a quick
>> check. Beside ColumnarBatch and ColumnarVector, we also need to make the
>> following public. People who are familiar with SQL internals should help
>> assess the risk.
>> >>> >> * ColumnarArray
>> >>> >> * ColumnarMap
>> >>> >> * unsafe.types.CaledarInterval
>> >>> >> * ColumnarRow
>> >>> >> * UTF8String
>> >>> >> * ArrayData
>> >>> >> * ...
>> >>> >> 3. I still feel using Pandas UDF as the mid-term success doesn't
>> match the purpose of this SPIP. It does make some code cleaner. But I guess
>> for ETL use cases, it won't bring much value.
>> >>> >>
>> >>> > --
>> >>> > Twitter: https://twitter.com/holdenkarau
>> >>> > Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [RESULT][VOTE] SPIP: Public APIs for extended Columnar Processing Support

2019-05-29 Thread Reynold Xin
Thanks Tom.

I finally had time to look at the updated SPIP 10 mins ago. I support the high 
level idea and +1 on the SPIP.

That said, I think the proposed API is too complicated and invasive change to 
the existing internals. A much simpler API would be to expose a columnar batch 
iterator interface, i.e. an uber column oriented UDF with ability to manage 
life cycle. Once we have that, we can also refactor the existing Python UDFs to 
use that interface.

As I said earlier (couple months ago when this was first surfaced?), I support 
the idea to enable *external* column oriented processing logic, but not 
changing Spark itself to have two processing mode, which is simply very 
complicated and would create very high maintenance burden for the project.

On Wed, May 29, 2019 at 9:49 PM, Thomas graves < tgra...@apache.org > wrote:

> 
> 
> 
> Hi all,
> 
> 
> 
> The vote passed with 9 +1's (4 binding) and 1 +0 and no -1's.
> 
> 
> 
> +1s (* = binding) :
> Bobby Evans*
> Thomas Graves*
> DB Tsai*
> Felix Cheung*
> Bryan Cutler
> Kazuaki Ishizaki
> Tyson Condie
> Dongjoon Hyun
> Jason Lowe
> 
> 
> 
> +0s:
> Xiangrui Meng
> 
> 
> 
> Thanks,
> Tom Graves
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Reynold Xin
+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:

> I don't have a good sense of the overhead of continuing to support
>> Python 2; is it large enough to consider dropping it in Spark 3.0?
>>
>> from the build/test side, it will actually be pretty easy to continue
> support for python2.7 for spark 2.x as the feature sets won't be expanding.
>
> that being said, i will be cracking a bottle of champagne when i can
> delete all of the ansible and anaconda configs for python2.x.  :)
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Reynold Xin
Seems like a good idea. Can we test this with a component first?

On Thu, Jun 13, 2019 at 6:17 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we use both Apache JIRA and GitHub actively for Apache Spark
> contributions, we have lots of JIRAs and PRs consequently. One specific
> thing I've been longing to see is `Jira Issue Type` in GitHub.
>
> How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`?
> There are two main benefits:
> 1. It helps the communication between the contributors and reviewers with
> more information.
> (In some cases, some people only visit GitHub to see the PR and
> commits)
> 2. `Labels` is searchable. We don't need to visit Apache Jira to search
> PRs to see a specific type.
> (For example, the reviewers can see and review 'BUG' PRs first by
> using `is:open is:pr label:BUG`.)
>
> Of course, this can be done automatically without human intervention.
> Since we already have GitHub Jenkins job to access JIRA/GitHub, that job
> can add the labels from the beginning. If needed, I can volunteer to update
> the script.
>
> To show the demo, I labeled several PRs manually. You can see the result
> right now in Apache Spark PR page.
>
>   - https://github.com/apache/spark/pulls
>
> If you're surprised due to those manual activities, I want to apologize
> for that. I hope we can take advantage of the existing GitHub features to
> serve Apache Spark community in a way better than yesterday.
>
> How do you think about this specific suggestion?
>
> Bests,
> Dongjoon
>
> PS. I saw that `Request Review` and `Assign` features are already used for
> some purposes, but these feature are out of the scope in this email.
>


Re: Disabling `Merge Commits` from GitHub Merge Button

2019-07-01 Thread Reynold Xin
That's a good idea. We should only be using squash.

On Mon, Jul 01, 2019 at 1:52 PM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Hi, Apache Spark PMC members and committers.
> 
> 
> We are using GitHub `Merge Button` in `spark-website` repository
> because it's very convenient.
> 
> 
>     1. https:/ / github. com/ apache/ spark-website/ commits/ asf-site (
> https://github.com/apache/spark-website/commits/asf-site )
> 
>     2. https:/ / github. com/ apache/ spark/ commits/ master (
> https://github.com/apache/spark/commits/master )
> 
> 
> In order to be consistent with our previous behavior,
> can we disable `Allow Merge Commits` from GitHub `Merge Button` setting
> explicitly?
> 
> 
> I hope we can enforce it in both `spark-website` and `spark` repository
> consistently.
> 
> 
> Bests,
> Dongjoon.
>

Revisiting Python / pandas UDF

2019-07-05 Thread Reynold Xin
Hi all,

In the past two years, the pandas UDFs are perhaps the most important changes 
to Spark for Python data science. However, these functionalities have evolved 
organically, leading to some inconsistencies and confusions among users. I 
created a ticket and a document summarizing the issues, and a concrete proposal 
to fix them (the changes are pretty small). Thanks Xiangrui for initially 
bringing this to my attention, and Li Jin, Hyukjin, for offline discussions.

Please take a look: 

https://issues.apache.org/jira/browse/SPARK-28264

https://docs.google.com/document/u/1/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will 
also at some point run out of memory with too big of a query plan tree or 
become incredibly slow due to query planning complexity. I've seen queries that 
are tens of MBs in size.

On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmont...@126.com > wrote:

> 
> I have a question about the limit(biggest) of SQL's length that is
> supported in SparkSQL. I can't find the answer in the documents of Spark.
> 
> 
> Maybe Interger.MAX_VALUE or not ?
> 
> 
> 
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code.

On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > 
wrote:

> 
> Hi Reynold,
> 
> 
> I am genuinely curious about queries which are more than 1 MB and am
> stunned by tens of MB's. Any samples to share :) 
> 
> 
> Regards,
> Gourav
> 
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also at some point run out of memory with too big of a query plan
>> tree or become incredibly slow due to query planning complexity. I've seen
>> queries that are tens of MBs in size.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmontree@ 126. com (
>> alemmont...@126.com ) > wrote:
>> 
>>> I have a question about the limit(biggest) of SQL's length that is
>>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>> 
>>> 
>>> Maybe Interger.MAX_VALUE or not ?
>>> 
>> 
>> 
> 
>

Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-23 Thread Reynold Xin
I like the spirit, but not sure about the exact proposal. Take a look at
k8s':
https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md



On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon  wrote:

> (Plus, it helps to track history too. Spark's commit logs are growing and
> now it's pretty difficult to track the history and see what change
> introduced a specific behaviour)
>
> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>
> Hi all,
>
> I would like to discuss about some new sections under "## What changes
> were proposed in this pull request?":
>
> ### Do the changes affect _any_ user/dev-facing input or output?
>
> (Please answer yes or no. If yes, answer the questions below)
>
> ### What was the previous behavior?
>
> (Please provide the console output, description and/or reproducer about the 
> previous behavior)
>
> ### What is the behavior the changes propose?
>
> (Please provide the console output, description and/or reproducer about the 
> behavior the changes propose)
>
> See
> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>  .
>
> From my experience so far in Spark community, and assuming from the
> interaction with other
> committers and contributors, It is pretty critical to know before/after
> behaviour changes even if it
> was a bug. In addition, I think this is requested by reviewers often.
>
> The new sections will make review process much easier, and we're able to
> quickly judge how serious the changes are.
> Given that Spark community still suffer from open PRs just queueing up
> without review, I think this can help
> both reviewers and PR authors.
>
> I do describe them often when I think it's useful and possible.
> For instance see https://github.com/apache/spark/pull/24927 - I am sure
> you guys have clear idea what the
> PR fixes.
>
> I cc'ed some guys I can currently think of for now FYI. Please let me know
> if you guys have any thought on this!
>
>


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
Matt what do you mean by maximizing 3, while allowing not throwing errors when 
any operations overflow? Those two seem contradicting.

On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> I’m -1, simply from disagreeing with the premise that we can afford to not
> be maximal on standard 3. The correctness of the data is non-negotiable,
> and whatever solution we settle on cannot silently adjust the user’s data
> under any circumstances.
> 
> 
> 
>  
> 
> 
> 
> I think the existing behavior is fine, or perhaps the behavior can be
> flagged by the destination writer at write time.
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Hyukjin Kwon < gurwls...@gmail.com >
> *Date:* Monday, July 29, 2019 at 11:33 PM
> *To:* Wenchen Fan < cloud0...@gmail.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> From my look, +1 on the proposal, considering ASCI and other DBMSes in
> general.
> 
> 
> 
> 
>  
> 
> 
> 
> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
> 
> 
> 
> 
>> 
>> 
>> We can add a config for a certain behavior if it makes sense, but the most
>> important thing we want to reach an agreement here is: what should be the
>> default behavior?
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Let's explore the solution space of table insertion behavior first:
>> 
>> 
>> 
>> 
>> At compile time,
>> 
>> 
>> 
>> 
>> 1. always add cast
>> 
>> 
>> 
>> 
>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>> int is forbidden but long to int is allowed)
>> 
>> 
>> 
>> 
>> 3. only add cast if it's 100% safe
>> 
>> 
>> 
>> 
>> At runtime,
>> 
>> 
>> 
>> 
>> 1. return null for invalid operations
>> 
>> 
>> 
>> 
>> 2. throw exceptions at runtime for invalid operations
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The standards to evaluate a solution:
>> 
>> 
>> 
>> 
>> 1. How robust the query execution is. For example, users usually don't
>> want to see the query fails midway.
>> 
>> 
>> 
>> 
>> 2. how tolerant to user queries. For example, a user would like to write
>> long values to an int column as he knows all the long values won't exceed
>> int range.
>> 
>> 
>> 
>> 
>> 3. How clean the result is. For example, users usually don't want to see
>> silently corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V1 tables: always add cast and
>> return null for invalid operations. This maximizes standard 1 and 2, but
>> the result is least clean and users are very likely to see silently
>> corrupted data (null values).
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The current Spark behavior for Data Source V2 tables (new in Spark 3.0):
>> only add cast if it's 100% safe. This maximizes standard 1 and 3, but many
>> queries may fail to compile, even if these queries can run on other SQL
>> systems. Note that, people can still see silently corrupted data because
>> cast is not the only one that can return corrupted data. Simple operations
>> like ADD can also return corrected data if overflow happens. e.g. INSERT
>> INTO t1 (intCol) SELECT anotherIntCol + 100 FROM t2 
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> The proposal here: add cast following ANSI SQL store assignment rule, and
>> return null for invalid operations. This maximizes standard 1, and also
>> fits standard 2 well: if a query can't compile in Spark, it usually can't
>> compile in other mainstream databases as well. I think that's tolerant
>> enough. For standard 3, this proposal doesn't maximize it but can avoid
>> many invalid operations already.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> Technically we can't make the result 100% clean at compile-time, we have
>> to handle things like overflow at runtime. I think the new proposal makes
>> more sense as the default behavior.
>> 
>> 
>> 
>> 
>>   
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> On Mon, Jul 29, 2019 at 8:31 PM Russell Spitzer < russell.spit...@gmail.com
>> > wrote:
>> 
>> 
>> 
>>> 
>>> 
>>> I understand spark is making the decisions, i'm say the actual final
>>> effect of the null decision would be different depending on the insertion
>>> target if the target has different behaviors for null.
>>> 
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> On Mon, Jul 29, 2019 at 5:26 AM Wenchen Fan < cloud0...@gmail.com > wrote:
>>> 
>>> 
>>> 
>>> 
 
 
 > I'm a big -1 on null values for invalid casts.
 
 
 
  
 
 
 
 
 This is why we want to introduce the ANSI mode, so that invalid cast fails
 at runtime. But we have to keep the null behavior for a while, to keep
 backward compatibility. Spark returns null for invalid cast since the
 first day of Spark SQL, we c

Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-31 Thread Reynold Xin
OK to push back: "disagreeing with the premise that we can afford to not be 
maximal on standard 3. The correctness of the data is non-negotiable, and 
whatever solution we settle on cannot silently adjust the user’s data under any 
circumstances."

This blanket statement sounds great on surface, but there are a lot of 
subtleties. "Correctness" is absolutely important, but engineering/prod 
development are often about tradeoffs, and the industry has consistently traded 
correctness for performance or convenience, e.g. overflow checks, null 
pointers, consistency in databases ...

It all depends on the use cases and to what degree use cases can tolerate. For 
example, while I want my data engineering production pipeline to throw any 
error when the data doesn't match my expectations (e.g. type widening, 
overflow), if I'm doing some quick and dirty data science, I don't want the job 
to just fail due to outliers.

On Wed, Jul 31, 2019 at 10:13 AM, Matt Cheah < mch...@palantir.com > wrote:

> 
> 
> 
> Sorry I meant the current behavior for V2, which fails the query
> compilation if the cast is not safe.
> 
> 
> 
>  
> 
> 
> 
> Agreed that a separate discussion about overflow might be warranted. I’m
> surprised we don’t throw an error now, but it might be warranted to do so.
> 
> 
> 
> 
>  
> 
> 
> 
> -Matt Cheah
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < r...@databricks.com >
> *Date:* Wednesday, July 31, 2019 at 9:58 AM
> *To:* Matt Cheah < mch...@palantir.com >
> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
> linguin@gmail.com
> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
> >rb...@netflix.com
> >, Spark dev list < dev@spark.apache.org >, Hyukjin Kwon < gurwls...@gmail.com
> >, Wenchen Fan < cloud0...@gmail.com >
> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Matt what do you mean by maximizing 3, while allowing not throwing errors
> when any operations overflow? Those two seem contradicting.
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Wed, Jul 31, 2019 at 9:55 AM, Matt Cheah < mch...@palantir.com > wrote:
> 
> 
> 
>> 
>> 
>> I’m -1, simply from disagreeing with the premise that we can afford to not
>> be maximal on standard 3. The correctness of the data is non-negotiable,
>> and whatever solution we settle on cannot silently adjust the user’s data
>> under any circumstances.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> I think the existing behavior is fine, or perhaps the behavior can be
>> flagged by the destination writer at write time.
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> -Matt Cheah
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> *From:* Hyukjin Kwon < gurwls...@gmail.com >
>> *Date:* Monday, July 29, 2019 at 11:33 PM
>> *To:* Wenchen Fan < cloud0...@gmail.com >
>> *Cc:* Russell Spitzer < russell.spit...@gmail.com >, Takeshi Yamamuro < 
>> linguin@gmail.com
>> >, Gengliang Wang < gengliang.w...@databricks.com >, Ryan Blue < 
>> >rb...@netflix.com
>> >, Spark dev list < dev@spark.apache.org >
>> *Subject:* Re: [Discuss] Follow ANSI SQL on table insertion
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> From my look, +1 on the proposal, considering ASCI and other DBMSes in
>> general.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 2019 년 7 월 30 일 ( 화 ) 오후 3:21, Wenchen Fan < cloud0...@gmail.com > 님이 작성 :
>> 
>> 
>> 
>> 
>>> 
>>> 
>>> We can add a config for a certain behavior if it makes sense, but the most
>>> important thing we want to reach an agreement here is: what should be the
>>> default behavior?
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 
>>> 
>>> Let's explore the solution space of table insertion behavior first:
>>> 
>>> 
>>> 
>>> 
>>> At compile time,
>>> 
>>> 
>>> 
>>> 
>>> 1. always add cast
>>> 
>>> 
>>> 
>>> 
>>> 2. add cast following the ASNI SQL store assignment rule (e.g. string to
>>> int is forbidden but long to int is allowed)
>>> 
>>> 
>>> 
>>> 
>>&

Re: DataSourceV2 : Transactional Write support

2019-08-05 Thread Reynold Xin
We can also just write using one partition, which will be sufficient for
most use cases.

On Mon, Aug 5, 2019 at 7:48 PM Matt Cheah  wrote:

> There might be some help from the staging table catalog as well.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Wenchen Fan 
> *Date: *Monday, August 5, 2019 at 7:40 PM
> *To: *Shiv Prashant Sood 
> *Cc: *Ryan Blue , Jungtaek Lim ,
> Spark Dev List 
> *Subject: *Re: DataSourceV2 : Transactional Write support
>
>
>
> I agree with the temp table approach. One idea is: maybe we only need one
> temp table, and each task writes to this temp table. At the end we read the
> data from the temp table and write it to the target table. AFAIK JDBC can
> handle concurrent table writing very well, and it's better than creating
> thousands of temp tables for one write job(assume the input RDD has
> thousands of partitions).
>
>
>
> On Tue, Aug 6, 2019 at 7:57 AM Shiv Prashant Sood 
> wrote:
>
> Thanks all for the clarification.
>
>
>
> Regards,
>
> Shiv
>
>
>
> On Sat, Aug 3, 2019 at 12:49 PM Ryan Blue 
> wrote:
>
> > What you could try instead is intermediate output: inserting into
> temporal table in executors, and move inserted records to the final table
> in driver (must be atomic)
>
>
>
> I think that this is the approach that other systems (maybe sqoop?) have
> taken. Insert into independent temporary tables, which can be done quickly.
> Then for the final commit operation, union and insert into the final table.
> In a lot of cases, JDBC databases can do that quickly as well because the
> data is already on disk and just needs to added to the final table.
>
>
>
> On Fri, Aug 2, 2019 at 7:25 PM Jungtaek Lim  wrote:
>
> I asked similar question for end-to-end exactly-once with Kafka, and
> you're correct distributed transaction is not supported. Introducing
> distributed transaction like "two-phase commit" requires huge change on
> Spark codebase and the feedback was not positive.
>
>
>
> What you could try instead is intermediate output: inserting into temporal
> table in executors, and move inserted records to the final table in driver
> (must be atomic).
>
>
>
> Thanks,
>
> Jungtaek Lim (HeartSaVioR)
>
>
>
> On Sat, Aug 3, 2019 at 4:56 AM Shiv Prashant Sood 
> wrote:
>
> All,
>
>
>
> I understood that DataSourceV2 supports Transactional write and wanted to
> implement that in JDBC DataSource V2 connector ( PR#25211 [github.com]
> 
> ).
>
>
>
> Don't see how this is feasible for JDBC based connector.  The FW suggest
> that EXECUTOR send a commit message  to DRIVER, and actual commit should
> only be done by DRIVER after receiving all commit confirmations. This will
> not work for JDBC  as commits have to happen on the JDBC Connection which
> is maintained by the EXECUTORS and JDBCConnection  is not serializable that
> it can be sent to the DRIVER.
>
>
>
> Am i right in thinking that this cannot be supported for JDBC? My goal is
> to either fully write or roll back the dataframe write operation.
>
>
>
> Thanks in advance for your help.
>
>
>
> Regards,
>
> Shiv
>
>
>
>
> --
>
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior [medium.com]
> 
> Twitter : http://twitter.com/heartsavior [twitter.com]
> 
>
> LinkedIn : http://www.linkedin.com/in/heartsavior [linkedin.com]
> 
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>


Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Would it be possible to have one build that works for both?

On Mon, Aug 26, 2019 at 10:22 AM Dongjoon Hyun 
wrote:

> Thank you all!
>
> Let me add more explanation on the current status.
>
> - If you want to run on JDK8, you need to build on JDK8
> - If you want to run on JDK11, you need to build on JDK11.
>
> The other combinations will not work.
>
> Currently, we have two Jenkins jobs. (1) is the one I pointed, and (2) is
> the one for the remaining community work.
>
> 1) Build and test on JDK11 (spark-master-test-maven-hadoop-3.2-jdk-11)
> 2) Build on JDK8 and test on JDK11
> (spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing)
>
> To keep JDK11 compatibility, the following is merged today.
>
> [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11
> support for pull request builds
>
> But, we still have many stuffs to do for Jenkins/Release and we need your
> support about JDK11. :)
>
> Bests,
> Dongjoon.
>
>
> On Sun, Aug 25, 2019 at 10:30 PM Takeshi Yamamuro 
> wrote:
>
>> Cool, congrats!
>>
>> Bests,
>> Takeshi
>>
>> On Mon, Aug 26, 2019 at 1:01 PM Hichame El Khalfi 
>> wrote:
>>
>>> That's Awesome !!!
>>>
>>> Thanks to everyone that made this possible :cheers:
>>>
>>> Hichame
>>>
>>> *From:* cloud0...@gmail.com
>>> *Sent:* August 25, 2019 10:43 PM
>>> *To:* lix...@databricks.com
>>> *Cc:* felixcheun...@hotmail.com; ravishankar.n...@gmail.com;
>>> dongjoon.h...@gmail.com; dev@spark.apache.org; u...@spark.apache.org
>>> *Subject:* Re: JDK11 Support in Apache Spark
>>>
>>> Great work!
>>>
>>> On Sun, Aug 25, 2019 at 6:03 AM Xiao Li  wrote:
>>>
 Thank you for your contributions! This is a great feature for Spark
 3.0! We finally achieve it!

 Xiao

 On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> That’s great!
>
> --
> *From:* ☼ R Nair 
> *Sent:* Saturday, August 24, 2019 10:57:31 AM
> *To:* Dongjoon Hyun 
> *Cc:* dev@spark.apache.org ; user @spark/'user
> @spark'/spark users/user@spark 
> *Subject:* Re: JDK11 Support in Apache Spark
>
> Finally!!! Congrats
>
> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Thanks to your many many contributions,
>> Apache Spark master branch starts to pass on JDK11 as of today.
>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>> (JDK11 is used for building and testing.)
>>
>> We already verified all UTs (including PySpark/SparkR) before.
>>
>> Please feel free to use JDK11 in order to build/test/run `master`
>> branch and
>> share your experience including any issues. It will help Apache Spark
>> 3.0.0 release.
>>
>> For the follow-ups, please follow
>> https://issues.apache.org/jira/browse/SPARK-24417 .
>> The next step is `how to support JDK8/JDK11 together in a single
>> artifact`.
>>
>> Bests,
>> Dongjoon.
>>
>

 --
 [image: Databricks Summit - Watch the talks]
 

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: JDK11 Support in Apache Spark

2019-08-26 Thread Reynold Xin
Exactly - I think it's important to be able to create a single binary build. 
Otherwise downstream users (the 99.99% won't be building their own Spark but 
just pull it from Maven) will have to deal with the mess, and it's even worse 
for libraries.

On Mon, Aug 26, 2019 at 10:51 AM, Dongjoon Hyun < dongjoon.h...@gmail.com > 
wrote:

> 
> Oh, right. If you want to publish something to Maven, it will inherit the
> situation.
> Thank you for feedback. :)
> 
> On Mon, Aug 26, 2019 at 10:37 AM Michael Heuer < heuermh@ gmail. com (
> heue...@gmail.com ) > wrote:
> 
> 
>> That is not true for any downstream users who also provide a library. 
>> Whatever build mess you create in Apache Spark, we'll have to inherit it. 
>> ;)
>> 
>> 
>>    michael
>> 
>> 
>> 
>> 
>>> On Aug 26, 2019, at 12:32 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
>>> dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> As Shane wrote, not yet.
>>> 
>>> 
>>> `one build for works for both` is our aspiration and the next step
>>> mentioned in the first email.
>>> 
>>> 
>>> 
>>> > The next step is `how to support JDK8/JDK11 together in a single
>>> artifact`.
>>> 
>>> 
>>> For the downstream users who build from the Apache Spark source, that will
>>> not be a blocker because they will prefer a single JDK.
>>> 
>>> 
>>> 
>>> Bests,
>>> Dongjoon.
>>> 
>>> On Mon, Aug 26, 2019 at 10:28 AM Shane Knapp < sknapp@ berkeley. edu (
>>> skn...@berkeley.edu ) > wrote:
>>> 
>>> 
>>>> maybe in the future, but not right now as the hadoop 2.7 build is broken.
>>>> 
>>>> 
>>>> also, i busted dev/ run-tests. py ( http://dev/run-tests.py ) in my changes
>>>> to support java11 in PRBs:
>>>> https:/ / github. com/ apache/ spark/ pull/ 25585 (
>>>> https://github.com/apache/spark/pull/25585 )
>>>> 
>>>> 
>>>> 
>>>> quick fix, testing now.
>>>> 
>>>> On Mon, Aug 26, 2019 at 10:23 AM Reynold Xin < rxin@ databricks. com (
>>>> r...@databricks.com ) > wrote:
>>>> 
>>>> 
>>>>> Would it be possible to have one build that works for both?
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-09-05 Thread Reynold Xin
Having three modes is a lot. Why not just use ansi mode as default, and legacy 
for backward compatibility? Then over time there's only the ANSI mode, which is 
standard compliant and easy to understand. We also don't need to invent a 
standard just for Spark.

On Thu, Sep 05, 2019 at 12:27 AM, Wenchen Fan < cloud0...@gmail.com > wrote:

> 
> +1
> 
> 
> To be honest I don't like the legacy policy. It's too loose and easy for
> users to make mistakes, especially when Spark returns null if a function
> hit errors like overflow.
> 
> 
> The strict policy is not good either. It's too strict and stops valid use
> cases like writing timestamp values to a date type column. Users do expect
> truncation to happen without adding cast manually in this case. It's also
> weird to use a spark specific policy that no other database is using.
> 
> 
> The ANSI policy is better. It stops invalid use cases like writing string
> values to an int type column, while keeping valid use cases like timestamp
> -> date.
> 
> 
> I think it's no doubt that we should use ANSI policy instead of legacy
> policy for v1 tables. Except for backward compatibility, ANSI policy is
> literally better than the legacy policy.
> 
> 
> The v2 table is arguable here. Although the ANSI policy is better than
> strict policy to me, this is just the store assignment policy, which only
> partially controls the table insertion behavior. With Spark's "return null
> on error" behavior, the table insertion is more likely to insert invalid
> null values with the ANSI policy compared to the strict policy.
> 
> 
> I think we should use ANSI policy by default for both v1 and v2 tables,
> because
> 1. End-users don't care how the table is implemented. Spark should provide
> consistent table insertion behavior between v1 and v2 tables.
> 2. Data Source V2 is unstable in Spark 2.x so there is no backward
> compatibility issue. That said, the baseline to judge which policy is
> better should be the table insertion behavior in Spark 2.x, which is the
> legacy policy + "return null on error". ANSI policy is better than the
> baseline.
> 3. We expect more and more uses to migrate their data sources to the V2
> API. The strict policy can be a stopper as it's a too big breaking change,
> which may break many existing queries.
> 
> 
> Thanks,
> Wenchen 
> 
> 
> 
> 
> On Wed, Sep 4, 2019 at 1:59 PM Gengliang Wang < gengliang. wang@ databricks.
> com ( gengliang.w...@databricks.com ) > wrote:
> 
> 
>> Hi everyone,
>> 
>> I'd like to call for a vote on SPARK-28885 (
>> https://issues.apache.org/jira/browse/SPARK-28885 ) "Follow ANSI store
>> assignment rules in table insertion by default".  
>> When inserting a value into a column with the different data type, Spark
>> performs type coercion. Currently, we support 3 policies for the type
>> coercion rules: ANSI, legacy and strict, which can be set via the option
>> "spark.sql.storeAssignmentPolicy":
>> 1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice,
>> the behavior is mostly the same as PostgreSQL. It disallows certain
>> unreasonable type conversions such as converting `string` to `int` and
>> `double` to `boolean`.
>> 2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`,
>> which is very loose. E.g., converting either `string` to `int` or `double`
>> to `boolean` is allowed. It is the current behavior in Spark 2.x for
>> compatibility with Hive.
>> 3. Strict: Spark doesn't allow any possible precision loss or data
>> truncation in type coercion, e.g., converting either `double` to `int` or
>> `decimal` to `double` is allowed. The rules are originally for Dataset
>> encoder. As far as I know, no maintainstream DBMS is using this policy by
>> default.
>> 
>> Currently, the V1 data source uses "Legacy" policy by default, while V2
>> uses "Strict". This proposal is to use "ANSI" policy by default for both
>> V1 and V2 in Spark 3.0.
>> 
>> There was also a DISCUSS thread "Follow ANSI SQL on table insertion" in
>> the dev mailing list.
>> 
>> This vote is open until next Thurs (Sept. 12nd).
>> 
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ... Thank you! Gengliang
>> 
>> 
> 
>

Re: Thoughts on Spark 3 release, or a preview release

2019-09-12 Thread Reynold Xin
+1! Long due for a preview release.

On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau < hol...@pigscanfly.ca > wrote:

> 
> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge < jzhuge@ apache. org (
> jzh...@apache.org ) > wrote:
> 
> 
>> +1  Like the idea as a user and a DSv2 contributor.
>> 
>> 
>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim < kabhwan@ gmail. com (
>> kabh...@gmail.com ) > wrote:
>> 
>> 
>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>> would help to test the feature. When to cut preview release is
>>> questionable, as major works are ideally to be done before that - if we
>>> are intended to introduce new features before official release, that
>>> should work regardless of this, but if we are intended to have opportunity
>>> to test earlier, ideally it should.
>>> 
>>> 
>>> As a one of contributors in structured streaming area, I'd like to add
>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>> "better to have", I pick some items for new features which committers
>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>> consumer pool as improvement) I hope we provide some gifts for structured
>>> streaming users in Spark 3.0 envelope.
>>> 
>>> 
>>> > must be done
>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>> output
>>> 
>>> It's a correctness issue with multiple users reported, being reported at
>>> Nov. 2018. There's a way to reproduce it consistently, and we have a patch
>>> submitted at Jan. 2019 to fix it.
>>> 
>>> 
>>> > better to have
>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to
>>> start and end offset
>>> * SPARK-20568 Delete files after processing in structured streaming
>>> 
>>> 
>>> There're some more new features/improvements items in SS, but given we're
>>> talking about ramping-down, above list might be realistic one.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin < jgp@ jgp. net (
>>> j...@jgp.net ) > wrote:
>>> 
>>> 
 As a user/non committer, +1
 
 
 I love the idea of an early 3.0.0 so we can test current dev against it, I
 know the final 3.x will probably need another round of testing when it
 gets out, but less for sure... I know I could checkout and compile, but
 having a “packaged” preversion is great if it does not take too much time
 to the team...
 
 jg
 
 
 
 On Sep 11, 2019, at 20:40, Hyukjin Kwon < gurwls223@ gmail. com (
 gurwls...@gmail.com ) > wrote:
 
 
 
> +1 from me too but I would like to know what other people think too.
> 
> 
> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun < dongjoon. hyun@ gmail. com (
> dongjoon.h...@gmail.com ) >님이 작성:
> 
> 
>> Thank you, Sean.
>> 
>> 
>> I'm also +1 for the following three.
>> 
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps 
>> it
>> a lot.
>> 
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release
>> Window` in our versioning-policy page?
>> 
>> 
>> - https:/ / spark. apache. org/ versioning-policy. html (
>> https://spark.apache.org/versioning-policy.html )
>> 
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < heuermh@ gmail. com (
>> heue...@gmail.com ) > wrote:
>> 
>> 
>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility 
>>> problems
>>> resolved, e.g.
>>> 
>>> 
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-25588 (
>>> https://issues.apache.org/jira/browse/SPARK-25588 )
>>> https:/ / issues. apache. org/ jira/ browse/ SPARK-27781 (
>>> https://issues.apache.org/jira/browse/SPARK-27781 )
>>> 
>>> 
>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far 
>>> as
>>> I know, Parquet has not cut a release based on this new version.
>>> 
>>> 
>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>> 
>>> 
>>> https:/ / github. com/ apache/ spark/ pull/ 24851 (
>>> https://github.com/apache/spark/pull/24851 )
>>> https:/ / github. com/ apache/ spark/ pull/ 24297 (
>>> https://github.com/apache/spark/pull/24297 )
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
DSv2 is far from stable right? All the actual data types are unstable and you 
guys have completely ignored that. We'd need to work on that and that will be a 
breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that 
seems too invasive of a change to backport once you consider the parts needed 
to make dsv2 stable.

On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rb...@netflix.com.invalid > wrote:

> 
> Hi everyone,
> 
> 
> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
> 
> 
> A Spark 2.5 release with these two additions will help people migrate to
> Spark 3.0 when it is released because they will be able to use a single
> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
> upgrading to 3.0 won't also require also updating to Java 11 because users
> could update to Java 11 with the 2.5 release and have fewer major changes.
> 
> 
> 
> Another reason to consider a 2.5 release is that many people are
> interested in a release with the latest DSv2 API and support for DSv2 SQL.
> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
> it makes sense to share this work with the community.
> 
> 
> This release line would just consist of backports like DSv2 and Java 11
> that assist compatibility, to keep the scope of the release small. The
> purpose is to assist people moving to 3.0 and not distract from the 3.0
> release.
> 
> 
> Would a Spark 2.5 release help anyone else? Are there any concerns about
> this plan?
> 
> 
> 
> 
> rb
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [DISCUSS] Spark 2.5 release

2019-09-20 Thread Reynold Xin
To push back, while I agree we should not drastically change "InternalRow", 
there are a lot of changes that need to happen to make it stable. For example, 
none of the publicly exposed interfaces should be in the Catalyst package or 
the unsafe package. External implementations should be decoupled from the 
internal implementations, with cheap ways to convert back and forth.

When you created the PR to make InternalRow public, the understanding was to 
work towards making it stable in the future, assuming we will start with an 
unstable API temporarily. You can't just make a bunch internal APIs tightly 
coupled with other internal pieces public and stable and call it a day, just 
because it happen to satisfy some use cases temporarily assuming the rest of 
Spark doesn't change.

On Fri, Sep 20, 2019 at 11:19 AM, Ryan Blue < rb...@netflix.com > wrote:

> 
> > DSv2 is far from stable right?
> 
> 
> No, I think it is reasonably stable and very close to being ready for a
> release.
> 
> 
> > All the actual data types are unstable and you guys have completely
> ignored that.
> 
> 
> I think what you're referring to is the use of `InternalRow`. That's a
> stable API and there has been no work to avoid using it. In any case, I
> don't think that anyone is suggesting that we delay 3.0 until a
> replacement for `InternalRow` is added, right?
> 
> 
> While I understand the motivation for a better solution here, I think the
> pragmatic solution is to continue using `InternalRow`.
> 
> 
> > If the goal is to make DSv2 work across 3.x and 2.x, that seems too
> invasive of a change to backport once you consider the parts needed to
> make dsv2 stable.
> 
> 
> I believe that those of us working on DSv2 are confident about the current
> stability. We set goals for what to get into the 3.0 release months ago
> and have very nearly reached the point where we are ready for that
> release.
> 
> 
> I don't think instability would be a problem in maintaining compatibility
> between the 2.5 version and the 3.0 version. If we find that we need to
> make API changes (other than additions) then we can make those in the 3.1
> release. Because the goals we set for the 3.0 release have been reached
> with the current API and if we are ready to release 3.0, we can release a
> 2.5 with the same API.
> 
> On Fri, Sep 20, 2019 at 11:05 AM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> DSv2 is far from stable right? All the actual data types are unstable and
>> you guys have completely ignored that. We'd need to work on that and that
>> will be a breaking change. If the goal is to make DSv2 work across 3.x and
>> 2.x, that seems too invasive of a change to backport once you consider the
>> parts needed to make dsv2 stable.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 10:47 AM, Ryan Blue < rblue@ netflix. com. invalid
>> ( rb...@netflix.com.invalid ) > wrote:
>> 
>>> Hi everyone,
>>> 
>>> 
>>> In the DSv2 sync this week, we talked about a possible Spark 2.5 release
>>> based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
>>> 
>>> 
>>> A Spark 2.5 release with these two additions will help people migrate to
>>> Spark 3.0 when it is released because they will be able to use a single
>>> implementation for DSv2 sources that works in both 2.5 and 3.0. Similarly,
>>> upgrading to 3.0 won't also require also updating to Java 11 because users
>>> could update to Java 11 with the 2.5 release and have fewer major changes.
>>> 
>>> 
>>> 
>>> Another reason to consider a 2.5 release is that many people are
>>> interested in a release with the latest DSv2 API and support for DSv2 SQL.
>>> I'm already going to be backporting DSv2 support to the Spark 2.4 line, so
>>> it makes sense to share this work with the community.
>>> 
>>> 
>>> This release line would just consist of backports like DSv2 and Java 11
>>> that assist compatibility, to keep the scope of the release small. The
>>> purpose is to assist people moving to 3.0 and not distract from the 3.0
>>> release.
>>> 
>>> 
>>> Would a Spark 2.5 release help anyone else? Are there any concerns about
>>> this plan?
>>> 
>>> 
>>> 
>>> 
>>> rb
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> --
> Ryan Blue
> Software Engineer
> Netflix
>

<    1   2   3   4   5   6   7   8   9   10   >