Kubernetes Big-Data-SIG notes, September 19

2018-09-19 Thread Erik Erlandson
Meta Following this week's regular meeting we will be meeting bi weekly. The next meeting will be October 3. I will be in London for Spark Summit and so Yinan Li will chair that meeting. Spark K8s backend development for 2.4 is complete. There is some renewed discussion about how much

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
What does partition management look like in those systems and what are the options we would standardize in an API? On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > I think partition management feature would be very useful in RDBMSes that > support it –

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Thakrar, Jayesh
I think partition management feature would be very useful in RDBMSes that support it – e.g. Oracle, PostgreSQL, and DB2. In some cases add partitions can be explicit and can/may be done outside of data loads. But in some other cases, it may/can need to be done implicitly when supported by the

Re: data source api v2 refactoring

2018-09-19 Thread Thakrar, Jayesh
Thanks for the info Ryan – very helpful! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 3:17 PM To: "Thakrar, Jayesh" Cc: Wenchen Fan , Hyukjin Kwon , Spark Dev List Subject: Re: data source api v2 refactoring Hi Jayesh, The existing sources haven't

Re: from_csv

2018-09-19 Thread John Zhuge
+1 On Wed, Sep 19, 2018 at 8:07 AM Ted Yu wrote: > +1 > > Original message > From: Dongjin Lee > Date: 9/19/18 7:20 AM (GMT-08:00) > To: dev > Subject: Re: from_csv > > Another +1. > > I already experienced this case several times. > > On Mon, Sep 17, 2018 at 11:03 AM

Re: data source api v2 refactoring

2018-09-19 Thread Ryan Blue
Hi Jayesh, The existing sources haven't been ported to v2 yet. That is going to be tricky because the existing sources implement behaviors that we need to keep for now. I wrote up an SPIP to standardize logical plans while moving to the v2 sources. The reason why we need this is that too much is

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread John Zhuge
+1 (non-binding) Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) On

Re: Datasource v2 Select Into support

2018-09-19 Thread Ryan Blue
Ross, The problem you're hitting is that there aren't many logical plans that work with the v2 source API yet. Here, you're creating an InsertIntoTable logical plan from SQL, which can't be converted to a physical plan because there is no rule to convert it either to the right logical plan for

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
I'm open to exploring the idea of adding partition management as a catalog API. The approach we're taking is to have an interface for each concern a catalog might implement, like TableCatalog (proposed in SPARK-24252), but also FunctionCatalog for stored functions and possibly

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-19 Thread Ryan Blue
I’m not a huge fan of special cases for configuration values like this. Is there something that we can do to pass a set of values to all sources (and catalogs for #21306)? I would prefer adding a special prefix for options that are passed to all sources, like this:

DirectFileOutputCommitter in Spark 2.3.1

2018-09-19 Thread Priya Ch
Hello Team, I am trying to write a DataSet as parquet file in Append mode partitioned by few columns. However since the job is time consuming, I would like to enable DirectFileOutputCommitter (i.e by-passing the writes to temporary folder). Version of the spark i am using is 2.3.1. Can someone

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Stavros Kontopoulos
There is a design document that covers a lot of concerns: https://docs.google.com/document/d/1pcyH5f610X2jyJW9WbWHnj8jktQPLlbbmmUwdeK4fJk, validation included. We had a discussion about validation (validate before we hit the api server) and was considered too much. In general regarding Rob's

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Erik Erlandson
I can speak somewhat to the current design. Two of the goals for the design of this feature are that (1) its behavior is easy to reason about (2) its implementation in the back-end is light weight Option 1 was chosen partly because it's behavior is relatively simple to describe to a user: "Your

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Yinan Li
Thanks for bring this up. My opinion on this is this feature is really targeting advanced use cases that need more customization than what the basic k8s-related Spark config properties offer. So I think it's fair to assume that users who would like to use this feature know the risks and are

[DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Rob Vesse
Hey all For those following the K8S backend you are probably aware of SPARK-24434 [1] (and PR 22416 [2]) which proposes a mechanism to allow for advanced pod customisation via pod templates.  This is motivated by the fact that introducing additional Spark configuration properties for each

Re: ***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra
That's overstated. We will also block for a data correctness issue -- and that is, arguably, what this is. On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my

Re: from_csv

2018-09-19 Thread Ted Yu
+1 Original message From: Dongjin Lee Date: 9/19/18 7:20 AM (GMT-08:00) To: dev Subject: Re: from_csv Another +1. I already experienced this case several times. On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote: +1 for this idea since text parsing in CSV/JSON is quite

Re: from_csv

2018-09-19 Thread Dongjin Lee
Another +1. I already experienced this case several times. On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote: > +1 for this idea since text parsing in CSV/JSON is quite common. > > One thing is about schema inference likewise with JSON functionality. In > case of JSON, we added

Re: [DISCUSS] upper/lower of special characters

2018-09-19 Thread Sean Owen
I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL

Re: ***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Sean Owen
If the issue also existed in 2.3.1, then it should not block this release. It can of course be fixed for a future 2.3.3. On Wed, Sep 19, 2018 at 2:18 AM Saisai Shao wrote: > > Hi Marco, > > From my understanding of SPARK-25454, I don't think it is a block issue, it > might be an corner case, so

***UNCHECKED*** [STREAMING] Improving the Checkpointing architecture

2018-09-19 Thread ssaavedra
Hi, I've been working on SPARK-23200 and a key point was raised during the discussion, available at https://github.com/apache/spark/pull/22392 Namely, whether the checkpointing process should allow for changing values of variables as they have been previously submitted. For example, when you

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Takeshi Yamamuro
+1 I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserve` on the openjdk below/macOSv10.12.6 $ java -version java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) On Wed, Sep 19,

Permanent UDF support across session

2018-09-19 Thread Ajith shetty
I have a question related to Permanent UDF for spark enabled hive support. When we do create function, this is registered with hive via spark-sql>create function customfun as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDay' using jar 'hdfs:///tmp/hive-exec.jar'; call stack:

***UNCHECKED*** Re: Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Marco Gaido
It is not new, it has been there since 2.3.0, so in that case this is not a blocker. Thanks. Il giorno mer 19 set 2018 alle ore 09:21 Reynold Xin ha scritto: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my

***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Reynold Xin
We also only block if it is a new regression. On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao wrote: > Hi Marco, > > From my understanding of SPARK-25454, I don't think it is a block issue, > it might be an corner case, so personally I don't want to block the release > of 2.3.2 because of this

***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao
Hi Marco, >From my understanding of SPARK-25454, I don't think it is a block issue, it might be an corner case, so personally I don't want to block the release of 2.3.2 because of this issue. The release has been delayed for a long time. Thanks Saisai Marco Gaido 于2018年9月19日周三 下午2:58写道: >

***UNCHECKED*** Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao
Hi Marco, >From my understanding of SPARK-25454, I don't think it is a block issue, it might be an corner case, so personally I don't want to block the release of 2.3.2 because of this issue. The release has been delayed for a long time. Marco Gaido 于2018年9月19日周三 下午2:58写道: > Sorry, I am -1

***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Marco Gaido
Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2. Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun < dongjoon.h...@gmail.com> ha scritto: > +1. > > I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive > -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5. > > I hit

Re: [DISCUSS] upper/lower of special characters

2018-09-19 Thread Reynold Xin
I'd just document it as a known limitation and move on for now, until there are enough end users that need this. Spark is also very powerful with UDFs and end users can easily work around this using UDFs. -- excuse the brevity and lower case due to wrist injury On Tue, Sep 18, 2018 at 11:14 PM