Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-09-19 Thread Felix Cheung
Hi +baibing3 +huangtao6 Came across your presentation on Alluxio - including shuffling - would you be interested in this? From: Matt Cheah Sent: Tuesday, September 4, 2018 2:54 PM To: Yuanjian Li Cc: Spark dev list Subject: Re: [Feedback Requested] SPARK-25299:

Re: [DISCUSS] PySpark Window UDF

2018-09-19 Thread Felix Cheung
Definitely! numba numbers are amazing From: Wes McKinney Sent: Saturday, September 8, 2018 7:46 AM To: Li Jin Cc: dev@spark.apache.org Subject: Re: [DISCUSS] PySpark Window UDF hi Li, These results are very cool. I'm excited to see you continuing to push this ef

Kubernetes Big-Data-SIG notes, September 19

2018-09-19 Thread Erik Erlandson
Meta Following this week's regular meeting we will be meeting bi weekly. The next meeting will be October 3. I will be in London for Spark Summit and so Yinan Li will chair that meeting. Spark K8s backend development for 2.4 is complete. There is some renewed discussion about how much verification

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
What does partition management look like in those systems and what are the options we would standardize in an API? On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > I think partition management feature would be very useful in RDBMSes that > support it – e.g.

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Thakrar, Jayesh
I think partition management feature would be very useful in RDBMSes that support it – e.g. Oracle, PostgreSQL, and DB2. In some cases add partitions can be explicit and can/may be done outside of data loads. But in some other cases, it may/can need to be done implicitly when supported by the p

Re: data source api v2 refactoring

2018-09-19 Thread Thakrar, Jayesh
Thanks for the info Ryan – very helpful! From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Wednesday, September 19, 2018 at 3:17 PM To: "Thakrar, Jayesh" Cc: Wenchen Fan , Hyukjin Kwon , Spark Dev List Subject: Re: data source api v2 refactoring Hi Jayesh, The existing sources haven't bee

Re: from_csv

2018-09-19 Thread John Zhuge
+1 On Wed, Sep 19, 2018 at 8:07 AM Ted Yu wrote: > +1 > > Original message > From: Dongjin Lee > Date: 9/19/18 7:20 AM (GMT-08:00) > To: dev > Subject: Re: from_csv > > Another +1. > > I already experienced this case several times. > > On Mon, Sep 17, 2018 at 11:03 AM Hyukjin

Re: data source api v2 refactoring

2018-09-19 Thread Ryan Blue
Hi Jayesh, The existing sources haven't been ported to v2 yet. That is going to be tricky because the existing sources implement behaviors that we need to keep for now. I wrote up an SPIP to standardize logical plans while moving to the v2 sources. The reason why we need this is that too much is

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread John Zhuge
+1 (non-binding) Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) On We

Re: Datasource v2 Select Into support

2018-09-19 Thread Ryan Blue
Ross, The problem you're hitting is that there aren't many logical plans that work with the v2 source API yet. Here, you're creating an InsertIntoTable logical plan from SQL, which can't be converted to a physical plan because there is no rule to convert it either to the right logical plan for v2,

Re: [Discuss] Datasource v2 support for manipulating partitions

2018-09-19 Thread Ryan Blue
I'm open to exploring the idea of adding partition management as a catalog API. The approach we're taking is to have an interface for each concern a catalog might implement, like TableCatalog (proposed in SPARK-24252), but also FunctionCatalog for stored functions and possibly PartitionedTableCatal

Re: [Discuss] Datasource v2 support for Kerberos

2018-09-19 Thread Ryan Blue
I’m not a huge fan of special cases for configuration values like this. Is there something that we can do to pass a set of values to all sources (and catalogs for #21306)? I would prefer adding a special prefix for options that are passed to all sources, like this: spark.sql.catalog.shared.shared

DirectFileOutputCommitter in Spark 2.3.1

2018-09-19 Thread Priya Ch
Hello Team, I am trying to write a DataSet as parquet file in Append mode partitioned by few columns. However since the job is time consuming, I would like to enable DirectFileOutputCommitter (i.e by-passing the writes to temporary folder). Version of the spark i am using is 2.3.1. Can someone p

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Stavros Kontopoulos
There is a design document that covers a lot of concerns: https://docs.google.com/document/d/1pcyH5f610X2jyJW9WbWHnj8jktQPLlbbmmUwdeK4fJk, validation included. We had a discussion about validation (validate before we hit the api server) and was considered too much. In general regarding Rob's optio

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Erik Erlandson
I can speak somewhat to the current design. Two of the goals for the design of this feature are that (1) its behavior is easy to reason about (2) its implementation in the back-end is light weight Option 1 was chosen partly because it's behavior is relatively simple to describe to a user: "Your te

Re: [DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Yinan Li
Thanks for bring this up. My opinion on this is this feature is really targeting advanced use cases that need more customization than what the basic k8s-related Spark config properties offer. So I think it's fair to assume that users who would like to use this feature know the risks and are respons

[DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Rob Vesse
Hey all For those following the K8S backend you are probably aware of SPARK-24434 [1] (and PR 22416 [2]) which proposes a mechanism to allow for advanced pod customisation via pod templates.  This is motivated by the fact that introducing additional Spark configuration properties for each as

Re: ***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra
That's overstated. We will also block for a data correctness issue -- and that is, arguably, what this is. On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my u

Re: from_csv

2018-09-19 Thread Ted Yu
+1 Original message From: Dongjin Lee Date: 9/19/18 7:20 AM (GMT-08:00) To: dev Subject: Re: from_csv Another +1. I already experienced this case several times. On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote: +1 for this idea since text parsing in CSV/JSON is quite co

Re: from_csv

2018-09-19 Thread Dongjin Lee
Another +1. I already experienced this case several times. On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote: > +1 for this idea since text parsing in CSV/JSON is quite common. > > One thing is about schema inference likewise with JSON functionality. In > case of JSON, we added schema_of_json

Re: [DISCUSS] upper/lower of special characters

2018-09-19 Thread Sean Owen
I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL

Re: ***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Sean Owen
If the issue also existed in 2.3.1, then it should not block this release. It can of course be fixed for a future 2.3.3. On Wed, Sep 19, 2018 at 2:18 AM Saisai Shao wrote: > > Hi Marco, > > From my understanding of SPARK-25454, I don't think it is a block issue, it > might be an corner case, so

***UNCHECKED*** [STREAMING] Improving the Checkpointing architecture

2018-09-19 Thread ssaavedra
Hi, I've been working on SPARK-23200 and a key point was raised during the discussion, available at https://github.com/apache/spark/pull/22392 Namely, whether the checkpointing process should allow for changing values of variables as they have been previously submitted. For example, when you res

Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Takeshi Yamamuro
+1 I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserve` on the openjdk below/macOSv10.12.6 $ java -version java version "1.8.0_181" Java(TM) SE Runtime Environment (build 1.8.0_181-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode) On Wed, Sep 19, 2018

Permanent UDF support across session

2018-09-19 Thread Ajith shetty
I have a question related to Permanent UDF for spark enabled hive support. When we do create function, this is registered with hive via spark-sql>create function customfun as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDay' using jar 'hdfs:///tmp/hive-exec.jar'; call stack: org.apach

***UNCHECKED*** Re: Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Marco Gaido
It is not new, it has been there since 2.3.0, so in that case this is not a blocker. Thanks. Il giorno mer 19 set 2018 alle ore 09:21 Reynold Xin ha scritto: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my un

***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Reynold Xin
We also only block if it is a new regression. On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao wrote: > Hi Marco, > > From my understanding of SPARK-25454, I don't think it is a block issue, > it might be an corner case, so personally I don't want to block the release > of 2.3.2 because of this issu

***UNCHECKED*** Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao
Hi Marco, >From my understanding of SPARK-25454, I don't think it is a block issue, it might be an corner case, so personally I don't want to block the release of 2.3.2 because of this issue. The release has been delayed for a long time. Thanks Saisai Marco Gaido 于2018年9月19日周三 下午2:58写道: > Sor

***UNCHECKED*** Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Saisai Shao
Hi Marco, >From my understanding of SPARK-25454, I don't think it is a block issue, it might be an corner case, so personally I don't want to block the release of 2.3.2 because of this issue. The release has been delayed for a long time. Marco Gaido 于2018年9月19日周三 下午2:58写道: > Sorry, I am -1 beca