Meta
Following this week's regular meeting we will be meeting bi weekly. The
next meeting will be October 3. I will be in London for Spark Summit and so
Yinan Li will chair that meeting.
Spark
K8s backend development for 2.4 is complete. There is some renewed
discussion about how much
What does partition management look like in those systems and what are the
options we would standardize in an API?
On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:
> I think partition management feature would be very useful in RDBMSes that
> support it –
I think partition management feature would be very useful in RDBMSes that
support it – e.g. Oracle, PostgreSQL, and DB2.
In some cases add partitions can be explicit and can/may be done outside of
data loads.
But in some other cases, it may/can need to be done implicitly when supported
by the
Thanks for the info Ryan – very helpful!
From: Ryan Blue
Reply-To: "rb...@netflix.com"
Date: Wednesday, September 19, 2018 at 3:17 PM
To: "Thakrar, Jayesh"
Cc: Wenchen Fan , Hyukjin Kwon ,
Spark Dev List
Subject: Re: data source api v2 refactoring
Hi Jayesh,
The existing sources haven't
+1
On Wed, Sep 19, 2018 at 8:07 AM Ted Yu wrote:
> +1
>
> Original message
> From: Dongjin Lee
> Date: 9/19/18 7:20 AM (GMT-08:00)
> To: dev
> Subject: Re: from_csv
>
> Another +1.
>
> I already experienced this case several times.
>
> On Mon, Sep 17, 2018 at 11:03 AM
Hi Jayesh,
The existing sources haven't been ported to v2 yet. That is going to be
tricky because the existing sources implement behaviors that we need to
keep for now.
I wrote up an SPIP to standardize logical plans while moving to the v2
sources. The reason why we need this is that too much is
+1 (non-binding)
Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
-Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
On
Ross,
The problem you're hitting is that there aren't many logical plans that
work with the v2 source API yet. Here, you're creating an InsertIntoTable
logical plan from SQL, which can't be converted to a physical plan because
there is no rule to convert it either to the right logical plan for
I'm open to exploring the idea of adding partition management as a catalog
API. The approach we're taking is to have an interface for each concern a
catalog might implement, like TableCatalog (proposed in SPARK-24252), but
also FunctionCatalog for stored functions and possibly
I’m not a huge fan of special cases for configuration values like this. Is
there something that we can do to pass a set of values to all sources (and
catalogs for #21306)?
I would prefer adding a special prefix for options that are passed to all
sources, like this:
Hello Team,
I am trying to write a DataSet as parquet file in Append mode partitioned
by few columns. However since the job is time consuming, I would like to
enable DirectFileOutputCommitter (i.e by-passing the writes to temporary
folder).
Version of the spark i am using is 2.3.1.
Can someone
There is a design document that covers a lot of concerns:
https://docs.google.com/document/d/1pcyH5f610X2jyJW9WbWHnj8jktQPLlbbmmUwdeK4fJk,
validation included.
We had a discussion about validation (validate before we hit the api
server) and was considered too much. In general regarding Rob's
I can speak somewhat to the current design. Two of the goals for the design
of this feature are that
(1) its behavior is easy to reason about
(2) its implementation in the back-end is light weight
Option 1 was chosen partly because it's behavior is relatively simple to
describe to a user: "Your
Thanks for bring this up. My opinion on this is this feature is really
targeting advanced use cases that need more customization than what the
basic k8s-related Spark config properties offer. So I think it's fair to
assume that users who would like to use this feature know the risks and are
Hey all
For those following the K8S backend you are probably aware of SPARK-24434 [1]
(and PR 22416 [2]) which proposes a mechanism to allow for advanced pod
customisation via pod templates. This is motivated by the fact that
introducing additional Spark configuration properties for each
That's overstated. We will also block for a data correctness issue -- and
that is, arguably, what this is.
On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote:
> We also only block if it is a new regression.
>
> On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao
> wrote:
>
>> Hi Marco,
>>
>> From my
+1
Original message From: Dongjin Lee Date:
9/19/18 7:20 AM (GMT-08:00) To: dev Subject: Re:
from_csv
Another +1.
I already experienced this case several times.
On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote:
+1 for this idea since text parsing in CSV/JSON is quite
Another +1.
I already experienced this case several times.
On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon wrote:
> +1 for this idea since text parsing in CSV/JSON is quite common.
>
> One thing is about schema inference likewise with JSON functionality. In
> case of JSON, we added
I don't have the details in front of me, but I recall we explicitly
overhauled locale-sensitive toUpper and toLower in the code for this exact
situation. The current behavior should be on purpose. I believe user data
strings are handled in a case sensitive way but things like reserved words
in SQL
If the issue also existed in 2.3.1, then it should not block this
release. It can of course be fixed for a future 2.3.3.
On Wed, Sep 19, 2018 at 2:18 AM Saisai Shao wrote:
>
> Hi Marco,
>
> From my understanding of SPARK-25454, I don't think it is a block issue, it
> might be an corner case, so
Hi,
I've been working on SPARK-23200 and a key point was raised during the
discussion, available at https://github.com/apache/spark/pull/22392
Namely, whether the checkpointing process should allow for changing
values of variables as they have been previously submitted.
For example, when you
+1
I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
-Phive-thriftserve` on the openjdk below/macOSv10.12.6
$ java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
On Wed, Sep 19,
I have a question related to Permanent UDF for spark enabled hive support.
When we do create function, this is registered with hive via
spark-sql>create function customfun as
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDay' using jar
'hdfs:///tmp/hive-exec.jar';
call stack:
It is not new, it has been there since 2.3.0, so in that case this is not a
blocker. Thanks.
Il giorno mer 19 set 2018 alle ore 09:21 Reynold Xin
ha scritto:
> We also only block if it is a new regression.
>
> On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao
> wrote:
>
>> Hi Marco,
>>
>> From my
We also only block if it is a new regression.
On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao wrote:
> Hi Marco,
>
> From my understanding of SPARK-25454, I don't think it is a block issue,
> it might be an corner case, so personally I don't want to block the release
> of 2.3.2 because of this
Hi Marco,
>From my understanding of SPARK-25454, I don't think it is a block issue, it
might be an corner case, so personally I don't want to block the release of
2.3.2 because of this issue. The release has been delayed for a long time.
Thanks
Saisai
Marco Gaido 于2018年9月19日周三 下午2:58写道:
>
Hi Marco,
>From my understanding of SPARK-25454, I don't think it is a block issue, it
might be an corner case, so personally I don't want to block the release of
2.3.2 because of this issue. The release has been delayed for a long time.
Marco Gaido 于2018年9月19日周三 下午2:58写道:
> Sorry, I am -1
Sorry, I am -1 because of SPARK-25454 which is a regression from 2.2.
Il giorno mer 19 set 2018 alle ore 03:45 Dongjoon Hyun <
dongjoon.h...@gmail.com> ha scritto:
> +1.
>
> I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.
>
> I hit
I'd just document it as a known limitation and move on for now, until there
are enough end users that need this. Spark is also very powerful with UDFs
and end users can easily work around this using UDFs.
--
excuse the brevity and lower case due to wrist injury
On Tue, Sep 18, 2018 at 11:14 PM
29 matches
Mail list logo