Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Wenchen Fan
Sean thanks for checking them!

I made one pass and re-targeted/closed some of them. Most of them are
documentation and auditing, do we need to block the release for them?

On Fri, Sep 21, 2018 at 6:01 AM Sean Owen  wrote:

> Because we're into 2.4 release candidates, I thought I'd look at
> what's still open and targeted at 2.4.0. I presume the Blockers are
> the usual umbrellas that don't themselves block anything, but,
> confirming, there is nothing left to do there?
>
> I think that's mostly a question for Joseph and Weichen.
>
> As ever, anyone who knows these items are a) done or b) not going to
> be in 2.4, go ahead and update them.
>
>
> Blocker:
>
> SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
> SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
> SPARK-25323 ML 2.4 QA: API: Python API coverage
> SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes
>
> Critical:
>
> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
> SPARK-25327 Update MLlib, GraphX websites for 2.4
> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide
>
> Other:
>
> SPARK-25346 Document Spark builtin data sources
> SPARK-25347 Document image data source in doc site
> SPARK-12978 Skip unnecessary final group-by when input data already
> clustered with group-by keys
> SPARK-20184 performance regression for complex/long sql when enable
> whole stage codegen
> SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
> SPARK-15693 Write schema definition out for file-based data sources to
> avoid schema inference
> SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
> SPARK-25179 Document the features that require Pyarrow 0.10
> SPARK-25110 make sure Flume streaming connector works with Spark 2.4
> SPARK-21318 The exception message thrown by `lookupFunction` is ambiguous.
> SPARK-24464 Unit tests for MLlib's Instrumentation
> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
> SPARK-22809 pyspark is sensitive to imports with dots
> SPARK-22739 Additional Expression Support for Objects
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-21030 extend hint syntax to support any expression for Python and R
> SPARK-22386 Data Source V2 improvements
> SPARK-15117 Generate code that get a value in each compressed column
> from CachedBatch when DataFrame.cache() is called
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-21 Thread Wenchen Fan
Thanks! If both versions are specified, yes we can just remove 3.0.0

On Fri, Sep 21, 2018 at 1:38 PM Jungtaek Lim  wrote:

> OK got it. Thanks for clarifying.
>
> I can help checking and modifying version, but not sure the case both
> versions are specified, like "2.4.0/3.0.0". Removing 3.0.0 would work in
> this case?
>
> 2018년 9월 21일 (금) 오후 2:29, Wenchen Fan 님이 작성:
>
>> There is an issue in the merge script, when resolving a ticket, the
>> default fixed version is 3.0.0. I guess someone forgot to type the fixed
>> version and lead to this mistake.
>>
>> On Fri, Sep 21, 2018 at 1:15 PM Jungtaek Lim  wrote:
>>
>>> Ah these issues were resolved before branch-2.4 is cut, like SPARK-24441
>>>
>>>
>>> https://github.com/apache/spark/blob/v2.4.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala
>>>
>>> SPARK-24441 is included to Spark 2.4.0 RC1 but set to 3.0.0. I heard
>>> there's a step which version of issues are aligned with new release when
>>> branch/RC is being cut, but it doesn't look like happening for some issues.
>>>
>>> 2018년 9월 21일 (금) 오후 2:10, Holden Karau 님이 작성:
>>>
 So normally during the release process if it's in branch-2.4 but not
 part of the current RC we set the resolved version to 2.4.1 and then if
 roll a new RC we switch the 2.4.1 issues to 2.4.0.

 On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim  wrote:

> I also noticed there're some fixed issues which are included in
> branch-2.4 but its versions are still 3.0.0. Would we want to update
> versions to 2.4.0? If we are not planning to run some automations to
> correct it, I'm happy to fix them.
>
> 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이 작성:
>
>> We need to merge this.
>> https://github.com/apache/spark/pull/22492
>> Otherwise mleap cannot build against spark 2.4.0
>> Thanks!
>>
>> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li 
>> wrote:
>>
>>> FYI: SPARK-23200 has been resolved.
>>>
>>> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung <
>>> felixcheun...@hotmail.com> wrote:
>>>
 If we could work on this quickly - it might get on to future RCs.



 --
 *From:* Stavros Kontopoulos 
 *Sent:* Monday, September 17, 2018 2:35 PM
 *To:* Yinan Li
 *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid;
 Sean Owen; Wenchen Fan; dev
 *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)

 Hi Xiao,

 I just tested it, it seems ok. There are some questions about which
 properties we should keep when restoring the config. Otherwise it 
 looks ok
 to me.
 The reason this should go in 2.4 is that streaming on k8s is
 something people want to try day one (or at least it is cool to try) 
 and
 since 2.4 comes with k8s support being refactored a lot,
 it would be disappointing not to have it in...IMHO.

 Best,
 Stavros

 On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
 wrote:

> We can merge the PR and get SPARK-23200 resolved if the whole
> point is to make streaming on k8s work first. But given that this is 
> not a
> blocker for 2.4, I think we can take a bit more time here and get it 
> right.
> With that being said, I would expect it to be resolved soon.
>
> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
> wrote:
>
>> Hi, Erik and Stavros,
>>
>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
>> sounds important for the Streaming on K8S. Could the K8S oriented
>> committers speed up the reviews?
>>
>> Thanks,
>>
>> Xiao
>>
>> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>>
>>>
>>> I have no binding vote but I second Stavros’ recommendation for
>>> spark-23200
>>>
>>> Per parallel threads on Py2 support I would also like to propose
>>> deprecating Py2 starting with this 2.4 release
>>>
>>> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>>>  wrote:
>>>
 You can log in to https://repository.apache.org and see what's
 wrong.
 Just find that staging repo and look at the messages. In your
 case it
 seems related to your signature.

 failureMessageNo public key: Key with id: () was not able
 to be
 located on http://gpg-keyserver.de/. Upload your public key
 and try
 the operation again.
 On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan <
 cloud0...@gmail.com> wrote:
 >
 > I confirmed that
 https://rep

Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-21 Thread Jungtaek Lim
Got it. I just removed 3.0.0 when there're multiple versions, except
SPARK-25431 which has (2.4.1, 3.0.0) pair since other version is targeting
to bugfix version.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%203.0.0


2018년 9월 21일 (금) 오후 4:05, Wenchen Fan 님이 작성:

> Thanks! If both versions are specified, yes we can just remove 3.0.0
>
> On Fri, Sep 21, 2018 at 1:38 PM Jungtaek Lim  wrote:
>
>> OK got it. Thanks for clarifying.
>>
>> I can help checking and modifying version, but not sure the case both
>> versions are specified, like "2.4.0/3.0.0". Removing 3.0.0 would work in
>> this case?
>>
>> 2018년 9월 21일 (금) 오후 2:29, Wenchen Fan 님이 작성:
>>
>>> There is an issue in the merge script, when resolving a ticket, the
>>> default fixed version is 3.0.0. I guess someone forgot to type the fixed
>>> version and lead to this mistake.
>>>
>>> On Fri, Sep 21, 2018 at 1:15 PM Jungtaek Lim  wrote:
>>>
 Ah these issues were resolved before branch-2.4 is cut, like SPARK-24441


 https://github.com/apache/spark/blob/v2.4.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

 SPARK-24441 is included to Spark 2.4.0 RC1 but set to 3.0.0. I heard
 there's a step which version of issues are aligned with new release when
 branch/RC is being cut, but it doesn't look like happening for some issues.

 2018년 9월 21일 (금) 오후 2:10, Holden Karau 님이 작성:

> So normally during the release process if it's in branch-2.4 but not
> part of the current RC we set the resolved version to 2.4.1 and then if
> roll a new RC we switch the 2.4.1 issues to 2.4.0.
>
> On Thu, Sep 20, 2018 at 9:55 PM Jungtaek Lim 
> wrote:
>
>> I also noticed there're some fixed issues which are included in
>> branch-2.4 but its versions are still 3.0.0. Would we want to update
>> versions to 2.4.0? If we are not planning to run some automations to
>> correct it, I'm happy to fix them.
>>
>> 2018년 9월 20일 (목) 오후 9:22, Weichen Xu 님이
>> 작성:
>>
>>> We need to merge this.
>>> https://github.com/apache/spark/pull/22492
>>> Otherwise mleap cannot build against spark 2.4.0
>>> Thanks!
>>>
>>> On Wed, Sep 19, 2018 at 1:16 PM Yinan Li 
>>> wrote:
>>>
 FYI: SPARK-23200 has been resolved.

 On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung <
 felixcheun...@hotmail.com> wrote:

> If we could work on this quickly - it might get on to future RCs.
>
>
>
> --
> *From:* Stavros Kontopoulos 
> *Sent:* Monday, September 17, 2018 2:35 PM
> *To:* Yinan Li
> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid;
> Sean Owen; Wenchen Fan; dev
> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>
> Hi Xiao,
>
> I just tested it, it seems ok. There are some questions about
> which properties we should keep when restoring the config. Otherwise 
> it
> looks ok to me.
> The reason this should go in 2.4 is that streaming on k8s is
> something people want to try day one (or at least it is cool to try) 
> and
> since 2.4 comes with k8s support being refactored a lot,
> it would be disappointing not to have it in...IMHO.
>
> Best,
> Stavros
>
> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li 
> wrote:
>
>> We can merge the PR and get SPARK-23200 resolved if the whole
>> point is to make streaming on k8s work first. But given that this is 
>> not a
>> blocker for 2.4, I think we can take a bit more time here and get it 
>> right.
>> With that being said, I would expect it to be resolved soon.
>>
>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li 
>> wrote:
>>
>>> Hi, Erik and Stavros,
>>>
>>> This bug fix SPARK-23200 is not a blocker of the 2.4 release. It
>>> sounds important for the Streaming on K8S. Could the K8S oriented
>>> committers speed up the reviews?
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>> Erik Erlandson  于2018年9月17日周一 上午11:04写道:
>>>

 I have no binding vote but I second Stavros’ recommendation for
 spark-23200

 Per parallel threads on Py2 support I would also like to
 propose deprecating Py2 starting with this 2.4 release

 On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
  wrote:

> You can log in to https://repository.apache.org and see
> what's wrong.
> Just find that staging repo and look at the messages. In your
> case it
>>>

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Felix Cheung
I think the point is we actually need to do these validation before completing 
the release...



From: Wenchen Fan 
Sent: Friday, September 21, 2018 12:02 AM
To: Sean Owen
Cc: Spark dev list
Subject: Re: 2.4.0 Blockers, Critical, etc

Sean thanks for checking them!

I made one pass and re-targeted/closed some of them. Most of them are 
documentation and auditing, do we need to block the release for them?

On Fri, Sep 21, 2018 at 6:01 AM Sean Owen 
mailto:sro...@apache.org>> wrote:
Because we're into 2.4 release candidates, I thought I'd look at
what's still open and targeted at 2.4.0. I presume the Blockers are
the usual umbrellas that don't themselves block anything, but,
confirming, there is nothing left to do there?

I think that's mostly a question for Joseph and Weichen.

As ever, anyone who knows these items are a) done or b) not going to
be in 2.4, go ahead and update them.


Blocker:

SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
SPARK-25323 ML 2.4 QA: API: Python API coverage
SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes

Critical:

SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
SPARK-25327 Update MLlib, GraphX websites for 2.4
SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide

Other:

SPARK-25346 Document Spark builtin data sources
SPARK-25347 Document image data source in doc site
SPARK-12978 Skip unnecessary final group-by when input data already
clustered with group-by keys
SPARK-20184 performance regression for complex/long sql when enable
whole stage codegen
SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
SPARK-15693 Write schema definition out for file-based data sources to
avoid schema inference
SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
SPARK-25179 Document the features that require Pyarrow 0.10
SPARK-25110 make sure Flume streaming connector works with Spark 2.4
SPARK-21318 The exception message thrown by `lookupFunction` is ambiguous.
SPARK-24464 Unit tests for MLlib's Instrumentation
SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
SPARK-22809 pyspark is sensitive to imports with dots
SPARK-22739 Additional Expression Support for Objects
SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
list of structures
SPARK-21030 extend hint syntax to support any expression for Python and R
SPARK-22386 Data Source V2 improvements
SPARK-15117 Generate code that get a value in each compressed column
from CachedBatch when DataFrame.cache() is called

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Marco Gaido
Hi all,

I am writing this e-mail in order to discuss the issue which is reported in
SPARK-25454 and according to Wenchen's suggestion I prepared a design doc
for it.

The problem we are facing here is that our rules for decimals operations
are taken from Hive and MS SQL server and they explicitly don't support
decimals with negative scales. So the rules we have currently are not meant
to deal with negative scales. The issue is that Spark, instead, doesn't
forbid negative scales and - indeed - there are cases in which we are
producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1,
-8)).

Having negative scales most likely wasn't really intended. But
unfortunately getting rid of them would be a breaking change as many
operations working fine currently would not be allowed anymore and would
overflow (eg. select 1e36 * 1). As such, this is something I'd
definitely agree on doing, but I think we can target only for 3.0.

What we can start doing now, instead, is updating our rules in order to
handle properly also the case when decimal scales are negative. From my
investigation, it turns out that the only operations which has problems
with them is Divide.

Here you can find the design doc with all the details:
https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing.
The document is also linked in SPARK-25454. There is also already a PR with
the change: https://github.com/apache/spark/pull/22450.

Looking forward to hear your feedback,
Thanks.
Marco


Re: SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Wenchen Fan
Hi Marco,

Thanks for sending it! The problem is clearly explained in this email, but
I would not treat it as a SPIP. It proposes a fix for a very tricky bug,
and SPIP is usually for new features. Others please correct me if I was
wrong.

Thanks,
Wenchen

On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido  wrote:

> Hi all,
>
> I am writing this e-mail in order to discuss the issue which is reported
> in SPARK-25454 and according to Wenchen's suggestion I prepared a design
> doc for it.
>
> The problem we are facing here is that our rules for decimals operations
> are taken from Hive and MS SQL server and they explicitly don't support
> decimals with negative scales. So the rules we have currently are not meant
> to deal with negative scales. The issue is that Spark, instead, doesn't
> forbid negative scales and - indeed - there are cases in which we are
> producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1,
> -8)).
>
> Having negative scales most likely wasn't really intended. But
> unfortunately getting rid of them would be a breaking change as many
> operations working fine currently would not be allowed anymore and would
> overflow (eg. select 1e36 * 1). As such, this is something I'd
> definitely agree on doing, but I think we can target only for 3.0.
>
> What we can start doing now, instead, is updating our rules in order to
> handle properly also the case when decimal scales are negative. From my
> investigation, it turns out that the only operations which has problems
> with them is Divide.
>
> Here you can find the design doc with all the details:
> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing.
> The document is also linked in SPARK-25454. There is also already a PR with
> the change: https://github.com/apache/spark/pull/22450.
>
> Looking forward to hear your feedback,
> Thanks.
> Marco
>


Re: SPIP: support decimals with negative scale in decimal operation

2018-09-21 Thread Marco Gaido
Hi Wenchen,
Thank you for the clarification. I agree that this is more a bug fix rather
than an improvement. I apologize for the error. Please consider this as a
design doc.

Thanks,
Marco

Il giorno ven 21 set 2018 alle ore 12:04 Wenchen Fan 
ha scritto:

> Hi Marco,
>
> Thanks for sending it! The problem is clearly explained in this email, but
> I would not treat it as a SPIP. It proposes a fix for a very tricky bug,
> and SPIP is usually for new features. Others please correct me if I was
> wrong.
>
> Thanks,
> Wenchen
>
> On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido 
> wrote:
>
>> Hi all,
>>
>> I am writing this e-mail in order to discuss the issue which is reported
>> in SPARK-25454 and according to Wenchen's suggestion I prepared a design
>> doc for it.
>>
>> The problem we are facing here is that our rules for decimals operations
>> are taken from Hive and MS SQL server and they explicitly don't support
>> decimals with negative scales. So the rules we have currently are not meant
>> to deal with negative scales. The issue is that Spark, instead, doesn't
>> forbid negative scales and - indeed - there are cases in which we are
>> producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1,
>> -8)).
>>
>> Having negative scales most likely wasn't really intended. But
>> unfortunately getting rid of them would be a breaking change as many
>> operations working fine currently would not be allowed anymore and would
>> overflow (eg. select 1e36 * 1). As such, this is something I'd
>> definitely agree on doing, but I think we can target only for 3.0.
>>
>> What we can start doing now, instead, is updating our rules in order to
>> handle properly also the case when decimal scales are negative. From my
>> investigation, it turns out that the only operations which has problems
>> with them is Divide.
>>
>> Here you can find the design doc with all the details:
>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing.
>> The document is also linked in SPARK-25454. There is also already a PR with
>> the change: https://github.com/apache/spark/pull/22450.
>>
>> Looking forward to hear your feedback,
>> Thanks.
>> Marco
>>
>


Re: [DISCUSS] upper/lower of special characters

2018-09-21 Thread seancxmao






Hi, RaynoldSorry for slow response. Thanks for your suggestion. I'd like to document this in the API docs - SQL built-in functions. BTW, this is a real case we met in production, the Turkish data is from other systems through ETL. As what you mentioned, we use UDFs to avoid issues. E.g. for the special Turkish character "İ"(u+0130), we firstprocess by regexp_replace(c,'İ','I') before further processing.Thanks,Chenxiao Mao (Sean)

On 09/19/2018 14:18,Reynold Xin wrote: 


I'd just document it as a known limitation and move on for now, until there are enough end users that need this. Spark is also very powerful with UDFs and end users can easily work around this using UDFs.--excuse the brevity and lower case due to wrist injuryOn Tue, Sep 18, 2018 at 11:14 PM seancxmao  wrote:







Hi, allWe found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)select upper("i"), lower("İ"), upper("ı"), lower("I");--Spark      I, i with dot, I, iHive       I, i with dot, I, iTeradata   I, i,          I, iOracle     I, i,          I, iSQLServer  I, i,          I, iMySQL      I, i,          I, i"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:select upper("i"), lower("İ"), upper("ı"), lower("I");--İ, i, I, ıBut, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".My questions:(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.Some references below. Just FYI.* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale-* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale-* http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/* https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-letteYour comments and advices are highly appreciated.Many thanks!Chenxiao Mao (Sean)




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [DISCUSS] upper/lower of special characters

2018-09-21 Thread seancxmao






Hi, SeanAfter brief investigation, I found there are some tickets/PRs about this issue. I just didn't know that. https://issues.apache.org/jira/browse/SPARK-20156https://github.com/apache/spark/pull/17527https://github.com/apache/spark/pull/17655
 
I have carefully read the discussions, really learned a lot. I totally agree with the idea:> I could see backing out changes that affect user application strings, to be conservative. We could decide to change that later. The issue here really stems from lowercasing of purely internal strings.As for lower/upper functions, Spark has same bahavior with Hive.select upper("i"), lower("İ"), upper("ı"), lower("I");--Spark      I, i with dot, I, iHive       I, i with dot, I, iThanks,Chenxiao Mao
On 09/19/2018 18:35,Sean Owen wrote: 


I don't have the details in front of me, but I recall we explicitly overhauled locale-sensitive toUpper and toLower in the code for this exact situation. The current behavior should be on purpose. I believe user data strings are handled in a case sensitive way but things like reserved words in SQL are not of course. The Spark behavior is most correct and consistent with Hive, right?On Wed, Sep 19, 2018, 1:14 AM seancxmao  wrote:







Hi, allWe found that there are some differences about case handling of special characters between Spark and other database systems. You may see blow list for an example (you may also check attached pictures)select upper("i"), lower("İ"), upper("ı"), lower("I");--Spark      I, i with dot, I, iHive       I, i with dot, I, iTeradata   I, i,          I, iOracle     I, i,          I, iSQLServer  I, i,          I, iMySQL      I, i,          I, i"İ" and "ı" are Turkish characters. If locale-sensitive case handling is used, the expected results of above upper/lower functions should be:select upper("i"), lower("İ"), upper("ı"), lower("I");--İ, i, I, ıBut, it seems that these systems all do local-insensitive mapping. Presto explicitly describe this as a known issue in their docs (https://prestodb.io/docs/current/functions/string.html)> The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.Java besed systems have same behaviors since they all depend on the same JDK String methods. Teradata/Oracle/SQLServer/MySQL also have same behaviors. However Java based systems return different results for lower("İ"). Java based systems (Spark/Hive) return "i with dot" while other database systems(Teradata/Oracle/SQLServer/MySQL) return "i".My questions:(1) Should we let Spark return "i" for lower("İ"), which is same as other database systems?(2) Should Spark support locale-sensitive upper/lower functions? Because row of a table may need different locales, we cannot even set locale at table level. What we might do is to provide upper(string, locale)/lower(string, locale), and let users decide what locale they want to use.Some references below. Just FYI.* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase-java.util.Locale-* https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toUpperCase-java.util.Locale-* http://grepalex.com/2013/02/14/java-7-and-the-dotted--and-dotless-i/* https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-letteYour comments and advices are highly appreciated.Many thanks!Chenxiao Mao (Sean)




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Sean Owen
Yes, documentation for 2.4 has to be done before the 2.4 release. Or
else it's not for 2.4. Likewise auditing that must happen before 2.4,
must happen before 2.4 is released.
"Foo for 2.4" as Blocker for 2.4 needs to be resolved for 2.4, by
definition. Or else it's not a Blocker, not for 2.4.

 I know we've had this discussion before and agree to disagree about
the semantics. But we won't, say, release 2.4.0 and then go
retroactively patch the 2.4.0 released docs with docs for 2.4.

Really, I'm just asking if all the things those items mean to cover
are done? even if for whatever reason the JIRA is not resolved.

We have a new blocker thought, FWIW:
https://issues.apache.org/jira/browse/SPARK-25495
On Fri, Sep 21, 2018 at 3:02 AM Felix Cheung  wrote:
>
> I think the point is we actually need to do these validation before 
> completing the release...
>
>
> 
> From: Wenchen Fan 
> Sent: Friday, September 21, 2018 12:02 AM
> To: Sean Owen
> Cc: Spark dev list
> Subject: Re: 2.4.0 Blockers, Critical, etc
>
> Sean thanks for checking them!
>
> I made one pass and re-targeted/closed some of them. Most of them are 
> documentation and auditing, do we need to block the release for them?
>
> On Fri, Sep 21, 2018 at 6:01 AM Sean Owen  wrote:
>>
>> Because we're into 2.4 release candidates, I thought I'd look at
>> what's still open and targeted at 2.4.0. I presume the Blockers are
>> the usual umbrellas that don't themselves block anything, but,
>> confirming, there is nothing left to do there?
>>
>> I think that's mostly a question for Joseph and Weichen.
>>
>> As ever, anyone who knows these items are a) done or b) not going to
>> be in 2.4, go ahead and update them.
>>
>>
>> Blocker:
>>
>> SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
>> SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
>> SPARK-25323 ML 2.4 QA: API: Python API coverage
>> SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes
>>
>> Critical:
>>
>> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
>> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
>> SPARK-25327 Update MLlib, GraphX websites for 2.4
>> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
>> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration guide
>>
>> Other:
>>
>> SPARK-25346 Document Spark builtin data sources
>> SPARK-25347 Document image data source in doc site
>> SPARK-12978 Skip unnecessary final group-by when input data already
>> clustered with group-by keys
>> SPARK-20184 performance regression for complex/long sql when enable
>> whole stage codegen
>> SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
>> SPARK-15693 Write schema definition out for file-based data sources to
>> avoid schema inference
>> SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
>> SPARK-25179 Document the features that require Pyarrow 0.10
>> SPARK-25110 make sure Flume streaming connector works with Spark 2.4
>> SPARK-21318 The exception message thrown by `lookupFunction` is ambiguous.
>> SPARK-24464 Unit tests for MLlib's Instrumentation
>> SPARK-23197 Flaky test: spark.streaming.ReceiverSuite."receiver_life_cycle"
>> SPARK-22809 pyspark is sensitive to imports with dots
>> SPARK-22739 Additional Expression Support for Objects
>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>> list of structures
>> SPARK-21030 extend hint syntax to support any expression for Python and R
>> SPARK-22386 Data Source V2 improvements
>> SPARK-15117 Generate code that get a value in each compressed column
>> from CachedBatch when DataFrame.cache() is called
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-21 Thread Michael Heuer
+1 (non-binding)

Bumping our build to 2.3.2 rc6 and Avro to 1.8.2 and Parquet to 1.8.3 works
for us, running on version 2.3.2 rc6 and older Spark versions.

https://github.com/bigdatagenomics/adam/pull/2055

   michael


On Thu, Sep 20, 2018 at 7:09 PM, Ryan Blue 
wrote:

> Changing my vote to +1 with this fixed.
>
> Here's what was going on -- and thanks to Owen O'Malley for debugging:
>
> The problem was that Iceberg contained a fix for a JVM bug for timestamps
> before the unix epoch where the timestamp was off by 1s. Owen moved this
> code into ORC as well and using the new version of Spark pulled in the
> newer version of ORC. That meant that the values were "fixed" twice and
> were wrong.
>
> Updating the Iceberg code to rely on the fix in the version of ORC that
> Spark includes fixes the problem.
>
> On Thu, Sep 20, 2018 at 2:38 PM Dongjoon Hyun 
> wrote:
>
>> Hi, Ryan.
>>
>> Could you share the result on 2.3.1 since this is 2.3.2 RC? That would be
>> helpful to narrow down the scope.
>>
>> Bests,
>> Dongjoon.
>>
>> On Thu, Sep 20, 2018 at 11:56 Ryan Blue 
>> wrote:
>>
>>> -0
>>>
>>> My DataSourceV2 implementation for Iceberg is failing ORC tests when I
>>> run with the 2.3.2 RC that pass when I run with 2.3.0. I'm tracking down
>>> the cause and will report back, but I'm -0 on the release because there may
>>> be a behavior change.
>>>
>>> On Thu, Sep 20, 2018 at 10:37 AM Denny Lee 
>>> wrote:
>>>
 +1

 On Thu, Sep 20, 2018 at 9:55 AM Xiao Li  wrote:

> +1
>
>
> John Zhuge  于2018年9月19日周三 下午1:17写道:
>
>> +1 (non-binding)
>>
>> Built on Ubuntu 16.04 with Maven flags: -Phadoop-2.7 -Pmesos -Pyarn
>> -Phive-thriftserver -Psparkr -Pkinesis-asl -Phadoop-provided
>>
>> java version "1.8.0_181"
>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>
>>
>> On Wed, Sep 19, 2018 at 2:31 AM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>>> +1
>>>
>>> I also checked `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
>>> -Phive-thriftserve` on the openjdk below/macOSv10.12.6
>>>
>>> $ java -version
>>> java version "1.8.0_181"
>>> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
>>> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>>>
>>>
>>> On Wed, Sep 19, 2018 at 10:45 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 +1.

 I tested with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 -Phive-thriftserve` on OpenJDK(1.8.0_181)/CentOS 7.5.

 I hit the following test case failure once during testing, but it's
 not persistent.

 KafkaContinuousSourceSuite
 ...
 subscribing topic by name from earliest offsets
 (failOnDataLoss: false) *** FAILED ***

 Thank you, Saisai.

 Bests,
 Dongjoon.

 On Mon, Sep 17, 2018 at 6:48 PM Saisai Shao 
 wrote:

> +1 from my own side.
>
> Thanks
> Saisai
>
> Wenchen Fan  于2018年9月18日周二 上午9:34写道:
>
>> +1. All the blocker issues are all resolved in 2.3.2 AFAIK.
>>
>> On Tue, Sep 18, 2018 at 9:23 AM Sean Owen 
>> wrote:
>>
>>> +1 . Licenses and sigs check out as in previous 2.3.x releases. A
>>> build from source with most profiles passed for me.
>>> On Mon, Sep 17, 2018 at 8:17 AM Saisai Shao <
>>> sai.sai.s...@gmail.com> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache
>>> Spark version 2.3.2.
>>> >
>>> > The vote is open until September 21 PST and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.3.2
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.3.2-rc6 (commit
>>> 02b510728c31b70e6035ad541bfcdc2b59dcd79a):
>>> > https://github.com/apache/spark/tree/v2.3.2-rc6
>>> >
>>> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc6-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> > https://repository.apache.org/content/repositories/
>>> orgapachespark-1286/
>>> >
>>> > The documentation corresponding to this release can be found
>>> at:
>

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Xiangrui Meng
Sean, thanks for checking! The MLlib blockers were resolved today by
reverting breaking API changes. We still have some documentation work to
wrap up. -Xiangrui

+Weichen Xu 

On Fri, Sep 21, 2018 at 6:54 AM Sean Owen  wrote:

> Yes, documentation for 2.4 has to be done before the 2.4 release. Or
> else it's not for 2.4. Likewise auditing that must happen before 2.4,
> must happen before 2.4 is released.
> "Foo for 2.4" as Blocker for 2.4 needs to be resolved for 2.4, by
> definition. Or else it's not a Blocker, not for 2.4.
>
>  I know we've had this discussion before and agree to disagree about
> the semantics. But we won't, say, release 2.4.0 and then go
> retroactively patch the 2.4.0 released docs with docs for 2.4.
>
> Really, I'm just asking if all the things those items mean to cover
> are done? even if for whatever reason the JIRA is not resolved.
>
> We have a new blocker thought, FWIW:
> https://issues.apache.org/jira/browse/SPARK-25495
> On Fri, Sep 21, 2018 at 3:02 AM Felix Cheung 
> wrote:
> >
> > I think the point is we actually need to do these validation before
> completing the release...
> >
> >
> > 
> > From: Wenchen Fan 
> > Sent: Friday, September 21, 2018 12:02 AM
> > To: Sean Owen
> > Cc: Spark dev list
> > Subject: Re: 2.4.0 Blockers, Critical, etc
> >
> > Sean thanks for checking them!
> >
> > I made one pass and re-targeted/closed some of them. Most of them are
> documentation and auditing, do we need to block the release for them?
> >
> > On Fri, Sep 21, 2018 at 6:01 AM Sean Owen  wrote:
> >>
> >> Because we're into 2.4 release candidates, I thought I'd look at
> >> what's still open and targeted at 2.4.0. I presume the Blockers are
> >> the usual umbrellas that don't themselves block anything, but,
> >> confirming, there is nothing left to do there?
> >>
> >> I think that's mostly a question for Joseph and Weichen.
> >>
> >> As ever, anyone who knows these items are a) done or b) not going to
> >> be in 2.4, go ahead and update them.
> >>
> >>
> >> Blocker:
> >>
> >> SPARK-25321 ML, Graph 2.4 QA: API: New Scala APIs, docs
> >> SPARK-25324 ML 2.4 QA: API: Java compatibility, docs
> >> SPARK-25323 ML 2.4 QA: API: Python API coverage
> >> SPARK-25320 ML, Graph 2.4 QA: API: Binary incompatible changes
> >>
> >> Critical:
> >>
> >> SPARK-25319 Spark MLlib, GraphX 2.4 QA umbrella
> >> SPARK-25378 ArrayData.toArray(StringType) assume UTF8String in 2.4
> >> SPARK-25327 Update MLlib, GraphX websites for 2.4
> >> SPARK-25325 ML, Graph 2.4 QA: Update user guide for new features & APIs
> >> SPARK-25326 ML, Graph 2.4 QA: Programming guide update and migration
> guide
> >>
> >> Other:
> >>
> >> SPARK-25346 Document Spark builtin data sources
> >> SPARK-25347 Document image data source in doc site
> >> SPARK-12978 Skip unnecessary final group-by when input data already
> >> clustered with group-by keys
> >> SPARK-20184 performance regression for complex/long sql when enable
> >> whole stage codegen
> >> SPARK-16196 Optimize in-memory scan performance using ColumnarBatches
> >> SPARK-15693 Write schema definition out for file-based data sources to
> >> avoid schema inference
> >> SPARK-23597 Audit Spark SQL code base for non-interpreted expressions
> >> SPARK-25179 Document the features that require Pyarrow 0.10
> >> SPARK-25110 make sure Flume streaming connector works with Spark 2.4
> >> SPARK-21318 The exception message thrown by `lookupFunction` is
> ambiguous.
> >> SPARK-24464 Unit tests for MLlib's Instrumentation
> >> SPARK-23197 Flaky test:
> spark.streaming.ReceiverSuite."receiver_life_cycle"
> >> SPARK-22809 pyspark is sensitive to imports with dots
> >> SPARK-22739 Additional Expression Support for Objects
> >> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> >> list of structures
> >> SPARK-21030 extend hint syntax to support any expression for Python and
> R
> >> SPARK-22386 Data Source V2 improvements
> >> SPARK-15117 Generate code that get a value in each compressed column
> >> from CachedBatch when DataFrame.cache() is called
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 


Kafka Connector version support

2018-09-21 Thread Basil Hariri
Hi all,

Are there any plans to backport the recent (2.4) updates to the Spark-Kafka 
adapter for use with Spark v2.3, or will the updates just be for v2.4+?

Thanks,
Basil