Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-20 Thread Tom Graves
 fyi, I merged in a couple jira that were critical (and I thought would be good 
to include in the next release) that if we spin another RC will get included, 
we should update the jira SPARK-24755 and SPARK-24677, if anyone disagrees we 
could back those out but I think they would be good to include.
Tom
On Thursday, July 19, 2018, 8:13:23 PM CDT, Saisai Shao 
 wrote:  
 
 Sure, I can wait for this and create another RC then.
Thanks,Saisai
Xiao Li  于2018年7月20日周五 上午9:11写道:

Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I created. 
The PR has been created. Since this is not rare, let us merge it to 2.3.2? 
Reynold' PR is to get rid of AnalysisBarrier. That is better than multiple 
patches we added for AnalysisBarrier after 2.3.0 release. We can target it to 
2.4. 
Thanks, 
Xiao
2018-07-19 17:48 GMT-07:00 Saisai Shao :

I see, thanks Reynold.
Reynold Xin  于2018年7月20日周五 上午8:46写道:

Looking at the list of pull requests it looks like this is the ticket: 
https://issues.apache.org/jira/browse/SPARK-24867


On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin  wrote:

I don't think my ticket should block this release. It's a big general 
refactoring.
Xiao do you have a ticket for the bug you found?

On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao  wrote:

Hi Xiao,
Are you referring to this JIRA 
(https://issues.apache.org/jira/browse/SPARK-24865)?
Xiao Li  于2018年7月20日周五 上午2:41写道:

dfWithUDF.cache()
dfWithUDF.write.saveAsTable("t")
dfWithUDF.write.saveAsTable("t1")
Cached data is not being used. It causes a big performance regression. 



2018-07-19 11:32 GMT-07:00 Sean Owen :

What regression are you referring to here? A -1 vote really needs a rationale.

On Thu, Jul 19, 2018 at 1:27 PM Xiao Li  wrote:

I would first vote -1. 
I might find another regression caused by the analysis barrier. Will keep you 
posted. 










  

[RESULT] [VOTE] SPIP: Standardize SQL logical plans

2018-07-20 Thread Ryan Blue
This vote passes with 4 binding +1s and 9 community +1s.

Thanks for taking the time to vote, everyone!

Binding votes:
Wenchen Fan
Xiao Li
Reynold Xin
Felix Cheung

Non-binding votes:
Ryan Blue
John Zhuge
Takeshi Yamamuro
Marco Gaido
Russel Spitzer
Alessandro Solimando
Henry Robinson
Dongjoon Hyun
Bruce Robbins


On Wed, Jul 18, 2018 at 4:43 PM Felix Cheung 
wrote:

> +1
>
>
> --
> *From:* Bruce Robbins 
> *Sent:* Wednesday, July 18, 2018 3:02 PM
> *To:* Ryan Blue
> *Cc:* Spark Dev List
> *Subject:* Re: [VOTE] SPIP: Standardize SQL logical plans
>
> +1 (non-binding)
>
> On Tue, Jul 17, 2018 at 10:59 AM, Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> From discussion on the proposal doc and the discussion thread, I think we
>> have consensus around the plan to standardize logical write operations for
>> DataSourceV2. I would like to call a vote on the proposal.
>>
>> The proposal doc is here: SPIP: Standardize SQL logical plans
>> 
>> .
>>
>> This vote is for the plan in that doc. The related SPIP with APIs to
>> create/alter/drop tables will be a separate vote.
>>
>> Please vote in the next 72 hours:
>>
>> [+1]: Spark should adopt the SPIP
>> [-1]: Spark should not adopt the SPIP because . . .
>>
>> Thanks for voting, everyone!
>>
>> --
>> Ryan Blue
>>
>
>

-- 
Ryan Blue


Re: Query on Spark Hive with kerberos Enabled on Kubernetes

2018-07-20 Thread Sandeep Katta
Can you please tell us what exception you ve got,any logs for the same ?

On Fri, 20 Jul 2018 at 8:36 PM, Garlapati, Suryanarayana (Nokia -
IN/Bangalore)  wrote:

> Hi All,
>
> I am trying to use Spark 2.2.0 Kubernetes(
> https://github.com/apache-spark-on-k8s/spark/tree/v2.2.0-kubernetes-0.5.0)
> code to run the Hive Query on Kerberos Enabled cluster. Spark-submit’s fail
> for the Hive Queries, but pass when I am trying to access the hdfs. Is this
> a known limitation or am I doing something wrong. Please let me know. If
> this is working, can you please specify an example for running Hive
> Queries?
>
>
>
> Thanks.
>
>
>
> Regards
>
> Surya
>


Query on Spark Hive with kerberos Enabled on Kubernetes

2018-07-20 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi All,
I am trying to use Spark 2.2.0 
Kubernetes(https://github.com/apache-spark-on-k8s/spark/tree/v2.2.0-kubernetes-0.5.0)
 code to run the Hive Query on Kerberos Enabled cluster. Spark-submit's fail 
for the Hive Queries, but pass when I am trying to access the hdfs. Is this a 
known limitation or am I doing something wrong. Please let me know. If this is 
working, can you please specify an example for running Hive Queries?

Thanks.

Regards
Surya


Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

2018-07-20 Thread Jacek Laskowski
Hi Joseph,

Thanks for your explanation. It makes a lot of sense and I found
http://spark.apache.org/docs/latest/sql-programming-
guide.html#jdbc-to-other-databases giving more.

With that and after I reviewed the code, customSchema option is simply to
override the data type of the fields in a relation schema [1][2]. I think
the name of the option should be different with the word "override" to give
the exact meaning, shouldn't it?

With that said, I think the description of customSchema option may slightly
be incorrect. For example the following says:

"The custom schema to use for reading data from JDBC connectors"

and although it is used for reading it merely overrides the data types and
may not match the fields at all which makes no difference. Is that correct?

It's in the following sentence where the word of "type" appears:

"You can also specify partial fields, and the others use the default type
mapping."

But that begs for another question about "the default type mapping". What
the default type mapping is? That was one of my questions when I first
found the option.

What do you think about the following description of the customSchema
option. You're welcome to make further changes if needed.


customSchema - Specifies the custom data types of the read schema (that is
used at load time).

customSchema is a comma-separated list of field definitions with column
names and their data types in a canonical SQL representation, e.g. id
DECIMAL(38, 0), name STRING.

customSchema defines the data types of the columns that will override the
data types inferred from the table schema and follows the following pattern:

colTypeList
: colType (',' colType)*
;

colType
: identifier dataType (COMMENT STRING)?
;

dataType
: complex=ARRAY '<' dataType '>'
#complexDataType
| complex=MAP '<' dataType ',' dataType '>'
 #complexDataType
| complex=STRUCT ('<' complexColTypeList? '>' | NEQ)
#complexDataType
| identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')?
#primitiveDataType
;


Should I file a JIRA task for this?

[1] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.
scala?utf8=%E2%9C%93#L116-L118
[2] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
scala#L785-L788

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jul 16, 2018 at 4:27 PM, Joseph Torres  wrote:

> I guess the question is partly about the semantics of
> DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will
> definitely have exactly this schema", that doesn't quite match the behavior
> of the customSchema option. If it's only meant to be an arbitrary schema
> input which the source can interpret however it wants, it'd be fine.
>
> The second semantic is IMO more useful, so I'm in favor here.
>
> On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I think there is a sort of inconsistency in how DataFrameReader.jdbc
>> deals with a user-defined schema as it makes sure that there's no
>> user-specified schema [1][2] yet allows for setting one using customSchema
>> option [3]. Why is so? Has this been merely overlooked or similar?
>>
>> I think assertNoSpecifiedSchema should be removed from
>> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should
>> be added (with the customSchema option marked as deprecated to be removed
>> in 2.4 or 3.0).
>>
>> Should I file an issue in Spark JIRA and do the changes? WDYT?
>>
>> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L249
>> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L320
>> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JDBCOptions.scala#L167
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>
>


Re: Live Streamed Code Review today at 11am Pacific

2018-07-20 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30
am because I had to move some flights around.

On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau  wrote:

> This afternoon @ 3pm pacific I'll be looking at review tooling for Spark &
> Beam https://www.youtube.com/watch?v=ff8_jbzC8JI.
>
> Next week's regular Friday code (this time July 20th @ 9:30am pacific)
> review will once again probably have more of an ML focus for folks
> interested in watching Spark ML PRs be reviewed - https://www.youtube.com/
> watch?v=aG5h99yb6XE 
>
> Next week I'll have a live coding session with more of a Beam focus if you
> want to see something a bit different (but still related since Beam runs on
> Spark) with a focus on Python dependency management (which is a thing we
> are also exploring in Spark at the same time) - https://www.youtube.com/
> watch?v=Sv0XhS2pYqA on July 19th at 2pm pacific.
>
> P.S.
>
> You can follow more generally me holdenkarau on YouTube
> 
> and holdenkarau on Twitch  to be
> notified even when I forget to send out the emails (which is pretty often).
>
> This morning I did another live review session I forgot to ping to the
> list about ( https://www.youtube.com/watch?v=M_lRFptcGTI=
> PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=31 ) and yesterday I did some
> live coding using PySpark and working on Sparkling ML -
> https://www.youtube.com/watch?v=kCnBDpNce9A=
> PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw=32
>
> On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau 
> wrote:
>
>> Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and
>> see how we validate Spark releases - https://www.twitch.tv/events/V
>> Ag-5PKURQeH15UAawhBtw / https://www.youtube.com/watch?v=1_XLrlKS26o .
>> Tomorrow @ 12:30 live PR reviews & Monday live coding -
>> https://youtube.com/user/holdenkarau & https://www.twitch.tv/holdenka
>> rau/events . Hopefully this can encourage more folks to help with RC
>> validation & PR reviews :)
>>
>> On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau 
>> wrote:
>>
>>> Next week is pride in San Francisco but I'm still going to do two quick
>>> session. One will be live coding with Apache Spark to collect ASF diversity
>>> information ( https://www.youtube.com/watch?v=OirnFnsU37A /
>>> https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
>>> pacific and the other will be the regular Friday code review (
>>> https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.tw
>>> itch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.
>>>
>>> On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau 
>>> wrote:
>>>
 I'll be doing another one tomorrow morning at 9am pacific focused on
 Python + K8s support & improved JSON support -
 https://www.youtube.com/watch?v=Z7ZEkvNwneU &
 https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)

 On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau 
 wrote:

> If anyone wants to watch the recording: https://www.youtube
> .com/watch?v=lugG_2QU6YU
>
> I'll do one next week as well - March 16th @ 11am -
> https://www.youtube.com/watch?v=pXzVtEUjrLc
>
> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
> wrote:
>
>> Hi folks,
>>
>> If your curious in learning more about how Spark is developed, I’m
>> going to expirement doing a live code review where folks can watch and 
>> see
>> how that part of our process works. I have two volunteers already for
>> having their PRs looked at live, and if you have a Spark PR your working 
>> on
>> you’d like me to livestream a review of please ping me.
>>
>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU
>> .
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



 --
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau