Unsubscribe

2018-06-14 Thread Kumar S, Sajive
Unsubscribe


Unsubscribe

2018-06-14 Thread Congxian Qiu
Unsubscribe

-- 
Blog:http://www.klion26.com
GTalk:qcx978132955
一切随心


Support SqlStreaming in spark

2018-06-14 Thread JackyLee
Hello 

Nowadays, more and more streaming products begin to support SQL streaming,
such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
only reduce the threshold of streaming, but also make streaming easier to be
accepted by everyone. 

At present, StructStreaming is relatively mature, and the StructStreaming is
based on DataSet API, which make it possibal to  provide a SQL portal for
structstreaming and run structstreaming in SQL. 

To support for SQL Streaming, there are two key points: 
1, Analysis should be able to parse streaming type SQL. 
2, Analyzer should be able to map metadata information to the corresponding
Relation. 

Running StructStreaming in SQL can bring some benefits. 
1, Reduce the entry threshold of StructStreaming and attract users more
easily. 
2, Encapsulate the meta information of source or sink into table, maintain
and manage uniformly, and make users more accessible. 
3. Metadata permissions management, which is based on hive, can control
StructStreaming's overall authority management scheme more closely. 

We have found some ways to solve this problem. It's a pleasure to discuss it
with you. 

Thanks,  

Jackey Lee



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread 刘崇光
Hello Takuya,

Thanks for your message. I will do the JIRA and PR.

Best regards,
Chongguang

On Thu, Jun 14, 2018 at 11:25 PM, Takuya UESHIN 
wrote:

> Hi Chongguang,
>
> Thanks for the report!
>
> That makes sense and the proposition should work, or we can add something
> like `def array_contains(column: Column, value: Column)`.
> Maybe other functions, such as `array_position`, `element_at`, are the
> same situation.
>
> Could you file a JIRA, and submit a PR if possible?
> We can have a discussion more about the issue there.
>
> Btw, I guess you can use `expr("array_contains(columnA, columnB)")` as a
> workaround.
>
> Thanks.
>
>
> On Thu, Jun 14, 2018 at 2:15 AM, 刘崇光  wrote:
>
>>
>> -- Forwarded message --
>> From: 刘崇光 
>> Date: Thu, Jun 14, 2018 at 11:08 AM
>> Subject: array_contains in package org.apache.spark.sql.functions
>> To: u...@spark.apache.org
>>
>>
>> Hello all,
>>
>> I ran into a use case in project with spark sql and want to share with
>> you some thoughts about the function array_contains.
>>
>> Say I have a Dataframe containing 2 columns. Column A of type "Array of
>> String" and Column B of type "String". I want to determine if the value of
>> column B is contained in the value of column A, without using a udf of
>> course.
>> The function array_contains came into my mind naturally:
>>
>> def array_contains(column: Column, value: Any): Column = withExpr {
>>   ArrayContains(column.expr, Literal(value))
>> }
>>
>> However the function takes the column B and does a "Literal" of column B,
>> which yields a runtime exception: RuntimeException("Unsupported literal
>> type " + v.getClass + " " + v).
>>
>> Then after discussion with my friends, we fund a solution without using
>> udf:
>>
>> new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
>>
>>
>> With this solution, I think of empowering a little bit more the function,
>> by doing like this:
>>
>> def array_contains(column: Column, value: Any): Column = withExpr {
>>   value match {
>> case c: Column => ArrayContains(column.expr, c.expr)
>> case _ => ArrayContains(column.expr, Literal(value))
>>   }
>> }
>>
>>
>> It does a pattern matching to detect if value is of type Column. If yes,
>> it will use the .expr of the column, otherwise it will work as it used to.
>>
>> Any suggestion or opinion on the proposition?
>>
>>
>> Kind regards,
>> Chongguang LIU
>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread Takuya UESHIN
Hi Chongguang,

Thanks for the report!

That makes sense and the proposition should work, or we can add something
like `def array_contains(column: Column, value: Column)`.
Maybe other functions, such as `array_position`, `element_at`, are the same
situation.

Could you file a JIRA, and submit a PR if possible?
We can have a discussion more about the issue there.

Btw, I guess you can use `expr("array_contains(columnA, columnB)")` as a
workaround.

Thanks.


On Thu, Jun 14, 2018 at 2:15 AM, 刘崇光  wrote:

>
> -- Forwarded message --
> From: 刘崇光 
> Date: Thu, Jun 14, 2018 at 11:08 AM
> Subject: array_contains in package org.apache.spark.sql.functions
> To: u...@spark.apache.org
>
>
> Hello all,
>
> I ran into a use case in project with spark sql and want to share with you
> some thoughts about the function array_contains.
>
> Say I have a Dataframe containing 2 columns. Column A of type "Array of
> String" and Column B of type "String". I want to determine if the value of
> column B is contained in the value of column A, without using a udf of
> course.
> The function array_contains came into my mind naturally:
>
> def array_contains(column: Column, value: Any): Column = withExpr {
>   ArrayContains(column.expr, Literal(value))
> }
>
> However the function takes the column B and does a "Literal" of column B,
> which yields a runtime exception: RuntimeException("Unsupported literal
> type " + v.getClass + " " + v).
>
> Then after discussion with my friends, we fund a solution without using
> udf:
>
> new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
>
>
> With this solution, I think of empowering a little bit more the function,
> by doing like this:
>
> def array_contains(column: Column, value: Any): Column = withExpr {
>   value match {
> case c: Column => ArrayContains(column.expr, c.expr)
> case _ => ArrayContains(column.expr, Literal(value))
>   }
> }
>
>
> It does a pattern matching to detect if value is of type Column. If yes,
> it will use the .expr of the column, otherwise it will work as it used to.
>
> Any suggestion or opinion on the proposition?
>
>
> Kind regards,
> Chongguang LIU
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Hossein
The vote passed with following +1.

- Felix
- Joseph
- Xiangrui
- Reynold

Joseph has kindly volunteered to shepherd this.

Thanks,
--Hossein


On Thu, Jun 14, 2018 at 1:32 PM Reynold Xin  wrote:

> +1 on the proposal.
>
>
> On Fri, Jun 1, 2018 at 8:17 PM Hossein  wrote:
>
>> Hi Shivaram,
>>
>> We converged on a CRAN release process that seems identical to current
>> SparkR.
>>
>> --Hossein
>>
>> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Hossein -- Can you clarify what the resolution on the repository /
>>> release issue discussed on SPIP ?
>>>
>>> Shivaram
>>>
>>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>>> wrote:
>>> > +1
>>> > With my concerns in the SPIP discussion.
>>> >
>>> > 
>>> > From: Hossein 
>>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>>> > To: dev@spark.apache.org
>>> > Subject: [VOTE] SPIP ML Pipelines in R
>>> >
>>> > Hi,
>>> >
>>> > I started discussion thread for a new R package to expose MLlib
>>> pipelines in
>>> > R.
>>> >
>>> > To summarize we will work on utilities to generate R wrappers for MLlib
>>> > pipeline API for a new R package. This will lower the burden for
>>> exposing
>>> > new API in future.
>>> >
>>> > Following the SPIP process, I am proposing the SPIP for a vote.
>>> >
>>> > +1: Let's go ahead and implement the SPIP.
>>> > +0: Don't really care.
>>> > -1: I do not think this is a good idea for the following reasons.
>>> >
>>> > Thanks,
>>> > --Hossein
>>>
>>
>>


Re: [VOTE] SPIP ML Pipelines in R

2018-06-14 Thread Reynold Xin
+1 on the proposal.


On Fri, Jun 1, 2018 at 8:17 PM Hossein  wrote:

> Hi Shivaram,
>
> We converged on a CRAN release process that seems identical to current
> SparkR.
>
> --Hossein
>
> On Thu, May 31, 2018 at 9:10 AM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Hossein -- Can you clarify what the resolution on the repository /
>> release issue discussed on SPIP ?
>>
>> Shivaram
>>
>> On Thu, May 31, 2018 at 9:06 AM, Felix Cheung 
>> wrote:
>> > +1
>> > With my concerns in the SPIP discussion.
>> >
>> > 
>> > From: Hossein 
>> > Sent: Wednesday, May 30, 2018 2:03:03 PM
>> > To: dev@spark.apache.org
>> > Subject: [VOTE] SPIP ML Pipelines in R
>> >
>> > Hi,
>> >
>> > I started discussion thread for a new R package to expose MLlib
>> pipelines in
>> > R.
>> >
>> > To summarize we will work on utilities to generate R wrappers for MLlib
>> > pipeline API for a new R package. This will lower the burden for
>> exposing
>> > new API in future.
>> >
>> > Following the SPIP process, I am proposing the SPIP for a vote.
>> >
>> > +1: Let's go ahead and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I do not think this is a good idea for the following reasons.
>> >
>> > Thanks,
>> > --Hossein
>>
>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-14 Thread Hadrien Chicault
Unsubscribe et

Le jeu. 14 juin 2018 à 20:59, Jules Damji  a écrit :

>
>
> Matei & I own it. I normally tweet or handle Spark related PSAs
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> > On Jun 14, 2018, at 11:45 AM, Marcelo Vanzin 
> wrote:
> >
> > Hi Jacek,
> >
> > I seriously have no idea... I don't even know who owns that account (I
> > hope they have some connection with the PMC?).
> >
> > But it seems whoever owns it already sent something.
> >
> >> On Thu, Jun 14, 2018 at 12:31 AM, Jacek Laskowski 
> wrote:
> >> Hi Marcelo,
> >>
> >> How to announce it on twitter @ https://twitter.com/apachespark? How
> to make
> >> it part of the release process?
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://about.me/JacekLaskowski
> >> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> >> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>> On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin 
> wrote:
> >>>
> >>> We are happy to announce the availability of Spark 2.3.1!
> >>>
> >>> Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
> >>> maintenance branch of Spark. We strongly recommend all 2.3.x users to
> >>> upgrade to this stable release.
> >>>
> >>> To download Spark 2.3.1, head over to the download page:
> >>> http://spark.apache.org/downloads.html
> >>>
> >>> To view the release notes:
> >>> https://spark.apache.org/releases/spark-release-2-3-1.html
> >>>
> >>> We would like to acknowledge all community members for contributing to
> >>> this release. This release would not have been possible without you.
> >>>
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> -
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Shared variable in executor level

2018-06-14 Thread Nikodimos Nikolaidis
Thanks, that's what I was looking for.


On 06/14/2018 04:41 PM, Sean Owen wrote:
> Just use a singleton or static variable. It will be a simple per-JVM
> value that is therefore per-executor.
>
> On Thu, Jun 14, 2018 at 6:59 AM Nikodimos Nikolaidis
> mailto:niknik...@csd.auth.gr>> wrote:
>
> Hello community,
>
> I am working on a project in which statistics (like predicate
> selectivity) are collected during execution. I think that it's a good
> idea to keep these statistics in executor level. So, all tasks in
> same
> executor share the same variable and no extra network traffic is
> needed.
> Also, I am not especially interested in thread safety, it's not a big
> deal if some updates are lost - we are trying to see the general
> trend.
>
> This could be done, for example, with an in-memory data structure
> store
> server like Redis in each worker machine. But, could it be done in
> Spark
> natively?
>
> thanks,
> nik
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
>



Re: [ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-14 Thread Jules Damji



Matei & I own it. I normally tweet or handle Spark related PSAs

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Jun 14, 2018, at 11:45 AM, Marcelo Vanzin  
> wrote:
> 
> Hi Jacek,
> 
> I seriously have no idea... I don't even know who owns that account (I
> hope they have some connection with the PMC?).
> 
> But it seems whoever owns it already sent something.
> 
>> On Thu, Jun 14, 2018 at 12:31 AM, Jacek Laskowski  wrote:
>> Hi Marcelo,
>> 
>> How to announce it on twitter @ https://twitter.com/apachespark? How to make
>> it part of the release process?
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>>> On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin  wrote:
>>> 
>>> We are happy to announce the availability of Spark 2.3.1!
>>> 
>>> Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
>>> maintenance branch of Spark. We strongly recommend all 2.3.x users to
>>> upgrade to this stable release.
>>> 
>>> To download Spark 2.3.1, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>> 
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2-3-1.html
>>> 
>>> We would like to acknowledge all community members for contributing to
>>> this release. This release would not have been possible without you.
>>> 
>>> 
>>> --
>>> Marcelo
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-14 Thread Marcelo Vanzin
Hi Jacek,

I seriously have no idea... I don't even know who owns that account (I
hope they have some connection with the PMC?).

But it seems whoever owns it already sent something.

On Thu, Jun 14, 2018 at 12:31 AM, Jacek Laskowski  wrote:
> Hi Marcelo,
>
> How to announce it on twitter @ https://twitter.com/apachespark? How to make
> it part of the release process?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin  wrote:
>>
>> We are happy to announce the availability of Spark 2.3.1!
>>
>> Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
>> maintenance branch of Spark. We strongly recommend all 2.3.x users to
>> upgrade to this stable release.
>>
>> To download Spark 2.3.1, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-3-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this release. This release would not have been possible without you.
>>
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unsubscribe

2018-06-14 Thread Mohamed Gabr
Unsubscribe

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unsubscribe

2018-06-14 Thread Thiago
Unsubscribe


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Marcelo Vanzin
Yes, my bad. The code in session.py needs to also catch TypeError like before.

On Thu, Jun 14, 2018 at 11:03 AM, Li Jin  wrote:
> Sounds good. Thanks all for the quick reply.
>
> https://issues.apache.org/jira/browse/SPARK-24563
>
>
> On Thu, Jun 14, 2018 at 12:19 PM, Xiao Li  wrote:
>>
>> Thanks for catching this. Please feel free to submit a PR. I do not think
>> Vanzin wants to introduce the behavior changes in that PR. We should do the
>> code review more carefully.
>>
>> Xiao
>>
>> 2018-06-14 9:18 GMT-07:00 Li Jin :
>>>
>>> Are there objection to restore the behavior for PySpark users? I am happy
>>> to submit a patch.
>>>
>>> On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin  wrote:

 The behavior change is not good...

 On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:
>
> Ah, looks like it's this change:
>
> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>
> It seems strange that by default Spark doesn't build with Hive but by
> default PySpark requires it...
>
> This might also be a behavior change to PySpark users that build Spark
> without Hive. The old behavior is "fall back to non-hive support" and the
> new behavior is "program won't start".
>
> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>>
>> I think you would have to build with the 'hive' profile? but if so
>> that would have been true for a while now.
>>
>>
>> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>>>
>>> Hey all,
>>>
>>> I just did a clean checkout of github.com/apache/spark but failed to
>>> start PySpark, this is what I did:
>>>
>>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt
>>> package; bin/pyspark
>>>
>>>
>>> And got this exception:
>>>
>>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>>
>>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>>
>>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>>
>>> Type "help", "copyright", "credits" or "license" for more
>>> information.
>>>
>>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>>
>>> Setting default log level to "WARN".
>>>
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>>
>>>
>>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>>> UserWarning: Failed to initialize Spark session.
>>>
>>>   warnings.warn("Failed to initialize Spark session.")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", 
>>> line
>>> 41, in 
>>>
>>> spark = SparkSession._create_shell_session()
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>>> line 564, in _create_shell_session
>>>
>>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>>
>>> TypeError: 'JavaPackage' object is not callable
>>>
>>>
>>> I also tried to delete hadoop deps from my ivy2 cache and reinstall
>>> them but no luck. I wonder:
>>>
>>> I have not seen this before, could this be caused by recent change to
>>> head?
>>> Am I doing something wrong in the build process?
>>>
>>>
>>> Thanks much!
>>> Li
>>>
>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Sounds good. Thanks all for the quick reply.

https://issues.apache.org/jira/browse/SPARK-24563


On Thu, Jun 14, 2018 at 12:19 PM, Xiao Li  wrote:

> Thanks for catching this. Please feel free to submit a PR. I do not think
> Vanzin wants to introduce the behavior changes in that PR. We should do the
> code review more carefully.
>
> Xiao
>
> 2018-06-14 9:18 GMT-07:00 Li Jin :
>
>> Are there objection to restore the behavior for PySpark users? I am happy
>> to submit a patch.
>>
>> On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin  wrote:
>>
>>> The behavior change is not good...
>>>
>>> On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:
>>>
 Ah, looks like it's this change:
 https://github.com/apache/spark/commit/b3417b731d4e323398a0d
 7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5

 It seems strange that by default Spark doesn't build with Hive but by
 default PySpark requires it...

 This might also be a behavior change to PySpark users that build Spark
 without Hive. The old behavior is "fall back to non-hive support" and the
 new behavior is "program won't start".

 On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:

> I think you would have to build with the 'hive' profile? but if so
> that would have been true for a while now.
>
>
> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>
>> Hey all,
>>
>> I just did a clean checkout of github.com/apache/spark but failed to
>> start PySpark, this is what I did:
>>
>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt
>> package; bin/pyspark
>>
>> And got this exception:
>>
>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>
>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>
>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> Using Spark's default log4j profile: org/apache/spark/log4j-default
>> s.properties
>>
>> Setting default log level to "WARN".
>>
>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>> setLogLevel(newLevel).
>>
>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>> UserWarning: Failed to initialize Spark session.
>>
>>   warnings.warn("Failed to initialize Spark session.")
>>
>> Traceback (most recent call last):
>>
>>   File 
>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py",
>> line 41, in 
>>
>> spark = SparkSession._create_shell_session()
>>
>>   File 
>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>> line 564, in _create_shell_session
>>
>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>
>> TypeError: 'JavaPackage' object is not callable
>>
>> I also tried to delete hadoop deps from my ivy2 cache and reinstall
>> them but no luck. I wonder:
>>
>>
>>1. I have not seen this before, could this be caused by recent
>>change to head?
>>2. Am I doing something wrong in the build process?
>>
>>
>> Thanks much!
>> Li
>>
>>

>


unsubscribe

2018-06-14 Thread Huamin Li
unsubscribe


unsubscribe

2018-06-14 Thread Vasilis Hadjipanos
unsubscribe


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Xiao Li
Thanks for catching this. Please feel free to submit a PR. I do not think
Vanzin wants to introduce the behavior changes in that PR. We should do the
code review more carefully.

Xiao

2018-06-14 9:18 GMT-07:00 Li Jin :

> Are there objection to restore the behavior for PySpark users? I am happy
> to submit a patch.
>
> On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin  wrote:
>
>> The behavior change is not good...
>>
>> On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:
>>
>>> Ah, looks like it's this change:
>>> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e8640
>>> 5f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>>>
>>> It seems strange that by default Spark doesn't build with Hive but by
>>> default PySpark requires it...
>>>
>>> This might also be a behavior change to PySpark users that build Spark
>>> without Hive. The old behavior is "fall back to non-hive support" and the
>>> new behavior is "program won't start".
>>>
>>> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>>>
 I think you would have to build with the 'hive' profile? but if so that
 would have been true for a while now.


 On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:

> Hey all,
>
> I just did a clean checkout of github.com/apache/spark but failed to
> start PySpark, this is what I did:
>
> git clone g...@github.com:apache/spark.git; cd spark; build/sbt
> package; bin/pyspark
>
> And got this exception:
>
> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>
> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>
> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> Using Spark's default log4j profile: org/apache/spark/log4j-
> defaults.properties
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
> UserWarning: Failed to initialize Spark session.
>
>   warnings.warn("Failed to initialize Spark session.")
>
> Traceback (most recent call last):
>
>   File 
> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py",
> line 41, in 
>
> spark = SparkSession._create_shell_session()
>
>   File 
> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
> line 564, in _create_shell_session
>
> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>
> TypeError: 'JavaPackage' object is not callable
>
> I also tried to delete hadoop deps from my ivy2 cache and reinstall
> them but no luck. I wonder:
>
>
>1. I have not seen this before, could this be caused by recent
>change to head?
>2. Am I doing something wrong in the build process?
>
>
> Thanks much!
> Li
>
>
>>>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Are there objection to restore the behavior for PySpark users? I am happy
to submit a patch.
On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin  wrote:

> The behavior change is not good...
>
> On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:
>
>> Ah, looks like it's this change:
>>
>> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>>
>> It seems strange that by default Spark doesn't build with Hive but by
>> default PySpark requires it...
>>
>> This might also be a behavior change to PySpark users that build Spark
>> without Hive. The old behavior is "fall back to non-hive support" and the
>> new behavior is "program won't start".
>>
>> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>>
>>> I think you would have to build with the 'hive' profile? but if so that
>>> would have been true for a while now.
>>>
>>>
>>> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>>>
 Hey all,

 I just did a clean checkout of github.com/apache/spark but failed to
 start PySpark, this is what I did:

 git clone g...@github.com:apache/spark.git; cd spark; build/sbt
 package; bin/pyspark

 And got this exception:

 (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark

 Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)

 [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin

 Type "help", "copyright", "credits" or "license" for more information.

 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 Setting default log level to "WARN".

 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
 setLogLevel(newLevel).

 /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
 UserWarning: Failed to initialize Spark session.

   warnings.warn("Failed to initialize Spark session.")

 Traceback (most recent call last):

   File
 "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
 41, in 

 spark = SparkSession._create_shell_session()

   File
 "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
 line 564, in _create_shell_session

 SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()

 TypeError: 'JavaPackage' object is not callable

 I also tried to delete hadoop deps from my ivy2 cache and reinstall
 them but no luck. I wonder:


1. I have not seen this before, could this be caused by recent
change to head?
2. Am I doing something wrong in the build process?


 Thanks much!
 Li


>>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Reynold Xin
The behavior change is not good...

On Thu, Jun 14, 2018 at 9:05 AM Li Jin  wrote:

> Ah, looks like it's this change:
>
> https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5
>
> It seems strange that by default Spark doesn't build with Hive but by
> default PySpark requires it...
>
> This might also be a behavior change to PySpark users that build Spark
> without Hive. The old behavior is "fall back to non-hive support" and the
> new behavior is "program won't start".
>
> On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:
>
>> I think you would have to build with the 'hive' profile? but if so that
>> would have been true for a while now.
>>
>>
>> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>>
>>> Hey all,
>>>
>>> I just did a clean checkout of github.com/apache/spark but failed to
>>> start PySpark, this is what I did:
>>>
>>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
>>> bin/pyspark
>>>
>>> And got this exception:
>>>
>>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>>
>>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>>
>>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>>
>>> Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>>
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>>
>>> Setting default log level to "WARN".
>>>
>>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>>> setLogLevel(newLevel).
>>>
>>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>>> UserWarning: Failed to initialize Spark session.
>>>
>>>   warnings.warn("Failed to initialize Spark session.")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
>>> 41, in 
>>>
>>> spark = SparkSession._create_shell_session()
>>>
>>>   File
>>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>>> line 564, in _create_shell_session
>>>
>>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>>
>>> TypeError: 'JavaPackage' object is not callable
>>>
>>> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
>>> but no luck. I wonder:
>>>
>>>
>>>1. I have not seen this before, could this be caused by recent
>>>change to head?
>>>2. Am I doing something wrong in the build process?
>>>
>>>
>>> Thanks much!
>>> Li
>>>
>>>
>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Ah, looks like it's this change:
https://github.com/apache/spark/commit/b3417b731d4e323398a0d7ec6e86405f4464f4f9#diff-3b5463566251d5b09fd328738a9e9bc5

It seems strange that by default Spark doesn't build with Hive but by
default PySpark requires it...

This might also be a behavior change to PySpark users that build Spark
without Hive. The old behavior is "fall back to non-hive support" and the
new behavior is "program won't start".

On Thu, Jun 14, 2018 at 11:51 AM, Sean Owen  wrote:

> I think you would have to build with the 'hive' profile? but if so that
> would have been true for a while now.
>
>
> On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:
>
>> Hey all,
>>
>> I just did a clean checkout of github.com/apache/spark but failed to
>> start PySpark, this is what I did:
>>
>> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
>> bin/pyspark
>>
>> And got this exception:
>>
>> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>>
>> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>>
>> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>>
>> Type "help", "copyright", "credits" or "license" for more information.
>>
>> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>>
>> Using Spark's default log4j profile: org/apache/spark/log4j-
>> defaults.properties
>>
>> Setting default log level to "WARN".
>>
>> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
>> setLogLevel(newLevel).
>>
>> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
>> UserWarning: Failed to initialize Spark session.
>>
>>   warnings.warn("Failed to initialize Spark session.")
>>
>> Traceback (most recent call last):
>>
>>   File "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py",
>> line 41, in 
>>
>> spark = SparkSession._create_shell_session()
>>
>>   File 
>> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
>> line 564, in _create_shell_session
>>
>> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>>
>> TypeError: 'JavaPackage' object is not callable
>>
>> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
>> but no luck. I wonder:
>>
>>
>>1. I have not seen this before, could this be caused by recent change
>>to head?
>>2. Am I doing something wrong in the build process?
>>
>>
>> Thanks much!
>> Li
>>
>>


Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Sean Owen
I think you would have to build with the 'hive' profile? but if so that
would have been true for a while now.

On Thu, Jun 14, 2018 at 10:38 AM Li Jin  wrote:

> Hey all,
>
> I just did a clean checkout of github.com/apache/spark but failed to
> start PySpark, this is what I did:
>
> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
> bin/pyspark
>
> And got this exception:
>
> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>
> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>
> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
> UserWarning: Failed to initialize Spark session.
>
>   warnings.warn("Failed to initialize Spark session.")
>
> Traceback (most recent call last):
>
>   File
> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
> 41, in 
>
> spark = SparkSession._create_shell_session()
>
>   File
> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
> line 564, in _create_shell_session
>
> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>
> TypeError: 'JavaPackage' object is not callable
>
> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
> but no luck. I wonder:
>
>
>1. I have not seen this before, could this be caused by recent change
>to head?
>2. Am I doing something wrong in the build process?
>
>
> Thanks much!
> Li
>
>


Dot file from execution plan

2018-06-14 Thread Leonardo Herrera
Hi,

We have an automatic report creation tool that creates Spark SQL jobs
based on user instruction (this is a web application). We'd like to
give the users an opportunity to visualize the execution plan of their
handiwork before them inflicting it to the world.

Currently, I'm just capturing the output of an `explain` statement and
displaying it, but it's still too cryptic for users, so we'd love to
have something similar to the SQL tab in the Spark UI; from my
research, the beautiful SVG displayed there is built from a dot file
which is built from aggregated metrics, which are gathered when the
job is executing. This appears not to be what I need.

Since I'm just getting familiar with Spark's extremely complex
codebase, do you guys have any pointers or ideas on how can I provide
a more user friendly physical execution plan to my users?

Regards,
Leonardo Herrera

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
I can work around by using:

bin/pyspark --conf spark.sql.catalogImplementation=in-memory

now, but still wonder what's going on with HiveConf..

On Thu, Jun 14, 2018 at 11:37 AM, Li Jin  wrote:

> Hey all,
>
> I just did a clean checkout of github.com/apache/spark but failed to
> start PySpark, this is what I did:
>
> git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
> bin/pyspark
>
> And got this exception:
>
> (spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
>
> Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)
>
> [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> 18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> Using Spark's default log4j profile: org/apache/spark/log4j-
> defaults.properties
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> /Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
> UserWarning: Failed to initialize Spark session.
>
>   warnings.warn("Failed to initialize Spark session.")
>
> Traceback (most recent call last):
>
>   File "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py",
> line 41, in 
>
> spark = SparkSession._create_shell_session()
>
>   File 
> "/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
> line 564, in _create_shell_session
>
> SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
>
> TypeError: 'JavaPackage' object is not callable
>
> I also tried to delete hadoop deps from my ivy2 cache and reinstall them
> but no luck. I wonder:
>
>
>1. I have not seen this before, could this be caused by recent change
>to head?
>2. Am I doing something wrong in the build process?
>
>
> Thanks much!
> Li
>
>


Missing HiveConf when starting PySpark from head

2018-06-14 Thread Li Jin
Hey all,

I just did a clean checkout of github.com/apache/spark but failed to start
PySpark, this is what I did:

git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
bin/pyspark

And got this exception:

(spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark

Python 3.6.3 |Anaconda, Inc.| (default, Nov  8 2017, 18:10:31)

[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

18/06/14 11:34:14 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).

/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py:45:
UserWarning: Failed to initialize Spark session.

  warnings.warn("Failed to initialize Spark session.")

Traceback (most recent call last):

  File
"/Users/icexelloss/workspace/upstream2/spark/python/pyspark/shell.py", line
41, in 

spark = SparkSession._create_shell_session()

  File
"/Users/icexelloss/workspace/upstream2/spark/python/pyspark/sql/session.py",
line 564, in _create_shell_session

SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()

TypeError: 'JavaPackage' object is not callable

I also tried to delete hadoop deps from my ivy2 cache and reinstall them
but no luck. I wonder:


   1. I have not seen this before, could this be caused by recent change to
   head?
   2. Am I doing something wrong in the build process?


Thanks much!
Li


Re: Spark issue 20236 - overwrite a partitioned data srouce

2018-06-14 Thread Marco Gaido
Hi Alessandro,


I'd recommend you to check the UTs added in the commit which solved the
issue (ie.
https://github.com/apache/spark/commit/a66fe36cee9363b01ee70e469f1c968f633c5713).
You can use them to try and reproduce the issue.

Thanks,
Marco

2018-06-14 15:57 GMT+02:00 Alessandro Liparoti :

> Good morning,
>
> I am trying to see how this bug affects the write in spark 2.2.0, but I
> cannot reproduce it. Is it ok then using the code
> df.write.mode(SaveMode.Overwrite).insertInto("table_name")
> ?
>
> Thank you,
> *Alessandro Liparoti*
>


Spark issue 20236 - overwrite a partitioned data srouce

2018-06-14 Thread Alessandro Liparoti
Good morning,

I am trying to see how this bug affects the write in spark 2.2.0, but I
cannot reproduce it. Is it ok then using the code
df.write.mode(SaveMode.Overwrite).insertInto("table_name")
?

Thank you,
*Alessandro Liparoti*


Re: Shared variable in executor level

2018-06-14 Thread Sean Owen
Just use a singleton or static variable. It will be a simple per-JVM value
that is therefore per-executor.

On Thu, Jun 14, 2018 at 6:59 AM Nikodimos Nikolaidis 
wrote:

> Hello community,
>
> I am working on a project in which statistics (like predicate
> selectivity) are collected during execution. I think that it's a good
> idea to keep these statistics in executor level. So, all tasks in same
> executor share the same variable and no extra network traffic is needed.
> Also, I am not especially interested in thread safety, it's not a big
> deal if some updates are lost - we are trying to see the general trend.
>
> This could be done, for example, with an in-memory data structure store
> server like Redis in each worker machine. But, could it be done in Spark
> natively?
>
> thanks,
> nik
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
Next week is pride in San Francisco but I'm still going to do two quick
session. One will be live coding with Apache Spark to collect ASF diversity
information ( https://www.youtube.com/watch?v=OirnFnsU37A /
https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am
pacific and the other will be the regular Friday code review (
https://www.youtube.com/watch?v=IAWm4OLRoyY /
https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am.

On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau  wrote:

> I'll be doing another one tomorrow morning at 9am pacific focused on
> Python + K8s support & improved JSON support -
> https://www.youtube.com/watch?v=Z7ZEkvNwneU &
> https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :)
>
> On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau  wrote:
>
>> If anyone wants to watch the recording: https://www.youtube
>> .com/watch?v=lugG_2QU6YU
>>
>> I'll do one next week as well - March 16th @ 11am -
>> https://www.youtube.com/watch?v=pXzVtEUjrLc
>>
>> On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks,
>>>
>>> If your curious in learning more about how Spark is developed, I’m going
>>> to expirement doing a live code review where folks can watch and see how
>>> that part of our process works. I have two volunteers already for having
>>> their PRs looked at live, and if you have a Spark PR your working on you’d
>>> like me to livestream a review of please ping me.
>>>
>>> The livestream will be at https://www.youtube.com/watch?v=lugG_2QU6YU.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Shared variable in executor level

2018-06-14 Thread Nikodimos Nikolaidis

Hello community,

I am working on a project in which statistics (like predicate 
selectivity) are collected during execution. I think that it's a good 
idea to keep these statistics in executor level. So, all tasks in same 
executor share the same variable and no extra network traffic is needed. 
Also, I am not especially interested in thread safety, it's not a big 
deal if some updates are lost - we are trying to see the general trend.


This could be done, for example, with an in-memory data structure store 
server like Redis in each worker machine. But, could it be done in Spark 
natively?


thanks,
nik


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Very slow complex type column reads from parquet

2018-06-14 Thread Jakub Wozniak
Dear Ryan,

Thanks a lot for your answer.
After having sent the e-mail we have investigated a bit more the data itself.
It happened that for certain days it was very skewed and one of the row groups 
had much more records that all others.
This was somehow related to the fact that we have sorted it using our object 
ids and by chance those that went first were smaller (or compressed better).
So the Parquet file had a 6 rows groups where the first one had 300k rows and 
others only 30k rows.
The search for a given object fell into the first row group and lasted very 
long time.
The data itself was very much compressed as it contained a lot of zeros. To 
give some numbers the 600MB parquet file expanded to 56GB in JSON.

What we did is to sort the data not by object id but by the record timestamp 
which resulted in much more even data distribution among the row groups.
This in fact helped a lot for the query time (using the timestamp & object id)

I have to say that I haven't fully understood this phenomenon yet as I’m not a 
Parquet format & reader expert (at least not yet).
Maybe it is just a simple function of how many records Spark has to scan and 
the level of parallelism (searching for a given object id when sorted by time 
needs to scan all/more the groups for larger times).
One question here - is Parquet reader reading & decoding the projection columns 
even if the predicate columns should filter the record out?

Unfortunately we have to have those big columns in the query as people want to 
do analysis on them.

We will continue to investigate…

Cheers,
Jakub



On 12 Jun 2018, at 22:51, Ryan Blue 
mailto:rb...@netflix.com>> wrote:

Jakub,

You're right that Spark currently doesn't use the vectorized read path for 
nested data, but I'm not sure that's the problem here. With 50k elements in the 
f1 array, it could easily be that you're getting the significant speed-up from 
not reading or materializing that column. The non-vectorized path is slower, 
but it is more likely that the problem is the data if it is that much slower.

I'd be happy to see vectorization for nested Parquet data move forward, but I 
think you might want to get an idea of how much it will help before you move 
forward with it. Can you use Impala to test whether vectorization would help 
here?

rb



On Mon, Jun 11, 2018 at 6:16 AM, Jakub Wozniak 
mailto:jakub.wozn...@cern.ch>> wrote:
Hello,

We have stumbled upon a quite degraded performance when reading a complex 
(struct, array) type columns stored in Parquet.
A Parquet file is of around 600MB (snappy) with ~400k rows with a field of a 
complex type { f1: array of ints, f2: array of ints } where f1 array length is 
50k elements.
There are also other fields like entity_id: long, timestamp: long.

A simple query that selects rows using predicates entity_id = X and timestamp 
>= T1 and timestamp <= T2 plus ds.show() takes 17 minutes to execute.
If we remove the complex type columns from the query it is executed in a 
sub-second time.

Now when looking at the implementation of the Parquet datasource the 
Vectorized* classes are used only if the read types are primitives. In other 
case the code falls back to the parquet-mr default implementation.
In the VectorizedParquetRecordReader there is a TODO to handle complex types 
that "should be efficient & easy with codegen".

For our CERN Spark usage the current execution times are pretty much 
prohibitive as there is a lot of data stored as arrays / complex types…
The file of 600 MB represents 1 day of measurements and our data scientists 
would like to process sometimes months or even years of those.

Could you please let me know if there is anybody currently working on it or 
maybe you have it in a roadmap for the future?
Or maybe you could give me some suggestions how to avoid / resolve this 
problem? I’m using Spark 2.2.1.

Best regards,
Jakub Wozniak




-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org




--
Ryan Blue
Software Engineer
Netflix



Fwd: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread 刘崇光
-- Forwarded message --
From: 刘崇光 
Date: Thu, Jun 14, 2018 at 11:08 AM
Subject: array_contains in package org.apache.spark.sql.functions
To: u...@spark.apache.org


Hello all,

I ran into a use case in project with spark sql and want to share with you
some thoughts about the function array_contains.

Say I have a Dataframe containing 2 columns. Column A of type "Array of
String" and Column B of type "String". I want to determine if the value of
column B is contained in the value of column A, without using a udf of
course.
The function array_contains came into my mind naturally:

def array_contains(column: Column, value: Any): Column = withExpr {
  ArrayContains(column.expr, Literal(value))
}

However the function takes the column B and does a "Literal" of column B,
which yields a runtime exception: RuntimeException("Unsupported literal
type " + v.getClass + " " + v).

Then after discussion with my friends, we fund a solution without using udf:

new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)


With this solution, I think of empowering a little bit more the function,
by doing like this:

def array_contains(column: Column, value: Any): Column = withExpr {
  value match {
case c: Column => ArrayContains(column.expr, c.expr)
case _ => ArrayContains(column.expr, Literal(value))
  }
}


It does a pattern matching to detect if value is of type Column. If yes, it
will use the .expr of the column, otherwise it will work as it used to.

Any suggestion or opinion on the proposition?


Kind regards,
Chongguang LIU


Re: [ANNOUNCE] Announcing Apache Spark 2.3.1

2018-06-14 Thread Jacek Laskowski
Hi Marcelo,

How to announce it on twitter @ https://twitter.com/apachespark? How to
make it part of the release process?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jun 11, 2018 at 9:47 PM, Marcelo Vanzin  wrote:

> We are happy to announce the availability of Spark 2.3.1!
>
> Apache Spark 2.3.1 is a maintenance release, based on the branch-2.3
> maintenance branch of Spark. We strongly recommend all 2.3.x users to
> upgrade to this stable release.
>
> To download Spark 2.3.1, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-3-1.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>