date:20170907

Re: DAG in Pipeline

2017-09-07 Thread Srikanth Sampath

Hi,
Pranay/Joseph, Can you share an example of ML DAG pipeline?
Thanks,
-Srikanth



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Spark ML DAG Pipelines

2017-09-07 Thread Srikanth Sampath

Hi Spark Experts,

Can someone point me to some examples for non-linear (DAG) ML pipelines.
That would be of great help.
Thanks much in advance
-Srikanth

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-07 Thread Bryan Cutler

+1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
fine to work out the minor details of the API during review.

Bryan

On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
wrote:

> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size
> hint, I'd like to submit a pr based on the current proposal and continue
> discussing in its review.
>
> https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
>
>> +1 on the design and proposed API.
>>
>> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
>> the size hint. This can be done in the PR review though.
>>
>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
>> wrote:
>>
>>> +1 on this and like the suggestion of type in string form.
>>>
>>> Would it be correct to assume there will be data type check, for example
>>> the returned pandas data frame column data types match what are specified.
>>> We have seen quite a bit of issues/confusions with that in R.
>>>
>>> Would it make sense to have a more generic decorator name so that it
>>> could also be useable for other efficient vectorized format in the future?
>>> Or do we anticipate the decorator to be format specific and will have more
>>> in the future?
>>>
>>> --
>>> *From:* Reynold Xin 
>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>> *To:* Takuya UESHIN
>>> *Cc:* spark-dev
>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>
>>> Ok, thanks.
>>>
>>> +1 on the SPIP for scope etc
>>>
>>>
>>> On API details (will deal with in code reviews as well but leaving a
>>> note here in case I forget)
>>>
>>> 1. I would suggest having the API also accept data type specification in
>>> string form. It is usually simpler to say "long" then "LongType()".
>>>
>>> 2. Think about what error message to show when the rows numbers don't
>>> match at runtime.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
>>> wrote:
>>>
 Yes, the aggregation is out of scope for now.
 I think we should continue discussing the aggregation at JIRA and we
 will be adding those later separately.

 Thanks.


 On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
 wrote:

> Is the idea aggregate is out of scope for the current effort and we
> will be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we
>> almost got a consensus about the APIs, so I'd like to summarize and
>> call for a vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not
>> APIs for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>   return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>   return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your
>> vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


 --
 Takuya UESHIN
 Tokyo, Japan

 http://twitter.com/ueshin

>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread wangzhenhua (G)

+1 (non-binding)  Great to see data source API is going to be improved!

best regards,
-Zhenhua(Xander)

发件人: Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
发送时间: 2017年9月8日 4:07
收件人: 蒋星博
抄送: Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot 
Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 
> wrote:
+1

Reynold Xin >于2017年9月7日 
周四下午12:04写道：
+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
> wrote:
+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
> wrote:

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be great to 
see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics right 
for what many datasources are already doing in API V1 combined with other 
private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support you’re 
talking about later with a more specific plan that doesn’t block fixing the 
problems that this addresses.

On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier 
> wrote:
+1 (binding)

I personally believe that there is quite a big difference between having a 
generic data source interface with a low surface area and pushing down a 
significant part of query processing into a datasource. The later has much 
wider wider surface area and will require us to stabilize most of the internal 
catalyst API's which will be a significant burden on the community to maintain 
and has the potential to slow development velocity significantly. If you want 
to write such integrations then you should be prepared to work with catalyst 
internals and own up to the fact that things might change across minor versions 
(and in some cases even maintenance releases). If you are willing to go down 
that road, then your best bet is to use the already existing spark session 
extensions which will allow you to write such integrations and can be used as 
an `escape hatch`.

On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
> wrote:
+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources into 
a common public API for sure.  It will serve as a forcing function to ensure 
that those internal datasources aren't advantaged vs datasources developed 
externally as plugins to Spark, and that all Spark features are available to 
all datasources.

But I also think this read-path proposal avoids the more difficult questions 
around how to continue pushing datasource performance forwards.  James Baker 
(my colleague) had a number of questions about advanced pushdowns (combined 
sorting and filtering), and Reynold also noted that pushdown of aggregates and 
joins are desirable on longer timeframes as well.  The Spark community saw 
similar requests, for aggregate pushdown in SPARK-12686, join pushdown in 
SPARK-20259, and arbitrary plan pushdown in SPARK-12449.  Clearly a number of 
people are interested in this kind of performance work for datasources.

To leave enough space for datasource developers to continue experimenting with 
advanced interactions between Spark and their datasources, I'd propose we leave 
some sort of escape valve that enables these datasources to keep pushing the 
boundaries without forking Spark.  Possibly that looks like an additional 
unsupported/unstable interface that pushes down an entire (unstable API) 
logical plan, which is expected to break API on every release.   (Spark 
attempts this full-plan pushdown, and if that fails Spark ignores it and 
continues on with the rest of the V2 API for compatibility).  Or maybe it looks 
like something else that we don't know of yet.  Possibly this falls outside of 
the desired goals for the V2 API and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource 
developers I'd be an unequivocal +1.  Right now it feels like this SPIP is 
focused more on getting the basics right for what many datasources are already 
doing in API V1 combined with other private APIs, vs pushing forward state of 
the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati 
> wrote:
+1 (non-binding)

On Sep 6, 2017, at 7:29 PM, Wenchen Fan 
> wrote:

Hi all,

In the previous discussion, we decided to split the read and write path of data 
source v2 into 2 SPIPs, and I'm sending this

qualifier in AttributeReference

2017-09-07 Thread Ey-Chih Chow

Hi,

I am upgrading my Spark application from Spark 2.1 to 2.2.  I found that in
many places that qualifiers of AttributeReferences for base tables are no
longer existing.  Is there any reason to take out qualifiers from
AttributeReferences?  Thanks.

Best regards,

Ey-Chih Chow   



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: 2.1.2 maintenance release?

2017-09-07 Thread Holden Karau

I'd be happy to manage the 2.1.2 maintenance release (and 2.2.1 after that)
if people are ok with a committer / me running the release process rather
than a full PMC member.

On Thu, Sep 7, 2017 at 1:05 PM, Dongjoon Hyun 
wrote:

> +1!
>
> As of today,
>
> For 2.1.2, we have 87 commits. (2.1.1 was released 4 months ago)
> For 2.2.1, we have 95 commits. (2.2.0 was released 2 months ago)
>
> Can we have 2.2.1, too?
>
> Bests,
> Dongjoon.
>
>
> On Thu, Sep 7, 2017 at 2:14 AM, Sean Owen  wrote:
>
>> In a separate conversation about bugs and a security issue fixed in 2.1.x
>> and 2.0.x, Marcelo suggested it could be time for a maintenance release.
>> I'm not sure what our stance on 2.0.x is, but 2.1.2 seems like it could be
>> valuable to release.
>>
>> Thoughts? I believe Holden had expressed interest in even managing the
>> release process, but maybe others are interested as well. That is, this
>> could also be a chance to share that burden and spread release experience
>> around a bit.
>>
>> Sean
>>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Dongjoon Hyun

+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博  wrote:

> +1
>
>
> Reynold Xin 于2017年9月7日 周四下午12:04写道：
>
>> +1 as well
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>>> wrote:
>>>
 +1 (non-binding)

 Thanks for making the updates reflected in the current PR. It would be
 great to see the doc updated before it is finally published though.

 Right now it feels like this SPIP is focused more on getting the basics
 right for what many datasources are already doing in API V1 combined with
 other private APIs, vs pushing forward state of the art for performance.

 I think that’s the right approach for this SPIP. We can add the support
 you’re talking about later with a more specific plan that doesn’t block
 fixing the problems that this addresses.
 

 On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
 hvanhov...@databricks.com> wrote:

> +1 (binding)
>
> I personally believe that there is quite a big difference between
> having a generic data source interface with a low surface area and pushing
> down a significant part of query processing into a datasource. The later
> has much wider wider surface area and will require us to stabilize most of
> the internal catalyst API's which will be a significant burden on the
> community to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that 
> things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
> wrote:
>
>> +0 (non-binding)
>>
>> I think there are benefits to unifying all the Spark-internal
>> datasources into a common public API for sure.  It will serve as a 
>> forcing
>> function to ensure that those internal datasources aren't advantaged vs
>> datasources developed externally as plugins to Spark, and that all Spark
>> features are available to all datasources.
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown 
>> in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind 
>> of
>> performance work for datasources.
>>
>> To leave enough space for datasource developers to continue
>> experimenting with advanced interactions between Spark and their
>> datasources, I'd propose we leave some sort of escape valve that enables
>> these datasources to keep pushing the boundaries without forking Spark.
>> Possibly that looks like an additional unsupported/unstable interface 
>> that
>> pushes down an entire (unstable API) logical plan, which is expected to
>> break API on every release.   (Spark attempts this full-plan pushdown, 
>> and
>> if that fails Spark ignores it and continues on with the rest of the V2 
>> API
>> for compatibility).  Or maybe it looks like something else that we don't
>> know of yet.  Possibly this falls outside of the desired goals for the V2
>> API and instead should be a separate SPIP.
>>
>> If we had a plan for this kind of escape valve for advanced
>> datasource developers I'd be an unequivocal +1.  Right now it feels like
>> this SPIP is focused more on getting the basics right for what many
>> datasources are already doing in API V1 combined with other private APIs,
>> vs pushing forward state of the art for performance.
>>
>> Andrew
>>
>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> In the previous discussion, we decided to split the read and write
>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call 
>>> a
>>> vote for Data Source V2

Re: 2.1.2 maintenance release?

2017-09-07 Thread Dongjoon Hyun

+1!

As of today,

For 2.1.2, we have 87 commits. (2.1.1 was released 4 months ago)
For 2.2.1, we have 95 commits. (2.2.0 was released 2 months ago)

Can we have 2.2.1, too?

Bests,
Dongjoon.


On Thu, Sep 7, 2017 at 2:14 AM, Sean Owen  wrote:

> In a separate conversation about bugs and a security issue fixed in 2.1.x
> and 2.0.x, Marcelo suggested it could be time for a maintenance release.
> I'm not sure what our stance on 2.0.x is, but 2.1.2 seems like it could be
> valuable to release.
>
> Thoughts? I believe Holden had expressed interest in even managing the
> release process, but maybe others are interested as well. That is, this
> could also be a chance to share that burden and spread release experience
> around a bit.
>
> Sean
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread 蒋星博

+1


Reynold Xin 于2017年9月7日 周四下午12:04写道：

> +1 as well
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
> wrote:
>
>> +1
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Thanks for making the updates reflected in the current PR. It would be
>>> great to see the doc updated before it is finally published though.
>>>
>>> Right now it feels like this SPIP is focused more on getting the basics
>>> right for what many datasources are already doing in API V1 combined with
>>> other private APIs, vs pushing forward state of the art for performance.
>>>
>>> I think that’s the right approach for this SPIP. We can add the support
>>> you’re talking about later with a more specific plan that doesn’t block
>>> fixing the problems that this addresses.
>>> 
>>>
>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@databricks.com> wrote:
>>>
 +1 (binding)

 I personally believe that there is quite a big difference between
 having a generic data source interface with a low surface area and pushing
 down a significant part of query processing into a datasource. The later
 has much wider wider surface area and will require us to stabilize most of
 the internal catalyst API's which will be a significant burden on the
 community to maintain and has the potential to slow development velocity
 significantly. If you want to write such integrations then you should be
 prepared to work with catalyst internals and own up to the fact that things
 might change across minor versions (and in some cases even maintenance
 releases). If you are willing to go down that road, then your best bet is
 to use the already existing spark session extensions which will allow you
 to write such integrations and can be used as an `escape hatch`.


 On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
 wrote:

> +0 (non-binding)
>
> I think there are benefits to unifying all the Spark-internal
> datasources into a common public API for sure.  It will serve as a forcing
> function to ensure that those internal datasources aren't advantaged vs
> datasources developed externally as plugins to Spark, and that all Spark
> features are available to all datasources.
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
> To leave enough space for datasource developers to continue
> experimenting with advanced interactions between Spark and their
> datasources, I'd propose we leave some sort of escape valve that enables
> these datasources to keep pushing the boundaries without forking Spark.
> Possibly that looks like an additional unsupported/unstable interface that
> pushes down an entire (unstable API) logical plan, which is expected to
> break API on every release.   (Spark attempts this full-plan pushdown, and
> if that fails Spark ignores it and continues on with the rest of the V2 
> API
> for compatibility).  Or maybe it looks like something else that we don't
> know of yet.  Possibly this falls outside of the desired goals for the V2
> API and instead should be a separate SPIP.
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are
> already doing in API V1 combined with other private APIs, vs pushing
> forward state of the art for performance.
>
> Andrew
>
> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>>
>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> In the previous discussion, we decided to split the read and write
>> path of data source v2 into 2 SPIPs, and I'm sending this email to call a
>> vote for Data Source V2 read path only.
>>
>> The full document of the Data Source API V2 is:
>>
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Reynold Xin

+1 as well

On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust 
wrote:

> +1
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue 
> wrote:
>
>> +1 (non-binding)
>>
>> Thanks for making the updates reflected in the current PR. It would be
>> great to see the doc updated before it is finally published though.
>>
>> Right now it feels like this SPIP is focused more on getting the basics
>> right for what many datasources are already doing in API V1 combined with
>> other private APIs, vs pushing forward state of the art for performance.
>>
>> I think that’s the right approach for this SPIP. We can add the support
>> you’re talking about later with a more specific plan that doesn’t block
>> fixing the problems that this addresses.
>> 
>>
>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>>> +1 (binding)
>>>
>>> I personally believe that there is quite a big difference between having
>>> a generic data source interface with a low surface area and pushing down a
>>> significant part of query processing into a datasource. The later has much
>>> wider wider surface area and will require us to stabilize most of the
>>> internal catalyst API's which will be a significant burden on the community
>>> to maintain and has the potential to slow development velocity
>>> significantly. If you want to write such integrations then you should be
>>> prepared to work with catalyst internals and own up to the fact that things
>>> might change across minor versions (and in some cases even maintenance
>>> releases). If you are willing to go down that road, then your best bet is
>>> to use the already existing spark session extensions which will allow you
>>> to write such integrations and can be used as an `escape hatch`.
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash 
>>> wrote:
>>>
 +0 (non-binding)

 I think there are benefits to unifying all the Spark-internal
 datasources into a common public API for sure.  It will serve as a forcing
 function to ensure that those internal datasources aren't advantaged vs
 datasources developed externally as plugins to Spark, and that all Spark
 features are available to all datasources.

 But I also think this read-path proposal avoids the more difficult
 questions around how to continue pushing datasource performance forwards.
 James Baker (my colleague) had a number of questions about advanced
 pushdowns (combined sorting and filtering), and Reynold also noted that
 pushdown of aggregates and joins are desirable on longer timeframes as
 well.  The Spark community saw similar requests, for aggregate pushdown in
 SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
 in SPARK-12449.  Clearly a number of people are interested in this kind of
 performance work for datasources.

 To leave enough space for datasource developers to continue
 experimenting with advanced interactions between Spark and their
 datasources, I'd propose we leave some sort of escape valve that enables
 these datasources to keep pushing the boundaries without forking Spark.
 Possibly that looks like an additional unsupported/unstable interface that
 pushes down an entire (unstable API) logical plan, which is expected to
 break API on every release.   (Spark attempts this full-plan pushdown, and
 if that fails Spark ignores it and continues on with the rest of the V2 API
 for compatibility).  Or maybe it looks like something else that we don't
 know of yet.  Possibly this falls outside of the desired goals for the V2
 API and instead should be a separate SPIP.

 If we had a plan for this kind of escape valve for advanced datasource
 developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
 focused more on getting the basics right for what many datasources are
 already doing in API V1 combined with other private APIs, vs pushing
 forward state of the art for performance.

 Andrew

 On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
 suresh.thalam...@gmail.com> wrote:

> +1 (non-binding)
>
>
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>
> Hi all,
>
> In the previous discussion, we decided to split the read and write
> path of data source v2 into 2 SPIPs, and I'm sending this email to call a
> vote for Data Source V2 read path only.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
> -Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the read path is:
> https://github.com/apache/spark/pull/19136
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Michael Armbrust

+1

On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue  wrote:

> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
> 
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1 (binding)
>>
>> I personally believe that there is quite a big difference between having
>> a generic data source interface with a low surface area and pushing down a
>> significant part of query processing into a datasource. The later has much
>> wider wider surface area and will require us to stabilize most of the
>> internal catalyst API's which will be a significant burden on the community
>> to maintain and has the potential to slow development velocity
>> significantly. If you want to write such integrations then you should be
>> prepared to work with catalyst internals and own up to the fact that things
>> might change across minor versions (and in some cases even maintenance
>> releases). If you are willing to go down that road, then your best bet is
>> to use the already existing spark session extensions which will allow you
>> to write such integrations and can be used as an `escape hatch`.
>>
>>
>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:
>>
>>> +0 (non-binding)
>>>
>>> I think there are benefits to unifying all the Spark-internal
>>> datasources into a common public API for sure.  It will serve as a forcing
>>> function to ensure that those internal datasources aren't advantaged vs
>>> datasources developed externally as plugins to Spark, and that all Spark
>>> features are available to all datasources.
>>>
>>> But I also think this read-path proposal avoids the more difficult
>>> questions around how to continue pushing datasource performance forwards.
>>> James Baker (my colleague) had a number of questions about advanced
>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>> well.  The Spark community saw similar requests, for aggregate pushdown in
>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>>> performance work for datasources.
>>>
>>> To leave enough space for datasource developers to continue
>>> experimenting with advanced interactions between Spark and their
>>> datasources, I'd propose we leave some sort of escape valve that enables
>>> these datasources to keep pushing the boundaries without forking Spark.
>>> Possibly that looks like an additional unsupported/unstable interface that
>>> pushes down an entire (unstable API) logical plan, which is expected to
>>> break API on every release.   (Spark attempts this full-plan pushdown, and
>>> if that fails Spark ignores it and continues on with the rest of the V2 API
>>> for compatibility).  Or maybe it looks like something else that we don't
>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>> API and instead should be a separate SPIP.
>>>
>>> If we had a plan for this kind of escape valve for advanced datasource
>>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>>> focused more on getting the basics right for what many datasources are
>>> already doing in API V1 combined with other private APIs, vs pushing
>>> forward state of the art for performance.
>>>
>>> Andrew
>>>
>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>> suresh.thalam...@gmail.com> wrote:
>>>
 +1 (non-binding)


 On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:

 Hi all,

 In the previous discussion, we decided to split the read and write path
 of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
 for Data Source V2 read path only.

 The full document of the Data Source API V2 is:
 https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
 -Z8qU5Frf6WMQZ6jJVM/edit

 The ready-for-review PR that implements the basic infrastructure for
 the read path is:
 https://github.com/apache/spark/pull/19136

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

[spark][core] SPARK-21097 Dynamic Allocation Pull Request

2017-09-07 Thread Bradley Kaiser

Hi all,

I've written a new Spark feature and I would love to have a committer take a 
look at it. I want to increase Spark performance when using dynamic allocation 
by preserving cached data.

The PR and Jira ticket are here:

https://github.com/apache/spark/pull/19041
https://issues.apache.org/jira/browse/SPARK-21097

Notebook spark users are the primary target for this change. Notebook users 
generally have periods of inactivity where spark executors could be used for 
other jobs, but if the user has any cached data, then they will either lock up 
those executors or lose their cached data. This change remedies this problem by 
replicating data to surviving executors before shutting down idle ones.

I have conducted some benchmarks showing significant performance gains under 
the right usage patterns. See the benchmark data here:

https://docs.google.com/document/d/1E6_rhAAJB8Ww0n52-LYcFTO1zhJBWgfIXzNjLi29730/edit?usp=sharing

I tried to mitigate the risk of this code change by keeping the code self 
contained and falling back to regular dynamic allocation behavior if there are 
any issues. The feature should work with any coarse grained backend and I have 
tested with YARN and standalone clusters.

I would love to discuss this change with anyone who is interested. Your 
attention is greatly appreciated.

Thanks
Brad Kaiser


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Ryan Blue

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be
great to see the doc updated before it is finally published though.

Right now it feels like this SPIP is focused more on getting the basics
right for what many datasources are already doing in API V1 combined with
other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support
you’re talking about later with a more specific plan that doesn’t block
fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:

> +1 (binding)
>
> I personally believe that there is quite a big difference between having a
> generic data source interface with a low surface area and pushing down a
> significant part of query processing into a datasource. The later has much
> wider wider surface area and will require us to stabilize most of the
> internal catalyst API's which will be a significant burden on the community
> to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:
>
>> +0 (non-binding)
>>
>> I think there are benefits to unifying all the Spark-internal datasources
>> into a common public API for sure.  It will serve as a forcing function to
>> ensure that those internal datasources aren't advantaged vs datasources
>> developed externally as plugins to Spark, and that all Spark features are
>> available to all datasources.
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>> performance work for datasources.
>>
>> To leave enough space for datasource developers to continue experimenting
>> with advanced interactions between Spark and their datasources, I'd propose
>> we leave some sort of escape valve that enables these datasources to keep
>> pushing the boundaries without forking Spark.  Possibly that looks like an
>> additional unsupported/unstable interface that pushes down an entire
>> (unstable API) logical plan, which is expected to break API on every
>> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
>> ignores it and continues on with the rest of the V2 API for
>> compatibility).  Or maybe it looks like something else that we don't know
>> of yet.  Possibly this falls outside of the desired goals for the V2 API
>> and instead should be a separate SPIP.
>>
>> If we had a plan for this kind of escape valve for advanced datasource
>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>> focused more on getting the basics right for what many datasources are
>> already doing in API V1 combined with other private APIs, vs pushing
>> forward state of the art for performance.
>>
>> Andrew
>>
>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>>
>>> Hi all,
>>>
>>> In the previous discussion, we decided to split the read and write path
>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>>> for Data Source V2 read path only.
>>>
>>> The full document of the Data Source API V2 is:
>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>
>>> The ready-for-review PR that implements the basic infrastructure for the
>>> read path is:
>>> https://github.com/apache/spark/pull/19136
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>>
>>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com]

Re: Putting Kafka 0.8 behind an (opt-in) profile

2017-09-07 Thread Sean Owen

For those following along, see discussions at
https://github.com/apache/spark/pull/19134

It's now also clear that we'd need to remove Kafka 0.8 examples if Kafka
0.8 becomes optional. I think that's all reasonable but the change is
growing beyond just putting it behind a profile.

On Wed, Sep 6, 2017 at 3:00 PM Cody Koeninger  wrote:

> I kind of doubt the kafka 0.10 integration is going to change much at
> all before the upgrade to 0.11
>
> On Wed, Sep 6, 2017 at 8:57 AM, Sean Owen  wrote:
> > Thanks, I can do that. We're then in the funny position of having one
> > deprecated Kafka API, and one experimental one.
> >
> > Is the Kafka 0.10 integration as stable as it is going to be, and worth
> > marking as such for 2.3.0?
> >
> >
> > On Tue, Sep 5, 2017 at 4:12 PM Cody Koeninger 
> wrote:
> >>
> >> +1 to going ahead and giving a deprecation warning now
> >>
> >> On Tue, Sep 5, 2017 at 6:39 AM, Sean Owen  wrote:
> >> > On the road to Scala 2.12, we'll need to make Kafka 0.8 support
> optional
> >> > in
> >> > the build, because it is not available for Scala 2.12.
> >> >
> >> > https://github.com/apache/spark/pull/19134  adds that profile. I
> mention
> >> > it
> >> > because this means that Kafka 0.8 becomes "opt-in" and has to be
> >> > explicitly
> >> > enabled, and that may have implications for downstream builds.
> >> >
> >> > Yes, we can add true. It however
> only
> >> > has
> >> > effect when no other profiles are set, which makes it more deceptive
> >> > than
> >> > useful IMHO. (We don't use it otherwise.)
> >> >
> >> > Reviewers may want to check my work especially as regards the Python
> >> > test
> >> > support and SBT build.
> >> >
> >> >
> >> > Another related question is: when is 0.8 support deprecated, removed?
> It
> >> > seems sudden to remove it in 2.3.0. Maybe deprecation is in order. The
> >> > driver is that Kafka 0.11 and 1.0 will possibly require yet another
> >> > variant
> >> > of streaming support (not sure yet), and 3 versions is too many.
> >> > Deprecating
> >> > now opens more options sooner.
>

2.1.2 maintenance release?

2017-09-07 Thread Sean Owen

In a separate conversation about bugs and a security issue fixed in 2.1.x
and 2.0.x, Marcelo suggested it could be time for a maintenance release.
I'm not sure what our stance on 2.0.x is, but 2.1.2 seems like it could be
valuable to release.

Thoughts? I believe Holden had expressed interest in even managing the
release process, but maybe others are interested as well. That is, this
could also be a chance to share that burden and spread release experience
around a bit.

Sean

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Herman van Hövell tot Westerflier

+1 (binding)

I personally believe that there is quite a big difference between having a
generic data source interface with a low surface area and pushing down a
significant part of query processing into a datasource. The later has much
wider wider surface area and will require us to stabilize most of the
internal catalyst API's which will be a significant burden on the community
to maintain and has the potential to slow development velocity
significantly. If you want to write such integrations then you should be
prepared to work with catalyst internals and own up to the fact that things
might change across minor versions (and in some cases even maintenance
releases). If you are willing to go down that road, then your best bet is
to use the already existing spark session extensions which will allow you
to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:

> +0 (non-binding)
>
> I think there are benefits to unifying all the Spark-internal datasources
> into a common public API for sure.  It will serve as a forcing function to
> ensure that those internal datasources aren't advantaged vs datasources
> developed externally as plugins to Spark, and that all Spark features are
> available to all datasources.
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
> To leave enough space for datasource developers to continue experimenting
> with advanced interactions between Spark and their datasources, I'd propose
> we leave some sort of escape valve that enables these datasources to keep
> pushing the boundaries without forking Spark.  Possibly that looks like an
> additional unsupported/unstable interface that pushes down an entire
> (unstable API) logical plan, which is expected to break API on every
> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
> ignores it and continues on with the rest of the V2 API for
> compatibility).  Or maybe it looks like something else that we don't know
> of yet.  Possibly this falls outside of the desired goals for the V2 API
> and instead should be a separate SPIP.
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are
> already doing in API V1 combined with other private APIs, vs pushing
> forward state of the art for performance.
>
> Andrew
>
> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>>
>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> In the previous discussion, we decided to split the read and write path
>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>> for Data Source V2 read path only.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for the
>> read path is:
>> https://github.com/apache/spark/pull/19136
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>>
>>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Andrew Ash

+0 (non-binding)

I think there are benefits to unifying all the Spark-internal datasources
into a common public API for sure.  It will serve as a forcing function to
ensure that those internal datasources aren't advantaged vs datasources
developed externally as plugins to Spark, and that all Spark features are
available to all datasources.

But I also think this read-path proposal avoids the more difficult
questions around how to continue pushing datasource performance forwards.
James Baker (my colleague) had a number of questions about advanced
pushdowns (combined sorting and filtering), and Reynold also noted that
pushdown of aggregates and joins are desirable on longer timeframes as
well.  The Spark community saw similar requests, for aggregate pushdown in
SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
in SPARK-12449.  Clearly a number of people are interested in this kind of
performance work for datasources.

To leave enough space for datasource developers to continue experimenting
with advanced interactions between Spark and their datasources, I'd propose
we leave some sort of escape valve that enables these datasources to keep
pushing the boundaries without forking Spark.  Possibly that looks like an
additional unsupported/unstable interface that pushes down an entire
(unstable API) logical plan, which is expected to break API on every
release.   (Spark attempts this full-plan pushdown, and if that fails Spark
ignores it and continues on with the rest of the V2 API for
compatibility).  Or maybe it looks like something else that we don't know
of yet.  Possibly this falls outside of the desired goals for the V2 API
and instead should be a separate SPIP.

If we had a plan for this kind of escape valve for advanced datasource
developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
focused more on getting the basics right for what many datasources are
already doing in API V1 combined with other private APIs, vs pushing
forward state of the art for performance.

Andrew

On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> +1 (non-binding)
>
>
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>
> Hi all,
>
> In the previous discussion, we decided to split the read and write path of
> data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
> Data Source V2 read path only.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for the
> read path is:
> https://github.com/apache/spark/pull/19136
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
>
>

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou

I am examined the code and found lazy val is added recently in 2.2.0

2017-09-07 14:34 GMT+08:00 ChenJun Zou :

> thanks,
> my mistake
>
> 2017-09-07 14:21 GMT+08:00 sujith chacko :
>
>> If your intention is to just view the logical plan in spark shell  then I
>> think you can follow the query which I mentioned in previous mail.  In
>> spark 2.1.0 sessionState is a private member which you cannot access.
>>
>> Thanks.
>>
>>
>> On Thu, 7 Sep 2017 at 11:39 AM, ChenJun Zou 
>> wrote:
>>
>>> spark-2.1.1 I use
>>>
>>>
>>>
>>> 2017-09-07 14:00 GMT+08:00 sujith chacko :
>>>
 Hi,
 may I know which version of spark you are using, in 2.2 I tried
 with below query in spark-shell for viewing the logical plan and it's
 working fine

 spark.sql("explain extended select * from table1")

 The above query you can use for seeing logical plan.

 Thanks,
 Sujith

 On Thu, 7 Sep 2017 at 11:03 AM, ChenJun Zou 
 wrote:

> Hi,
>
> when I use spark-shell to get the logical plan of  sql, an error
> occurs
>
> scala> spark.sessionState
> :30: error: lazy value sessionState in class SparkSession
> cannot be accessed in org.apache.spark.sql.SparkSession
>spark.sessionState
>  ^
>
> But if I use spark-submit to access the "sessionState" variable, It's
> OK.
>
> Is there a way to access it in spark-shell?
>

>>>
>

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou

thanks,
my mistake

2017-09-07 14:21 GMT+08:00 sujith chacko :

> If your intention is to just view the logical plan in spark shell  then I
> think you can follow the query which I mentioned in previous mail.  In
> spark 2.1.0 sessionState is a private member which you cannot access.
>
> Thanks.
>
>
> On Thu, 7 Sep 2017 at 11:39 AM, ChenJun Zou  wrote:
>
>> spark-2.1.1 I use
>>
>>
>>
>> 2017-09-07 14:00 GMT+08:00 sujith chacko :
>>
>>> Hi,
>>> may I know which version of spark you are using, in 2.2 I tried with
>>> below query in spark-shell for viewing the logical plan and it's working
>>> fine
>>>
>>> spark.sql("explain extended select * from table1")
>>>
>>> The above query you can use for seeing logical plan.
>>>
>>> Thanks,
>>> Sujith
>>>
>>> On Thu, 7 Sep 2017 at 11:03 AM, ChenJun Zou 
>>> wrote:
>>>
 Hi,

 when I use spark-shell to get the logical plan of  sql, an error occurs

 scala> spark.sessionState
 :30: error: lazy value sessionState in class SparkSession
 cannot be accessed in org.apache.spark.sql.SparkSession
spark.sessionState
  ^

 But if I use spark-submit to access the "sessionState" variable, It's
 OK.

 Is there a way to access it in spark-shell?

>>>
>>

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread ChenJun Zou

spark-2.1.1 I use



2017-09-07 14:00 GMT+08:00 sujith chacko :

> Hi,
> may I know which version of spark you are using, in 2.2 I tried with
> below query in spark-shell for viewing the logical plan and it's working
> fine
>
> spark.sql("explain extended select * from table1")
>
> The above query you can use for seeing logical plan.
>
> Thanks,
> Sujith
>
> On Thu, 7 Sep 2017 at 11:03 AM, ChenJun Zou  wrote:
>
>> Hi,
>>
>> when I use spark-shell to get the logical plan of  sql, an error occurs
>>
>> scala> spark.sessionState
>> :30: error: lazy value sessionState in class SparkSession cannot
>> be accessed in org.apache.spark.sql.SparkSession
>>spark.sessionState
>>  ^
>>
>> But if I use spark-submit to access the "sessionState" variable, It's OK.
>>
>> Is there a way to access it in spark-shell?
>>
>

Re: sessionState could not be accessed in spark-shell command line

2017-09-07 Thread sujith chacko

Hi,
may I know which version of spark you are using, in 2.2 I tried with
below query in spark-shell for viewing the logical plan and it's working
fine

spark.sql("explain extended select * from table1")

The above query you can use for seeing logical plan.

Thanks,
Sujith

On Thu, 7 Sep 2017 at 11:03 AM, ChenJun Zou  wrote:

> Hi,
>
> when I use spark-shell to get the logical plan of  sql, an error occurs
>
> scala> spark.sessionState
> :30: error: lazy value sessionState in class SparkSession cannot
> be accessed in org.apache.spark.sql.SparkSession
>spark.sessionState
>  ^
>
> But if I use spark-submit to access the "sessionState" variable, It's OK.
>
> Is there a way to access it in spark-shell?
>

Re: DAG in Pipeline

Spark ML DAG Pipelines

Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

qualifier in AttributeReference

Re: 2.1.2 maintenance release?

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: 2.1.2 maintenance release?

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

[spark][core] SPARK-21097 Dynamic Allocation Pull Request

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: Putting Kafka 0.8 behind an (opt-in) profile

2.1.2 maintenance release?

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Re: sessionState could not be accessed in spark-shell command line

Re: sessionState could not be accessed in spark-shell command line

Re: sessionState could not be accessed in spark-shell command line

Re: sessionState could not be accessed in spark-shell command line

21 matches

Site Navigation

Mail list logo

Footer information