Re: separate spark and hive

2016-11-14 Thread Reynold Xin
If you just start a SparkSession without calling enableHiveSupport it
actually won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf 
wrote:

> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* dev@spark.apache.org
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> .
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson 
> wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> --
>
> View this message in context: separate spark and hive
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>


RE: separate spark and hive

2016-11-14 Thread Mendelson, Assaf
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive 
context and sql context and couldn’t find it for spark 2.0 (I know for previous 
versions there were a couple of functions which required hive context as well 
as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I 
can only find how to compile it without hive (and having to build from source 
each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple 
configuration or even the default and that if there is any missing 
functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: dev@spark.apache.org
Subject: Re: separate spark and hive

I agree with the high level idea, and thus 
SPARK-15691.

In reality, it's a huge amount of work to create & maintain a custom catalog. 
It might actually make sense to do, but it just seems a lot of work to do right 
now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, 
can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson 
mailto:assaf.mendel...@rsa.com>> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use 
of spark SQL.
When doing the default installation this means that a derby.log and 
metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working 
directory we have a problem.
The solution we employ locally is to always run from different directory as we 
ignore hive in practice (this of course means we lose the ability to use some 
of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper 
configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even 
for catalog elements such as saving a permanent table, we should be able to 
configure a target directory and simply write to it (doing everything file 
based to avoid the need for locking). Hive should be reserved for those who 
actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.


View this message in context: separate spark and 
hive
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.



Re: separate spark and hive

2016-11-14 Thread Reynold Xin
I agree with the high level idea, and thus SPARK-15691
.

In reality, it's a huge amount of work to create & maintain a custom
catalog. It might actually make sense to do, but it just seems a lot of
work to do right now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive
mode, can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson 
wrote:

> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
> --
> View this message in context: separate spark and hive
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


separate spark and hive

2016-11-14 Thread assaf.mendelson
Hi,
Today, we basically force people to use hive if they want to get the full use 
of spark SQL.
When doing the default installation this means that a derby.log and 
metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working 
directory we have a problem.
The solution we employ locally is to always run from different directory as we 
ignore hive in practice (this of course means we lose the ability to use some 
of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper 
configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn't be any hive use at all. Even 
for catalog elements such as saving a permanent table, we should be able to 
configure a target directory and simply write to it (doing everything file 
based to avoid the need for locking). Hive should be reserved for those who 
actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
Good catch. Updated!


On Mon, Nov 14, 2016 at 11:13 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source
> Code Management' sections in that page.
>
> Shivaram
>
> On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin  wrote:
> > It's on there on the page (both the release notes and the download
> version
> > dropdown).
> >
> > The one line text is outdated. I'm just going to delete that text as a
> > matter of fact so we don't run into this issue in the future.
> >
> >
> > On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson <
> assaf.mendel...@rsa.com>
> > wrote:
> >>
> >> While you can download spark 2.0.2, the description is still spark
> 2.0.1:
> >>
> >> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016
> >> (release notes) (git tag)
> >>
> >>
> >>
> >>
> >>
> >> From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> >> email]]
> >> Sent: Tuesday, November 15, 2016 7:15 AM
> >> To: Mendelson, Assaf
> >> Subject: [ANNOUNCE] Apache Spark 2.0.2
> >>
> >>
> >>
> >> We are happy to announce the availability of Spark 2.0.2!
> >>
> >>
> >>
> >> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes
> along
> >> with Kafka 0.10 support and runtime metrics for Structured Streaming.
> This
> >> release is based on the branch-2.0 maintenance branch of Spark. We
> strongly
> >> recommend all 2.0.x users to upgrade to this stable release.
> >>
> >>
> >>
> >> To download Apache Spark 2.0.12 visit
> >> http://spark.apache.org/downloads.html
> >>
> >>
> >>
> >> We would like to acknowledge all community members for contributing
> >> patches to this release.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> 
> >>
> >> If you reply to this email, your message will be added to the discussion
> >> below:
> >>
> >>
> >> http://apache-spark-developers-list.1001551.n3.
> nabble.com/ANNOUNCE-Apache-Spark-2-0-2-tp19870.html
> >>
> >> To start a new topic under Apache Spark Developers List, email [hidden
> >> email]
> >> To unsubscribe from Apache Spark Developers List, click here.
> >> NAML
> >>
> >>
> >> 
> >> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >
> >
>


RE: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread assaf.mendelson
I am not sure I understand when the statistics would be calculated. Would they 
always be calculated or just when analyze is called?
Would it be possible to save analysis results as part of dataframe saving (e.g. 
when writing it to parquet) or do we have to have a consistent hive 
installation?
Would it be possible to provide the hints manually? For example for streaming 
if I know the data in the beginning is not a representative of the entire 
stream?

From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19873...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 8:48 AM
To: Mendelson, Assaf
Subject: Re: statistics collection and propagation for cost-based optimizer

They are not yet complete. The benchmark was done with an implementation of 
cost-based optimizer Huawei had internally for Spark 1.5 (or some even older 
version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan <[hidden 
email]> wrote:
It looks like Huawei team have run TPC-H benchmark and some real-world test 
cases and their results show good performance gain in 2X-5X speedup depending 
on data volume.
Can we share the numbers and query wise rational behind the gain? Are there 
anything done on spark master yet? Or the implementation is not yet completed?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog

On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan <[hidden 
email]> wrote:

Thanks Reynold for the detailed proposals. A few questions/clarifications -

1) How the existing rule based operator co-exist with CBO? The existing rules 
are heuristics/empirical based, i am assuming rules like predicate pushdown or 
project pruning will co-exist with CBO and we just want to accurately estimate 
the filter factor and cardinality to make it more accurate? With predicate 
pushdown, a filter is mostly executed at an early stage of a query plan and the 
cardinality estimate of a predicate can improve the precision of cardinality 
estimates.

2. Will the query transformations be now based on the cost calculation? If yes, 
then what happens when the cost of execution of the transformed statement is 
higher than the cost of untransformed query?

3. Is there any upper limit on space used for storing the frequency histogram? 
255? And in case of more distinct values, we can even consider height balanced 
histogram in Oracle.

4. The first three proposals are new and not mentioned in the CBO design spec. 
CMS is good but it's less accurate compared the traditional histograms. This is 
a  major trade-off  we need to consider.

5. Are we going to consider system statistics- such as speed of CPU or disk 
access as a cost function? How about considering shuffle cost, output 
partitioning etc?

6. Like the current rule based optimizer, will this CBO also be an 'extensible 
optimizer'? If yes, what functionality users can extend?

7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by 
default as it's just experimental ?

8. ANALYZE TABLE, analyzeColumns etc ... all look good.

9. From the release point of view, how this is planned ? Will all this be 
implemented in one go or in phases?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog

On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin <[hidden 
email]> wrote:
Historically tpcds and tpch. There is certainly a chance of overfitting one or 
two benchmarks. Note that those will probably be impacted more by the way we 
set the parameters for CBO rather than using x or y for summary statistics.


On Monday, November 14, 2016, Shivaram Venkataraman <[hidden 
email]> wrote:
Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin <[hidden 
email]> wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin <[hidden 
> email]> wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Shivaram Venkataraman
FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source
Code Management' sections in that page.

Shivaram

On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin  wrote:
> It's on there on the page (both the release notes and the download version
> dropdown).
>
> The one line text is outdated. I'm just going to delete that text as a
> matter of fact so we don't run into this issue in the future.
>
>
> On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson 
> wrote:
>>
>> While you can download spark 2.0.2, the description is still spark 2.0.1:
>>
>> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016
>> (release notes) (git tag)
>>
>>
>>
>>
>>
>> From: rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
>> email]]
>> Sent: Tuesday, November 15, 2016 7:15 AM
>> To: Mendelson, Assaf
>> Subject: [ANNOUNCE] Apache Spark 2.0.2
>>
>>
>>
>> We are happy to announce the availability of Spark 2.0.2!
>>
>>
>>
>> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
>> with Kafka 0.10 support and runtime metrics for Structured Streaming. This
>> release is based on the branch-2.0 maintenance branch of Spark. We strongly
>> recommend all 2.0.x users to upgrade to this stable release.
>>
>>
>>
>> To download Apache Spark 2.0.12 visit
>> http://spark.apache.org/downloads.html
>>
>>
>>
>> We would like to acknowledge all community members for contributing
>> patches to this release.
>>
>>
>>
>>
>>
>>
>>
>> 
>>
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-Spark-2-0-2-tp19870.html
>>
>> To start a new topic under Apache Spark Developers List, email [hidden
>> email]
>> To unsubscribe from Apache Spark Developers List, click here.
>> NAML
>>
>>
>> 
>> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
It's on there on the page (both the release notes and the download version
dropdown).

The one line text is outdated. I'm just going to delete that text as a
matter of fact so we don't run into this issue in the future.


On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson 
wrote:

> While you can download spark 2.0.2, the description is still spark 2.0.1:
>
> Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 
> (release
> notes)  (git
> tag) 
>
>
>
>
>
> *From:* rxin [via Apache Spark Developers List] [mailto:ml-node+[hidden
> email] ]
> *Sent:* Tuesday, November 15, 2016 7:15 AM
> *To:* Mendelson, Assaf
> *Subject:* [ANNOUNCE] Apache Spark 2.0.2
>
>
>
> We are happy to announce the availability of Spark 2.0.2!
>
>
>
> Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
> with Kafka 0.10 support and runtime metrics for Structured Streaming. This
> release is based on the branch-2.0 maintenance branch of Spark. We strongly
> recommend all 2.0.x users to upgrade to this stable release.
>
>
>
> To download Apache Spark 2.0.12 visit http://spark.apache.org/
> downloads.html
>
>
>
> We would like to acknowledge all community members for contributing
> patches to this release.
>
>
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-
> Spark-2-0-2-tp19870.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: [ANNOUNCE] Apache Spark 2.0.2
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


RE: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread assaf.mendelson
While you can download spark 2.0.2, the description is still spark 2.0.1:
Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 
(release notes) (git 
tag)


From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19870...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 7:15 AM
To: Mendelson, Assaf
Subject: [ANNOUNCE] Apache Spark 2.0.2

We are happy to announce the availability of Spark 2.0.2!

Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with 
Kafka 0.10 support and runtime metrics for Structured Streaming. This release 
is based on the branch-2.0 maintenance branch of Spark. We strongly recommend 
all 2.0.x users to upgrade to this stable release.

To download Apache Spark 2.0.12 visit http://spark.apache.org/downloads.html

We would like to acknowledge all community members for contributing patches to 
this release.




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Apache-Spark-2-0-2-tp19870.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RE-ANNOUNCE-Apache-Spark-2-0-2-tp19874.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
They are not yet complete. The benchmark was done with an implementation of
cost-based optimizer Huawei had internally for Spark 1.5 (or some even
older version).

On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan 
wrote:

> It looks like Huawei team have run TPC-H benchmark and some real-world
> test cases and their results show good performance gain in 2X-5X speedup
> depending on data volume.
> Can we share the numbers and query wise rational behind the gain? Are
> there anything done on spark master yet? Or the implementation is not yet
> completed?
>
> Thanks,
> Yogesh Mahajan
> http://www.snappydata.io/blog 
>
> On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan 
> wrote:
>
>>
>> Thanks Reynold for the detailed proposals. A few questions/clarifications
>> -
>>
>> 1) How the existing rule based operator co-exist with CBO? The existing
>> rules are heuristics/empirical based, i am assuming rules like predicate
>> pushdown or project pruning will co-exist with CBO and we just want to
>> accurately estimate the filter factor and cardinality to make it more
>> accurate? With predicate pushdown, a filter is mostly executed at an early
>> stage of a query plan and the cardinality estimate of a predicate can
>> improve the precision of cardinality estimates.
>>
>> 2. Will the query transformations be now based on the cost calculation?
>> If yes, then what happens when the cost of execution of the transformed
>> statement is higher than the cost of untransformed query?
>>
>> 3. Is there any upper limit on space used for storing the frequency
>> histogram? 255? And in case of more distinct values, we can even consider
>> height balanced histogram in Oracle.
>>
>> 4. The first three proposals are new and not mentioned in the CBO design
>> spec. CMS is good but it's less accurate compared the traditional
>> histograms. This is a  major trade-off  we need to consider.
>>
>> 5. Are we going to consider system statistics- such as speed of CPU or
>> disk access as a cost function? How about considering shuffle cost, output
>> partitioning etc?
>>
>> 6. Like the current rule based optimizer, will this CBO also be an
>> 'extensible optimizer'? If yes, what functionality users can extend?
>>
>> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
>> default as it's just experimental ?
>>
>> 8. ANALYZE TABLE, analyzeColumns etc ... all look good.
>>
>> 9. From the release point of view, how this is planned ? Will all this be
>> implemented in one go or in phases?
>>
>> Thanks,
>> Yogesh Mahajan
>> http://www.snappydata.io/blog 
>>
>> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin 
>> wrote:
>>
>>> Historically tpcds and tpch. There is certainly a chance of overfitting
>>> one or two benchmarks. Note that those will probably be impacted more by
>>> the way we set the parameters for CBO rather than using x or y for summary
>>> statistics.
>>>
>>>
>>> On Monday, November 14, 2016, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Do we have any query workloads for which we can benchmark these
 proposals in terms of performance ?

 Thanks
 Shivaram

 On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin 
 wrote:
 > One additional note: in terms of size, the size of a count-min sketch
 with
 > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
 >
 > To look up what that means, see
 > http://spark.apache.org/docs/latest/api/java/org/apache/spar
 k/util/sketch/CountMinSketch.html
 >
 >
 >
 >
 >
 > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin 
 wrote:
 >>
 >> I want to bring this discussion to the dev list to gather broader
 >> feedback, as there have been some discussions that happened over
 multiple
 >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
 >> statistics to collect and how to use them.
 >>
 >> There are some basic statistics on columns that are obvious to use
 and we
 >> don't need to debate these: estimated size (in bytes), row count,
 min, max,
 >> number of nulls, number of distinct values, average column length,
 max
 >> column length.
 >>
 >> In addition, we want to be able to estimate selectivity for equality
 and
 >> range predicates better, especially taking into account skewed
 values and
 >> outliers.
 >>
 >> Before I dive into the different options, let me first explain
 count-min
 >> sketch: Count-min sketch is a common sketch algorithm that tracks
 frequency
 >> counts. It has the following nice properties:
 >> - sublinear space
 >> - can be generated in one-pass in a streaming fashion
 >> - can be incrementally maintained (i.e. for appending new data)
 >> - it's already implemented in Spark
 >> - more accurate for frequent values, and less accurate for
 less-frequent
 >> values, i.e. it tr

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Yogesh Mahajan
It looks like Huawei team have run TPC-H benchmark and some real-world test
cases and their results show good performance gain in 2X-5X speedup
depending on data volume.
Can we share the numbers and query wise rational behind the gain? Are there
anything done on spark master yet? Or the implementation is not yet
completed?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog 

On Tue, Nov 15, 2016 at 12:03 PM, Yogesh Mahajan 
wrote:

>
> Thanks Reynold for the detailed proposals. A few questions/clarifications
> -
>
> 1) How the existing rule based operator co-exist with CBO? The existing
> rules are heuristics/empirical based, i am assuming rules like predicate
> pushdown or project pruning will co-exist with CBO and we just want to
> accurately estimate the filter factor and cardinality to make it more
> accurate? With predicate pushdown, a filter is mostly executed at an early
> stage of a query plan and the cardinality estimate of a predicate can
> improve the precision of cardinality estimates.
>
> 2. Will the query transformations be now based on the cost calculation? If
> yes, then what happens when the cost of execution of the transformed
> statement is higher than the cost of untransformed query?
>
> 3. Is there any upper limit on space used for storing the frequency
> histogram? 255? And in case of more distinct values, we can even consider
> height balanced histogram in Oracle.
>
> 4. The first three proposals are new and not mentioned in the CBO design
> spec. CMS is good but it's less accurate compared the traditional
> histograms. This is a  major trade-off  we need to consider.
>
> 5. Are we going to consider system statistics- such as speed of CPU or
> disk access as a cost function? How about considering shuffle cost, output
> partitioning etc?
>
> 6. Like the current rule based optimizer, will this CBO also be an
> 'extensible optimizer'? If yes, what functionality users can extend?
>
> 7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
> default as it's just experimental ?
>
> 8. ANALYZE TABLE, analyzeColumns etc ... all look good.
>
> 9. From the release point of view, how this is planned ? Will all this be
> implemented in one go or in phases?
>
> Thanks,
> Yogesh Mahajan
> http://www.snappydata.io/blog 
>
> On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin  wrote:
>
>> Historically tpcds and tpch. There is certainly a chance of overfitting
>> one or two benchmarks. Note that those will probably be impacted more by
>> the way we set the parameters for CBO rather than using x or y for summary
>> statistics.
>>
>>
>> On Monday, November 14, 2016, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> Do we have any query workloads for which we can benchmark these
>>> proposals in terms of performance ?
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin 
>>> wrote:
>>> > One additional note: in terms of size, the size of a count-min sketch
>>> with
>>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>>> >
>>> > To look up what that means, see
>>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar
>>> k/util/sketch/CountMinSketch.html
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin 
>>> wrote:
>>> >>
>>> >> I want to bring this discussion to the dev list to gather broader
>>> >> feedback, as there have been some discussions that happened over
>>> multiple
>>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>>> >> statistics to collect and how to use them.
>>> >>
>>> >> There are some basic statistics on columns that are obvious to use
>>> and we
>>> >> don't need to debate these: estimated size (in bytes), row count,
>>> min, max,
>>> >> number of nulls, number of distinct values, average column length, max
>>> >> column length.
>>> >>
>>> >> In addition, we want to be able to estimate selectivity for equality
>>> and
>>> >> range predicates better, especially taking into account skewed values
>>> and
>>> >> outliers.
>>> >>
>>> >> Before I dive into the different options, let me first explain
>>> count-min
>>> >> sketch: Count-min sketch is a common sketch algorithm that tracks
>>> frequency
>>> >> counts. It has the following nice properties:
>>> >> - sublinear space
>>> >> - can be generated in one-pass in a streaming fashion
>>> >> - can be incrementally maintained (i.e. for appending new data)
>>> >> - it's already implemented in Spark
>>> >> - more accurate for frequent values, and less accurate for
>>> less-frequent
>>> >> values, i.e. it tracks skewed values well.
>>> >> - easy to compute inner product, i.e. trivial to compute the count-min
>>> >> sketch of a join given two count-min sketches of the join tables
>>> >>
>>> >>
>>> >> Proposal 1 is is to use a combination of count-min sketch and
>>> equi-height
>>> >> histograms. In this case, count-min sketch will be used for
>>> selectivity
>>> >>

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Yogesh Mahajan
Thanks Reynold for the detailed proposals. A few questions/clarifications -

1) How the existing rule based operator co-exist with CBO? The existing
rules are heuristics/empirical based, i am assuming rules like predicate
pushdown or project pruning will co-exist with CBO and we just want to
accurately estimate the filter factor and cardinality to make it more
accurate? With predicate pushdown, a filter is mostly executed at an early
stage of a query plan and the cardinality estimate of a predicate can
improve the precision of cardinality estimates.

2. Will the query transformations be now based on the cost calculation? If
yes, then what happens when the cost of execution of the transformed
statement is higher than the cost of untransformed query?

3. Is there any upper limit on space used for storing the frequency
histogram? 255? And in case of more distinct values, we can even consider
height balanced histogram in Oracle.

4. The first three proposals are new and not mentioned in the CBO design
spec. CMS is good but it's less accurate compared the traditional
histograms. This is a  major trade-off  we need to consider.

5. Are we going to consider system statistics- such as speed of CPU or disk
access as a cost function? How about considering shuffle cost, output
partitioning etc?

6. Like the current rule based optimizer, will this CBO also be an
'extensible optimizer'? If yes, what functionality users can extend?

7. Why this CBO will be disabled by default? “spark.sql.cbo" is false by
default as it's just experimental ?

8. ANALYZE TABLE, analyzeColumns etc ... all look good.

9. From the release point of view, how this is planned ? Will all this be
implemented in one go or in phases?

Thanks,
Yogesh Mahajan
http://www.snappydata.io/blog 

On Mon, Nov 14, 2016 at 11:25 PM, Reynold Xin  wrote:

> Historically tpcds and tpch. There is certainly a chance of overfitting
> one or two benchmarks. Note that those will probably be impacted more by
> the way we set the parameters for CBO rather than using x or y for summary
> statistics.
>
>
> On Monday, November 14, 2016, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Do we have any query workloads for which we can benchmark these
>> proposals in terms of performance ?
>>
>> Thanks
>> Shivaram
>>
>> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  wrote:
>> > One additional note: in terms of size, the size of a count-min sketch
>> with
>> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>> >
>> > To look up what that means, see
>> > http://spark.apache.org/docs/latest/api/java/org/apache/spar
>> k/util/sketch/CountMinSketch.html
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin 
>> wrote:
>> >>
>> >> I want to bring this discussion to the dev list to gather broader
>> >> feedback, as there have been some discussions that happened over
>> multiple
>> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> >> statistics to collect and how to use them.
>> >>
>> >> There are some basic statistics on columns that are obvious to use and
>> we
>> >> don't need to debate these: estimated size (in bytes), row count, min,
>> max,
>> >> number of nulls, number of distinct values, average column length, max
>> >> column length.
>> >>
>> >> In addition, we want to be able to estimate selectivity for equality
>> and
>> >> range predicates better, especially taking into account skewed values
>> and
>> >> outliers.
>> >>
>> >> Before I dive into the different options, let me first explain
>> count-min
>> >> sketch: Count-min sketch is a common sketch algorithm that tracks
>> frequency
>> >> counts. It has the following nice properties:
>> >> - sublinear space
>> >> - can be generated in one-pass in a streaming fashion
>> >> - can be incrementally maintained (i.e. for appending new data)
>> >> - it's already implemented in Spark
>> >> - more accurate for frequent values, and less accurate for
>> less-frequent
>> >> values, i.e. it tracks skewed values well.
>> >> - easy to compute inner product, i.e. trivial to compute the count-min
>> >> sketch of a join given two count-min sketches of the join tables
>> >>
>> >>
>> >> Proposal 1 is is to use a combination of count-min sketch and
>> equi-height
>> >> histograms. In this case, count-min sketch will be used for selectivity
>> >> estimation on equality predicates, and histogram will be used on range
>> >> predicates.
>> >>
>> >> Proposal 2 is to just use count-min sketch on equality predicates, and
>> >> then simple selected_range / (max - min) will be used for range
>> predicates.
>> >> This will be less accurate than using histogram, but simpler because we
>> >> don't need to collect histograms.
>> >>
>> >> Proposal 3 is a variant of proposal 2, and takes into account that
>> skewed
>> >> values can impact selectivity heavily. In 3, we track the list of heavy
>> >> hitters (HH, most frequent items) along with count-min sketch on the

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2!

Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along
with Kafka 0.10 support and runtime metrics for Structured Streaming. This
release is based on the branch-2.0 maintenance branch of Spark. We strongly
recommend all 2.0.x users to upgrade to this stable release.

To download Apache Spark 2.0.12 visit http://spark.apache.org/downloads.html

We would like to acknowledge all community members for contributing patches
to this release.


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Reynold Xin
The issue is now resolved.


On Mon, Nov 14, 2016 at 3:08 PM, Sean Owen  wrote:

> Yes, it's on Maven. We have some problem syncing the web site changes at
> the moment though those are committed too. I think that's the only piece
> before a formal announcement.
>
>
> On Mon, Nov 14, 2016 at 9:49 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Has the release already been made? I didn't see any announcement, but
>> Homebrew has already updated to 2.0.2.
>> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
>>
>> The vote has passed with the following +1s and no -1. I will work on
>> packaging the release.
>>
>> +1:
>>
>> Reynold Xin*
>> Herman van Hövell tot Westerflier
>> Ricardo Almeida
>> Shixiong (Ryan) Zhu
>> Sean Owen*
>> Michael Armbrust*
>> Dongjoon Hyun
>> Jagadeesan As
>> Liwei Lin
>> Weiqing Yang
>> Vaquar Khan
>> Denny Lee
>> Yin Huai*
>> Ryan Blue
>> Pratik Sharma
>> Kousuke Saruta
>> Tathagata Das*
>> Mingjie Tang
>> Adam Roberts
>>
>> * = binding
>>
>>
>> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>> a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
>> ba694b0c34)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
>> 0.2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>>


Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra
Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator.  A
single, fixed-sized sql.shuffle.partitions is not the only way to control
the number of partitions in an Exchange -- if you are willing to deal with
code that is still off by by default.

On Mon, Nov 14, 2016 at 4:19 PM, leo9r  wrote:

> Hi Daniel,
>
> I completely agree with your request. As the amount of data being processed
> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
> to prevent OOM and performance degradation. The fact that
> sql.shuffle.partitions cannot be set several times in the same job/action,
> because of the reason you explain, is a big inconvenient for the
> development
> of ETL pipelines.
>
> Have you got any answer or feedback in this regard?
>
> Thanks,
> Leo Lezcano
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread leo9r
Hi Daniel,

I completely agree with your request. As the amount of data being processed
with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
to prevent OOM and performance degradation. The fact that
sql.shuffle.partitions cannot be set several times in the same job/action,
because of the reason you explain, is a big inconvenient for the development
of ETL pipelines.

Have you got any answer or feedback in this regard?

Thanks,
Leo Lezcano



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19867.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Sean Owen
Yes, it's on Maven. We have some problem syncing the web site changes at
the moment though those are committed too. I think that's the only piece
before a formal announcement.


On Mon, Nov 14, 2016 at 9:49 PM Nicholas Chammas 
wrote:

> Has the release already been made? I didn't see any announcement, but
> Homebrew has already updated to 2.0.2.
> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
>
> The vote has passed with the following +1s and no -1. I will work on
> packaging the release.
>
> +1:
>
> Reynold Xin*
> Herman van Hövell tot Westerflier
> Ricardo Almeida
> Shixiong (Ryan) Zhu
> Sean Owen*
> Michael Armbrust*
> Dongjoon Hyun
> Jagadeesan As
> Liwei Lin
> Weiqing Yang
> Vaquar Khan
> Denny Lee
> Yin Huai*
> Ryan Blue
> Pratik Sharma
> Kousuke Saruta
> Tathagata Das*
> Mingjie Tang
> Adam Roberts
>
> * = binding
>
>
> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3
> (584354eaac02531c9584188b143367ba694b0c34)
>
> This release candidate resolves 84 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Michael Armbrust
I would definitly like to open up APIs for people to write their own
encoders.  The challenge thus far has been that Encoders use internal APIs
that have not been stable for translating the data into the tungsten
format.  We also make use of the analyzer to figure out the mapping from
columns to fields (also not a stable API)  This is the only "magic" that is
happening.

If someone wants to propose a stable / fast API here it would be great to
start the discussion.  Its an often requested feature.

On Mon, Nov 14, 2016 at 1:32 PM, Sam Goodwin 
wrote:

> I wouldn't recommend using a Tuple as you end up with column names like
> "_1" and "_2", but it will still work :)
>
> ExpressionEncoder can do the same thing but it doesn't support custom
> types, and as far as I can tell, does not support custom implementations.
> I.e. is it possible for me to write my own Encoder logic and completely
> bypass ExpressionEncoder? The trait definition has no useful methods so it
> doesn't seem straight-forward. If the Encoder trait was opened up so
> people could provide their own implementations then I don't see this as an
> issue anymore. It would allow for external Encoder libraries like mine
> while not neglecting Java (non-scala) developers. Is there "magic" happening
> behind the scenes stopping us from doing this?
>
> On Mon, 14 Nov 2016 at 12:31 Koert Kuipers  wrote:
>
>> just taking it for a quick spin it looks great, with correct support for
>> nested rows and using option for nullability.
>>
>> scala> val format = implicitly[RowFormat[(String, Seq[(String,
>> Option[Int])])]]
>> format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
>> Option[Int])])] = com.github.upio.spark.sql.
>> FamilyFormats$$anon$3@2c0961e2
>>
>> scala> format.schema
>> res12: org.apache.spark.sql.types.StructType = 
>> StructType(StructField(_1,StringType,false),
>> StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
>> StructField(_2,IntegerType,true)),true),false))
>>
>> scala> val x = format.read(Row("a", Seq(Row("a", 5
>> x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5
>>
>> scala> format.write(x)
>> res13: org.apache.spark.sql.Row = [a,List([a,5])]
>>
>>
>>
>> On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers  wrote:
>>
>> agreed on your point that this can be done without macros
>>
>> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
>> wrote:
>>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust 
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers 
>> *Cc:* Ryan Blue , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers 
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Josh Rosen
He pushed the 2.0.2 release docs but there's a problem with Git mirroring
of the Spark website repo which is interfering with the publishing:
https://issues.apache.org/jira/browse/INFRA-12913


On Mon, Nov 14, 2016 at 1:15 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> The release is available on http://www.apache.org/dist/spark/ and its
> on Maven central
> http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.0.2/
>
> I guess Reynold hasn't yet put together the release notes / updates to
> the website.
>
> Thanks
> Shivaram
>
> On Mon, Nov 14, 2016 at 12:49 PM, Nicholas Chammas
>  wrote:
> > Has the release already been made? I didn't see any announcement, but
> > Homebrew has already updated to 2.0.2.
> > 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
> >>
> >> The vote has passed with the following +1s and no -1. I will work on
> >> packaging the release.
> >>
> >> +1:
> >>
> >> Reynold Xin*
> >> Herman van Hövell tot Westerflier
> >> Ricardo Almeida
> >> Shixiong (Ryan) Zhu
> >> Sean Owen*
> >> Michael Armbrust*
> >> Dongjoon Hyun
> >> Jagadeesan As
> >> Liwei Lin
> >> Weiqing Yang
> >> Vaquar Khan
> >> Denny Lee
> >> Yin Huai*
> >> Ryan Blue
> >> Pratik Sharma
> >> Kousuke Saruta
> >> Tathagata Das*
> >> Mingjie Tang
> >> Adam Roberts
> >>
> >> * = binding
> >>
> >>
> >> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin 
> wrote:
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and
> passes if a
> >>> majority of at least 3+1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 2.0.2
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>>
> >>> The tag to be voted on is v2.0.2-rc3
> >>> (584354eaac02531c9584188b143367ba694b0c34)
> >>>
> >>> This release candidate resolves 84 issues:
> >>> https://s.apache.org/spark-2.0.2-jira
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://people.apache.org/keys/committer/pwendell.asc
> >>>
> >>> The staging repository for this release can be found at:
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1214/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> >>>
> >>>
> >>> Q: How can I help test this release?
> >>> A: If you are a Spark user, you can help us test this release by taking
> >>> an existing Spark workload and running on this release candidate, then
> >>> reporting any regressions from 2.0.1.
> >>>
> >>> Q: What justifies a -1 vote for this release?
> >>> A: This is a maintenance release in the 2.0.x series. Bugs already
> >>> present in 2.0.1, missing features, or bugs related to new features
> will not
> >>> necessarily block this release.
> >>>
> >>> Q: What fix version should I use for patches merging into branch-2.0
> from
> >>> now on?
> >>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> >>> (i.e. RC4) is cut, I will change the fix version of those patches to
> 2.0.2.
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Sam Goodwin
I wouldn't recommend using a Tuple as you end up with column names like
"_1" and "_2", but it will still work :)

ExpressionEncoder can do the same thing but it doesn't support custom
types, and as far as I can tell, does not support custom implementations.
I.e. is it possible for me to write my own Encoder logic and completely
bypass ExpressionEncoder? The trait definition has no useful methods so it
doesn't seem straight-forward. If the Encoder trait was opened up so people
could provide their own implementations then I don't see this as an issue
anymore. It would allow for external Encoder libraries like mine while not
neglecting Java (non-scala) developers. Is there "magic" happening behind
the scenes stopping us from doing this?

On Mon, 14 Nov 2016 at 12:31 Koert Kuipers  wrote:

> just taking it for a quick spin it looks great, with correct support for
> nested rows and using option for nullability.
>
> scala> val format = implicitly[RowFormat[(String, Seq[(String,
> Option[Int])])]]
> format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
> Option[Int])])] = com.github.upio.spark.sql.FamilyFormats$$anon$3@2c0961e2
>
> scala> format.schema
> res12: org.apache.spark.sql.types.StructType =
> StructType(StructField(_1,StringType,false),
> StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
> StructField(_2,IntegerType,true)),true),false))
>
> scala> val x = format.read(Row("a", Seq(Row("a", 5
> x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5
>
> scala> format.write(x)
> res13: org.apache.spark.sql.Row = [a,List([a,5])]
>
>
>
> On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers  wrote:
>
> agreed on your point that this can be done without macros
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
> wrote:
>
> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust 
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers 
> *Cc:* Ryan Blue , "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers  wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust 
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a 

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Shivaram Venkataraman
The release is available on http://www.apache.org/dist/spark/ and its
on Maven central
http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.0.2/

I guess Reynold hasn't yet put together the release notes / updates to
the website.

Thanks
Shivaram

On Mon, Nov 14, 2016 at 12:49 PM, Nicholas Chammas
 wrote:
> Has the release already been made? I didn't see any announcement, but
> Homebrew has already updated to 2.0.2.
> 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:
>>
>> The vote has passed with the following +1s and no -1. I will work on
>> packaging the release.
>>
>> +1:
>>
>> Reynold Xin*
>> Herman van Hövell tot Westerflier
>> Ricardo Almeida
>> Shixiong (Ryan) Zhu
>> Sean Owen*
>> Michael Armbrust*
>> Dongjoon Hyun
>> Jagadeesan As
>> Liwei Lin
>> Weiqing Yang
>> Vaquar Khan
>> Denny Lee
>> Yin Huai*
>> Ryan Blue
>> Pratik Sharma
>> Kousuke Saruta
>> Tathagata Das*
>> Mingjie Tang
>> Adam Roberts
>>
>> * = binding
>>
>>
>> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if a
>>> majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3
>>> (584354eaac02531c9584188b143367ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will not
>>> necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0 from
>>> now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
sorry this message by me was confusing. i was frustrated about how hard it
is to use the Encoder machinery myself directly on Row objects, this is
unrelated to the question if a shapeless based approach like sam suggest
would be better way to do encoders in general

On Mon, Nov 14, 2016 at 3:03 PM, Koert Kuipers  wrote:

> that makes sense. we have something like that inhouse as well, but not as
> nice and not using shapeless (we simply relied on sbt-boilerplate to handle
> all tuples and do not support case classes).
>
> however the frustrating part is that spark sql already has this more or
> less. look for example at ExpressionEncoder.fromRow and
> ExpressionEncoder.toRow. but these methods use InternalRow while the rows
> exposed to me as a user are not that.
>
> at this point i am more tempted to simply open up InternalRow at a few
> places strategically than to maintain another inhouse row marshalling
> class. once i have InternalRows looks of good stuff is available to me to
> use.
>
>
>
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust 
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers 
>> *Cc:* Ryan Blue , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers 
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Nicholas Chammas
Has the release already been made? I didn't see any announcement, but
Homebrew has already updated to 2.0.2.
2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성:

> The vote has passed with the following +1s and no -1. I will work on
> packaging the release.
>
> +1:
>
> Reynold Xin*
> Herman van Hövell tot Westerflier
> Ricardo Almeida
> Shixiong (Ryan) Zhu
> Sean Owen*
> Michael Armbrust*
> Dongjoon Hyun
> Jagadeesan As
> Liwei Lin
> Weiqing Yang
> Vaquar Khan
> Denny Lee
> Yin Huai*
> Ryan Blue
> Pratik Sharma
> Kousuke Saruta
> Tathagata Das*
> Mingjie Tang
> Adam Roberts
>
> * = binding
>
>
> On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3
> (584354eaac02531c9584188b143367ba694b0c34)
>
> This release candidate resolves 84 issues:
> https://s.apache.org/spark-2.0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
just taking it for a quick spin it looks great, with correct support for
nested rows and using option for nullability.

scala> val format = implicitly[RowFormat[(String, Seq[(String,
Option[Int])])]]
format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String,
Option[Int])])] = com.github.upio.spark.sql.FamilyFormats$$anon$3@2c0961e2

scala> format.schema
res12: org.apache.spark.sql.types.StructType =
StructType(StructField(_1,StringType,false),
StructField(_2,ArrayType(StructType(StructField(_1,StringType,false),
StructField(_2,IntegerType,true)),true),false))

scala> val x = format.read(Row("a", Seq(Row("a", 5
x: (String, Seq[(String, Option[Int])]) = (a,List((a,Some(5

scala> format.write(x)
res13: org.apache.spark.sql.Row = [a,List([a,5])]



On Mon, Nov 14, 2016 at 3:10 PM, Koert Kuipers  wrote:

> agreed on your point that this can be done without macros
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust 
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers 
>> *Cc:* Ryan Blue , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers 
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>


Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again.
any help is appreciated !

thanks in advance.

On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
manish.malhotra.w...@gmail.com> wrote:

> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
agreed on your point that this can be done without macros

On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
wrote:

> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust 
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers 
> *Cc:* Ryan Blue , "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers  wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust 
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a runtime exception.
>
> would it be useful to make this a little more specific? i guess the
> challenge is going to be case classes which unfortunately dont extend
> Product1, Product2, etc.
>
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
sorry last line should be:
once i have InternalRows lots of good stuff is available to me to use.

On Mon, Nov 14, 2016 at 3:03 PM, Koert Kuipers  wrote:

> that makes sense. we have something like that inhouse as well, but not as
> nice and not using shapeless (we simply relied on sbt-boilerplate to handle
> all tuples and do not support case classes).
>
> however the frustrating part is that spark sql already has this more or
> less. look for example at ExpressionEncoder.fromRow and
> ExpressionEncoder.toRow. but these methods use InternalRow while the rows
> exposed to me as a user are not that.
>
> at this point i am more tempted to simply open up InternalRow at a few
> places strategically than to maintain another inhouse row marshalling
> class. once i have InternalRows looks of good stuff is available to me to
> use.
>
>
>
>
> On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
> wrote:
>
>> You don't need compiler time macros for this, you can do it quite easily
>> using shapeless. I've been playing with a project which borrows ideas from
>> spray-json and spray-json-shapeless to implement Row marshalling for
>> arbitrary case classes. It's checked and generated at compile time,
>> supports arbitrary/nested case classes, and allows custom types. It is also
>> entirely pluggable meaning you can bypass the default implementations and
>> provide your own, just like any type class.
>>
>> https://github.com/upio/spark-sql-formats
>>
>>
>> *From:* Michael Armbrust 
>> *Date:* October 26, 2016 at 12:50:23 PM PDT
>> *To:* Koert Kuipers 
>> *Cc:* Ryan Blue , "dev@spark.apache.org" <
>> dev@spark.apache.org>
>> *Subject:* *Re: getting encoder implicits to be more accurate*
>>
>> Sorry, I realize that set is only one example here, but I don't think
>> that making the type of the implicit more narrow to include only ProductN
>> or something eliminates the issue.  Even with that change, we will fail to
>> generate an encoder with the same error if you, for example, have a field
>> of your case class that is an unsupported type.
>>
>>
>>
>> Short of changing this to compile-time macros, I think we are stuck with
>> this class of errors at runtime.  The simplest solution seems to be to
>> expand the set of thing we can handle as much as possible and allow users
>> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
>> make this the default though, as behavior would change with each release
>> that adds support for more types.  I would be very supportive of making
>> this fallback a built-in option though.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers 
>> wrote:
>>
>> yup, it doesnt really solve the underlying issue.
>>
>> we fixed it internally by having our own typeclass that produces encoders
>> and that does check the contents of the products, but we did this by simply
>> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
>> Product, since we dont have a need for case classes
>>
>> if case classes extended ProductN (which they will i think in scala
>> 2.12?) then we could drop Product and support Product1 - Product22 and
>> Option explicitly while checking the classes they contain. that would be
>> the cleanest.
>>
>>
>>
>> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>>
>> Isn't the problem that Option is a Product and the class it contains
>> isn't checked? Adding support for Set fixes the example, but the problem
>> would happen with any class there isn't an encoder for, right?
>>
>>
>>
>> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>> Hmm, that is unfortunate.  Maybe the best solution is to add support for
>> sets?  I don't think that would be super hard.
>>
>>
>>
>> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>>
>> i am trying to use encoders as a typeclass where if it fails to find an
>> ExpressionEncoder it falls back to KryoEncoder.
>>
>> the issue seems to be that ExpressionEncoder claims a little more than it
>> can handle here:
>>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
>> Encoders.product[T]
>>
>> this "claims" to handle for example Option[Set[Int]], but it really
>> cannot handle Set so it leads to a runtime exception.
>>
>> would it be useful to make this a little more specific? i guess the
>> challenge is going to be case classes which unfortunately dont extend
>> Product1, Product2, etc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>


Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
that makes sense. we have something like that inhouse as well, but not as
nice and not using shapeless (we simply relied on sbt-boilerplate to handle
all tuples and do not support case classes).

however the frustrating part is that spark sql already has this more or
less. look for example at ExpressionEncoder.fromRow and
ExpressionEncoder.toRow. but these methods use InternalRow while the rows
exposed to me as a user are not that.

at this point i am more tempted to simply open up InternalRow at a few
places strategically than to maintain another inhouse row marshalling
class. once i have InternalRows looks of good stuff is available to me to
use.




On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin 
wrote:

> You don't need compiler time macros for this, you can do it quite easily
> using shapeless. I've been playing with a project which borrows ideas from
> spray-json and spray-json-shapeless to implement Row marshalling for
> arbitrary case classes. It's checked and generated at compile time,
> supports arbitrary/nested case classes, and allows custom types. It is also
> entirely pluggable meaning you can bypass the default implementations and
> provide your own, just like any type class.
>
> https://github.com/upio/spark-sql-formats
>
>
> *From:* Michael Armbrust 
> *Date:* October 26, 2016 at 12:50:23 PM PDT
> *To:* Koert Kuipers 
> *Cc:* Ryan Blue , "dev@spark.apache.org" <
> dev@spark.apache.org>
> *Subject:* *Re: getting encoder implicits to be more accurate*
>
> Sorry, I realize that set is only one example here, but I don't think that
> making the type of the implicit more narrow to include only ProductN or
> something eliminates the issue.  Even with that change, we will fail to
> generate an encoder with the same error if you, for example, have a field
> of your case class that is an unsupported type.
>
>
>
> Short of changing this to compile-time macros, I think we are stuck with
> this class of errors at runtime.  The simplest solution seems to be to
> expand the set of thing we can handle as much as possible and allow users
> to turn on a kryo fallback for expression encoders.  I'd be hesitant to
> make this the default though, as behavior would change with each release
> that adds support for more types.  I would be very supportive of making
> this fallback a built-in option though.
>
>
>
> On Wed, Oct 26, 2016 at 11:47 AM, Koert Kuipers  wrote:
>
> yup, it doesnt really solve the underlying issue.
>
> we fixed it internally by having our own typeclass that produces encoders
> and that does check the contents of the products, but we did this by simply
> supporting Tuple1 - Tuple22 and Option explicitly, and not supporting
> Product, since we dont have a need for case classes
>
> if case classes extended ProductN (which they will i think in scala 2.12?)
> then we could drop Product and support Product1 - Product22 and Option
> explicitly while checking the classes they contain. that would be the
> cleanest.
>
>
>
> On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue  wrote:
>
> Isn't the problem that Option is a Product and the class it contains isn't
> checked? Adding support for Set fixes the example, but the problem would
> happen with any class there isn't an encoder for, right?
>
>
>
> On Wed, Oct 26, 2016 at 11:18 AM, Michael Armbrust 
> wrote:
>
> Hmm, that is unfortunate.  Maybe the best solution is to add support for
> sets?  I don't think that would be super hard.
>
>
>
> On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers  wrote:
>
> i am trying to use encoders as a typeclass where if it fails to find an
> ExpressionEncoder it falls back to KryoEncoder.
>
> the issue seems to be that ExpressionEncoder claims a little more than it
> can handle here:
>   implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] =
> Encoders.product[T]
>
> this "claims" to handle for example Option[Set[Int]], but it really cannot
> handle Set so it leads to a runtime exception.
>
> would it be useful to make this a little more specific? i guess the
> challenge is going to be case classes which unfortunately dont extend
> Product1, Product2, etc.
>
>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>


Re: Two questions about running spark on mesos

2016-11-14 Thread Joseph Wu
1) You should read through this page:
https://spark.apache.org/docs/latest/running-on-mesos.html
I (Mesos person) can't answer any questions that aren't already answered on
that page :)

2) Your normal spark commands (whatever they are) should still work
regardless of the backend.

On Mon, Nov 14, 2016 at 2:58 AM, Yu Wei  wrote:

> Hi Guys,
>
>
> Two questions about running spark on mesos.
>
> 1, Does spark configuration of conf/slaves still work when running spark
> on mesos?
>
> According to my observations, it seemed that conf/slaves still took
> effect when running spark-shell.
>
> However, it doesn't take effect when deploying in cluster mode.
>
> Is this expected behavior?
>
>Or did I miss anything?
>
>
> 2, Could I kill submitted jobs when running spark on mesos in cluster mode?
>
> I launched spark on mesos in cluster mode. Then submitted a long
> running job succeeded.
>
> Then I want to kill the job.
>
> How could I do that? Is there any similar commands as launching spark
> on yarn?
>
>
>
>
> Thanks,
>
> Jared, (??)
> Software developer
> Interested in open source software, big data, Linux
>


Re: Two questions about running spark on mesos

2016-11-14 Thread Michael Gummelt
1. I had never even heard of conf/slaves until this email, and I only see
it referenced in the docs next to Spark Standalone, so I doubt that works.

2. Yes.  See the --kill option in spark-submit.

Also, we're considering dropping the Spark dispatcher in DC/OS in favor of
Metronome, which will be our consolidated method of running any one-off
jobs.  The dispatcher is really just a lesser maintained and more
feature-sparse metronome.  If I were you, I would look into running
Metronome rather than the dispatcher (or just run DC/OS).

On Mon, Nov 14, 2016 at 3:10 AM, Yu Wei  wrote:

> Hi Guys,
>
>
> Two questions about running spark on mesos.
>
> 1, Does spark configuration of conf/slaves still work when running spark
> on mesos?
>
> According to my observations, it seemed that conf/slaves still took
> effect when running spark-shell.
>
> However, it doesn't take effect when deploying in cluster mode.
>
> Is this expected behavior?
>
>Or did I miss anything?
>
>
> 2, Could I kill submitted jobs when running spark on mesos in cluster mode?
>
> I launched spark on mesos in cluster mode. Then submitted a long
> running job succeeded.
>
> Then I want to kill the job.
> How could I do that? Is there any similar commands as launching spark
> on yarn?
>
>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
>



-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
Historically tpcds and tpch. There is certainly a chance of overfitting one
or two benchmarks. Note that those will probably be impacted more by the
way we set the parameters for CBO rather than using x or y for summary
statistics.

On Monday, November 14, 2016, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Do we have any query workloads for which we can benchmark these
> proposals in terms of performance ?
>
> Thanks
> Shivaram
>
> On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  > wrote:
> > One additional note: in terms of size, the size of a count-min sketch
> with
> > eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
> >
> > To look up what that means, see
> > http://spark.apache.org/docs/latest/api/java/org/apache/
> spark/util/sketch/CountMinSketch.html
> >
> >
> >
> >
> >
> > On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  > wrote:
> >>
> >> I want to bring this discussion to the dev list to gather broader
> >> feedback, as there have been some discussions that happened over
> multiple
> >> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
> >> statistics to collect and how to use them.
> >>
> >> There are some basic statistics on columns that are obvious to use and
> we
> >> don't need to debate these: estimated size (in bytes), row count, min,
> max,
> >> number of nulls, number of distinct values, average column length, max
> >> column length.
> >>
> >> In addition, we want to be able to estimate selectivity for equality and
> >> range predicates better, especially taking into account skewed values
> and
> >> outliers.
> >>
> >> Before I dive into the different options, let me first explain count-min
> >> sketch: Count-min sketch is a common sketch algorithm that tracks
> frequency
> >> counts. It has the following nice properties:
> >> - sublinear space
> >> - can be generated in one-pass in a streaming fashion
> >> - can be incrementally maintained (i.e. for appending new data)
> >> - it's already implemented in Spark
> >> - more accurate for frequent values, and less accurate for less-frequent
> >> values, i.e. it tracks skewed values well.
> >> - easy to compute inner product, i.e. trivial to compute the count-min
> >> sketch of a join given two count-min sketches of the join tables
> >>
> >>
> >> Proposal 1 is is to use a combination of count-min sketch and
> equi-height
> >> histograms. In this case, count-min sketch will be used for selectivity
> >> estimation on equality predicates, and histogram will be used on range
> >> predicates.
> >>
> >> Proposal 2 is to just use count-min sketch on equality predicates, and
> >> then simple selected_range / (max - min) will be used for range
> predicates.
> >> This will be less accurate than using histogram, but simpler because we
> >> don't need to collect histograms.
> >>
> >> Proposal 3 is a variant of proposal 2, and takes into account that
> skewed
> >> values can impact selectivity heavily. In 3, we track the list of heavy
> >> hitters (HH, most frequent items) along with count-min sketch on the
> column.
> >> Then:
> >> - use count-min sketch on equality predicates
> >> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
> >> (max - min)
> >>
> >> Proposal 4 is to not use any sketch, and use histogram for high
> >> cardinality columns, and exact (value, frequency) pairs for low
> cardinality
> >> columns (e.g. num distinct value <= 255).
> >>
> >> Proposal 5 is a variant of proposal 4, and adapts it to track exact
> >> (value, frequency) pairs for the most frequent values only, so we can
> still
> >> have that for high cardinality columns. This is actually very similar to
> >> count-min sketch, but might use less space, although requiring two
> passes to
> >> compute the initial value, and more difficult to compute the inner
> product
> >> for joins.
> >>
> >>
> >>
> >
>


Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Shivaram Venkataraman
Do we have any query workloads for which we can benchmark these
proposals in terms of performance ?

Thanks
Shivaram

On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin  wrote:
> One additional note: in terms of size, the size of a count-min sketch with
> eps = 0.1% and confidence 0.87, uncompressed, is 48k bytes.
>
> To look up what that means, see
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/util/sketch/CountMinSketch.html
>
>
>
>
>
> On Sun, Nov 13, 2016 at 5:30 PM, Reynold Xin  wrote:
>>
>> I want to bring this discussion to the dev list to gather broader
>> feedback, as there have been some discussions that happened over multiple
>> JIRA tickets (SPARK-16026, etc) and GitHub pull requests about what
>> statistics to collect and how to use them.
>>
>> There are some basic statistics on columns that are obvious to use and we
>> don't need to debate these: estimated size (in bytes), row count, min, max,
>> number of nulls, number of distinct values, average column length, max
>> column length.
>>
>> In addition, we want to be able to estimate selectivity for equality and
>> range predicates better, especially taking into account skewed values and
>> outliers.
>>
>> Before I dive into the different options, let me first explain count-min
>> sketch: Count-min sketch is a common sketch algorithm that tracks frequency
>> counts. It has the following nice properties:
>> - sublinear space
>> - can be generated in one-pass in a streaming fashion
>> - can be incrementally maintained (i.e. for appending new data)
>> - it's already implemented in Spark
>> - more accurate for frequent values, and less accurate for less-frequent
>> values, i.e. it tracks skewed values well.
>> - easy to compute inner product, i.e. trivial to compute the count-min
>> sketch of a join given two count-min sketches of the join tables
>>
>>
>> Proposal 1 is is to use a combination of count-min sketch and equi-height
>> histograms. In this case, count-min sketch will be used for selectivity
>> estimation on equality predicates, and histogram will be used on range
>> predicates.
>>
>> Proposal 2 is to just use count-min sketch on equality predicates, and
>> then simple selected_range / (max - min) will be used for range predicates.
>> This will be less accurate than using histogram, but simpler because we
>> don't need to collect histograms.
>>
>> Proposal 3 is a variant of proposal 2, and takes into account that skewed
>> values can impact selectivity heavily. In 3, we track the list of heavy
>> hitters (HH, most frequent items) along with count-min sketch on the column.
>> Then:
>> - use count-min sketch on equality predicates
>> - for range predicates, estimatedFreq =  sum(freq(HHInRange)) + range /
>> (max - min)
>>
>> Proposal 4 is to not use any sketch, and use histogram for high
>> cardinality columns, and exact (value, frequency) pairs for low cardinality
>> columns (e.g. num distinct value <= 255).
>>
>> Proposal 5 is a variant of proposal 4, and adapts it to track exact
>> (value, frequency) pairs for the most frequent values only, so we can still
>> have that for high cardinality columns. This is actually very similar to
>> count-min sketch, but might use less space, although requiring two passes to
>> compute the initial value, and more difficult to compute the inner product
>> for joins.
>>
>>
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



subscribe

2016-11-14 Thread Yu Wei


Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux


Two questions about running spark on mesos

2016-11-14 Thread Yu Wei
Hi Guys,


Two questions about running spark on mesos.

1, Does spark configuration of conf/slaves still work when running spark on 
mesos?

According to my observations, it seemed that conf/slaves still took effect 
when running spark-shell.

However, it doesn't take effect when deploying in cluster mode.

Is this expected behavior?

   Or did I miss anything?


2, Could I kill submitted jobs when running spark on mesos in cluster mode?

I launched spark on mesos in cluster mode. Then submitted a long running 
job succeeded.

Then I want to kill the job.

How could I do that? Is there any similar commands as launching spark on 
yarn?




Thanks,

Jared, (??)
Software developer
Interested in open source software, big data, Linux