Re: separate spark and hive

2016-11-14 Thread Reynold Xin
If you just start a SparkSession without calling enableHiveSupport it actually won't use the Hive catalog support. On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf wrote: > The default generation of spark context is actually a hive context. > > I tried to find on the

RE: separate spark and hive

2016-11-14 Thread Mendelson, Assaf
The default generation of spark context is actually a hive context. I tried to find on the documentation what are the differences between hive context and sql context and couldn’t find it for spark 2.0 (I know for previous versions there were a couple of functions which required hive context as

Re: separate spark and hive

2016-11-14 Thread Reynold Xin
I agree with the high level idea, and thus SPARK-15691 . In reality, it's a huge amount of work to create & maintain a custom catalog. It might actually make sense to do, but it just seems a lot of work to do right now and it'd take a toll on

separate spark and hive

2016-11-14 Thread assaf.mendelson
Hi, Today, we basically force people to use hive if they want to get the full use of spark SQL. When doing the default installation this means that a derby.log and metastore_db directory are created where we run from. The problem with this is that if we run multiple scripts from the same working

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
Good catch. Updated! On Mon, Nov 14, 2016 at 11:13 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source > Code Management' sections in that page. > > Shivaram > > On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin

RE: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread assaf.mendelson
I am not sure I understand when the statistics would be calculated. Would they always be calculated or just when analyze is called? Would it be possible to save analysis results as part of dataframe saving (e.g. when writing it to parquet) or do we have to have a consistent hive installation?

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Shivaram Venkataraman
FWIW 2.0.1 is also used in the 'Link With Spark' and 'Spark Source Code Management' sections in that page. Shivaram On Mon, Nov 14, 2016 at 11:11 PM, Reynold Xin wrote: > It's on there on the page (both the release notes and the download version > dropdown). > > The one

Re: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
It's on there on the page (both the release notes and the download version dropdown). The one line text is outdated. I'm just going to delete that text as a matter of fact so we don't run into this issue in the future. On Mon, Nov 14, 2016 at 11:09 PM, assaf.mendelson

RE: [ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread assaf.mendelson
While you can download spark 2.0.2, the description is still spark 2.0.1: Our latest stable version is Apache Spark 2.0.1, released on Oct 3, 2016 (release notes) (git tag) From:

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
They are not yet complete. The benchmark was done with an implementation of cost-based optimizer Huawei had internally for Spark 1.5 (or some even older version). On Mon, Nov 14, 2016 at 10:46 PM, Yogesh Mahajan wrote: > It looks like Huawei team have run TPC-H benchmark

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Yogesh Mahajan
Thanks Reynold for the detailed proposals. A few questions/clarifications - 1) How the existing rule based operator co-exist with CBO? The existing rules are heuristics/empirical based, i am assuming rules like predicate pushdown or project pruning will co-exist with CBO and we just want to

[ANNOUNCE] Apache Spark 2.0.2

2016-11-14 Thread Reynold Xin
We are happy to announce the availability of Spark 2.0.2! Apache Spark 2.0.2 is a maintenance release containing 90 bug fixes along with Kafka 0.10 support and runtime metrics for Structured Streaming. This release is based on the branch-2.0 maintenance branch of Spark. We strongly recommend all

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Reynold Xin
The issue is now resolved. On Mon, Nov 14, 2016 at 3:08 PM, Sean Owen wrote: > Yes, it's on Maven. We have some problem syncing the web site changes at > the moment though those are committed too. I think that's the only piece > before a formal announcement. > > > On Mon,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra
Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator. A single, fixed-sized sql.shuffle.partitions is not the only way to control the number of partitions in an Exchange -- if you are willing to deal with code that is still off by by default. On Mon, Nov 14, 2016 at 4:19 PM,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread leo9r
Hi Daniel, I completely agree with your request. As the amount of data being processed with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need to prevent OOM and performance degradation. The fact that sql.shuffle.partitions cannot be set several times in the same job/action,

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Sean Owen
Yes, it's on Maven. We have some problem syncing the web site changes at the moment though those are committed too. I think that's the only piece before a formal announcement. On Mon, Nov 14, 2016 at 9:49 PM Nicholas Chammas wrote: > Has the release already been

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Michael Armbrust
I would definitly like to open up APIs for people to write their own encoders. The challenge thus far has been that Encoders use internal APIs that have not been stable for translating the data into the tungsten format. We also make use of the analyzer to figure out the mapping from columns to

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Josh Rosen
He pushed the 2.0.2 release docs but there's a problem with Git mirroring of the Spark website repo which is interfering with the publishing: https://issues.apache.org/jira/browse/INFRA-12913 On Mon, Nov 14, 2016 at 1:15 PM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > The

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Sam Goodwin
I wouldn't recommend using a Tuple as you end up with column names like "_1" and "_2", but it will still work :) ExpressionEncoder can do the same thing but it doesn't support custom types, and as far as I can tell, does not support custom implementations. I.e. is it possible for me to write my

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Shivaram Venkataraman
The release is available on http://www.apache.org/dist/spark/ and its on Maven central http://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.0.2/ I guess Reynold hasn't yet put together the release notes / updates to the website. Thanks Shivaram On Mon, Nov 14, 2016 at 12:49 PM,

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
sorry this message by me was confusing. i was frustrated about how hard it is to use the Encoder machinery myself directly on Row objects, this is unrelated to the question if a shapeless based approach like sam suggest would be better way to do encoders in general On Mon, Nov 14, 2016 at 3:03

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-14 Thread Nicholas Chammas
Has the release already been made? I didn't see any announcement, but Homebrew has already updated to 2.0.2. 2016년 11월 11일 (금) 오후 2:59, Reynold Xin 님이 작성: > The vote has passed with the following +1s and no -1. I will work on > packaging the release. > > +1: > > Reynold Xin*

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
just taking it for a quick spin it looks great, with correct support for nested rows and using option for nullability. scala> val format = implicitly[RowFormat[(String, Seq[(String, Option[Int])])]] format: com.github.upio.spark.sql.RowFormat[(String, Seq[(String, Option[Int])])] =

Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again. any help is appreciated ! thanks in advance. On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Koert Kuipers
agreed on your point that this can be done without macros On Wed, Nov 2, 2016 at 12:15 AM, Sam Goodwin wrote: > You don't need compiler time macros for this, you can do it quite easily > using shapeless. I've been playing with a project which borrows ideas from >

Re: Two questions about running spark on mesos

2016-11-14 Thread Joseph Wu
1) You should read through this page: https://spark.apache.org/docs/latest/running-on-mesos.html I (Mesos person) can't answer any questions that aren't already answered on that page :) 2) Your normal spark commands (whatever they are) should still work regardless of the backend. On Mon, Nov 14,

Re: Two questions about running spark on mesos

2016-11-14 Thread Michael Gummelt
1. I had never even heard of conf/slaves until this email, and I only see it referenced in the docs next to Spark Standalone, so I doubt that works. 2. Yes. See the --kill option in spark-submit. Also, we're considering dropping the Spark dispatcher in DC/OS in favor of Metronome, which will be

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Reynold Xin
Historically tpcds and tpch. There is certainly a chance of overfitting one or two benchmarks. Note that those will probably be impacted more by the way we set the parameters for CBO rather than using x or y for summary statistics. On Monday, November 14, 2016, Shivaram Venkataraman <

Re: statistics collection and propagation for cost-based optimizer

2016-11-14 Thread Shivaram Venkataraman
Do we have any query workloads for which we can benchmark these proposals in terms of performance ? Thanks Shivaram On Sun, Nov 13, 2016 at 5:53 PM, Reynold Xin wrote: > One additional note: in terms of size, the size of a count-min sketch with > eps = 0.1% and confidence

subscribe

2016-11-14 Thread Yu Wei
Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux