Re: Join Strategies

2018-01-15 Thread Herman van Hövell tot Westerflier
Hey Marco, A Cartesian product is an inner join by definition :). The current cartesian product operator does not support outer joins, so we use the only operator that does: BroadcastNestedLoopJoinExec. This is far from great, and it does have the potential to OOM, there are some safety nets in

Re: Whole-stage codegen and SparkPlan.newPredicate

2018-01-01 Thread Herman van Hövell tot Westerflier
org/ > jira/browse/SPARK-22934. > > Best Regards, > Kazuaki Ishizaki > > > > From:Herman van Hövell tot Westerflier <hvanhov...@databricks.com> > To:Jacek Laskowski <ja...@japila.pl> > Cc:dev <dev@spark.apache.org> > Date:2017

Re: Whole-stage codegen and SparkPlan.newPredicate

2017-12-31 Thread Herman van Hövell tot Westerflier
Hi Jacek, In this case whole stage code generation is turned off. However we still use code generation for a lot of other things: projections, predicates, orderings & encoders. You are currently seeing a compile time failure while generating a predicate. There is currently no easy way to turn

Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-28 Thread Herman van Hövell tot Westerflier
+1 On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung wrote: > +1 > > Thanks Sean. Please vote! > > Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel > and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and > no leaked files (.cache

Re: SparkSQL not support CharType

2017-11-23 Thread Herman van Hövell tot Westerflier
You need to use a StringType. The CharType and VarCharType are there to ensure compatibility with Hive and ORC; they should not be used anywhere else. On Thu, Nov 23, 2017 at 4:09 AM, 163 wrote: > Hi, > when I use Dataframe with table schema, It goes wrong: > > val

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Herman van Hövell tot Westerflier
+1 On Tue, Oct 3, 2017 at 1:32 PM, Sean Owen wrote: > +1 same as last RC. Tests pass, sigs and hashes are OK. > > On Tue, Oct 3, 2017 at 7:24 AM Holden Karau wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.1.2.

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Herman van Hövell tot Westerflier
+1 (binding) I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Herman van Hövell tot Westerflier
Just move the case expression into an underlying select clause. On Thu, Jul 13, 2017 at 3:10 PM, Chang Chen wrote: > Hi Wenchen > > Yes. We also find this error is caused by Rand. However, this is classic > way to solve data skew in Hive. Is there any equivalent way in

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Herman van Hövell tot Westerflier
+1 On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non-binding) > > Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive > -Phive-thriftserver -Pscala-2.11 on > >- macOS 10.12.5 Java 8 (build 1.8.0_131) >- Ubuntu 17.04,

Re: Question on Spark code

2017-06-25 Thread Herman van Hövell tot Westerflier
I am not getting the question. The logging trait does exactly what is says on the box, I don't see what string concatenation has to do with it. On Sun, Jun 25, 2017 at 11:27 AM, kant kodali wrote: > Hi All, > > I came across this file

Re: [build system] jenkins got itself wedged...

2017-05-16 Thread Herman van Hövell tot Westerflier
Thanks Shane! On Tue, May 16, 2017 at 5:18 PM, shane knapp wrote: > ...so i kicked it and it's now back up and happily building. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > --

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Herman van Hövell tot Westerflier
Maciej, this is definitely a bug. I have opened https://github.com/apache/ spark/pull/17810 to fix this. I don't think this should be a blocker for the release of 2.2, if there is another RC we will include it. On Sat, Apr 29, 2017 at 10:17 AM, Maciej Szymkiewicz wrote:

Re: New Optimizer Hint

2017-04-20 Thread Herman van Hövell tot Westerflier
Hi Michael, This sounds like a good idea. Can you open a JIRA to track this? My initial feedback on your proposal would be that you might want to express the no_collapse at the expression level and not at the plan level. HTH On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles

Re: [SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Herman van Hövell tot Westerflier
This is definitely a bug in the CollapseWindow optimizer rule. I think we can use SPARK-20086 to track this. On Fri, Mar 24, 2017 at 9:28 PM, Maciej Szymkiewicz wrote: > Forwarded from SO

Re: [SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x

2017-03-08 Thread Herman van Hövell tot Westerflier
You are seeing a bug in the Hive parser. Hive drops the window clause when it encounters a count(distinct ...). See https://issues.apache.org/jira/browse/HIVE-10141 for more information. Spark 1.6 plans this as a regular distinct aggregate (dropping the window clause), which is wrong. Spark 2.x

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Herman van Hövell tot Westerflier
Congrats Takuya! On Mon, Feb 13, 2017 at 11:27 PM, Neelesh Salian wrote: > Congratulations, Takuya! > > On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin wrote: > >> Hi all, >> >> Takuya-san has recently been elected an Apache Spark committer. He's

Re: [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses in Spark 2.x

2017-02-06 Thread Herman van Hövell tot Westerflier
Hi Stan, I have opened https://github.com/apache/spark/pull/16821 to fix this. On Mon, Feb 6, 2017 at 1:41 PM, StanZhai wrote: > Hi all, > > SQLParser fails to resolve nested CASE WHEN statement like this: > > select case when > (1) + > case when 1>0 then 1 else 0 end =

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Herman van Hövell tot Westerflier
Congrats! On Tue, Jan 24, 2017 at 10:20 PM, Felix Cheung wrote: > Congrats and welcome!! > > > -- > *From:* Reynold Xin > *Sent:* Tuesday, January 24, 2017 10:13:16 AM > *To:* dev@spark.apache.org > *Cc:* Burak Yavuz;

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Herman van Hövell tot Westerflier
@Jacek The maximum output of 200 fields for whole stage code generation has been chosen to prevent the code generated method from exceeding the 64kb code limit. There absolutely no relation between this value and the number of partitions after a shuffle (if there were they should have used the

Re: shapeless in spark 2.1.0

2016-12-29 Thread Herman van Hövell tot Westerflier
Which dependency pulls in shapeless? On Thu, Dec 29, 2016 at 5:49 PM, Koert Kuipers wrote: > i just noticed that spark 2.1.0 bring in a new transitive dependency on > shapeless 2.0.0 > > shapeless is a popular library for scala users, and shapeless 2.0.0 is old > (2014) and

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Herman van Hövell tot Westerflier
+1 On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li wrote: > +1 > > Xiao Li > > 2016-12-16 12:19 GMT-08:00 Felix Cheung : > >> For R we have a license field in the DESCRIPTION, and this is standard >> practice (and requirement) for R packages. >> >>

Re: Green dot in web UI DAG visualization

2016-11-17 Thread Herman van Hövell tot Westerflier
Should I be able to see something? On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Some questions about this DAG visualization: > > [image: Screen Shot 2016-11-17 at 11.57.14 AM.png] > > 1. What's the meaning of the green dot? > 2. Should this be

Re: structured streaming and window functions

2016-11-17 Thread Herman van Hövell tot Westerflier
What kind of window functions are we talking about? Structured streaming only supports time window aggregates, not the more general sql window function (sum(x) over (partition by ... order by ...)) aggregates. The basic idea is that you use incremental aggregation and store the aggregation buffer

Re: Another Interesting Question on SPARK SQL

2016-11-17 Thread Herman van Hövell tot Westerflier
The diagram you have included, is a depiction of the steps Catalyst (the spark optimizer) takes to create an executable plan. Tungsten mainly comes into play during code generation and the actual execution. A datasource is represented by a LogicalRelation during analysis & optimization. The spark

Re: separate spark and hive

2016-11-15 Thread Herman van Hövell tot Westerflier
You can start a spark without hive support by setting the spark.sql. catalogImplementation configuration to in-memory, for example: > > ./bin/spark-shell --master local[*] --conf > spark.sql.catalogImplementation=in-memory I would not change the default from Hive to Spark-only just yet. On Tue,

Re: Would "alter table add column" be supported in the future?

2016-11-09 Thread Herman van Hövell tot Westerflier
This currently not on any roadmap I know of. You can open a JIRA ticket for this if you want to. On Wed, Nov 9, 2016 at 6:02 PM, 汪洋 wrote: > Hi, > > I notice that “alter table add column” command is banned in spark 2.0. > > Any plans on supporting it in the future?

Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Herman van Hövell tot Westerflier
Replied in the ticket. On Tue, Nov 8, 2016 at 11:36 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > SPARK-18367 : limit() > makes the lame walk again > > On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas < > nicholas.cham...@gmail.com>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Herman van Hövell tot Westerflier
+1 On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if > a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Herman van Hövell tot Westerflier
+1 On Fri, Nov 4, 2016 at 7:20 PM, Michael Armbrust wrote: > +1 > > On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.0.2. The vote is open until Fri, Nov 4, 2016 at

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Herman van Hövell tot Westerflier
+1 On Thu, Nov 3, 2016 at 6:58 PM, Michael Armbrust wrote: > +1 > > On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 1.6.3. The vote is open until Sat, Nov 5, 2016 at

Re: encoders for more complex types

2016-10-27 Thread Herman van Hövell tot Westerflier
What kind of difficulties are you experiencing? On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers wrote: > i have been pushing my luck a bit and started using ExpressionEncoder for > more complex types like sequences of case classes etc. (where the case > classes only had

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Herman van Hövell tot Westerflier
+1 On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin wrote: > Greetings from Spark Summit Europe at Brussels. > > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if > a majority of

Re: collect_list alternative for SQLContext?

2016-10-25 Thread Herman van Hövell tot Westerflier
What version of Spark are you using? We introduced a Spark native collect_list in 2.0. It still has the usual caveats, but it should quite a bit faster. On Tue, Oct 25, 2016 at 6:16 AM, Matt Smith wrote: > Is there an alternative function or design pattern for the

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Herman van Hövell tot Westerflier
Congratulations Xiao! Very well deserved! On Mon, Oct 3, 2016 at 10:46 PM, Reynold Xin wrote: > Hi all, > > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark > committer. Xiao has been a super active contributor to Spark SQL. Congrats > and welcome,

Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Herman van Hövell tot Westerflier
Tejas, This is because we use the same rule to parse top level and nested data fields. For example: create table tbl_x( id bigint, nested struct ) Shows both syntaxes. We should split this rule in a top-level and nested rule. Could you open a ticket? Thanks,

Re: https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread Herman van Hövell tot Westerflier
Hi Asaf, The current collect_list/collect_set implementations have room for improvement. We did not implement partial aggregation for these, because the idea of a partial aggregation is that we can reduce network traffic (by shipping fewer partially aggregated buffers); this does not really apply

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Herman van Hövell tot Westerflier
+1 (non-binding) On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non-binding) > > Built and tested on > - Ubuntu 16.04 / OpenJDK 1.8.0_91 > - CentOS / Oracle Java 1.7.0_55 > (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn) > >

Re: Why Expression.deterministic method and Nondeterministic trait?

2016-09-23 Thread Herman van Hövell tot Westerflier
Jacek, A non-deterministic expression usually holds some state. The Nondeterministic trait makes sure a user can initialize this state properly. Take a look at InterpretedProjection

Re: How to get 2 years prior date from currentdate using Spark Sql

2016-09-07 Thread Herman van Hövell tot Westerflier
This is more a @use question. You can write the following in sql: select date '2016-09-07' - interval 2 years HTH On Wed, Sep 7, 2016 at 3:14 PM, Yong Zhang wrote: > sorry, should be date_sub > > > https://issues.apache.org/jira/browse/SPARK-8187 > [SPARK-8187] date/time

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Herman van Hövell tot Westerflier
Congrats Felix! On Mon, Aug 8, 2016 at 11:57 PM, dhruve ashar wrote: > Congrats Felix! > > On Mon, Aug 8, 2016 at 2:28 PM, Tarun Kumar wrote: > >> Congrats Felix! >> >> Tarun >> >> On Tue, Aug 9, 2016 at 12:57 AM, Timothy Chen

Re: Result code of whole stage codegen

2016-08-05 Thread Herman van Hövell tot Westerflier
Do you want to see the code that whole stage codegen produces? You can prepend a SQL statement with EXPLAIN CODEGEN ... Or you can add the following code to a DataFrame/Dataset command: import org.apache.spark.sql.execution.debug._ and call the the debugCodegen() command on a

Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Herman van Hövell tot Westerflier
, Herman van Hövell tot Westerflier 2016-06-03 17:01 GMT+02:00 Gerhard Fiedler <gfied...@algebraixdata.com>: > When I look at the sources in Github, I see DataFrame.scala at > https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Herman van Hövell tot Westerflier
+1 2016-05-19 18:20 GMT+02:00 Xiangrui Meng : > +1 > > On Thu, May 19, 2016 at 9:18 AM Joseph Bradley > wrote: > >> +1 >> >> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin >> wrote: >> >>> Hi Ovidiu-Cristian , >>> >>> The best

Re: Query parsing error for the join query between different database

2016-05-18 Thread Herman van Hövell tot Westerflier
'User' is a SQL2003 keyword. This is normally not a problem, except when you use it as a table alias (which you are doing). Change the alias or place it between backticks and you should be fine. 2016-05-18 23:51 GMT+02:00 JaeSung Jun : > It's spark 1.6.1 and hive 1.2.1

Re: explain codegen

2016-04-04 Thread Herman van Hövell tot Westerflier
No, it can''t. You only need implicits when you are using the catalyst DSL. The error you get is due to the fact that the parser does not recognize the CODEGEN keyword (which was the case before we introduced this in

Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Herman van Hövell tot Westerflier
Hi Jerry, This is not on any roadmap. I (shortly) browsed through this; and this looks like some sort of a window function with very awkward syntax. I think spark provided better constructs for this using dataframes/datasets/nested data... Feel free to submit a PR. Kind regards, Herman van

Re: Aggregation + Adding static column + Union + Projection = Problem

2016-02-26 Thread Herman van Hövell tot Westerflier
Hi Jiří, Thanks for your mail. Could you create a JIRA ticket for this: https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

Re: Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Herman van Hövell tot Westerflier
Hi Tien-Dung, 1.6 plans single distinct aggregates like multiple distinct aggregates; this inherently causes some overhead but is more stable in case of high cardinalities. You can revert to the old behavior by setting the spark.sql.specializeSingleDistinctAggPlanning option to false. See also:

Re: Path to resource added with SQL: ADD FILE

2016-02-04 Thread Herman van Hövell tot Westerflier
Hi Antonio, I am not sure you got the silent treatment on the user list. Stackoverflow is also a good place to ask questions. Could you use an absolute path to add the jar file. So instead of './my resource file' (which is a relative path; this depends on where you started Spark), use something

Re: build error: code too big: specialStateTransition(int, IntStream)

2016-01-28 Thread Herman van Hövell tot Westerflier
Hi, I have only encountered 'code too large' errors when changing grammars. I am using SBT/Idea, no Eclipse. The size of an ANTLR Parser/Lexer is dependent on the rules inside the source grammar and the rules it depends on. So we should take a look at the IdentifiersParser.g/ExpressionParser.g;

Are we running SparkR tests in Jenkins?

2016-01-15 Thread Herman van Hövell tot Westerflier
supported as of Spark 2.0. > Use ./bin/spark-submit Are we still running R tests? Or just saying that this will be deprecated? Kind regards, Herman van Hövell tot Westerflier

Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
Thanks. I'll merge the most recent master... Still curious if we can stop a build. Kind regards, Herman van Hövell tot Westerflier 2015-12-29 18:59 GMT+01:00 Ted Yu <yuzhih...@gmail.com>: > HiveThriftBinaryServerSuite got stuck. > > I thought Josh has fixed this issue: > &

Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
My AMPLAB jenkins build has been stuck for a few hours now: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull Is there a way for me to stop the build? Kind regards, Herman van Hövell

Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
SH access. > > I've gone ahead killed the build for you. It looks like someone had > configured the pull request builder timeout to be 300 minutes (5 hours), > but I think we should consider decreasing that to match the timeout used by > the Spark full test jobs. > > On Tue, Dec 29,

Re: Lead operator not working as aggregation operator

2015-11-02 Thread Herman van Hövell tot Westerflier
vriendelijke groet/Kind regards, Herman van Hövell tot Westerflier QuestTec B.V. Torenwacht 98 2353 DC Leiderdorp hvanhov...@questtec.nl +31 6 420 590 27 2015-11-02 11:33 GMT+01:00 Shagun Sodhani <sshagunsodh...@gmail.com>: > Hi! I was trying out window functions in SparkSql (using hiv

Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Herman van Hövell tot Westerflier
We could also fallback to approximate count distincts when the user requests multiple count distincts. This is less invasive than throwing an AnalysisException, but it could violate the principle of least surprise. Met vriendelijke groet/Kind regards, Herman van Hövell tot Westerflier

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
Distinct Aggregate operators. Are there any opinions on this? Kind regards, Herman van Hövell tot Westerflier QuestTec B.V. Torenwacht 98 2353 DC Leiderdorp hvanhov...@questtec.nl +599 9 521 4402 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>: > Inspired by

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
results as the ClearSpring implementation. You could easily export the HLL++ register values to the current ClearSpring implementation and export those. Met vriendelijke groet/Kind regards, Herman van Hövell tot Westerflier QuestTec B.V. Torenwacht 98 2353 DC Leiderdorp hvanhov...@questtec.nl