Re: Join Strategies

2018-01-15 Thread Herman van Hövell tot Westerflier
Hey Marco,

A Cartesian product is an inner join by definition :). The current
cartesian product operator does not support outer joins, so we use the only
operator that does: BroadcastNestedLoopJoinExec. This is far from great,
and it does have the potential to OOM, there are some safety nets in the
driver that should start complaining before you actually OOM though.

An outer non-equi join is pretty hard to do in a distributed setting. This
is caused by two things:

   - There is no way to partition the data in such a way that you can
   exploit some locality (know that all the same keys are in one partition),
   unless you use only one partition or use some clever index.
   - You need to keep track of records that do not match the join condition
   if you are doing a full join or a join in which the stream side does not
   match the join side. This is the number one source of complexity in the
   current join implementations. If you can partition your data then you can
   track and emit unmatched rows as part of processing the partition. If you
   cannot (and you have more than 1 partition) then you need to send the
   unmatched rows (in some form) back to the driver and figure out which
   records actually have not been matched (see BroadcastNestedLoopJoinExec for
   example).

It is definitely doable to implement a such a join, however I have not seen
many JIRA's or user requests for this.

HTH

Herman


On Sat, Jan 13, 2018 at 6:41 AM, Marco Gaido  wrote:

> Hi dev,
>
> I have a question about how join strategies are defined.
>
> I see that CartesianProductExec is used only for InnerJoin, while for
> other kind of joins BroadcastNestedLoopJoinExec is used.
> For reference:
> https://github.com/apache/spark/blob/cd9f49a2aed3799964976ea
> d06080a0f7044a0c3/sql/core/src/main/scala/org/apache/spark/sql/execution/
> SparkStrategies.scala#L260
>
> May you kindly explain me why this is done? It doesn't seem a great choice
> to me, since BroadcastNestedLoopJoinExec can fail with OOM.
>
> Thanks,
> Marco
>
>
>
>


Re: Whole-stage codegen and SparkPlan.newPredicate

2018-01-01 Thread Herman van Hövell tot Westerflier
Wrong ticket: https://issues.apache.org/jira/browse/SPARK-22935

Thanks for working on this :)

On Mon, Jan 1, 2018 at 2:22 PM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
wrote:

> I ran the program in URL of stackoverflow with Spark 2.2.1 and master. I
> cannot see the exception even when I disabled whole-stage codegen. Am I
> wrong?
> We would appreciate it if you could create a JIRA entry with simple
> standalone repro.
>
> In addition to this report, I realized that this program produces
> incorrect results. I created a JIRA entry https://issues.apache.org/
> jira/browse/SPARK-22934.
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Herman van Hövell tot Westerflier <hvanhov...@databricks.com>
> To:Jacek Laskowski <ja...@japila.pl>
> Cc:dev <dev@spark.apache.org>
> Date:2017/12/31 21:44
> Subject:Re: Whole-stage codegen and SparkPlan.newPredicate
> --
>
>
>
> Hi Jacek,
>
> In this case whole stage code generation is turned off. However we still
> use code generation for a lot of other things: projections, predicates,
> orderings & encoders. You are currently seeing a compile time failure while
> generating a predicate. There is currently no easy way to turn code
> generation off entirely.
>
> The error itself is not great, but it still captures the problem in a
> relatively timely fashion. We should have caught this during analysis
> though. Can you file a ticket?
>
> - Herman
>
> On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski <*ja...@japila.pl*
> <ja...@japila.pl>> wrote:
> Hi,
>
> While working on an issue with Whole-stage codegen as reported @
> *https://stackoverflow.com/q/48026060/1305344*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_q_48026060_1305344=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=RvT_q8DEXf0WcHO5PmpYKTonkiLk_CRCFDWIwHR7b1o=>I
> found out that spark.sql.codegen.wholeStage=false does *not* turn
> whole-stage codegen off completely.
>
>
> It looks like SparkPlan.newPredicate [1] gets called regardless of the
> value of spark.sql.codegen.wholeStage property.
>
> $ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false
> ...
> scala> spark.sessionState.conf.wholeStageEnabled
> res7: Boolean = false
>
> That leads to an issue in the SO question with whole-stage codegen
> regardless of the value:
>
> ...
>   at org.apache.spark.sql.execution.SparkPlan.
> newPredicate(SparkPlan.scala:385)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:214)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:213)
>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal
> $1$$anonfun$apply$24.apply(RDD.scala:816)
> ...
>
> Is this a bug or does it work as intended? Why?
>
> [1]
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala?utf8=%E2%9C%93#L386*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_SparkPlan.scala-3Futf8-3D-25E2-259C-2593-23L386=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=vHxnoCNIUEN3ubKGZGGsWbkAxPDM5sbLewYKVLgwDY8=>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> *https://about.me/JacekLaskowski*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__about.me_JacekLaskowski=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=LH71LLLzVggGx5f1T9hE7BVortdTN6qh-Ji3OQGsfMY=>
> Mastering Spark SQL *https://bit.ly/mastering-spark-sql*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bit.ly_mastering-2Dspark-2Dsql=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=MxnWpkT9RJKvNrpPAFnfceiOl14n7CJ0SiRzWZc9nRA=>
> Spark Structured Streaming *https://bit.ly/spark-structured-streaming*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bit.ly_spark-2Dstructured-2Dstreaming=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=ssYJfznmDoI8I3uEOZW2r9sKaKw1fEJiD_DU2mzPg24=>
> Mastering Apache Spark 2 *https://bit.ly/mastering-apache-spark*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bit.ly_mastering-2Dapache-2Dspark=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=ZVqCRb9jygZwP8pTDorfMLRtQSPmal3P_HdgvPJ6_Qo=>
> Follow me at *https://twitter.com/jaceklaskowski*
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_jaceklaskowski=DwMFaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LNc_s5Vc87PNQAWCeE9iuJVzWzNEBkgolWvuze48L7k=hf9Zczq71Qh0vcGJs8iYL5mG_M6rB-a6IEXeGtaIkfI=>
>
>
>
>


Re: Whole-stage codegen and SparkPlan.newPredicate

2017-12-31 Thread Herman van Hövell tot Westerflier
Hi Jacek,

In this case whole stage code generation is turned off. However we still
use code generation for a lot of other things: projections, predicates,
orderings & encoders. You are currently seeing a compile time failure while
generating a predicate. There is currently no easy way to turn code
generation off entirely.

The error itself is not great, but it still captures the problem in a
relatively timely fashion. We should have caught this during analysis
though. Can you file a ticket?

- Herman

On Sat, Dec 30, 2017 at 9:16 AM, Jacek Laskowski  wrote:

> Hi,
>
> While working on an issue with Whole-stage codegen as reported @
> https://stackoverflow.com/q/48026060/1305344 I found out
> that spark.sql.codegen.wholeStage=false does *not* turn whole-stage
> codegen off completely.
>
> It looks like SparkPlan.newPredicate [1] gets called regardless of the
> value of spark.sql.codegen.wholeStage property.
>
> $ ./bin/spark-shell --conf spark.sql.codegen.wholeStage=false
> ...
> scala> spark.sessionState.conf.wholeStageEnabled
> res7: Boolean = false
>
> That leads to an issue in the SO question with whole-stage codegen
> regardless of the value:
>
> ...
>   at org.apache.spark.sql.execution.SparkPlan.newPredicate(
> SparkPlan.scala:385)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:214)
>   at org.apache.spark.sql.execution.FilterExec$$anonfun$18.apply(
> basicPhysicalOperators.scala:213)
>   at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInte
> rnal$1$$anonfun$apply$24.apply(RDD.scala:816)
> ...
>
> Is this a bug or does it work as intended? Why?
>
> [1] https://github.com/apache/spark/blob/master/sql/core/src
> /main/scala/org/apache/spark/sql/execution/SparkPlan.scala?
> utf8=%E2%9C%93#L386
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>


Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-28 Thread Herman van Hövell tot Westerflier
+1

On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung 
wrote:

> +1
>
> Thanks Sean. Please vote!
>
> Tested various scenarios with R package. Ubuntu, Debian, Windows r-devel
> and release and on r-hub. Verified CRAN checks are clean (only 1 NOTE!) and
> no leaked files (.cache removed, /tmp clean)
>
>
> On Sun, Nov 26, 2017 at 11:55 AM Sean Owen  wrote:
>
>> Yes it downloads recent releases. The test worked for me on a second try,
>> so I suspect a bad mirror. If this comes up frequently we can just add
>> retry logic, as the closer.lua script will return different mirrors each
>> time.
>>
>> The tests all pass for me on the latest Debian, so +1 for this release.
>>
>> (I committed the change to set -Xss4m for tests consistently, but this
>> shouldn't block a release.)
>>
>>
>> On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung 
>> wrote:
>>
>>> Ah sorry digging through the history it looks like this is changed
>>> relatively recently and should only download previous releases.
>>>
>>> Perhaps we are intermittently hitting a mirror that doesn’t have the
>>> files?
>>>
>>>
>>> https://github.com/apache/spark/commit/daa838b8886496e64700b55d1301d3
>>> 48f1d5c9ae
>>>
>>>
>>> On Sat, Nov 25, 2017 at 10:36 AM Felix Cheung 
>>> wrote:
>>>
 Thanks Sean.

 For the second one, it looks like the  HiveExternalCatalogVersionsSuite is
 trying to download the release tgz from the official Apache mirror, which
 won’t work unless the release is actually, released?

 val preferredMirror =
 Seq("wget", "https://www.apache.org/dyn/closer.lua?preferred=true;, "-q
 ", "-O", "-").!!.trim
 val url = s"$preferredMirror/spark/spark-$version/spark-$version-
 bin-hadoop2.7.tgz"

 It’s proabbly getting an error page instead.


 On Sat, Nov 25, 2017 at 10:28 AM Sean Owen  wrote:

> I hit the same StackOverflowError as in the previous RC test, but,
> pretty sure this is just because the increased thread stack size JVM flag
> isn't applied consistently. This seems to resolve it:
>
> https://github.com/apache/spark/pull/19820
>
> This wouldn't block release IMHO.
>
>
> I am currently investigating this failure though -- seems like the
> mechanism that downloads Spark tarballs needs fixing, or updating, in the
> 2.2 branch?
>
> HiveExternalCatalogVersionsSuite:
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> *** RUN ABORTED ***
>
>   java.io.IOException: Cannot run program "./bin/spark-submit" (in
> directory "/tmp/test-spark/spark-2.0.2"): error=2, No such file or
> directory
>
> On Sat, Nov 25, 2017 at 12:34 AM Felix Cheung 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.2.1. The vote is open until Friday December 1, 2017 at
>> 8:00:00 am UTC and passes if a majority of at least 3 PMC +1 votes
>> are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.2.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>>
>> The tag to be voted on is v2.2.1-rc2 https://github.com/
>> apache/spark/tree/v2.2.1-rc2  (e30e2698a2193f0bbdcd4edb88471
>> 0819ab6397c)
>>
>> List of JIRA tickets resolved in this release can be found here
>> https://issues.apache.org/jira/projects/SPARK/versions/12340470
>>
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/
>> orgapachespark-1257/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/spark-2.2.1-
>> rc2-docs/_site/index.html
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala 
>> you
>> can add the staging repository to your projects resolvers and test with 
>> the
>> RC (make sure to clean up the artifact cache before/after so you don't 
>> end
>> up building 

Re: SparkSQL not support CharType

2017-11-23 Thread Herman van Hövell tot Westerflier
You need to use a StringType. The CharType and VarCharType are there to
ensure compatibility with Hive and ORC; they should not be used anywhere
else.

On Thu, Nov 23, 2017 at 4:09 AM, 163  wrote:

> Hi,
>  when I use Dataframe with table schema, It goes wrong:
>
> val test_schema = StructType(Array(
>
>   StructField("id", IntegerType, false),
>   StructField("flag", CharType(1), false),
>   StructField("time", DateType, false)));
>
> val df = spark.read.format("com.databricks.spark.csv")
>   .schema(test_schema)
>   .option("header", "false")
>   .option("inferSchema", "false")
>   .option("delimiter", ",")
>   .load("file:///Users/name/b")
>
>
> The log is below:
> Exception in thread "main" scala.MatchError: CharType(1) (of class
> org.apache.spark.sql.types.CharType)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$.org$
> apache$spark$sql$catalyst$encoders$RowEncoder$$serializerFor(RowEncoder.
> scala:73)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$
> 2.apply(RowEncoder.scala:158)
> at org.apache.spark.sql.catalyst.encoders.RowEncoder$$anonfun$
> 2.apply(RowEncoder.scala:157)
>
> Why? Is this a bug?
>
> But I found spark will translate char type to string when using create
> table command:
>
>   create table test(flag char(1));
>   desc test:flag string;
>
>
>
>
> Regards
> Wendy He
>



-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 





Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Herman van Hövell tot Westerflier
+1

On Tue, Oct 3, 2017 at 1:32 PM, Sean Owen  wrote:

> +1 same as last RC. Tests pass, sigs and hashes are OK.
>
> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.2. The vote is open until Saturday October 7th at 9:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v2.1.2-rc4
>>  (2abaea9e40fce81
>> cd4626498e0f5c28a70917499)
>>
>> List of JIRA tickets resolved in this release can be found with this
>> filter.
>> 
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://home.apache.org/~holden/spark-2.1.2-rc4-bin/
>>
>> Release artifacts are signed with a key from:
>> https://people.apache.org/~holden/holdens_keys.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1252
>>
>> The documentation corresponding to this release can be found at:
>> https://people.apache.org/~holden/spark-2.1.2-rc4-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install the
>> current RC and see if anything important breaks, in the Java/Scala you
>> can add the staging repository to your projects resolvers and test with the
>> RC (make sure to clean up the artifact cache before/after so you don't
>> end up building with a out of date RC going forward).
>>
>> *What should happen to JIRA tickets still targeting 2.1.2?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.1.3.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1. That being said
>> if there is something which is a regression form 2.1.1 that has not been
>> correctly targeted please ping a committer to help target the issue (you
>> can see the open issues listed as impacting Spark 2.1.1 & 2.1.2
>> 
>> )
>>
>> *What are the unresolved* issues targeted for 2.1.2
>> 
>> ?
>>
>> At this time there are no open unresolved issues.
>>
>> *Is there anything different about this release?*
>>
>> This is the first release in awhile not built on the AMPLAB Jenkins. This
>> is good because it means future releases can more easily be built and
>> signed securely (and I've been updating the documentation in
>> https://github.com/apache/spark-website/pull/66 as I progress), however
>> the chances of a mistake are higher with any change like this. If there
>> something you normally take for granted as correct when checking a release,
>> please double check this time :)
>>
>> *Should I be committing code to branch-2.1?*
>>
>> Thanks for asking! Please treat this stage in the RC process as "code
>> freeze" so bug fixes only. If you're uncertain if something should be back
>> ported please reach out. If you do commit to branch-2.1 please tag your
>> JIRA issue fix version for 2.1.3 and if we cut another RC I'll move the 2.1.3
>> fixed into 2.1.2 as appropriate.
>>
>> *What happened to RC3?*
>>
>> Some R+zinc interactions kept it from getting out the door.
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Herman van Hövell tot Westerflier
+1 (binding)

I personally believe that there is quite a big difference between having a
generic data source interface with a low surface area and pushing down a
significant part of query processing into a datasource. The later has much
wider wider surface area and will require us to stabilize most of the
internal catalyst API's which will be a significant burden on the community
to maintain and has the potential to slow development velocity
significantly. If you want to write such integrations then you should be
prepared to work with catalyst internals and own up to the fact that things
might change across minor versions (and in some cases even maintenance
releases). If you are willing to go down that road, then your best bet is
to use the already existing spark session extensions which will allow you
to write such integrations and can be used as an `escape hatch`.


On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash  wrote:

> +0 (non-binding)
>
> I think there are benefits to unifying all the Spark-internal datasources
> into a common public API for sure.  It will serve as a forcing function to
> ensure that those internal datasources aren't advantaged vs datasources
> developed externally as plugins to Spark, and that all Spark features are
> available to all datasources.
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
> To leave enough space for datasource developers to continue experimenting
> with advanced interactions between Spark and their datasources, I'd propose
> we leave some sort of escape valve that enables these datasources to keep
> pushing the boundaries without forking Spark.  Possibly that looks like an
> additional unsupported/unstable interface that pushes down an entire
> (unstable API) logical plan, which is expected to break API on every
> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
> ignores it and continues on with the rest of the V2 API for
> compatibility).  Or maybe it looks like something else that we don't know
> of yet.  Possibly this falls outside of the desired goals for the V2 API
> and instead should be a separate SPIP.
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are
> already doing in API V1 combined with other private APIs, vs pushing
> forward state of the art for performance.
>
> Andrew
>
> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> +1 (non-binding)
>>
>>
>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan  wrote:
>>
>> Hi all,
>>
>> In the previous discussion, we decided to split the read and write path
>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>> for Data Source V2 read path only.
>>
>> The full document of the Data Source API V2 is:
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>> The ready-for-review PR that implements the basic infrastructure for the
>> read path is:
>> https://github.com/apache/spark/pull/19136
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>>
>>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]



Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread Herman van Hövell tot Westerflier
Just move the case expression into an underlying select clause.

On Thu, Jul 13, 2017 at 3:10 PM, Chang Chen  wrote:

> Hi Wenchen
>
> Yes. We also find this error is caused by Rand. However, this is classic
> way to solve data skew in Hive.  Is there any equivalent way in Spark?
>
> Thanks
> Chang
>
>
> On Thu, Jul 13, 2017 at 8:25 PM, Wenchen Fan  wrote:
>
>> It’s not about case when, but about rand(). Non-deterministic expressions
>> are not allowed in join condition.
>>
>> > On 13 Jul 2017, at 6:43 PM, wangshuang  wrote:
>> >
>> > I'm trying to execute hive sql on spark sql (Also on spark
>> thriftserver), For
>> > optimizing data skew, we use "case when" to handle null.
>> > Simple sql as following:
>> >
>> >
>> > SELECT a.col1
>> > FROM tbl1 a
>> > LEFT OUTER JOIN tbl2 b
>> > ON
>> > * CASE
>> >   WHEN a.col2 IS NULL
>> >   TNEN cast(rand(9)*1000 - 99 as string)
>> >   ELSE
>> >   a.col2 END *
>> >   = b.col3;
>> >
>> >
>> > But I get the error:
>> >
>> > == Physical Plan ==
>> > *org.apache.spark.sql.AnalysisException: nondeterministic expressions
>> are
>> > only allowed in
>> > Project, Filter, Aggregate or Window, found:*
>> > (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS
>> > DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt`
>> END =
>> > c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND
>> (c.`cur_flag` =
>> > 1))
>> > in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >   ;;
>> > GlobalLimit 10
>> > +- LocalLimit 10
>> >   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN
>> > (cast(19596 as string),cast(20134 as string),cast(10997 as string)) &&
>> > nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END],
>> > [date_id#7]
>> >  +- Filter (date_id#7 = 2017-07-12)
>> > +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN
>> > cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as
>> > string) ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as
>> int)
>> > = 9)) && (cur_flag#77 = 1))
>> >:- SubqueryAlias a
>> >:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
>> >: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7,
>> chanl_id#8L,
>> > pltfm_id#9, city_id#10, sessn_id#11, gu_id#12,
>> nav_refer_page_type_id#13,
>> > nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16,
>> > nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19,
>> nav_page_value#20,
>> > nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25,
>> nav_tcd#26,
>> > nav_tci#27, nav_tce#28, detl_refer_page_type_id#29,
>> > detl_refer_page_value#30, ... 33 more fields]
>> >+- SubqueryAlias c
>> >   +- SubqueryAlias dim_site_categ_ext
>> >  +- CatalogRelation `dw`.`dim_site_categ_ext`,
>> > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
>> [site_categ_skid#64L,
>> > site_categ_type#65, site_categ_code#66, site_categ_name#67,
>> > site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L,
>> sort_seq#71L,
>> > site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L,
>> > updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L,
>> > site_categ_id#80, site_categ_parnt_id#81]
>> >
>> > Does spark sql not support syntax "case when" in JOIN?  Additional, my
>> spark
>> > version is 2.2.0.
>> > Any help would be greatly appreciated.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-
>> supported-in-JOIN-tp21953.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]



Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Herman van Hövell tot Westerflier
+1

On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non-binding)
>
> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive
> -Phive-thriftserver -Pscala-2.11 on
>
>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>
>
>
>
>
> On 1 Jul 2017 02:45, "Michael Armbrust"  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
>  (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
> The release files, including signatures, digests, etc. can be found at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>
>
>


Re: Question on Spark code

2017-06-25 Thread Herman van Hövell tot Westerflier
I am not getting the question. The logging trait does exactly what is says
on the box, I don't see what string concatenation has to do with it.

On Sun, Jun 25, 2017 at 11:27 AM, kant kodali  wrote:

> Hi All,
>
> I came across this file https://github.com/apache/spark/blob/master/core/
> src/main/scala/org/apache/spark/internal/Logging.scala and I am wondering
> what is the purpose of this? Especially it doesn't prevent any string
> concatenation and also the if checks are already done by the library itself
> right?
>
> Thanks!
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 



[image: Announcing Databricks Serverless. The first serverless data science
and big data platform. Watch the demo from Spark Summit 2017.]



Re: [build system] jenkins got itself wedged...

2017-05-16 Thread Herman van Hövell tot Westerflier
Thanks Shane!

On Tue, May 16, 2017 at 5:18 PM, shane knapp  wrote:

> ...so i kicked it and it's now back up and happily building.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] 


Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-29 Thread Herman van Hövell tot Westerflier
Maciej, this is definitely a bug. I have opened https://github.com/apache/
spark/pull/17810 to fix this. I don't think this should be a blocker for
the release of 2.2, if there is another RC we will include it.

On Sat, Apr 29, 2017 at 10:17 AM, Maciej Szymkiewicz  wrote:

> I am not sure if it is relevant but explode_outer and posexplode_outer
> seem to be broken: SPARK-20534
> 
>
> On 04/28/2017 12:49 AM, Sean Owen wrote:
>
> By the way the RC looks good. Sigs and license are OK, tests pass with
> -Phive -Pyarn -Phadoop-2.7. +1 from me.
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>>  (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> 
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] 


Re: New Optimizer Hint

2017-04-20 Thread Herman van Hövell tot Westerflier
Hi Michael,

This sounds like a good idea. Can you open a JIRA to track this?

My initial feedback on your proposal would be that you might want to
express the no_collapse at the expression level and not at the plan level.

HTH

On Thu, Apr 20, 2017 at 3:31 PM, Michael Styles 
wrote:

> Hello,
>
> I am in the process of putting together a PR that introduces a new hint
> called NO_COLLAPSE. This hint is essentially identical to Oracle's NO_MERGE
> hint.
>
> Let me first give an example of why I am proposing this.
>
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
> df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"]))
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
> df2["ua"].browser_version.alias("c2"))
> df3.explain(True)
>
> == Parsed Logical Plan ==
> 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS
> c2#91]
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>+- LogicalRDD [id#80L, user_agent#81]
>
> == Analyzed Logical Plan ==
> c1: string, c2: string
> Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS
> c2#91]
> +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85]
>+- LogicalRDD [id#80L, user_agent#81]
>
> == Optimized Logical Plan ==
> Project [UDF(user_agent#81).device_form_factor AS c1#90,
> UDF(user_agent#81).browser_version AS c2#91]
> +- LogicalRDD [id#80L, user_agent#81]
>
> == Physical Plan ==
> *Project [UDF(user_agent#81).device_form_factor AS c1#90,
> UDF(user_agent#81).browser_version AS c2#91]
> +- Scan ExistingRDD[id#80L,user_agent#81]
>
> user_agent_details is a user-defined function that returns a struct. As
> can be seen from the generated query plan, the function is being executed
> multiple times which could lead to performance issues. This is due to the
> CollapseProject optimizer rule that collapses adjacent projections.
>
> I'm proposing a hint that prevent the optimizer from collapsing adjacent
> projections. A new function called 'no_collapse' would be introduced for
> this purpose. Consider the following example and generated query plan.
>
> df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"])
> df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_
> agent"])))
> df3 = df2.select(df2["ua"].device_form_factor.alias("c1"),
> df2["ua"].browser_version.alias("c2"))
> df3.explain(True)
>
> == Parsed Logical Plan ==
> 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS
> c2#76]
> +- NoCollapseHint
>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>   +- LogicalRDD [id#64L, user_agent#65]
>
> == Analyzed Logical Plan ==
> c1: string, c2: string
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- NoCollapseHint
>+- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69]
>   +- LogicalRDD [id#64L, user_agent#65]
>
> == Optimized Logical Plan ==
> Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- NoCollapseHint
>+- Project [UDF(user_agent#65) AS ua#69]
>   +- LogicalRDD [id#64L, user_agent#65]
>
> == Physical Plan ==
> *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS
> c2#76]
> +- *Project [UDF(user_agent#65) AS ua#69]
>+- Scan ExistingRDD[id#64L,user_agent#65]
>
> As can be seen from the query plan, the user-defined function is now
> evaluated once per row.
>
> I would like to get some feedback on this proposal.
>
> Thanks.
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] 


Re: [SQL] Unresolved reference with chained window functions.

2017-03-24 Thread Herman van Hövell tot Westerflier
This is definitely a bug in the CollapseWindow optimizer rule. I think we
can use SPARK-20086  to
track this.

On Fri, Mar 24, 2017 at 9:28 PM, Maciej Szymkiewicz 
wrote:

> Forwarded from SO (http://stackoverflow.com/q/43007433). Looks like
> regression compared to 2.0.2.
>
> scala> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.expressions.Window
>
> scala> val win_spec_max =
> Window.partitionBy("x").orderBy("AmtPaid").rowsBetween(Window.
> unboundedPreceding,
> 0)
> win_spec_max: org.apache.spark.sql.expressions.WindowSpec =
> org.apache.spark.sql.expressions.WindowSpec@3433e418
>
> scala> val df = Seq((1, 2.0), (1, 3.0), (1, 1.0), (1, -2.0), (1,
> -1.0)).toDF("x", "AmtPaid")
> df: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double]
>
> scala> val df_with_sum = df.withColumn("AmtPaidCumSum",
> sum(col("AmtPaid")).over(win_spec_max))
> df_with_sum: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
> ... 1 more field]
>
> scala> val df_with_max = df_with_sum.withColumn("AmtPaidCumSumMax",
> max(col("AmtPaidCumSum")).over(win_spec_max))
> df_with_max: org.apache.spark.sql.DataFrame = [x: int, AmtPaid: double
> ... 2 more fields]
>
> scala> df_with_max.explain
> == Physical Plan ==
> !Window [sum(AmtPaid#361) windowspecdefinition(x#360, AmtPaid#361 ASC
> NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS
> AmtPaidCumSum#366, max(AmtPaidCumSum#366) windowspecdefinition(x#360,
> AmtPaid#361 ASC NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND
> CURRENT ROW) AS AmtPaidCumSumMax#372], [x#360], [AmtPaid#361 ASC NULLS
> FIRST]
> +- *Sort [x#360 ASC NULLS FIRST, AmtPaid#361 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(x#360, 200)
>   +- LocalTableScan [x#360, AmtPaid#361]
>
> scala> df_with_max.printSchema
> root
>  |-- x: integer (nullable = false)
>  |-- AmtPaid: double (nullable = false)
>  |-- AmtPaidCumSum: double (nullable = true)
>  |-- AmtPaidCumSumMax: double (nullable = true)
>
> scala> df_with_max.show
> 17/03/24 21:22:32 ERROR Executor: Exception in task 0.0 in stage 19.0
> (TID 234)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
> attribute, tree: AmtPaidCumSum#366
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>...
> Caused by: java.lang.RuntimeException: Couldn't find AmtPaidCumSum#366
> in [sum#385,max#386,x#360,AmtPaid#361]
>...
>
> Is it a known issue or do we need a JIRA?
>
> --
> Best,
> Maciej Szymkiewicz
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


[image: Join Databricks at Spark Summit 2017 in San Francisco, the world's
largest event for the Apache Spark community.] 


Re: [SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x

2017-03-08 Thread Herman van Hövell tot Westerflier
You are seeing a bug in the Hive parser. Hive drops the window clause when
it encounters a count(distinct ...). See
https://issues.apache.org/jira/browse/HIVE-10141 for more information.

Spark 1.6 plans this as a regular distinct aggregate (dropping the window
clause), which is wrong. Spark 2.x uses its own parser, and it does not
allow you to do use 'distinct' aggregates in window functions. You are
getting this error because aggregates are planned before a windows, and the
aggregate cannot find b in its grouping by expressions.

On Wed, Mar 8, 2017 at 5:21 AM, StanZhai  wrote:

> We can reproduce this using the following code:
>
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
>
> val sql1 =
>   """
> |create temporary view tb as select * from values
> |(1, 0),
> |(1, 0),
> |(2, 0)
> |as grouping(a, b)
>   """.stripMargin
>
> val sql =
>   """
> |select count(distinct(b)) over (partition by a) from tb group by a
>   """.stripMargin
>
> spark.sql(sql1)
> spark.sql(sql).show()
>
> It will throw exception like this:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
> 'tb.`b`' is neither present in the group by, nor is it an aggregate function. 
> Add to group by or wrap in first() (or first_value) if you don't care which 
> value you get.;;
> Project [count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING)#4L]
> +- Project [b#1, a#0, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN 
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L, count(DISTINCT b) OVER 
> (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L]
>+- Window [count(distinct b#1) windowspecdefinition(a#0, ROWS BETWEEN 
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS count(DISTINCT b) OVER 
> (PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
> FOLLOWING)#4L], [a#0]
>   +- Aggregate [a#0], [b#1, a#0]
>  +- SubqueryAlias tb
> +- Project [a#0, b#1]
>+- SubqueryAlias grouping
>   +- LocalRelation [a#0, b#1]
>
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:220)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:247)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>
> But, there is no exception in Spark 1.6.x.
>
> I think the sql select count(distinct(b)) over (partition by a) from tb
> group by a should be executed. I've no idea about the exception. Is this
> in line with expectations?
>
> Any help is appreciated!
>
> Best,
>
> Stan
>
>
> 

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Herman van Hövell tot Westerflier
Congrats Takuya!

On Mon, Feb 13, 2017 at 11:27 PM, Neelesh Salian 
wrote:

> Congratulations, Takuya!
>
> On Mon, Feb 13, 2017 at 11:16 AM, Reynold Xin  wrote:
>
>> Hi all,
>>
>> Takuya-san has recently been elected an Apache Spark committer. He's been
>> active in the SQL area and writes very small, surgical patches that are
>> high quality. Please join me in congratulating Takuya-san!
>>
>>
>>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: [SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses in Spark 2.x

2017-02-06 Thread Herman van Hövell tot Westerflier
Hi Stan,

I have opened https://github.com/apache/spark/pull/16821 to fix this.

On Mon, Feb 6, 2017 at 1:41 PM, StanZhai  wrote:

> Hi all,
>
> SQLParser fails to resolve nested CASE WHEN statement like this:
>
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>
>  Exception 
> Exception in thread "main"
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT,
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE,
> '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)
>
> == SQL ==
>
> select case when
>   (1) +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> ^^^
> from tb
>
> But,remove parentheses will be fine:
>
> select case when
>   1 +
>   case when 1>0 then 1 else 0 end = 2
> then 1 else 0 end
> from tb
>
> I've already filed a JIRA for this:
> https://issues.apache.org/jira/browse/SPARK-19472
> 
>
> Any help is greatly appreciated!
>
> Best,
> Stan
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/SQL-SQLParser-
> fails-to-resolve-nested-CASE-WHEN-statement-with-parentheses-in-Spark-2-x-
> tp20867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 


[image: Register today for Spark Summit East 2017!]


Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: welcoming Burak and Holden as committers

2017-01-24 Thread Herman van Hövell tot Westerflier
Congrats!

On Tue, Jan 24, 2017 at 10:20 PM, Felix Cheung 
wrote:

> Congrats and welcome!!
>
>
> --
> *From:* Reynold Xin 
> *Sent:* Tuesday, January 24, 2017 10:13:16 AM
> *To:* dev@spark.apache.org
> *Cc:* Burak Yavuz; Holden Karau
> *Subject:* welcoming Burak and Holden as committers
>
> Hi all,
>
> Burak and Holden have recently been elected as Apache Spark committers.
>
> Burak has been very active in a large number of areas in Spark, including
> linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> DataFrames, dstream, and most recently Structured Streaming.
>
> Holden has been a long time Spark contributor and evangelist. She has
> written a few books on Spark, as well as frequent contributions to the
> Python API to improve its usability and performance.
>
> Please join me in welcoming the two!
>
>
>


-- 


[image: Register today for Spark Summit East 2017!]


Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Herman van Hövell tot Westerflier
@Jacek The maximum output of 200 fields for whole stage code generation has
been chosen to prevent the code generated method from exceeding the 64kb
code limit. There absolutely no relation between this value and the number
of partitions after a shuffle (if there were they should have used the same
configuration).

On Tue, Jan 3, 2017 at 1:55 PM, Jacek Laskowski  wrote:

> Hi Shuai,
>
> Disclaimer: I'm not a spark guru, and what's written below are some
> notes I took when reading spark source code, so I could be wrong, in
> which case I'd appreciate a lot if someone could correct me.
>
> (Yes, I did copy your disclaimer since it applies to me too. Sorry for
> duplication :))
>
> I'd say that the description is very well-written and clear. I'd only add
> that:
>
> 1. CodegenSupport allows custom implementations to optionally disable
> codegen using supportCodegen predicate (that is enabled by default,
> i.e. true)
> 2. CollapseCodegenStages is a Rule[SparkPlan], i.e. a transformation
> of SparkPlan into another SparkPlan, that searches for sub-plans (aka
> stages) that support codegen and collapse them together as a
> WholeStageCodegen for which supportCodegen is enabled.
> 3. It is assumed that all Expression instances except CodegenFallback
> support codegen.
> 4. CollapseCodegenStages uses the internal setting
> spark.sql.codegen.maxFields (default: 200) to control the number of
> fields in input and output schemas before deactivating whole-stage
> codegen. See https://issues.apache.org/jira/browse/SPARK-14554.
>
> NOTE: The magic number 200 (!) again. I asked about it few days ago
> and in http://stackoverflow.com/questions/41359344/why-is-the-
> number-of-partitions-after-groupby-200
>
> 5. There are side-effecting logical commands that are executed for
> their side-effects that are translated to ExecutedCommandExec in
> BasicOperators strategy and won't take part in codegen.
>
> Thanks for sharing your notes! Gonna merge yours with mine! Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Jan 2, 2017 at 6:30 PM, Shuai Lin  wrote:
> > Disclaimer: I'm not a spark guru, and what's written below are some
> notes I
> > took when reading spark source code, so I could be wrong, in which case
> I'd
> > appreciate a lot if someone could correct me.
> >
> >>
> >> > Let me rephrase this. How does the SparkSQL engine call the codegen
> APIs
> >> > to
> >> do the job of producing RDDs?
> >
> >
> > IIUC, physical operators like `ProjectExec` implements
> doProduce/doConsume
> > to support codegen, and when whole-stage codegen is enabled, a subtree
> would
> > be collapsed into a WholeStageCodegenExec wrapper tree, and the root
> node of
> > the wrapper tree would call the doProduce/doConsume method of each
> operator
> > to generate the java source code to be compiled into java byte code by
> > janino.
> >
> > In contrast, when whole stage code gen is disabled (e.g. by passing
> "--conf
> > spark.sql.codegen.wholeStage=false" to spark submit), the doExecute
> method
> > of the physical operators are called so no code generation would happen.
> >
> > The producing of the RDDs is some post-order SparkPlan tree evaluation.
> The
> > leaf node would be some data source: either some file-based
> > HadoopFsRelation, or some external data sources like JdbcRelation, or
> > in-memory LocalRelation created by "spark.range(100)". Above all, the
> leaf
> > nodes could produce rows on their own. Then the evaluation goes in a
> bottom
> > up manner, applying filter/limit/project etc. along the way. The
> generated
> > code or the various doExecute method would be called, depending on
> whether
> > codegen is enabled (the default) or not.
> >
> >> > What are those eval methods in Expressions for given there's already a
> >> > doGenCode next to it?
> >
> >
> > AFAIK the `eval` method of Expression is used to do static evaluation
> when
> > the expression is foldable, e.g.:
> >
> >select map('a', 1, 'b', 2, 'a', 3) as m
> >
> > Regards,
> > Shuai
> >
> >
> > On Wed, Dec 28, 2016 at 1:05 PM, dragonly  wrote:
> >>
> >> Thanks for your reply!
> >>
> >> Here's my *understanding*:
> >> basic types that ScalaReflection understands are encoded into tungsten
> >> binary format, while UDTs are encoded into GenericInternalRow, which
> >> stores
> >> the JVM objects in an Array[Any] under the hood, and thus lose those
> >> memory
> >> footprint efficiency and cpu cache efficiency stuff provided by tungsten
> >> encoding.
> >>
> >> If the above is correct, then here are my *further questions*:
> >> Are SparkPlan nodes (those ends with Exec) all codegenerated before
> >> actually
> >> running the toRdd logic? I know there are some non-codegenable nodes
> which
> >> implement trait CodegenFallback, but there's also 

Re: shapeless in spark 2.1.0

2016-12-29 Thread Herman van Hövell tot Westerflier
Which dependency pulls in shapeless?

On Thu, Dec 29, 2016 at 5:49 PM, Koert Kuipers  wrote:

> i just noticed that spark 2.1.0 bring in a new transitive dependency on
> shapeless 2.0.0
>
> shapeless is a popular library for scala users, and shapeless 2.0.0 is old
> (2014) and not compatible with more current versions.
>
> so this means a spark user that uses shapeless in his own development
> cannot upgrade safely from 2.0.0 to 2.1.0, i think.
>
> wish i had noticed this sooner
>



-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Herman van Hövell tot Westerflier
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:

> +1
>
> Xiao Li
>
> 2016-12-16 12:19 GMT-08:00 Felix Cheung :
>
>> For R we have a license field in the DESCRIPTION, and this is standard
>> practice (and requirement) for R packages.
>>
>> https://cran.r-project.org/doc/manuals/R-exts.html#Licensing
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Friday, December 16, 2016 9:57:15 AM
>> *To:* Reynold Xin; dev@spark.apache.org
>> *Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>>
>> (If you have a template for these emails, maybe update it to use https
>> links. They work for apache.org domains. After all we are asking people
>> to verify the integrity of release artifacts, so it might as well be
>> secure.)
>>
>> (Also the new archives use .tar.gz instead of .tgz like the others. No
>> big deal, my OCD eye just noticed it.)
>>
>> I don't see an Apache license / notice for the Pyspark or SparkR
>> artifacts. It would be good practice to include this in a convenience
>> binary. I'm not sure if it's strictly mandatory, but something to adjust in
>> any event. I think that's all there is to do for SparkR. For Pyspark, which
>> packages a bunch of dependencies, it does include the licenses (good) but I
>> think it should include the NOTICE file.
>>
>> This is the first time I recall getting 0 test failures off the bat!
>> I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.
>>
>> I think I'd +1 this therefore unless someone knows that the license issue
>> above is real and a blocker.
>>
>> On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116
>>> bf56fa62c76)
>>>
>>> List of JIRA tickets resolved are:  https://issues.apache.org/jir
>>> a/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1223/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>>>
>>> *What happened to RC3/RC5?*
>>>
>>> They had issues withe release packaging and as a result were skipped.
>>>
>>>
>


-- 

Herman van Hövell

Software Engineer

Databricks Inc.

hvanhov...@databricks.com

+31 6 420 590 27

databricks.com

[image: http://databricks.com] 


Re: Green dot in web UI DAG visualization

2016-11-17 Thread Herman van Hövell tot Westerflier
Should I be able to see something?

On Thu, Nov 17, 2016 at 9:10 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Some questions about this DAG visualization:
>
> [image: Screen Shot 2016-11-17 at 11.57.14 AM.png]
>
> 1. What's the meaning of the green dot?
> 2. Should this be documented anywhere (if it isn't already)? Preferably a
> tooltip or something directly in the UI would explain the significance.
>
> Nick
>
>


Re: structured streaming and window functions

2016-11-17 Thread Herman van Hövell tot Westerflier
What kind of window functions are we talking about? Structured streaming
only supports time window aggregates, not the more general sql window
function (sum(x) over (partition by ... order by ...)) aggregates.

The basic idea is that you use incremental aggregation and store the
aggregation buffer (not the end result) in a state store after each
increment. When an new batch comes in, you perform aggregation on that
batch, merge the result of that aggregation with the buffer in the state
store, update the state store and return the new result.

This is much harder than it sounds, because you need to maintain state in a
fault tolerant way and you need to have some eviction policy (watermarks
for instance) for aggregation buffers to prevent the state store from
reaching an infinite size.

On Thu, Nov 17, 2016 at 12:19 AM, assaf.mendelson 
wrote:

> Hi,
>
> I have been trying to figure out how structured streaming handles window
> functions efficiently.
>
> The portion I understand is that whenever new data arrived, it is grouped
> by the time and the aggregated data is added to the state.
>
> However, unlike operations like sum etc. window functions need the
> original data and can change when data arrives late.
>
> So if I understand correctly, this would mean that we would have to save
> the original data and rerun on it to calculate the window function every
> time new data arrives.
>
> Is this correct? Are there ways to go around this issue?
>
>
>
> Assaf.
>
> --
> View this message in context: structured streaming and window functions
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: Another Interesting Question on SPARK SQL

2016-11-17 Thread Herman van Hövell tot Westerflier
The diagram you have included, is a depiction of the steps Catalyst (the
spark optimizer) takes to create an executable plan. Tungsten mainly comes
into play during code generation and the actual execution.

A datasource is represented by a LogicalRelation during analysis &
optimization. The spark planner takes such a LogicalRelation and plans it
as either RowDataSourceScanExec or an BatchedDataSourceScanExec depending
on the datasource. Both scan nodes support whole stage code generation.

HTH


On Thu, Nov 17, 2016 at 1:28 AM, kant kodali  wrote:

>
> ​
> Which parts in the diagram above are executed by DataSource connectors and
> which parts are executed by Tungsten? or to put it in another way which
> phase in the diagram above does Tungsten leverages the Datasource
> connectors (such as say cassandra connector ) ?
>
> My understanding so far is that connectors come in during Physical
> planning phase but I am not sure if the connectors take logical plan as an
> input?
>
> Thanks,
> kant
>


Re: separate spark and hive

2016-11-15 Thread Herman van Hövell tot Westerflier
You can start a spark without hive support by setting the spark.sql.
catalogImplementation configuration to in-memory, for example:
>
> ./bin/spark-shell --master local[*] --conf
> spark.sql.catalogImplementation=in-memory


I would not change the default from Hive to Spark-only just yet.

On Tue, Nov 15, 2016 at 9:38 AM, assaf.mendelson 
wrote:

> After looking at the code, I found that spark.sql.catalogImplementation
> is set to “hive”. I would proposed that it should be set to “in-memory” by
> default (or at least have this in the documentation, the configuration
> documentation at http://spark.apache.org/docs/latest/configuration.html
> has no mentioning of hive at all)
>
> Assaf.
>
>
>
> *From:* Mendelson, Assaf
> *Sent:* Tuesday, November 15, 2016 10:11 AM
> *To:* 'rxin [via Apache Spark Developers List]'
> *Subject:* RE: separate spark and hive
>
>
>
> Spark shell (and pyspark) by default create the spark session with hive
> support (also true when the session is created using getOrCreate, at least
> in pyspark)
>
> At a minimum there should be a way to configure it using
> spark-defaults.conf
>
> Assaf.
>
>
>
> *From:* rxin [via Apache Spark Developers List] [[hidden email]
> ]
> *Sent:* Tuesday, November 15, 2016 9:46 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: separate spark and hive
>
>
>
> If you just start a SparkSession without calling enableHiveSupport it
> actually won't use the Hive catalog support.
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]
> > wrote:
>
> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:[hidden email]
> ]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* [hidden email] 
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> .
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]
> > wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> --
>
> View this message in context: separate spark and hive
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.
> 

Re: Would "alter table add column" be supported in the future?

2016-11-09 Thread Herman van Hövell tot Westerflier
This currently not on any roadmap I know of. You can open a JIRA ticket for
this if you want to.

On Wed, Nov 9, 2016 at 6:02 PM, 汪洋  wrote:

> Hi,
>
> I notice that “alter table add column” command is banned in spark 2.0.
>
> Any plans on supporting it in the future? (After all it was supported in
> spark 1.6.x)
>
> Thanks.
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Diffing execution plans to understand an optimizer bug

2016-11-08 Thread Herman van Hövell tot Westerflier
Replied in the ticket.

On Tue, Nov 8, 2016 at 11:36 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> SPARK-18367 : limit()
> makes the lame walk again
>
> On Tue, Nov 8, 2016 at 5:00 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hmm, it doesn’t seem like I can access the output of
>> df._jdf.queryExecution().hiveResultString() from Python, and until I can
>> boil the issue down a bit, I’m stuck with using Python.
>>
>> I’ll have a go at using regexes to strip some stuff from the printed
>> plans. The one that’s working for me to strip the IDs is #\d+L?.
>>
>> Nick
>> ​
>>
>> On Tue, Nov 8, 2016 at 4:47 PM Reynold Xin  wrote:
>>
>> If you want to peek into the internals and do crazy things, it is much
>> easier to do it in Scala with df.queryExecution.
>>
>> For explain string output, you can work around the comparison simply by
>> doing replaceAll("#\\d+", "#x")
>>
>> similar to the patch here: https://github.com/apache/spark/commit/
>> fd90541c35af2bccf0155467bec8cea7c8865046#diff-
>> 432455394ca50800d5de508861984ca5R217
>>
>>
>>
>> On Tue, Nov 8, 2016 at 1:42 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> I’m trying to understand what I think is an optimizer bug. To do that,
>> I’d like to compare the execution plans for a certain query with and
>> without a certain change, to understand how that change is impacting the
>> plan.
>>
>> How would I do that in PySpark? I’m working with 2.0.1, but I can use
>> master if it helps.
>>
>> explain()
>> 
>> is helpful but is limited in two important ways:
>>
>>1. It prints to screen and doesn’t offer another way to access the
>>plan or capture it.
>>2.
>>
>>The printed plan includes auto-generated IDs that make diffing
>>impossible. e.g.
>>
>> == Physical Plan ==
>> *Project [struct(primary_key#722, person#550, dataset_name#671)
>>
>>
>> Any suggestions on what to do? Any relevant JIRAs I should follow?
>>
>> Nick
>> ​
>>
>>
>>


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Herman van Hövell tot Westerflier
+1

On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
> ba694b0c34)
>
> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Herman van Hövell tot Westerflier
+1

On Fri, Nov 4, 2016 at 7:20 PM, Michael Armbrust 
wrote:

> +1
>
> On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.2-rc2 (a6abe1ee22141931614bf27a4f371
>> c46d8379e33)
>>
>> This release candidate resolves 84 issues: https://s.apache.org/spark-2.0
>> .2-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1210/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc2-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.1.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series. Bugs already
>> present in 2.0.1, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>> (i.e. RC3) is cut, I will change the fix version of those patches to 2.0.2.
>>
>
>


Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Herman van Hövell tot Westerflier
+1

On Thu, Nov 3, 2016 at 6:58 PM, Michael Armbrust 
wrote:

> +1
>
> On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.3
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v1.6.3-rc2 (1e860747458d74a4ccbd081103a05
>> 42a2367b14b)
>>
>> This release candidate addresses 52 JIRA tickets:
>> https://s.apache.org/spark-1.6.3-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1212/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc2-docs/
>>
>>
>> ===
>> == How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.6.2.
>>
>> 
>> == What justifies a -1 vote for this release?
>> 
>> This is a maintenance release in the 1.6.x series.  Bugs already present
>> in 1.6.2, missing features, or bugs related to new features will not
>> necessarily block this release.
>>
>>
>


Re: encoders for more complex types

2016-10-27 Thread Herman van Hövell tot Westerflier
What kind of difficulties are you experiencing?

On Thu, Oct 27, 2016 at 9:57 PM, Koert Kuipers  wrote:

> i have been pushing my luck a bit and started using ExpressionEncoder for
> more complex types like sequences of case classes etc. (where the case
> classes only had primitives and Strings).
>
> it all seems to work but i think the wheels come off in certain cases in
> the code generation. i guess this is not unexpected, after all what i am
> doing is not yet supported.
>
> is there a planned path forward to support more complex types with
> encoders? it would be nice if we can at least support all types that
> spark-sql supports in general for DataFrame.
>
> best, koert
>


Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Herman van Hövell tot Westerflier
+1

On Thu, Oct 27, 2016 at 9:18 AM, Reynold Xin  wrote:

> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc1 (1c2908eeb8890fdc91413a3f5bad2b
> b3d114db6c)
>
> This release candidate resolves 75 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1208/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC2) is cut, I will change the fix version of those patches to 2.0.2.
>
>
>


Re: collect_list alternative for SQLContext?

2016-10-25 Thread Herman van Hövell tot Westerflier
What version of Spark are you using? We introduced a Spark native
collect_list in 2.0.

It still has the usual caveats, but it should quite a bit faster.

On Tue, Oct 25, 2016 at 6:16 AM, Matt Smith  wrote:

> Is there an alternative function or design pattern for the collect_list
> UDAF that can used without taking a dependency on HiveContext?  How does
> one typically roll things up into an array when outputting JSON?
>


Re: welcoming Xiao Li as a committer

2016-10-04 Thread Herman van Hövell tot Westerflier
Congratulations Xiao! Very well deserved!

On Mon, Oct 3, 2016 at 10:46 PM, Reynold Xin  wrote:

> Hi all,
>
> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
> committer. Xiao has been a super active contributor to Spark SQL. Congrats
> and welcome, Xiao!
>
> - Reynold
>
>


Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Herman van Hövell tot Westerflier
Tejas,

This is because we use the same rule to parse top level and nested data
fields. For example:

create table tbl_x(
  id bigint,
  nested struct
)

Shows both syntaxes. We should split this rule in a top-level and nested
rule.

Could you open a ticket?

Thanks,
Herman



On Thu, Sep 29, 2016 at 6:54 PM, Tejas Patil 
wrote:

> Is there any reason why Spark SQL supports "" ":" " type>" while specifying columns ? eg. sql("CREATE TABLE t1 (column1:INT)")
> works fine.
>
> Here is relevant snippet in the grammar [0]:
>
> ```
> colType
> : identifier ':'? dataType (COMMENT STRING)?
> ;
> ```
>
> I do not see MySQL[1], Hive[2], Presto[3] and PostgreSQL [4] supporting
> ":" while specifying columns. They all use space as a delimiter.
>
> [0] : https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/
> parser/SqlBase.g4#L596
> [1] : http://dev.mysql.com/doc/refman/5.7/en/create-table.html
> [2] : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#
> LanguageManualDDL-CreateTable
> [3] : https://prestodb.io/docs/current/sql/create-table.html
> [4] : https://www.postgresql.org/docs/9.1/static/sql-createtable.html
>
> Thanks,
> Tejas
>
>


Re: https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread Herman van Hövell tot Westerflier
Hi Asaf,

The current collect_list/collect_set implementations have room for
improvement. We did not implement partial aggregation for these, because
the idea of a partial aggregation is that we can reduce network traffic (by
shipping fewer partially aggregated buffers); this does not really apply to
a collect_list where the typical use case is to change the shape of the
data.

I think you have two simple options here:

   1. In the latest branch we added a TypedImperativeAggregate. This allows
   you to use any object as an aggregation buffer. You will need to do some
   serialization though. The ApproximatePercentile aggregate function uses
   this technique.
   2. Exploit the fact that you want collect a limited amount of elements.
   You can use a row a as the buffer. This is much easier to work with. See
   HyperLogLogPlusPlus for an example of this.


HTH
-Herman

On Tue, Sep 27, 2016 at 7:02 AM, assaf.mendelson 
wrote:

> Hi,
>
>
>
> I wanted to try to implement https://issues.apache.org/
> jira/browse/SPARK-17691.
>
> So I started by looking at the implementation of collect_list. My idea
> was, do the same as they but when adding a new element, if there are
> already more than the threshold, remove one instead.
>
> The problem with this is that since collect_list has no partial
> aggregation we would end up shuffling all the data anyway. So while it
> would mean the actual resulting column might be smaller, the whole process
> would be as expensive as collect_list.
>
> So I thought of adding partial aggregation. The problem is that the merge
> function receives a buffer which is in a row format. Collect_list doesn’t
> use the buffer and uses its own data structure for collecting the data.
>
> I can change the implementation to use a spark ArrayType instead, however,
> since ArrayType is immutable it would mean that I would need to copy it
> whenever I do anything.
>
> Consider the simplest implementation of the update function:
>
> If there are few elements => add an element to the array (if I use regular
> Array this would mean copy as I grow it which is fine for this stage)
>
> If there are enough elements => we do not grow the array. Instead we need
> to decide what to replace. If we want to have the top 10 for example and
> there are 10 elements, we need to drop the lowest and put the new one.
>
> This means that if we simply loop across the array we would create a new
> copy and pay the copy + loop. If we keep it sorted then adding, sorting and
> removing the low one means 3 copies.
>
> If I would have been able to use scala’s array then I would basically copy
> whenever I grow and then when we grown to the max, all I would need to do
> is REPLACE the relevant element which is much cheaper.
>
>
>
> The only other solution I see is to simply provide “take first N” agg
> function and have the user sort beforehand but this seems a bad solution to
> me both because sort is expensive and because if we do multiple
> aggregations we can’t sort in two different ways.
>
>
>
>
>
> I can’t find a way to convert an internal buffer the way collect_list does
> it to an internal buffer before the merge.
>
> I also can’t find any way to use an array in the internal buffer as a
> mutable array. If I look at GenericMutableRow implementation then updating
> an array means creating a new one. I thought maybe of adding a function
> update_array_element which would change the relevant element (and similarly
> get_array_element to get an array element) which would allow to easily make
> the array mutable but if I look at the documentation it states this is not
> allowed.
>
>
>
> Can anyone give me a tip on where to try to go from here?
>
> --
> View this message in context: https://issues.apache.org/
> jira/browse/SPARK-17691
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-25 Thread Herman van Hövell tot Westerflier
+1 (non-binding)

On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non-binding)
>
> Built and tested on
> - Ubuntu 16.04 / OpenJDK 1.8.0_91
> - CentOS / Oracle Java 1.7.0_55
> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn)
>
>
> On 25 September 2016 at 22:35, Matei Zaharia 
> wrote:
>
>> +1
>>
>> Matei
>>
>> On Sep 25, 2016, at 1:25 PM, Josh Rosen  wrote:
>>
>> +1
>>
>> On Sun, Sep 25, 2016 at 1:16 PM Yin Huai  wrote:
>>
>>> +1
>>>
>>> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun 
>>> wrote:
>>>
 +1 (non binding)

 RC3 is compiled and tested on the following two systems, too. All tests
 passed.

 * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
 -Dsparkr
 * CentOS 7.2 / Open JDK 1.8.0_102
with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver

 Cheers,
 Dongjoon



 On Saturday, September 24, 2016, Reynold Xin 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and
> passes if a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.1
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
> c2cc2730d17)
>
> This release candidate resolves 290 issues:
> https://s.apache.org/spark-2.0.1-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapache
> spark-1201/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
> 1-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by
> taking an existing Spark workload and running on this release candidate,
> then reporting any regressions from 2.0.0.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series.  Bugs already
> present in 2.0.0, missing features, or bugs related to new features will
> not necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0
> from now on?
> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
> RC (i.e. RC4) is cut, I will change the fix version of those patches to
> 2.0.1.
>
>
>
>>>
>>
>


Re: Why Expression.deterministic method and Nondeterministic trait?

2016-09-23 Thread Herman van Hövell tot Westerflier
Jacek,

A non-deterministic expression usually holds some state. The
Nondeterministic trait makes sure a user can initialize this state
properly. Take a look at InterpretedProjection

for instance.

HTH

-Herman

On Fri, Sep 23, 2016 at 8:28 AM, Jacek Laskowski  wrote:

> Hi,
>
> Just came across the Expression trait [1] that can be check for
> determinism by the method deterministic [2] and trait Nondeterministic
> [3]. Why both?
>
> [1] https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> expressions/Expression.scala#L53
> [2] https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> expressions/Expression.scala#L80
> [3] https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> expressions/Expression.scala#L271
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: How to get 2 years prior date from currentdate using Spark Sql

2016-09-07 Thread Herman van Hövell tot Westerflier
This is more a @use question.

You can write the following in sql: select date '2016-09-07' - interval 2
years

HTH

On Wed, Sep 7, 2016 at 3:14 PM, Yong Zhang  wrote:

> sorry, should be date_sub
>
>
> https://issues.apache.org/jira/browse/SPARK-8187
> [SPARK-8187] date/time function: date_sub - ASF JIRA
> 
> issues.apache.org
> Apache Spark added a comment - 12/Jun/15 06:56 User 'adrian-wang' has
> created a pull request for this issue: https://github.com/apache/
> spark/pull/6782
>
>
>
> --
> *From:* Yong Zhang 
> *Sent:* Wednesday, September 7, 2016 9:13 AM
> *To:* farman.bsse1855; dev@spark.apache.org
> *Subject:* Re: How to get 2 years prior date from currentdate using Spark
> Sql
>
>
> https://issues.apache.org/jira/browse/SPARK-8185
> [SPARK-8185] date/time function: datediff - ASF JIRA
> 
> issues.apache.org
> Spark; SPARK-8159 Improve expression function coverage (Spark 1.5)
> SPARK-8185; date/time function: datediff
>
>
>
> --
> *From:* farman.bsse1855 
> *Sent:* Wednesday, September 7, 2016 7:27 AM
> *To:* dev@spark.apache.org
> *Subject:* How to get 2 years prior date from currentdate using Spark Sql
>
> I need to derive 2 years prior date of current date using a query in Spark
> Sql. For ex : today's date is 2016-09-07. I need to get the date exactly 2
> years before this date in the above format (-MM-DD).
>
> Please let me know if there are multiple approaches and which one would be
> better.
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/How-to-get-2-years-
> prior-date-from-currentdate-using-Spark-Sql-tp18875.html
> Apache Spark Developers List - How to get 2 years prior date from
> currentdate using Spark Sql
> 
> apache-spark-developers-list.1001551.n3.nabble.com
> How to get 2 years prior date from currentdate using Spark Sql. I need to
> derive 2 years prior date of current date using a query in Spark Sql. For
> ex : today's date is 2016-09-07. I need to get the...
>
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Herman van Hövell tot Westerflier
Congrats Felix!

On Mon, Aug 8, 2016 at 11:57 PM, dhruve ashar  wrote:

> Congrats Felix!
>
> On Mon, Aug 8, 2016 at 2:28 PM, Tarun Kumar  wrote:
>
>> Congrats Felix!
>>
>> Tarun
>>
>> On Tue, Aug 9, 2016 at 12:57 AM, Timothy Chen  wrote:
>>
>>> Congrats Felix!
>>>
>>> Tim
>>>
>>> On Mon, Aug 8, 2016 at 11:15 AM, Matei Zaharia 
>>> wrote:
>>> > Hi all,
>>> >
>>> > The PMC recently voted to add Felix Cheung as a committer. Felix has
>>> been a major contributor to SparkR and we're excited to have him join
>>> officially. Congrats and welcome, Felix!
>>> >
>>> > Matei
>>> > -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
> -Dhruve Ashar
>
>


Re: Result code of whole stage codegen

2016-08-05 Thread Herman van Hövell tot Westerflier
Do you want to see the code that whole stage codegen produces?

You can prepend a SQL statement with EXPLAIN CODEGEN ...

Or you can add the following code to a DataFrame/Dataset command:

import org.apache.spark.sql.execution.debug._

and call the the debugCodegen() command on a Dataframe/Dataset, for example:

range(0, 100).debugCodegen

...

Found 1 WholeStageCodegen subtrees.

== Subtree 1 / 1 ==

*Range (0, 100, splits=8)


Generated code:

/* 001 */ public Object generate(Object[] references) {

/* 002 */   return new GeneratedIterator(references);

/* 003 */ }

/* 004 */

/* 005 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {

/* 006 */   private Object[] references;

/* 007 */   private org.apache.spark.sql.execution.metric.SQLMetric
range_numOutputRows;

/* 008 */   private boolean range_initRange;

/* 009 */   private long range_partitionEnd;

...

On Fri, Aug 5, 2016 at 9:55 AM, Maciej Bryński  wrote:

> Hi,
> I have some operation on DataFrame / Dataset.
> How can I see source code for whole stage codegen ?
> Is there any API for this ? Or maybe I should configure log4j in specific
> way ?
>
> Regards,
> --
> Maciek Bryński
>


Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Herman van Hövell tot Westerflier
Hi Gerhard,

DataFrame and DataSet have been merged in Spark 2.0. A DataFrame is now a
DataSet that contains Row objects. We still maintain a type alias for
DataFrame:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/package.scala#L45

HTH

Kind regards,

Herman van Hövell tot Westerflier

2016-06-03 17:01 GMT+02:00 Gerhard Fiedler <gfied...@algebraixdata.com>:

> When I look at the sources in Github, I see DataFrame.scala at
> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala
> in the 1.6 branch. But when I change the branch to branch-2.0 or master, I
> get a 404 error. I also can’t find the file in the directory listings, for
> example
> https://github.com/apache/spark/tree/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql
> (for branch-2.0).
>
>
>
> It seems that quite a few APIs use the DataFrame class, even in 2.0. Can
> someone please point me to its location, or otherwise explain why it is not
> there?
>
>
>
> Thanks,
>
> Gerhard
>
>
>


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Herman van Hövell tot Westerflier
+1


2016-05-19 18:20 GMT+02:00 Xiangrui Meng :

> +1
>
> On Thu, May 19, 2016 at 9:18 AM Joseph Bradley 
> wrote:
>
>> +1
>>
>> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin 
>> wrote:
>>
>>> Hi Ovidiu-Cristian ,
>>>
>>> The best source of truth is change the filter with target version to
>>> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
>>> get closer to 2.0 release, more will be retargeted at 2.1.0.
>>>
>>>
>>>
>>> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 Yes, I can filter..
 Did that and for example:

 https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
 

 To rephrase: for 2.0 do you have specific issues that are not a
 priority and will released maybe with 2.1 for example?

 Keep up the good work!

 On 18 May 2016, at 18:19, Reynold Xin  wrote:

 You can find that by changing the filter to target version = 2.0.0.
 Cheers.

 On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
 ovidiu-cristian.ma...@inria.fr> wrote:

> +1 Great, I see the list of resolved issues, do you have a list of
> known issue you plan to stay with this release?
>
> with
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
> -Phive-thriftserver -DskipTests clean package
>
> mvn -version
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
> 2015-11-10T17:41:47+01:00)
> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
> Java version: 1.7.0_80, vendor: Oracle Corporation
> Java home:
> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
> 2.635 s]
> [INFO] Spark Project Tags . SUCCESS [
> 1.896 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 2.560 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 6.533 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 4.176 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 4.809 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 6.242 s]
> [INFO] Spark Project Core . SUCCESS
> [01:20 min]
> [INFO] Spark Project GraphX ... SUCCESS [
> 9.148 s]
> [INFO] Spark Project Streaming  SUCCESS [
> 22.760 s]
> [INFO] Spark Project Catalyst . SUCCESS [
> 50.783 s]
> [INFO] Spark Project SQL .. SUCCESS
> [01:05 min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 4.281 s]
> [INFO] Spark Project ML Library ... SUCCESS [
> 54.537 s]
> [INFO] Spark Project Tools  SUCCESS [
> 0.747 s]
> [INFO] Spark Project Hive . SUCCESS [
> 33.032 s]
> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
> 3.198 s]
> [INFO] Spark Project REPL . SUCCESS [
> 3.573 s]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> 4.617 s]
> [INFO] Spark Project YARN . SUCCESS [
> 7.321 s]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
> 16.496 s]
> [INFO] Spark Project Assembly . SUCCESS [
> 2.300 s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [
> 4.219 s]
> [INFO] Spark Project External Flume ... SUCCESS [
> 6.987 s]
> [INFO] Spark Project External Flume Assembly .. SUCCESS [
> 1.465 s]
> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
> 6.891 s]
> [INFO] Spark Project Examples . SUCCESS [
> 13.465 s]
> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
> 2.815 s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] 

Re: Query parsing error for the join query between different database

2016-05-18 Thread Herman van Hövell tot Westerflier
'User' is a SQL2003 keyword. This is normally not a problem, except when
you use it as a table alias (which you are doing). Change the alias or
place it between backticks and you should be fine.


2016-05-18 23:51 GMT+02:00 JaeSung Jun :

> It's spark 1.6.1 and hive 1.2.1 (spark-sql saying "SET
> spark.sql.hive.version=1.2.1").
>
> Thanks
>
> On 18 May 2016 at 23:31, Ted Yu  wrote:
>
>> Which release of Spark / Hive are you using ?
>>
>> Cheers
>>
>> On May 18, 2016, at 6:12 AM, JaeSung Jun  wrote:
>>
>> Hi,
>>
>> I'm working on custom data source provider, and i'm using fully qualified
>> table name in FROM clause like following :
>>
>> SELECT user. uid, dept.name
>> FROM userdb.user user, deptdb.dept
>> WHERE user.dept_id = dept.id
>>
>> and i've got the following error :
>>
>> MismatchedTokenException(279!=26)
>> at
>> org.antlr.runtime.BaseRecognizer.recoverFromMismatchedToken(BaseRecognizer.java:617)
>> at org.antlr.runtime.BaseRecognizer.match(BaseRecognizer.java:115)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableSource(HiveParser_FromClauseParser.java:4608)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:3729)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1873)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1518)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:45861)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:41516)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:41402)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:40413)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:40283)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1590)
>> at
>> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
>> at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
>> at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
>> at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
>> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
>> at
>> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
>> at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>> at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>> at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>> at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at
>> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>> at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at
>> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>> at
>> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>> at
>> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
>>
>> Any idea?
>>
>> Thanks
>> Jason
>>
>>
>


Re: explain codegen

2016-04-04 Thread Herman van Hövell tot Westerflier
No, it can''t. You only need implicits when you are using the catalyst DSL.

The error you get is due to the fact that the parser does not recognize the
CODEGEN keyword (which was the case before we introduced this in
https://github.com/apache/spark/commit/fa1af0aff7bde9bbf7bfa6a3ac74699734c2fd8a).
That suggests to me that you are not on the latest master.

Kind regards,

Herman van Hövell

2016-04-04 12:15 GMT+02:00 Ted Yu :

> Could the error I encountered be due to missing import(s) of implicit ?
>
> Thanks
>
> On Sun, Apr 3, 2016 at 9:42 PM, Reynold Xin  wrote:
>
>> Works for me on latest master.
>>
>>
>>
>> scala> sql("explain codegen select 'a' as a group by 1").head
>> res3: org.apache.spark.sql.Row =
>> [Found 2 WholeStageCodegen subtrees.
>> == Subtree 1 / 2 ==
>> WholeStageCodegen
>> :  +- TungstenAggregate(key=[], functions=[], output=[a#10])
>> : +- INPUT
>> +- Exchange SinglePartition, None
>>+- WholeStageCodegen
>>   :  +- TungstenAggregate(key=[], functions=[], output=[])
>>   : +- INPUT
>>   +- Scan OneRowRelation[]
>>
>> Generated code:
>> /* 001 */ public Object generate(Object[] references) {
>> /* 002 */   return new GeneratedIterator(references);
>> /* 003 */ }
>> /* 004 */
>> /* 005 */ /** Codegened pipeline for:
>> /* 006 */ * TungstenAggregate(key=[], functions=[], output=[a#10])
>> /* 007 */ +- INPUT
>> /* 008 */ */
>> /* 009 */ final class GeneratedIterator extends
>> org.apache.spark.sql.execution.BufferedRowIterator {
>> /* 010 */   private Object[] references;
>> /* 011 */   ...
>>
>>
>> On Sun, Apr 3, 2016 at 9:38 PM, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Looks related to the recent commit...
>>>
>>> Repository: spark
>>> Updated Branches:
>>>   refs/heads/master 2262a9335 -> 1f0c5dceb
>>>
>>> [SPARK-14350][SQL] EXPLAIN output should be in a single cell
>>>
>>> Jacek
>>> 03.04.2016 7:00 PM "Ted Yu"  napisał(a):
>>>
 Hi,
 Based on master branch refreshed today, I issued 'git clean -fdx'
 first.

 Then this command:
 build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6
 -Dhadoop.version=2.7.0 package -DskipTests

 I got the following error:

 scala>  sql("explain codegen select 'a' as a group by 1").head
 org.apache.spark.sql.catalyst.parser.ParseException:
 extraneous input 'codegen' expecting {'(', 'SELECT', 'FROM', 'ADD',
 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE',
 'DESCRIBE', 'EXPLAIN', 'LOGICAL', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP',
 'SET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'EXTENDED', 'REFRESH',
 'CLEAR', 'CACHE', 'UNCACHE', 'FORMATTED', 'DFS', 'TRUNCATE', 'ANALYZE',
 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT',
 'LOAD'}(line 1, pos 8)

 == SQL ==
 explain codegen select 'a' as a group by 1
 ^^^

 Can someone shed light ?

 Thanks

>>>
>>
>


Re: SPARK-SQL: Pattern Detection on Live Event or Archived Event Data

2016-03-01 Thread Herman van Hövell tot Westerflier
Hi Jerry,

This is not on any roadmap. I (shortly) browsed through this; and this
looks like some sort of a window function with very awkward syntax. I think
spark provided better constructs for this using dataframes/datasets/nested
data...

Feel free to submit a PR.

Kind regards,

Herman van Hövell

2016-03-01 15:16 GMT+01:00 Jerry Lam :

> Hi Spark developers,
>
> Will you consider to add support for implementing "Pattern matching in
> sequences of rows"? More specifically, I'm referring to this:
> http://web.cs.ucla.edu/classes/fall15/cs240A/notes/temporal/row-pattern-recogniton-11.pdf
>
> This is a very cool/useful feature to pattern matching over live
> stream/archived data. It is sorted of related to machine learning because
> this is usually used in clickstream analysis or path analysis. Also it is
> related to streaming because of the nature of the processing (time series
> data mostly). It is SQL because there is a good way to express and optimize
> the query.
>
> Best Regards,
>
> Jerry
>


Re: Aggregation + Adding static column + Union + Projection = Problem

2016-02-26 Thread Herman van Hövell tot Westerflier
Hi Jiří,

Thanks for your mail.

Could you create a JIRA ticket for this:
 
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel



?

Kind regards,

Herman van Hövell


2016-02-26 15:11 GMT+01:00 Jiří Syrový :

> Hi,
>
> I've recently noticed a bug in Spark (branch 1.6) that appears if you do
> the following
>
> Let's have some DataFrame called df.
>
> 1) Aggregation of multiple columns on the Dataframe df and store result as
> result_agg_1
> 2) Do another aggregation of multiple columns, but on one less grouping
> columns and store the result as result_agg_2
> 3) Align the result of second aggregation by adding missing grouping
> column with value empty lit("")
> 4) Union result_agg_1 and result_agg_2
> 5) Do the projection from "sum(count_column)" to "count_column" for all
> aggregated columns.
>
> The result is structurally inconsistent DataFrame that has all the data
> coming from result_agg_1 shifted.
>
> An example of stripped down code and example result can be seen here:
>
> https://gist.github.com/xjrk58/e0c7171287ee9bdc8df8
> https://gist.github.com/xjrk58/7a297a42ebb94f300d96
>
> Best,
> Jiri Syrovy
>
>


Re: Spark SQL performance: version 1.6 vs version 1.5

2016-02-12 Thread Herman van Hövell tot Westerflier
Hi Tien-Dung,

1.6 plans single distinct aggregates like multiple distinct aggregates;
this inherently causes some overhead but is more stable in case of high
cardinalities. You can revert to the old behavior by setting the
spark.sql.specializeSingleDistinctAggPlanning option to false. See also:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462

HTH

Kind regards,

Herman van Hövell


2016-02-12 16:23 GMT+01:00 Le Tien Dung :

> Hi folks,
>
> I have compared the performance of Spark SQL version 1.6.0 and version
> 1.5.2. In a simple case, Spark 1.6.0 is quite faster than Spark 1.5.2.
> However in a more complex query - in our case it is an aggregation query
> with grouping sets, Spark SQL version 1.6.0 is very much slower than Spark
> SQL version 1.5. Could any of you kindly let us know a workaround for this
> performance regression ?
>
> Here is our test scenario:
>
> case class Toto(
>  a: String = f"${(math.random*1e6).toLong}%06.0f",
>  b: String = f"${(math.random*1e6).toLong}%06.0f",
>  c: String = f"${(math.random*1e6).toLong}%06.0f",
>  n: Int = (math.random*1e3).toInt,
>  m: Double = (math.random*1e3))
>
> val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
> val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )
>
> df.registerTempTable( "toto" )
> val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2,
> SUM(m) AS k3"
> val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
> val sqlText = s"$sqlSelect $sqlGroupBy"
>
> val rs1 = sqlContext.sql( sqlText )
> rs1.saveAsParquetFile( "rs1" )
>
> The query is executed from a spark-shell in local mode with
> --driver-memory=1G. Screenshots from Spark UI are accessible at
> http://i.stack.imgur.com/VujQY.png (Spark 1.5.2) and
> http://i.stack.imgur.com/Hlg95.png (Spark 1.6.0). The DAG on Spark 1.6.0
> can be viewed at http://i.stack.imgur.com/u3HrG.png.
>
> Many thanks and looking forward to hearing from you,
> Tien-Dung Le
>
>


Re: Path to resource added with SQL: ADD FILE

2016-02-04 Thread Herman van Hövell tot Westerflier
Hi Antonio,

I am not sure you got the silent treatment on the user list. Stackoverflow
is also a good place to ask questions.

Could you use an absolute path to add the jar file. So instead of './my
resource file' (which is a relative path; this depends on where you started
Spark), use something like this '/some/path/my resource file' or use an URI.

Kind regards,

Herman van Hövell


2016-02-03 19:17 GMT+01:00 Antonio Piccolboni :

> Sorry if this is more appropriate for user list, I asked there on 12/17
> and got the silence treatment. I am writing a UDF that needs some
> additional info to perform its task. This information is in a file that I
> reference in a SQL ADD FILE statement. I expect that file to be accessible
> in the working directory for the UDF, but it doesn't seem to work (aka,
> failure on open("./my resource file"). What is the correct way to access
> the added resource? Thanks
>
>
> Antonio
>


Re: build error: code too big: specialStateTransition(int, IntStream)

2016-01-28 Thread Herman van Hövell tot Westerflier
Hi,

I have only encountered 'code too large' errors when changing grammars. I
am using SBT/Idea, no Eclipse.

The size of an ANTLR Parser/Lexer is dependent on the rules inside the
source grammar and the rules it depends on. So we should take a look at the
IdentifiersParser.g/ExpressionParser.g; the problem is probably caused by
the nonReserved rule.

HTH

Kind regards,

Herman van Hövell


2016-01-28 16:43 GMT+01:00 Ted Yu :

> After this change:
> [SPARK-12681] [SQL] split IdentifiersParser.g into two files
>
> the biggest file under
> sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser is 
> SparkSqlParser.g
>
> Maybe split SparkSqlParser.g up as well ?
>
> On Thu, Jan 28, 2016 at 5:21 AM, Iulian Dragoș  > wrote:
>
>> Hi,
>>
>> Has anyone seen this error?
>>
>> The code of method specialStateTransition(int, IntStream) is exceeding the 
>> 65535 bytes limitSparkSqlParser_IdentifiersParser.java:39907
>>
>> The error is in ANTLR generated files and it’s (according to Stack
>> Overflow) due to state explosion in parser (or lexer). That seems
>> plausible, given that one file has >5 lines of code. Some suggest that
>> refactoring the grammar would help.
>>
>> I’m seeing this error only sometimes on the command line (building with
>> Sbt), but every time when building with Eclipse (which has its own Java
>> compiler, so it’s not surprising that it has a different behavior). Same
>> behavior with both Java 1.7 and 1.8.
>>
>> Any ideas?
>>
>> iulian
>> ​
>> --
>>
>> --
>> Iulian Dragos
>>
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com
>>
>>
>


Are we running SparkR tests in Jenkins?

2016-01-15 Thread Herman van Hövell tot Westerflier
Hi all,

I just noticed the following log entry in Jenkins:


> Running SparkR tests
> 
> Running R applications through 'sparkR' is not supported as of Spark 2.0.
> Use ./bin/spark-submit 


Are we still running R tests? Or just saying that this will be deprecated?

Kind regards,

Herman van Hövell tot Westerflier


Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
Thanks. I'll merge the most recent master...

Still curious if we can stop a build.

Kind regards,

Herman van Hövell tot Westerflier

2015-12-29 18:59 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:

> HiveThriftBinaryServerSuite got stuck.
>
> I thought Josh has fixed this issue:
>
> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
> HiveThriftBinaryServerSuite
>
> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> My AMPLAB jenkins build has been stuck for a few hours now:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull
>>
>> Is there a way for me to stop the build?
>>
>> Kind regards,
>>
>> Herman van Hövell
>>
>>
>


Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
My AMPLAB jenkins build has been stuck for a few hours now:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull

Is there a way for me to stop the build?

Kind regards,

Herman van Hövell


Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
Hi Josh,

Your HiveThriftBinaryServerSuite fix wasn't in the build I was running (I
forgot to merge the latest master). So it might actually work.

As for stopping the build, it is understandable that you cannot do that
without the proper permissions. It would still be cool to be able to issue
a 'stop build' command from github though.

Kind regards,

Herman

2015-12-29 19:19 GMT+01:00 Josh Rosen <joshro...@databricks.com>:

> Yeah, I thought that my quick fix might address the
> HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work
> so I'll now have to do the more principled fix of using a UDF which sleeps
> for some amount of time.
>
> In order to stop builds, you need to have a Jenkins account with the
> proper permissions. I believe that it's generally only Spark committers and
> AMPLab members who have accounts + Jenkins SSH access.
>
> I've gone ahead killed the build for you. It looks like someone had
> configured the pull request builder timeout to be 300 minutes (5 hours),
> but I think we should consider decreasing that to match the timeout used by
> the Spark full test jobs.
>
> On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> Thanks. I'll merge the most recent master...
>>
>> Still curious if we can stop a build.
>>
>> Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>> 2015-12-29 18:59 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>>
>>> HiveThriftBinaryServerSuite got stuck.
>>>
>>> I thought Josh has fixed this issue:
>>>
>>> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
>>> HiveThriftBinaryServerSuite
>>>
>>> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
>>>> My AMPLAB jenkins build has been stuck for a few hours now:
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull
>>>>
>>>> Is there a way for me to stop the build?
>>>>
>>>> Kind regards,
>>>>
>>>> Herman van Hövell
>>>>
>>>>
>>>
>>
>


Re: Lead operator not working as aggregation operator

2015-11-02 Thread Herman van Hövell tot Westerflier
Hi,

This is more a question for the User list.

Lead and Lag imply ordering of the whole dataset, and this is not
supported. You can use Lead/Lag in an ordered window function and you'll be
fine:

*select lead(max(expenses)) over (order by customerId) from tbl group by
customerId*

HTH

Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+31 6 420 590 27


2015-11-02 11:33 GMT+01:00 Shagun Sodhani <sshagunsodh...@gmail.com>:

> Hi! I was trying out window functions in SparkSql (using hive context)
> and I noticed that while this
> <https://issues.apache.org/jira/browse/TAJO-919?jql=text%20~%20%22lag%20window%22>
> mentions that *lead* is implemented as an aggregate operator, it seems
> not to be the case.
>
> I am using the following configuration:
>
> Query : SELECT lead(max(`expenses`)) FROM `table` GROUP BY `customerId`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (`customerId`, `expenses`) scheme where
> each customer has multiple values for expenses (though I am setting age as
> Double and not Int as I am trying out maths functions).
>
>
> *java.lang.NullPointerException at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57)*
>
> The entire error stack can be found here <http://pastebin.com/jTRR4Ubx>.
>
> Can someone confirm if this is an actual issue or some oversight on my
> part?
>
> Thanks!
>


Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Herman van Hövell tot Westerflier
We could also fallback to approximate count distincts when the user
requests multiple count distincts. This is less invasive than throwing an
AnalysisException, but it could violate the principle of least surprise.



Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+599 9 521 4402


2015-10-07 22:43 GMT+02:00 Reynold Xin <r...@databricks.com>:

> Adding user list too.
>
>
>
> -- Forwarded message --
> From: Reynold Xin <r...@databricks.com>
> Date: Tue, Oct 6, 2015 at 5:54 PM
> Subject: Re: multiple count distinct in SQL/DataFrame?
> To: "dev@spark.apache.org" <dev@spark.apache.org>
>
>
> To provide more context, if we do remove this feature, the following SQL
> query would throw an AnalysisException:
>
> select count(distinct colA), count(distinct colB) from foo;
>
> The following should still work:
>
> select count(distinct colA) from foo;
>
> The following should also work:
>
> select count(distinct colA, colB) from foo;
>
>
> On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> The current implementation of multiple count distinct in a single query
>> is very inferior in terms of performance and robustness, and it is also
>> hard to guarantee correctness of the implementation in some of the
>> refactorings for Tungsten. Supporting a better version of it is possible in
>> the future, but will take a lot of engineering efforts. Most other
>> Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.
>>
>> As a result, we are considering removing support for multiple count
>> distinct in a single query in the next Spark release (1.6). If you use this
>> feature, please reply to this email. Thanks.
>>
>> Note that if you don't care about null values, it is relatively easy to
>> reconstruct a query using joins to support multiple distincts.
>>
>>
>>
>
>


Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
Hello Nick,

I have been working on a (UDT-less) implementation of HLL++. You can find
the PR here: https://github.com/apache/spark/pull/8362. This current
implements the dense version of HLL++, which is a further development of
HLL. It returns a Long, but it shouldn't be to hard to return a Row
containing the cardinality and/or the HLL registers (the binary data).

I am curious what the stance is on using UDTs in the new UDAF interface. Is
this still viable? This wouldn't work with UnsafeRow for instance. The
OpenHashSetUDT for instance would be a nice building block for CollectSet
and all Distinct Aggregate operators. Are there any opinions on this?

Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+599 9 521 4402


2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:

> Inspired by this post:
> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
> I've started putting together something based on the Spark 1.5 UDAF
> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>
> Some questions -
>
> 1. How do I get the UDAF to accept input arguments of different type? We
> can hash anything basically for HLL - Int, Long, String, Object, raw bytes
> etc. Right now it seems we'd need to build a new UDAF for each input type,
> which seems strange - I should be able to use one UDAF that can handle raw
> input of different types, as well as handle existing HLLs that can be
> merged/aggregated (e.g. for grouped data)
> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
> Where should I look for examples on how this works internally?
> 3. I've based this on the Sum and Avg examples for the new UDAF interface
> - any suggestions or issue please advise. Is the intermediate buffer
> efficient?
> 4. The current HyperLogLogUDT is private - so I've had to make my own one
> which is a bit pointless as it's copy-pasted. Any thoughts on exposing that
> type? Or I need to make the package spark.sql ...
>
> Nick
>
> On Thu, Jul 2, 2015 at 8:06 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> Yes - it's very interesting. However, ideally we should have a version of
>> hyperloglog that can work directly against some raw bytes in memory (rather
>> than java objects), in order for this to fit the Tungsten execution model
>> where everything is operating directly against some memory address.
>>
>> On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <nick.pentre...@gmail.com
>> > wrote:
>>
>>> Sure I can copy the code but my aim was more to understand:
>>>
>>> (A) if this is broadly interesting enough to folks to think about
>>> updating / extending the existing UDAF within Spark
>>> (b) how to register ones own custom UDAF - in which case it could be a
>>> Spark package for example
>>>
>>> All examples deal with registering a UDF but nothing about UDAFs
>>>
>>> —
>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>
>>>
>>> On Wed, Jul 1, 2015 at 6:32 PM, Daniel Darabos <
>>> daniel.dara...@lynxanalytics.com> wrote:
>>>
>>>> It's already possible to just copy the code from countApproxDistinct
>>>> <https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1153>
>>>>  and
>>>> access the HLL directly, or do anything you like.
>>>>
>>>> On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Any thoughts?
>>>>>
>>>>> —
>>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>>
>>>>>
>>>>> On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> Hey Spark devs
>>>>>>
>>>>>> I've been looking at DF UDFs and UDAFs. The approx distinct is using
>>>>>> hyperloglog,
>>>>>> but there is only an option to return the count as a Long.
>>>>>>
>>>>>> It can be useful to be able to return and store the actual data
>>>>>> structure (ie serialized HLL). This effectively allows one to do
>>>>>> aggregation / rollups over columns while still preserving the ability to
>>>>>> get distinct counts.
>>>>>>
>>>>>> For exa

Re: HyperLogLogUDT

2015-09-12 Thread Herman van Hövell tot Westerflier
I am typically all for code re-use. The reason for writing this is to
prevent the indirection of a UDT and work directly against memory. A UDT
will work fine at the moment because we still use
GenericMutableRow/SpecificMutableRow as aggregation buffers. However if you
would use an UnsafeRow as an AggregationBuffer (which is attractive when
you have a lot of groups during aggregation) the use of an UDT is either
impossible or it would become very slow because it would require us to
deserialize/serialize a UDT on every update.

As for compatibility, the implementation produces exactly the same results
as the ClearSpring implementation. You could easily export the HLL++
register values to the current ClearSpring implementation and export those.

Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhov...@questtec.nl
+599 9 521 4402


2015-09-12 11:06 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:

> I should add that surely the idea behind UDT is exactly that it can (a)
> fit automatically into DFs and Tungsten and (b) that it can be used
> efficiently in writing ones own UDTs and UDAFs?
>
>
> On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <nick.pentre...@gmail.com
> > wrote:
>
>> Can I ask why you've done this as a custom implementation rather than
>> using StreamLib, which is already implemented and widely used? It seems
>> more portable to me to use a library - for example, I'd like to export the
>> grouped data with raw HLLs to say Elasticsearch, and then do further
>> on-demand aggregation in ES and visualization in Kibana etc.
>>
>> Others may want to do something similar into Hive, Cassandra, HBase or
>> whatever they are using. In this case they'd need to use this particular
>> implementation from Spark which may be tricky to include in a dependency
>> etc.
>>
>> If there are enhancements, does it not make sense to do a PR to
>> StreamLib? Or does this interact in some better way with Tungsten?
>>
>> I am unclear on how the interop with Tungsten raw memory works - some
>> pointers on that and where to look in the Spark code would be helpful.
>>
>> On Sat, Sep 12, 2015 at 10:45 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> Hello Nick,
>>>
>>> I have been working on a (UDT-less) implementation of HLL++. You can
>>> find the PR here: https://github.com/apache/spark/pull/8362. This
>>> current implements the dense version of HLL++, which is a further
>>> development of HLL. It returns a Long, but it shouldn't be to hard to
>>> return a Row containing the cardinality and/or the HLL registers (the
>>> binary data).
>>>
>>> I am curious what the stance is on using UDTs in the new UDAF interface.
>>> Is this still viable? This wouldn't work with UnsafeRow for instance. The
>>> OpenHashSetUDT for instance would be a nice building block for CollectSet
>>> and all Distinct Aggregate operators. Are there any opinions on this?
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell tot Westerflier
>>>
>>> QuestTec B.V.
>>> Torenwacht 98
>>> 2353 DC Leiderdorp
>>> hvanhov...@questtec.nl
>>> +599 9 521 4402
>>>
>>>
>>> 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>:
>>>
>>>> Inspired by this post:
>>>> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hyperloglog/,
>>>> I've started putting together something based on the Spark 1.5 UDAF
>>>> interface: https://gist.github.com/MLnick/eca566604f2e4e3c6141
>>>>
>>>> Some questions -
>>>>
>>>> 1. How do I get the UDAF to accept input arguments of different type?
>>>> We can hash anything basically for HLL - Int, Long, String, Object, raw
>>>> bytes etc. Right now it seems we'd need to build a new UDAF for each input
>>>> type, which seems strange - I should be able to use one UDAF that can
>>>> handle raw input of different types, as well as handle existing HLLs that
>>>> can be merged/aggregated (e.g. for grouped data)
>>>> 2. @Reynold, how would I ensure this works for Tungsten (ie against raw
>>>> bytes in memory)? Or does the new Aggregate2 stuff automatically do that?
>>>> Where should I look for examples on how this works internally?
>>>> 3. I've based this on the Sum and Avg examples for the new UDAF
>>>> interface - any suggestions or issue please advise. Is the intermediate
>