[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-12-04 Thread TomaszGaweda
Github user TomaszGaweda closed the pull request at:

https://github.com/apache/spark/pull/22249


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22399: [SPARK-25408] Move to mode ideomatic Java8

2018-09-11 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/22399
  
I like this idea :) +1 as it's only refactor, without logic change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22399: [SPARK-25408] Move to mode ideomatic Java8

2018-09-11 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22399#discussion_r216824283
  
--- Diff: 
common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java 
---
@@ -383,7 +383,7 @@ public void testRefWithIntNaturalKey() throws Exception 
{
 LevelDBSuite.IntKeyType i = new LevelDBSuite.IntKeyType();
 i.key = 1;
 i.id = "1";
-i.values = Arrays.asList("1");
+i.values = Collections.singletonList("1");
--- End diff --

If you are already changing this, I would  also suggest to add static 
imports to shorten the code


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-28 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22249#discussion_r213232338
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2459,6 +2459,26 @@ object functions {
 StringTrimLeft(e.expr, Literal(trimString))
   }
 
+  /**
+* Extracts a part from a URL.
+*
+* @group string_funcs
+* @since 2.4.0
+*/
+  def parse_url(url: Column, partToExtract: String): Column = withExpr {
--- End diff --

@HyukjinKwon I see now. Yeah, wrapping in  the `Column` will be necessary, 
at least no string concatenation will be required


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-28 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22249#discussion_r213213173
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2459,6 +2459,26 @@ object functions {
 StringTrimLeft(e.expr, Literal(trimString))
   }
 
+  /**
+* Extracts a part from a URL.
+*
+* @group string_funcs
+* @since 2.4.0
+*/
+  def parse_url(url: Column, partToExtract: String): Column = withExpr {
--- End diff --

@HyukjinKwon Thanks for the suggestion, however now users are complaining 
about stringly-typed system in Spark, there are libs like Frameless from 
Typelevel to archieve a bit more type safety. `expr` is springly-typed, while 
functions in `functions` object or accessed via `UserDefinedFunction` are a bit 
more type safe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-27 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22249#discussion_r213126158
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2459,6 +2459,26 @@ object functions {
 StringTrimLeft(e.expr, Literal(trimString))
   }
 
+  /**
+* Extracts a part from a URL.
+*
+* @group string_funcs
+* @since 2.4.0
+*/
+  def parse_url(url: Column, partToExtract: String): Column = withExpr {
--- End diff --

Ok, tomorrow I will create a Jira and start working on it. Thanks for your 
comments! :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-27 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22249#discussion_r213112726
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2459,6 +2459,26 @@ object functions {
 StringTrimLeft(e.expr, Literal(trimString))
   }
 
+  /**
+* Extracts a part from a URL.
+*
+* @group string_funcs
+* @since 2.4.0
+*/
+  def parse_url(url: Column, partToExtract: String): Column = withExpr {
--- End diff --

In long term, I would suggest method that return handler for any registered 
function. So that you can write: SqlFunction something = 
spark.(...?).getFunction("parse_url"). Now `spark.udf.register` returns a 
handler for UDF, something similar for getting any kind of registered function 
may be helpful. However, it's a lot more work, so now I've just proposed to add 
this function :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-27 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22249#discussion_r213110408
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2459,6 +2459,26 @@ object functions {
 StringTrimLeft(e.expr, Literal(trimString))
   }
 
+  /**
+* Extracts a part from a URL.
+*
+* @group string_funcs
+* @since 2.4.0
+*/
+  def parse_url(url: Column, partToExtract: String): Column = withExpr {
--- End diff --

This PR is created after this StackOverflow question: 
https://stackoverflow.com/questions/52041342/how-to-parse-url-in-spark-sqlscala/52043771

I'm not sure how often it is used, however most of functions are available 
in functions object to make Scala and SQL interfaces similar. If you think it's 
useless - please let me know, I'll just close this PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to functions...

2018-08-27 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/22249
  
@gatorsmile @cloud-fan @HyukjinKwon @mgaido91 Could you please review this 
PR and start tests? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...

2018-08-27 Thread TomaszGaweda
GitHub user TomaszGaweda opened a pull request:

https://github.com/apache/spark/pull/22249

[SPARK-16281][SQL][FOLLOW-UP] Add parse_url to functions.scala

## What changes were proposed in this pull request?

Add `parse_url` function to `functions.scala`. This will allow users to use 
this functions without calling `selectExpr` or `spark.sql`

## How was this patch tested?

`testUrl` function was changed to test also this change

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/TomaszGaweda/spark parseUrl

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22249.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22249


commit 4002cb9c48b608123845b89af9f77e137626f83f
Author: Tomasz Gaweda 
Date:   2018-08-27T20:05:50Z

SPARK-16281 Follow-up: parse_url in functions.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...

2018-08-13 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/22063#discussion_r209713521
  
--- Diff: project/MimaExcludes.scala ---
@@ -36,6 +36,11 @@ object MimaExcludes {
 
   // Exclude rules for 2.4.x
   lazy val v24excludes = v23excludes ++ Seq(
+// [SPARK-25044] Address translation of LMF closure primitive args to 
Object in Scala 2.1
--- End diff --

Nit: Wrong Scala version


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21979: [SPARK-25009][CORE]Standalone Cluster mode application s...

2018-08-02 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21979
  
@vanzin @squito, it's probably deleted by mistake in this commit: 
https://github.com/devaraj-kavali/spark/commit/3cb82047f2f51af553df09b9323796af507d36f8


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205269695
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -172,7 +172,11 @@ private[sql] case class JDBCRelation(
 
   // Check if JDBCRDD.compileFilter can accept input filters
   override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
-filters.filter(JDBCRDD.compileFilter(_, 
JdbcDialects.get(jdbcOptions.url)).isEmpty)
+if (jdbcOptions.pushDownPredicate) {
--- End diff --

Indeed it's confusing. `buildScan` argument may be named `pushedFilters`, 
variables also, then code will be self-describing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205269684
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -172,7 +172,11 @@ private[sql] case class JDBCRelation(
 
   // Check if JDBCRDD.compileFilter can accept input filters
   override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
-filters.filter(JDBCRDD.compileFilter(_, 
JdbcDialects.get(jdbcOptions.url)).isEmpty)
+if (jdbcOptions.pushDownPredicate) {
--- End diff --

Indeed it's confusing. `buildScan` argument may be named `pushedFilters`, 
variables also, then code will be self-describing


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205268370
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -183,6 +183,9 @@ class JDBCOptions(
 }
   // An option to execute custom SQL before fetching data from the remote 
DB
   val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT)
+
+  // An option to allow/disallow pushing down predicate into JDBC data 
source
+  val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, 
"true").toBoolean
--- End diff --

Yeah, consistency is a very good argument :) Indeed it will be plural or 
not, depending from which side we are looking at it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205267937
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -172,7 +172,11 @@ private[sql] case class JDBCRelation(
 
   // Check if JDBCRDD.compileFilter can accept input filters
   override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
-filters.filter(JDBCRDD.compileFilter(_, 
JdbcDialects.get(jdbcOptions.url)).isEmpty)
+if (jdbcOptions.pushDownPredicate) {
--- End diff --

I see now, my mistake. Thanks for clarification :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21858#discussion_r205243526
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -1150,16 +1150,48 @@ object functions {
   /**
* A column expression that generates monotonically increasing 64-bit 
integers.
*
-   * The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
+   * The generated IDs are guaranteed to be monotonically increasing and 
unique, but not
+   * consecutive (unless all rows are in the same single partition which 
you rarely want due to
+   * the volume of the data).
* The current implementation puts the partition ID in the upper 31 
bits, and the record number
* within each partition in the lower 33 bits. The assumption is that 
the data frame has
* less than 1 billion partitions, and each partition has less than 8 
billion records.
*
-   * As an example, consider a `DataFrame` with two partitions, each with 
3 records.
-   * This expression would return the following IDs:
-   *
* {{{
-   * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
+   * // Create a dataset with four partitions, each with two rows.
+   * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4)
+   *
+   * // Make sure that every partition has the same number of rows
+   * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => 
assert(rows.next == 2))
+   * q.select(monotonically_increasing_id).show
--- End diff --

IMHO It' enough to add that rows are consecutive in each partition, but not 
between partitions and that values are shifted left by 33 - written in words, 
not code, will be much shorter and concise


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Enable preventing predicate pu...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205224686
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 ---
@@ -183,6 +183,9 @@ class JDBCOptions(
 }
   // An option to execute custom SQL before fetching data from the remote 
DB
   val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT)
+
+  // An option to allow/disallow pushing down predicate into JDBC data 
source
+  val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, 
"true").toBoolean
--- End diff --

Super Nit: Shouldn't it be in plural, pushDownPredicates? There may be many 
predicates


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21875: [SPARK-24288][SQL] Enable preventing predicate pu...

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on a diff in the pull request:

https://github.com/apache/spark/pull/21875#discussion_r205227335
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -172,7 +172,11 @@ private[sql] case class JDBCRelation(
 
   // Check if JDBCRDD.compileFilter can accept input filters
   override def unhandledFilters(filters: Array[Filter]): Array[Filter] = {
-filters.filter(JDBCRDD.compileFilter(_, 
JdbcDialects.get(jdbcOptions.url)).isEmpty)
+if (jdbcOptions.pushDownPredicate) {
--- End diff --

Are you sure, that this is the only place? JDBCRDD.scanTable defines 
filters as all filters that may be pushed down. Probably we should use `filters 
-- unhandledFilters` in JdbcRelation.buildScan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21875: [SPARK-24288][SQL] Enable preventing predicate pushdown

2018-07-25 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21875
  
Thanks! :) LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-07-09 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
@gatorsmile That makes sense :) Simple predicates can be placed in dbtable 
option. Current approach is still more powerful, but if you think that the risk 
is too big, we can switch to reader's option


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-07-09 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
@gatorsmile This will reduce usability a lot. With current approach you can 
push down filters that may speed up reading. Global option will affect every 
other Dataset. To be honest new jdbc option won't fulfill users' requirements, 
at least those users who asked me for workarounds ;)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-07-09 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
Hi @maryannxue, can you please rebase this PR? Then maybe review will be 
possible by others. Would be great to include this in Spark 2.4 :) Thanks :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation

2018-06-20 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21594
  
IMHO it is good, but may confuse users. Could you please add some JavaDocs 
to explain the difference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-06-08 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
I've tesed it with my application that had problem with predicate pushdowns 
to database. Looks good, performance is degradated a bit, but it was previously 
ran on Spark 2.3, not 2.4. However, memory consumption is much better as I 
don't have to cache input Datasets.

LGTM From functional side. @gatorsmile @cloud-fan Could you please review 
it? Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore

2018-06-06 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21500
  
@HeartSaVioR IMHO we should consider new state provider such as RocksDB, 
like Flink and Databricks Delta did. It is not a direct fix, but will improve 
latency and memory consumption, maybe additional management on Spark side won't 
be required


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-05-29 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
Hi @maryannxue, thanks for the PR! Could you please rebase it? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21444: Branch 2.3

2018-05-28 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21444
  
Please close this PR


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21432: [SPARK-24373][SQL] Add AnalysisBarrier to RelationalGrou...

2018-05-25 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21432
  
Probably similar change should be also in KeyValueGroupedDataset. It also 
uses logicalPlan without AnalysisBarrie


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown

2018-05-18 Thread TomaszGaweda
Github user TomaszGaweda commented on the issue:

https://github.com/apache/spark/pull/21360
  
@viirya I've written it in the ticket. In my case, pushing down ORs with 
non-equality predicates caused DB2 to slow down; workaround was to cache data 
before filtering, it was approx. 10 times faster. This PR is to enable 
possibility to decide that you don't want to push down predicate without caching


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org