[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda closed the pull request at: https://github.com/apache/spark/pull/22249 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22399: [SPARK-25408] Move to mode ideomatic Java8
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/22399 I like this idea :) +1 as it's only refactor, without logic change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22399: [SPARK-25408] Move to mode ideomatic Java8
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22399#discussion_r216824283 --- Diff: common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java --- @@ -383,7 +383,7 @@ public void testRefWithIntNaturalKey() throws Exception { LevelDBSuite.IntKeyType i = new LevelDBSuite.IntKeyType(); i.key = 1; i.id = "1"; -i.values = Arrays.asList("1"); +i.values = Collections.singletonList("1"); --- End diff -- If you are already changing this, I would also suggest to add static imports to shorten the code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213232338 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- @HyukjinKwon I see now. Yeah, wrapping in the `Column` will be necessary, at least no string concatenation will be required --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213213173 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- @HyukjinKwon Thanks for the suggestion, however now users are complaining about stringly-typed system in Spark, there are libs like Frameless from Typelevel to archieve a bit more type safety. `expr` is springly-typed, while functions in `functions` object or accessed via `UserDefinedFunction` are a bit more type safe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213126158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- Ok, tomorrow I will create a Jira and start working on it. Thanks for your comments! :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213112726 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- In long term, I would suggest method that return handler for any registered function. So that you can write: SqlFunction something = spark.(...?).getFunction("parse_url"). Now `spark.udf.register` returns a handler for UDF, something similar for getting any kind of registered function may be helpful. However, it's a lot more work, so now I've just proposed to add this function :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22249#discussion_r213110408 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2459,6 +2459,26 @@ object functions { StringTrimLeft(e.expr, Literal(trimString)) } + /** +* Extracts a part from a URL. +* +* @group string_funcs +* @since 2.4.0 +*/ + def parse_url(url: Column, partToExtract: String): Column = withExpr { --- End diff -- This PR is created after this StackOverflow question: https://stackoverflow.com/questions/52041342/how-to-parse-url-in-spark-sqlscala/52043771 I'm not sure how often it is used, however most of functions are available in functions object to make Scala and SQL interfaces similar. If you think it's useless - please let me know, I'll just close this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to functions...
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/22249 @gatorsmile @cloud-fan @HyukjinKwon @mgaido91 Could you please review this PR and start tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22249: [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to fu...
GitHub user TomaszGaweda opened a pull request: https://github.com/apache/spark/pull/22249 [SPARK-16281][SQL][FOLLOW-UP] Add parse_url to functions.scala ## What changes were proposed in this pull request? Add `parse_url` function to `functions.scala`. This will allow users to use this functions without calling `selectExpr` or `spark.sql` ## How was this patch tested? `testUrl` function was changed to test also this change Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/TomaszGaweda/spark parseUrl Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22249.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22249 commit 4002cb9c48b608123845b89af9f77e137626f83f Author: Tomasz Gaweda Date: 2018-08-27T20:05:50Z SPARK-16281 Follow-up: parse_url in functions.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22063: [WIP][SPARK-25044][SQL] Address translation of LM...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/22063#discussion_r209713521 --- Diff: project/MimaExcludes.scala --- @@ -36,6 +36,11 @@ object MimaExcludes { // Exclude rules for 2.4.x lazy val v24excludes = v23excludes ++ Seq( +// [SPARK-25044] Address translation of LMF closure primitive args to Object in Scala 2.1 --- End diff -- Nit: Wrong Scala version --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21979: [SPARK-25009][CORE]Standalone Cluster mode application s...
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21979 @vanzin @squito, it's probably deleted by mistake in this commit: https://github.com/devaraj-kavali/spark/commit/3cb82047f2f51af553df09b9323796af507d36f8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205269695 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala --- @@ -172,7 +172,11 @@ private[sql] case class JDBCRelation( // Check if JDBCRDD.compileFilter can accept input filters override def unhandledFilters(filters: Array[Filter]): Array[Filter] = { -filters.filter(JDBCRDD.compileFilter(_, JdbcDialects.get(jdbcOptions.url)).isEmpty) +if (jdbcOptions.pushDownPredicate) { --- End diff -- Indeed it's confusing. `buildScan` argument may be named `pushedFilters`, variables also, then code will be self-describing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205269684 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala --- @@ -172,7 +172,11 @@ private[sql] case class JDBCRelation( // Check if JDBCRDD.compileFilter can accept input filters override def unhandledFilters(filters: Array[Filter]): Array[Filter] = { -filters.filter(JDBCRDD.compileFilter(_, JdbcDialects.get(jdbcOptions.url)).isEmpty) +if (jdbcOptions.pushDownPredicate) { --- End diff -- Indeed it's confusing. `buildScan` argument may be named `pushedFilters`, variables also, then code will be self-describing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205268370 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -183,6 +183,9 @@ class JDBCOptions( } // An option to execute custom SQL before fetching data from the remote DB val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT) + + // An option to allow/disallow pushing down predicate into JDBC data source + val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, "true").toBoolean --- End diff -- Yeah, consistency is a very good argument :) Indeed it will be plural or not, depending from which side we are looking at it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Add a JDBC Option to enable pr...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205267937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala --- @@ -172,7 +172,11 @@ private[sql] case class JDBCRelation( // Check if JDBCRDD.compileFilter can accept input filters override def unhandledFilters(filters: Array[Filter]): Array[Filter] = { -filters.filter(JDBCRDD.compileFilter(_, JdbcDialects.get(jdbcOptions.url)).isEmpty) +if (jdbcOptions.pushDownPredicate) { --- End diff -- I see now, my mistake. Thanks for clarification :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21858: [SPARK-24899][SQL][DOC] Add example of monotonica...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21858#discussion_r205243526 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -1150,16 +1150,48 @@ object functions { /** * A column expression that generates monotonically increasing 64-bit integers. * - * The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. + * The generated IDs are guaranteed to be monotonically increasing and unique, but not + * consecutive (unless all rows are in the same single partition which you rarely want due to + * the volume of the data). * The current implementation puts the partition ID in the upper 31 bits, and the record number * within each partition in the lower 33 bits. The assumption is that the data frame has * less than 1 billion partitions, and each partition has less than 8 billion records. * - * As an example, consider a `DataFrame` with two partitions, each with 3 records. - * This expression would return the following IDs: - * * {{{ - * 0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594. + * // Create a dataset with four partitions, each with two rows. + * val q = spark.range(start = 0, end = 8, step = 1, numPartitions = 4) + * + * // Make sure that every partition has the same number of rows + * q.mapPartitions(rows => Iterator(rows.size)).foreachPartition(rows => assert(rows.next == 2)) + * q.select(monotonically_increasing_id).show --- End diff -- IMHO It' enough to add that rows are consecutive in each partition, but not between partitions and that values are shifted left by 33 - written in words, not code, will be much shorter and concise --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Enable preventing predicate pu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205224686 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala --- @@ -183,6 +183,9 @@ class JDBCOptions( } // An option to execute custom SQL before fetching data from the remote DB val sessionInitStatement = parameters.get(JDBC_SESSION_INIT_STATEMENT) + + // An option to allow/disallow pushing down predicate into JDBC data source + val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, "true").toBoolean --- End diff -- Super Nit: Shouldn't it be in plural, pushDownPredicates? There may be many predicates --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21875: [SPARK-24288][SQL] Enable preventing predicate pu...
Github user TomaszGaweda commented on a diff in the pull request: https://github.com/apache/spark/pull/21875#discussion_r205227335 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala --- @@ -172,7 +172,11 @@ private[sql] case class JDBCRelation( // Check if JDBCRDD.compileFilter can accept input filters override def unhandledFilters(filters: Array[Filter]): Array[Filter] = { -filters.filter(JDBCRDD.compileFilter(_, JdbcDialects.get(jdbcOptions.url)).isEmpty) +if (jdbcOptions.pushDownPredicate) { --- End diff -- Are you sure, that this is the only place? JDBCRDD.scanTable defines filters as all filters that may be pushed down. Probably we should use `filters -- unhandledFilters` in JdbcRelation.buildScan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21875: [SPARK-24288][SQL] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21875 Thanks! :) LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 @gatorsmile That makes sense :) Simple predicates can be placed in dbtable option. Current approach is still more powerful, but if you think that the risk is too big, we can switch to reader's option --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 @gatorsmile This will reduce usability a lot. With current approach you can push down filters that may speed up reading. Global option will affect every other Dataset. To be honest new jdbc option won't fulfill users' requirements, at least those users who asked me for workarounds ;) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 Hi @maryannxue, can you please rebase this PR? Then maybe review will be possible by others. Would be great to include this in Spark 2.4 :) Thanks :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidation
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21594 IMHO it is good, but may confuse users. Could you please add some JavaDocs to explain the difference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 I've tesed it with my application that had problem with predicate pushdowns to database. Looks good, performance is degradated a bit, but it was previously ran on Spark 2.3, not 2.4. However, memory consumption is much better as I don't have to cache input Datasets. LGTM From functional side. @gatorsmile @cloud-fan Could you please review it? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21500: Scalable Memory option for HDFSBackedStateStore
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21500 @HeartSaVioR IMHO we should consider new state provider such as RocksDB, like Flink and Databricks Delta did. It is not a direct fix, but will improve latency and memory consumption, maybe additional management on Spark side won't be required --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 Hi @maryannxue, thanks for the PR! Could you please rebase it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21444: Branch 2.3
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21444 Please close this PR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21432: [SPARK-24373][SQL] Add AnalysisBarrier to RelationalGrou...
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21432 Probably similar change should be also in KeyValueGroupedDataset. It also uses logicalPlan without AnalysisBarrie --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21360: [SPARK-24288] Enable preventing predicate pushdown
Github user TomaszGaweda commented on the issue: https://github.com/apache/spark/pull/21360 @viirya I've written it in the ticket. In my case, pushing down ORs with non-equality predicates caused DB2 to slow down; workaround was to cache data before filtering, it was approx. 10 times faster. This PR is to enable possibility to decide that you don't want to push down predicate without caching --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org