[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 This was superceded by #19643 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk closed the pull request at: https://github.com/apache/spark/pull/15666 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19643: [SPARK-11421][CORE][PYTHON][R] Added ability for ...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/19643#discussion_r149085525 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,27 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } +#' Adds a JAR dependency for Spark tasks to be executed in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current driver. --- End diff -- maybe something like `underlying/backing java process` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r122578282 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } + +#' Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. --- End diff -- In that case do we want to bother having this method for R? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r122578275 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } + +#' Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current threads' classloader. In +#' general adding to the current threads' class loader will impact all other application threads +#' unless they have explicitly changed their class loader. +#' +#' @rdname spark.addJar +#' @param path The path of the jar to be added +#' @param addToCurrentClassLoader Whether to add the jar to the current driver classloader. +#' Default is FALSE. +#' @export +#' @examples +#'\dontrun{ +#' spark.addJar("/path/to/something.jar", TRUE) +#'} +#' @note spark.addJar since 2.2.0 +spark.addJar <- function(path, addToCurrentClassLoader = FALSE) { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 @HyukjinKwon Any hints what's needed to get the R stuff passing? I don't really have a windows testbed that I can use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/16766 Let me rebase this. I don't currently have a clean way of testing this on Windows --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r106781948 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } + +#' Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current threads' classloader. In +#' general adding to the current threads' class loader will impact all other application threads +#' unless they have explicitly changed their class loader. +#' +#' @rdname spark.addJar +#' @param path The path of the jar to be added +#' @param addToCurrentClassLoader Whether to add the jar to the current driver classloader. +#' Default is FALSE. +#' @export +#' @examples +#'\dontrun{ +#' spark.addJar("/path/to/something.jar", TRUE) +#'} +#' @note spark.addJar since 2.2.0 +spark.addJar <- function(path, addToCurrentClassLoader = FALSE) { + sc <- getSparkContext() + normalizedPath <- suppressWarnings(normalizePath(path)) + scala_sc <- callJMethod(sc, "sc") + invisible(callJMethod(scala_sc, "addJar", normalizedPath, addToCurrentClassLoader)) --- End diff -- why is normalizepath doing that to the url? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104908469 --- Diff: R/pkg/R/context.R --- @@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) { invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path)), recursive)) } + +#' Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. +#' +#' The \code{path} passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. +#' If \code{addToCurrentClassLoader} is true, add the jar to the current threads' classloader. In +#' general adding to the current threads' class loader will impact all other application threads +#' unless they have explicitly changed their class loader. +#' +#' @rdname spark.addJar +#' @param path The path of the jar to be added +#' @param addToCurrentClassLoader Whether to add the jar to the current driver classloader. +#' Default is FALSE. +#' @export +#' @examples +#'\dontrun{ +#' spark.addJar("/path/to/something.jar", TRUE) +#'} +#' @note spark.addJar since 2.2.0 +spark.addJar <- function(path, addToCurrentClassLoader = FALSE) { --- End diff -- Mostly for backwards compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r104486665 --- Diff: R/pkg/inst/tests/testthat/test_context.R --- @@ -167,6 +167,18 @@ test_that("spark.lapply should perform simple transforms", { sparkR.session.stop() }) +test_that("add jar should work and allow usage of the jar on the driver node", { + sparkR.sparkContext() + + destDir <- paste0(tempdir(), "/", "testjar") + jarName <- callJStatic("org.apache.spark.TestUtils", "createDummyJar", + destDir, "sparkrTests", "DummyClassForAddJarTest") + + spark.addJar(jarName, addToCurrentClassLoader = TRUE) + testClass <- newJObject("sparkrTests.DummyClassForAddJarTest") --- End diff -- yeah i suspect that the windows path didn't make it properly into the classloader --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 Ah thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 whoops. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 Seems to be something in pyspark.SparkContext.addJar:10: ERROR: Unexpected indentation. ? what exactly does it want in that docstring? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 @holdenk done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 I'll see if I can rebase it tomorrow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r100177099 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1802,19 +1802,34 @@ class SparkContext(config: SparkConf) extends Logging { * Adds a JAR dependency for all tasks to be executed on this `SparkContext` in the future. * @param path can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), * an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. + * If addToCurrentClassLoader is true, attempt to add the new class to the current threads' class --- End diff -- Add to doc that already loaded urls will have no effect if a url is already present. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r100176188 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1802,19 +1802,34 @@ class SparkContext(config: SparkConf) extends Logging { * Adds a JAR dependency for all tasks to be executed on this `SparkContext` in the future. * @param path can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), * an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. + * If addToCurrentClassLoader is true, attempt to add the new class to the current threads' class + * loader. In general adding to the current threads' class loader will impact all other + * application threads unless they have explicitly changed their class loader. */ def addJar(path: String) { +addJar(path, false) + } + + def addJar(path: String, addToCurrentClassLoader: Boolean) { if (path == null) { logWarning("null specified as parameter to addJar") } else { var key = "" - if (path.contains("\\")) { + + val uri = if (path.contains("\\")) { // For local paths with backslashes on Windows, URI throws an exception -key = env.rpcEnv.fileServer.addJar(new File(path)) +new File(path).toURI } else { val uri = new URI(path) // SPARK-17650: Make sure this is a valid URL before adding it to the list of dependencies Utils.validateURL(uri) +uri + } + + if (path.contains("\\")) { +// For local paths with backslashes on Windows, URI throws an exception +key = env.rpcEnv.fileServer.addJar(new File(uri)) --- End diff -- If we have backslashes we are in a local path on windows. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15821 Probably a good thing to look at is the R pieces since that is effectively constrained to InternalRow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/16766 @felixcheung This does not touch any of the coalesce internals. Only allows setting a partitionCoalescer similar to what is already available in rdd.coalesce --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99369813 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -497,7 +496,9 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan { * if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of * the 100 new partitions will claim 10 of the current partitions. */ -case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode { +case class CoalesceExec(numPartitions: Int, child: SparkPlan, +partitionCoalescer: Option[PartitionCoalescer] + ) extends UnaryExecNode { --- End diff -- Do you guys have a .scalafmt.conf that applies all of this? that should make things cleaner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99366754 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -117,6 +134,34 @@ class DatasetSuite extends QueryTest with SharedSQLContext { data: _*) } + test("coalesce, custom") { + +val maxSplitSize = 512 +// Similar to the implementation of `test("custom RDD coalescer")` from [[RDDSuite]] we first +// write out to disk, to ensure that our splits are in fact [[FileSplit]] instances. +val data = (1 to 1000).map(i => ClassData(i.toString, i)) +data.toDS().repartition(10).write.format("csv").save(path.toString) + +val ds = spark.read.format("csv").load(path.toString).as[ClassData] --- End diff -- Oh right csv doesn't do headers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99366143 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -17,24 +17,41 @@ package org.apache.spark.sql -import java.io.{Externalizable, ObjectInput, ObjectOutput} +import java.io.{Externalizable, File, ObjectInput, ObjectOutput} import java.sql.{Date, Timestamp} +import org.apache.hadoop.mapred.FileSplit +import org.scalatest.BeforeAndAfter + +import org.apache.spark.rdd.{CoalescedRDDPartition, HadoopPartition, SizeBasedCoalescer} import org.apache.spark.sql.catalyst.encoders.{OuterScopes, RowEncoder} import org.apache.spark.sql.catalyst.util.sideBySide -import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec, SortExec} +import org.apache.spark.sql.execution.{LogicalRDD, RDDScanExec} import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, ShuffleExchange} import org.apache.spark.sql.execution.streaming.MemoryStream import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.SharedSQLContext import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils case class TestDataPoint(x: Int, y: Double, s: String, t: TestDataPoint2) case class TestDataPoint2(x: Int, s: String) -class DatasetSuite extends QueryTest with SharedSQLContext { +class DatasetSuite extends QueryTest with SharedSQLContext with BeforeAndAfter { import testImplicits._ + private var path: File = null + + override def beforeAll(): Unit = { +super.beforeAll() +path = Utils.createTempDir() +path.delete() + } + + after { +Utils.deleteRecursively(path) + } --- End diff -- ah thanks. I looked at the writer tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99363149 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) } /** + * Returns a new RDD that has exactly `numPartitions` partitions. + */ +case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], --- End diff -- that sounds good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/16766#discussion_r99132600 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -823,6 +825,17 @@ case class Repartition(numPartitions: Int, shuffle: Boolean, child: LogicalPlan) } /** + * Returns a new RDD that has exactly `numPartitions` partitions. + */ +case class CoalesceLogical(numPartitions: Int, partitionCoalescer: Option[PartitionCoalescer], --- End diff -- Main reason is there was already a Coalesce expression class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 Yeah I'll be there --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 @holdenk Anything i can do from my side to help this guy along? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16766: [SPARK-19426][SQL] Custom coalesce for Dataset
GitHub user mariusvniekerk opened a pull request: https://github.com/apache/spark/pull/16766 [SPARK-19426][SQL] Custom coalesce for Dataset ## What changes were proposed in this pull request? This adds support for using the PartitionCoalescer features added in #11865 (SPARK-14042) to the Dataset API ## How was this patch tested? Manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/mariusvniekerk/spark wip_customCoalesce Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16766.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16766 commit 15b34dd88f81b20c1be1ef42e6b647d42ef5f462 Author: Marius van Niekerk Date: 2016-11-07T22:06:38Z custom coalesce --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r93730314 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -164,6 +164,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { --- End diff -- The R tests do indeed verify that they can call the internal functions. I can revert that part of the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r93729928 --- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala --- @@ -164,6 +164,27 @@ private[spark] object TestUtils { createCompiledClass(className, destDir, sourceFile, classpathUrls) } + /** Create a dummy compile jar for a given package, classname. Jar will be placed in destDir */ + def createDummyJar(destDir: String, packageName: String, className: String): String = { --- End diff -- Yeah when i wrote this that didn't exist yet. Changing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 Rebased. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk closed the pull request at: https://github.com/apache/spark/pull/15666 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
GitHub user mariusvniekerk reopened a pull request: https://github.com/apache/spark/pull/15666 [SPARK-11421] [Core][Python][R] Added ability for addJar to augment the current classloader ## What changes were proposed in this pull request? Adds a flag to sc.addJar to add the jar to the current classloader ## How was this patch tested? Unit tests, manual tests This is a continuation of the pull request in https://github.com/apache/spark/pull/9313 and is mostly a rebase of that moved to master with SparkR additions. cc @holdenk You can merge this pull request into a Git repository by running: $ git pull https://github.com/mariusvniekerk/spark SPARK-11421 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15666 commit 6fb5d66e7669ebe0e8a515e02b1276e1bab652a2 Author: Marius van Niekerk Date: 2016-10-28T00:26:17Z Squashed content from pull request #9313 commit 6a6e98a0fcc7f388009f36b8a31664bda2ccf5d9 Author: Marius van Niekerk Date: 2016-10-28T00:26:29Z Remove _loadClass method since we dont need it anymore under py4j 0.10 commit 2b1e98e50feb7180b94f7b9e304634566f163718 Author: Marius van Niekerk Date: 2016-10-28T00:26:36Z Expose addJar to sparkR as well commit 7f37d3a060d574bd6c38539ec896fbc4c94060f3 Author: mariusvniekerk Date: 2016-10-29T13:24:40Z Style fixes commit 9d838b35b53b1e4fdcf39721b4f638ead9e40fcd Author: Marius van Niekerk Date: 2016-10-29T20:15:32Z Adjust test suite to test add jar in scala as well. commit d4416d92610affd363701fd08dc53eb720566130 Author: Marius van Niekerk Date: 2016-10-29T21:19:16Z Fixed scala test not working due to incorrect classloader being used. commit fccb141dd9e6d36db242997f1c6f3e007caa514f Author: Marius van Niekerk Date: 2016-10-30T00:15:27Z Fixed typo with test. commit 26b39de51f9a76b121ebcb70079072dfcc9972bd Author: Marius van Niekerk Date: 2016-11-01T01:46:07Z Fixed documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15821 So this is very cool stuff. Would it be reasonable to add some api pieces so that on the python side things like DataFrame.mapPartitions makes use of Apache Arrow to lower the serialization costs? Or is that more a follow-on piece of work --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/15666 @HyukjinKwon there seems to be something weird with the appveyor checks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r85865112 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1700,19 +1700,34 @@ class SparkContext(config: SparkConf) extends Logging { * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. + * If addToCurrentClassLoader is true, attempt to add the new class to the current threads' class + * loader. In general adding to the current threads' class loader will impact all other + * application threads unless they have explicitly changed their class loader. */ def addJar(path: String) { +addJar(path, false) + } + + def addJar(path: String, addToCurrentClassLoader: Boolean) { if (path == null) { logWarning("null specified as parameter to addJar") } else { var key = "" - if (path.contains("\\")) { + + val uri = if (path.contains("\\")) { // For local paths with backslashes on Windows, URI throws an exception -key = env.rpcEnv.fileServer.addJar(new File(path)) --- End diff -- So this change gets the URI for the windows URI which is used later on to construct a File instance. That should allow the windows special case to work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/15666#discussion_r85833766 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1700,19 +1700,34 @@ class SparkContext(config: SparkConf) extends Logging { * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node. + * If addToCurrentClassLoader is true, attempt to add the new class to the current threads' class + * loader. In general adding to the current threads' class loader will impact all other + * application threads unless they have explicitly changed their class loader. */ def addJar(path: String) { +addJar(path, false) + } + + def addJar(path: String, addToCurrentClassLoader: Boolean) { --- End diff -- Keeping it in the Scala makes it simpler for other spark Scala interpeters (eg toree, zeppelin) to make use of this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...
GitHub user mariusvniekerk opened a pull request: https://github.com/apache/spark/pull/15666 [SPARK-11421] [Core][Python][R] Added ability for addJar to augment the current classloader ## What changes were proposed in this pull request? Adds a flag to sc.addJar to add the jar to the current classloader ## How was this patch tested? Unit tests, manual tests This is a continuation of the pull request in https://github.com/apache/spark/pull/9313 and is mostly a rebase of that moved to master with SparkR additions. cc @holdenk You can merge this pull request into a Git repository by running: $ git pull https://github.com/mariusvniekerk/spark SPARK-11421 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15666 commit 6fb5d66e7669ebe0e8a515e02b1276e1bab652a2 Author: Marius van Niekerk Date: 2016-10-28T00:26:17Z Squashed content from pull request #9313 commit 6a6e98a0fcc7f388009f36b8a31664bda2ccf5d9 Author: Marius van Niekerk Date: 2016-10-28T00:26:29Z Remove _loadClass method since we dont need it anymore under py4j 0.10 commit 2b1e98e50feb7180b94f7b9e304634566f163718 Author: Marius van Niekerk Date: 2016-10-28T00:26:36Z Expose addJar to sparkR as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9313: [SPARK-10658][SPARK-11421][PYSPARK][CORE] Provide add jar...
Github user mariusvniekerk commented on the issue: https://github.com/apache/spark/pull/9313 So since py4j now uses the context classloader, we can remove the python pieces about loading a class by name. @holdenk If you want I can revisit this PR. This case occurs for me specifically because I have python modules that bundle their jars with them, and when using spark-submit it is rather tedious to have to manually muck around with the classloader under python. We can probably also add it to SparkR since I assume they have similar requirements to the PySpark side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11881][SQL] Fix for postgresql fetchsiz...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/9861#discussion_r45620973 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala --- @@ -489,6 +492,13 @@ private[sql] class JDBCRDD( } try { if (null != conn) { + if (!conn.getAutoCommit && !conn.isClosed) { +try { + conn.commit() +} catch { + case e: Exception => logWarning("Exception committing transaction", e) --- End diff -- Want to do anything special for throwable vs Exception or just change it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11881][SQL] Fix for postgresql fetchsiz...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9861#issuecomment-158701845 Not entirely sure why this causes NPE exceptions in some of the unit tests... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11881][SQL] Fix for postgresql fetchsiz...
GitHub user mariusvniekerk opened a pull request: https://github.com/apache/spark/pull/9861 [SPARK-11881][SQL] Fix for postgresql fetchsize > 0 Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mariusvniekerk/spark SPARK-11881 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9861.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9861 commit 464976670c2857a972b38aa4d32396915d7e0c0a Author: mariusvniekerk Date: 2015-11-20T14:34:31Z [SPARK-11881][SQL] Fix for postgresql fetchsize > 0 Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] support postgre array type ...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9662#issuecomment-156285098 These test failures don't seem to be related? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] support postgre array type ...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/9662#discussion_r44664911 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -121,6 +145,12 @@ object JdbcUtils extends Logging { case TimestampType => stmt.setTimestamp(i + 1, row.getAs[java.sql.Timestamp](i)) case DateType => stmt.setDate(i + 1, row.getAs[java.sql.Date](i)) case t: DecimalType => stmt.setBigDecimal(i + 1, row.getDecimal(i)) +case ArrayType(et, _) => + assert(jdbcTypes(i).databaseTypeDefinition.endsWith("[]")) --- End diff -- Is that the same in all backends that support arrays (Oracle etc)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] support postgre array type ...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9662#issuecomment-156121867 I've added write support in #9137 as well if you want to just use it from there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-154415116 @JoshRosen Guess its refactor time due to SPARK-11541. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-152561158 Is the best approach to rebase or just merge master into this and resolve conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-152309023 I also need to rebase this thing against master again it seems --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/9137#discussion_r43439253 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -121,6 +122,21 @@ object JdbcUtils extends Logging { case TimestampType => stmt.setTimestamp(i + 1, row.getAs[java.sql.Timestamp](i)) case DateType => stmt.setDate(i + 1, row.getAs[java.sql.Date](i)) case t: DecimalType => stmt.setBigDecimal(i + 1, row.getDecimal(i)) --- End diff -- If the particular dialect does not support these types saveTable should toss an exception when building the nullTypes array --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/9137#discussion_r43437880 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala --- @@ -171,21 +187,9 @@ object JdbcUtils extends Logging { val name = field.name val typ: String = dialect.getJDBCType(field.dataType).map(_.databaseTypeDefinition).getOrElse( - field.dataType match { --- End diff -- Moved this one so that i could get access to it in PostgresDialect.getJDBCType in order to build representations for array fields --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on a diff in the pull request: https://github.com/apache/spark/pull/9137#discussion_r43437555 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala --- @@ -207,6 +225,25 @@ case object PostgresDialect extends JdbcDialect { Some(StringType) } else if (sqlType == Types.OTHER && typeName.equals("jsonb")) { Some(StringType) +} else if (sqlType == Types.OTHER && typeName.equals("uuid")) { +Some(StringType) +} else if (sqlType == Types.ARRAY) { + typeName match { --- End diff -- The underscores are particularly for the array types. Postgres prepends them to all array types here https://github.com/pgjdbc/pgjdbc/blob/REL9_4_1204/org/postgresql/jdbc2/TypeInfoCache.java#L159 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-149907705 I'll add tests once #8101 is merged in --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-149857945 Sure. Had to refactor a little to work around type erasure warnings --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5753] [SQL] add JDBCRDD support for pos...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/4549#issuecomment-149619987 I've given this a shot in https://github.com/apache/spark/pull/9137 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
Github user mariusvniekerk commented on the pull request: https://github.com/apache/spark/pull/9137#issuecomment-149393173 Still need to add some additional types from https://github.com/pgjdbc/pgjdbc/blob/master/org/postgresql/jdbc2/TypeInfoCache.java#L70 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10186][SQL] Array types using JDBCRDD a...
GitHub user mariusvniekerk opened a pull request: https://github.com/apache/spark/pull/9137 [SPARK-10186][SQL] Array types using JDBCRDD and postgres This change allows reading from jdbc array column types for the postgresql dialect. This also opens up some implementation for array types using other jdbc backends. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mariusvniekerk/spark SPARK-10186 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/9137.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #9137 commit cf6a22b9e043671f0b0a70867a9c121f25ca6eca Author: mariusvniekerk Date: 2015-10-15T17:37:32Z [SPARK-10186] [SQL] Add support for array types using JDBCRDD and postgres This change allows reading from jdbc array column types for the postgresql dialect. This also opens up some implementation for array types using other jdbc backends. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org