[jira] [Created] (SPARK-40155) Optionally use a serialized storage level for DataFrame.localCheckpoint()
Paul Staab created SPARK-40155: -- Summary: Optionally use a serialized storage level for DataFrame.localCheckpoint() Key: SPARK-40155 URL: https://issues.apache.org/jira/browse/SPARK-40155 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.3.0 Reporter: Paul Staab In PySpark 3.3.0 `DataFrame.localCheckpoint()` stores the RDD checkpoints using the "Disk Memory *Deserialized* 1x Replicated" storage level. Looking through the Python code and the documentation, I haven't found any possibility to change this. As serialized RDDs are often a lot smaller than deserialized ones - I have seen examples where a 40GB deserialized RDD shrank to 200MB when serialized - I would usually like to create local checkpoints that are stored in serialized instead of deserialized format. To make this possible, we could e.g. add an optional `storage_level` argument to `DataFrame.localCheckpoint()` similar to `DataFrame.persist()` or add a global configuration option similar to `spark.checkpoint.compress`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level
[ https://issues.apache.org/jira/browse/SPARK-40154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Staab updated SPARK-40154: --- Description: The docstring of the `DataFrame.cache()` method currently states that it uses a serialized storage level {code:java} Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). [...] -The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.{code} while `DataFrame.persist()` states that it uses a deserialized storage level {code:java} If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`) [...] The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala in 3.0.{code} However, in practice both `.cache()` and `.persist()` use deserialized storage levels: {code:java} import pyspark from pyspark.sql import SparkSession from pyspark import StorageLevel print(pyspark.__version__) # 3.3.0 spark = SparkSession.builder.master("local[2]").getOrCreate() df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.cache() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code} was: The docstring of the `DataFrame.cache()` methods currently states that it uses a serialized storage level {code:java} Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). [...] -The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.{code} while `DataFrame.persists()` states that it uses deserialized storage level {code:java} If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`) [...] The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala in 3.0.{code} However, in practice both `.cache()` and `.persist()` use deserialized storage levels: {code:java} import pyspark from pyspark.sql import SparkSession from pyspark import StorageLevel print(pyspark.__version__) # 3.3.0 spark = SparkSession.builder.master("local[2]").getOrCreate() df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.cache() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code} > PySpark: DataFrame.cache docstring gives wrong storage level > > > Key: SPARK-40154 > URL: https://issues.apache.org/jira/browse/SPARK-40154 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Paul Staab >Priority: Minor > > The docstring of the `DataFrame.cache()` method currently states that it uses > a serialized storage level > {code:java} > Persists the :class:`DataFrame` with the default storage level > (`MEMORY_AND_DISK`). > [...] > -The default storage level has changed to `MEMORY_AND_DISK` to match > Scala in 2.0.{code} > while `DataFrame.persist()` states that it uses a deserialized storage level > {code:java} > If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`) > [...] > The default storage level has changed to `MEMORY_AND_DISK_DESER` to match > Scala in 3.0.{code} > > However, in practice both `.cache()` and `.persist()` use deserialized > storage levels: > {code:java} > import pyspark > from pyspark.sql import SparkSession > from pyspark import StorageLevel > print(pyspark.__version__) > # 3.3.0 > spark = SparkSession.builder.master("local[2]").getOrCreate() > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.cache() > df.count() > # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.persist() > df.count() > # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.persist(StorageLevel.MEMORY_AND_DISK) > df.count() > # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code}
[jira] [Updated] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level
[ https://issues.apache.org/jira/browse/SPARK-40154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Staab updated SPARK-40154: --- Priority: Minor (was: Major) > PySpark: DataFrame.cache docstring gives wrong storage level > > > Key: SPARK-40154 > URL: https://issues.apache.org/jira/browse/SPARK-40154 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Paul Staab >Priority: Minor > > The docstring of the `DataFrame.cache()` methods currently states that it > uses a serialized storage level > {code:java} > Persists the :class:`DataFrame` with the default storage level > (`MEMORY_AND_DISK`). > [...] > -The default storage level has changed to `MEMORY_AND_DISK` to match > Scala in 2.0.{code} > while `DataFrame.persists()` states that it uses deserialized storage level > {code:java} > If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`) > [...] > The default storage level has changed to `MEMORY_AND_DISK_DESER` to match > Scala in 3.0.{code} > > However, in practice both `.cache()` and `.persist()` use deserialized > storage levels: > {code:java} > import pyspark > from pyspark.sql import SparkSession > from pyspark import StorageLevel > print(pyspark.__version__) > # 3.3.0 > spark = SparkSession.builder.master("local[2]").getOrCreate() > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.cache() > df.count() > # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.persist() > df.count() > # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" > df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", > "col_b"]) > df = df.persist(StorageLevel.MEMORY_AND_DISK) > df.count() > # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40154) PySpark: DataFrame.cache docstring gives wrong storage level
Paul Staab created SPARK-40154: -- Summary: PySpark: DataFrame.cache docstring gives wrong storage level Key: SPARK-40154 URL: https://issues.apache.org/jira/browse/SPARK-40154 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Paul Staab The docstring of the `DataFrame.cache()` methods currently states that it uses a serialized storage level {code:java} Persists the :class:`DataFrame` with the default storage level (`MEMORY_AND_DISK`). [...] -The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.{code} while `DataFrame.persists()` states that it uses deserialized storage level {code:java} If no storage level is specified defaults to (`MEMORY_AND_DISK_DESER`) [...] The default storage level has changed to `MEMORY_AND_DISK_DESER` to match Scala in 3.0.{code} However, in practice both `.cache()` and `.persist()` use deserialized storage levels: {code:java} import pyspark from pyspark.sql import SparkSession from pyspark import StorageLevel print(pyspark.__version__) # 3.3.0 spark = SparkSession.builder.master("local[2]").getOrCreate() df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.cache() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist() df.count() # Storage level in Spark UI: "Disk Memory Deserialized 1x Replicated" df = spark.createDataFrame(zip(["A"] * 1000, ["B"] * 1000), ["col_a", "col_b"]) df = df.persist(StorageLevel.MEMORY_AND_DISK) df.count() # Storage level in Spark UI: "Disk Memory Serialized 1x Replicated"{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516990#comment-16516990 ] Paul Staab edited comment on SPARK-21063 at 6/19/18 12:01 PM: -- I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when both registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. was (Author: paulstaab): I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516990#comment-16516990 ] Paul Staab edited comment on SPARK-21063 at 6/19/18 11:59 AM: -- I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. was (Author: paulstaab): I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516990#comment-16516990 ] Paul Staab edited comment on SPARK-21063 at 6/19/18 11:59 AM: -- I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. was (Author: paulstaab): I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect with Logging { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516990#comment-16516990 ] Paul Staab edited comment on SPARK-21063 at 6/19/18 11:58 AM: -- I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect with Logging { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from [https://github.com/apache/spark/pull/19238] 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. was (Author: paulstaab): I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect with Logging { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from https://github.com/apache/spark/pull/19238 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516990#comment-16516990 ] Paul Staab commented on SPARK-21063: I was able to find a workaround for this problem on Spark 2.1.0: 1. Create an Hive Dialect which uses the correct quotes for escaping the column names: {code:java} object HiveDialect extends JdbcDialect with Logging { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") override def quoteIdentifier(colName: String): String = s"`$colName`" } } {code} This is taken from https://github.com/apache/spark/pull/19238 2. Register it before making the call with spark.read.jdbc {code:java} JdbcDialects.registerDialect(HiveDialect) {code} 3. Execute spark.read.jdbc with fetchsize option {code:java} spark.read.jdbc("jdbc:hive2://localhost:1/default","test1", properties={"driver": "org.apache.hive.jdbc.HiveDriver", "fetchsize": "10"}).show() {code} It only works when registering the dialect and using fetchsize. There was a merge request for adding the dialect to spark by default [https://github.com/apache/spark/pull/19238] but unfortunately it was not merged. > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org