[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start
[ https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649990#comment-16649990 ] Ran Haim commented on SPARK-25527: -- Any update? > Job stuck waiting for last stage to start > - > > Key: SPARK-25527 > URL: https://issues.apache.org/jira/browse/SPARK-25527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ran Haim >Priority: Major > Attachments: threaddumpjob.txt > > > Sometimes it can somehow happen that a job is stuck waiting for the last > stage to start. > There are no Tasks waiting for completion, and the job just hangs. > There are available Executors for the job to run. > I do not know how to reproduce this, all I know is that it happens randomly > after couple days of hard load. > Another thing that might help is that it seems to happen when some tasks fail > because one or more executors killed (due to memory issues or something). > Those tasks eventually do get finished by other executors because of retries, > but the next stage hangs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start
[ https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628392#comment-16628392 ] Ran Haim commented on SPARK-25527: -- Hi, actually as I said there is one task (in the last stage) that is waiting to be started. I do not know exactly what to look for in the thread dump, I uploaded it - [^threaddumpjob.txt] > Job stuck waiting for last stage to start > - > > Key: SPARK-25527 > URL: https://issues.apache.org/jira/browse/SPARK-25527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ran Haim >Priority: Major > Attachments: threaddumpjob.txt > > > Sometimes it can somehow happen that a job is stuck waiting for the last > stage to start. > There are no Tasks waiting for completion, and the job just hangs. > There are available Executors for the job to run. > I do not know how to reproduce this, all I know is that it happens randomly > after couple days of hard load. > Another thing that might help is that it seems to happen when some tasks fail > because one or more executors killed (due to memory issues or something). > Those tasks eventually do get finished by other executors because of retries, > but the next stage hangs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25527) Job stuck waiting for last stage to start
[ https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-25527: - Attachment: threaddumpjob.txt > Job stuck waiting for last stage to start > - > > Key: SPARK-25527 > URL: https://issues.apache.org/jira/browse/SPARK-25527 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Ran Haim >Priority: Major > Attachments: threaddumpjob.txt > > > Sometimes it can somehow happen that a job is stuck waiting for the last > stage to start. > There are no Tasks waiting for completion, and the job just hangs. > There are available Executors for the job to run. > I do not know how to reproduce this, all I know is that it happens randomly > after couple days of hard load. > Another thing that might help is that it seems to happen when some tasks fail > because one or more executors killed (due to memory issues or something). > Those tasks eventually do get finished by other executors because of retries, > but the next stage hangs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25527) Job stuck waiting for last stage to start
Ran Haim created SPARK-25527: Summary: Job stuck waiting for last stage to start Key: SPARK-25527 URL: https://issues.apache.org/jira/browse/SPARK-25527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Ran Haim Sometimes it can somehow happen that a job is stuck waiting for the last stage to start. There are no Tasks waiting for completion, and the job just hangs. There are available Executors for the job to run. I do not know how to reproduce this, all I know is that it happens randomly after couple days of hard load. Another thing that might help is that it seems to happen when some tasks fail because one or more executors killed (due to memory issues or something). Those tasks eventually do get finished by other executors because of retries, but the next stage hangs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248837#comment-16248837 ] Ran Haim edited comment on SPARK-22481 at 11/12/17 12:31 PM: - Hi, I created this simple test that will show how slow refresh table can be. I create a parquet table, with 10K files just to make it obvious how slow it can get - but a table with a lot of partitions will have the same effect. {code:java} import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.junit.runner.RunWith import org.scalatest.junit.JUnitRunner import org.scalatest.{BeforeAndAfter, FunSuite} @RunWith(classOf[JUnitRunner]) class TestRefreshTable extends FunSuite with BeforeAndAfter{ { if (System.getProperty("os.name").toLowerCase.startsWith("windows")) { val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath System.setProperty("hadoop.home.dir",hadoopDirURI ) } } var fs: FileSystem = _ before{ fs = FileSystem.get(new Configuration()) fs.setWriteChecksum(false) } private val path = "/tmp/test" test("big table refresh test"){ val config = new SparkConf(). setMaster("local[*]"). setAppName("test"). set("spark.ui.enabled", "false") val sparkSession = SparkSession .builder() .config(config) .enableHiveSupport() .getOrCreate() try { for (i <- 1 to 1) { val id = UUID.randomUUID().toString fs.create(new Path(path, id)) } sparkSession.sql("drop TABLE if exists test") sparkSession.sql( s""" CREATE EXTERNAL TABLE test( `rowId` bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '$path' """) val start = System.currentTimeMillis() sparkSession.catalog.refreshTable("test") val end = System.currentTimeMillis() println(s"refresh took ${end - start}") }finally { sparkSession.close() } } } {code} was (Author: ran.h...@optimalplus.com): Hi, I create this simple test that will show how slow refresh table can be. I create a parquet table, with 10K files just to make it obvious how slow it can get - but a table with a lot of partitions will have the same effect. {code:java} import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.junit.runner.RunWith import org.scalatest.junit.JUnitRunner import org.scalatest.{BeforeAndAfter, FunSuite} @RunWith(classOf[JUnitRunner]) class TestRefreshTable extends FunSuite with BeforeAndAfter{ { if (System.getProperty("os.name").toLowerCase.startsWith("windows")) { val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath System.setProperty("hadoop.home.dir",hadoopDirURI ) } } var fs: FileSystem = _ before{ fs = FileSystem.get(new Configuration()) fs.setWriteChecksum(false) } private val path = "/tmp/test" test("big table refresh test"){ val config = new SparkConf(). setMaster("local[*]"). setAppName("test"). set("spark.ui.enabled", "false") val sparkSession = SparkSession .builder() .config(config) .enableHiveSupport() .getOrCreate() try { for (i <- 1 to 1) { val id = UUID.randomUUID().toString fs.create(new Path(path, id)) } sparkSession.sql("drop TABLE if exists test") sparkSession.sql( s""" CREATE EXTERNAL TABLE test( `rowId` bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '$path' """) val start = System.currentTimeMillis() sparkSession.catalog.refreshTable("test") val end = System.currentTimeMillis() println(s"refresh took ${end - start}") }finally { sparkSession.close() } } } {code} > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spar
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248837#comment-16248837 ] Ran Haim commented on SPARK-22481: -- Hi, I create this simple test that will show how slow refresh table can be. I create a parquet table, with 10K files just to make it obvious how slow it can get - but a table with a lot of partitions will have the same effect. {code:java} import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession import org.junit.runner.RunWith import org.scalatest.junit.JUnitRunner import org.scalatest.{BeforeAndAfter, FunSuite} @RunWith(classOf[JUnitRunner]) class TestRefreshTable extends FunSuite with BeforeAndAfter{ { if (System.getProperty("os.name").toLowerCase.startsWith("windows")) { val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath System.setProperty("hadoop.home.dir",hadoopDirURI ) } } var fs: FileSystem = _ before{ fs = FileSystem.get(new Configuration()) fs.setWriteChecksum(false) } private val path = "/tmp/test" test("big table refresh test"){ val config = new SparkConf(). setMaster("local[*]"). setAppName("test"). set("spark.ui.enabled", "false") val sparkSession = SparkSession .builder() .config(config) .enableHiveSupport() .getOrCreate() try { for (i <- 1 to 1) { val id = UUID.randomUUID().toString fs.create(new Path(path, id)) } sparkSession.sql("drop TABLE if exists test") sparkSession.sql( s""" CREATE EXTERNAL TABLE test( `rowId` bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION '$path' """) val start = System.currentTimeMillis() sparkSession.catalog.refreshTable("test") val end = System.currentTimeMillis() println(s"refresh took ${end - start}") }finally { sparkSession.close() } } } {code} > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > spa
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247913#comment-16247913 ] Ran Haim commented on SPARK-22481: -- No, it is not. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247858#comment-16247858 ] Ran Haim commented on SPARK-22481: -- Ok, I'll create a simple app to reproduce this, later next week. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247852#comment-16247852 ] Ran Haim commented on SPARK-22481: -- I will double check on Sunday, but sparksession.table in the end does sparkSession.sessionState.executePlan(logicalPlan), and because my tables are parquet tables, if I am not mistaken, it will go and fetch the list of files from the filesystem. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247838#comment-16247838 ] Ran Haim commented on SPARK-22481: -- I can check it again on Sunday. I don't know why it is suprising, as it has to go and fetch a list of files from the filesystem. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247828#comment-16247828 ] Ran Haim edited comment on SPARK-22481 at 11/10/17 5:47 PM: It is as I wrote above. refreshing 30 tables takes 1 minute in 2.1.1 and 2 seconds in 2.1.0. It is unnecessary to create the dataset, unless you know it is actually cached - this is why it is so slow now. was (Author: ran.h...@optimalplus.com): It is as I wrote above. It takes 1 minute in 2.1.1 and 2 seconds in 2.1.0. It is unnecessary to create the dataset, unless you know it is actually cached - this is why it is so slow now. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247828#comment-16247828 ] Ran Haim commented on SPARK-22481: -- It is as I wrote above. It takes 1 minute in 2.1.1 and 2 seconds in 2.1.0. It is unnecessary to create the dataset, unless you know it is actually cached - this is why it is so slow now. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246176#comment-16246176 ] Ran Haim commented on SPARK-22481: -- It takes about 2 seconds to create the dataset...i need to refresh 30 tables, and that takes a minute now - where it used to take 2 seconds. In the old code, it does the same because iscashed actually uses the plan and not the dataset. > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2, 2.2.0 >Reporter: Ran Haim >Priority: Critical > > CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become > really slow. > The cause of the issue is that it is now *always* creates a dataset, and this > is redundant most of the time, we only need the dataset if the table is > cached. > before 2.1.1: > override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > val logicalPlan = > sparkSession.sessionState.catalog.lookupRelation(tableIdent) > // Use lookupCachedData directly since RefreshTable also takes > databaseName. > val isCached = > sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty > if (isCached) { > // Create a data frame to represent the table. > // TODO: Use uncacheTable once it supports database name. > {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(df, > Some(tableIdent.table)) > } > } > after 2.1.1: >override def refreshTable(tableName: String): Unit = { > val tableIdent = > sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) > // Temp tables: refresh (or invalidate) any metadata/data cached in the > plan recursively. > // Non-temp tables: refresh the metadata cache. > sessionCatalog.refreshTable(tableIdent) > // If this table is cached as an InMemoryRelation, drop the original > // cached version and make the new version cached lazily. > {color:red} val table = sparkSession.table(tableIdent){color} > if (isCached(table)) { > // Uncache the logicalPlan. > sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = > true) > // Cache it again. > sparkSession.sharedState.cacheManager.cacheQuery(table, > Some(tableIdent.table)) > } > } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. before 2.1.1: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after 2.1.1: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug >
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* creates a dataset, and this is redundant most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Typ
[jira] [Created] (SPARK-22481) CatalogImpl.refreshTable is slow
Ran Haim created SPARK-22481: Summary: CatalogImpl.refreshTable is slow Key: SPARK-22481 URL: https://issues.apache.org/jira/browse/SPARK-22481 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0, 2.1.2, 2.1.1 Reporter: Ran Haim Priority: Critical CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. * val df = Dataset.ofRows(sparkSession, logicalPlan)* // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. * val table = sparkSession.table(tableIdent)* if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow
[ https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-22481: - Description: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color} // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. {color:red} val table = sparkSession.table(tableIdent){color} if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } was: CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become really slow. The cause of the issue is that it is now *always* create a dataset, and this is redundent most of the time, we only need the dataset if the table is cached. code before the change: override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. val logicalPlan = sparkSession.sessionState.catalog.lookupRelation(tableIdent) // Use lookupCachedData directly since RefreshTable also takes databaseName. val isCached = sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty if (isCached) { // Create a data frame to represent the table. // TODO: Use uncacheTable once it supports database name. * val df = Dataset.ofRows(sparkSession, logicalPlan)* // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(df, Some(tableIdent.table)) } } after the change: /override def refreshTable(tableName: String): Unit = { val tableIdent = sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName) // Temp tables: refresh (or invalidate) any metadata/data cached in the plan recursively. // Non-temp tables: refresh the metadata cache. sessionCatalog.refreshTable(tableIdent) // If this table is cached as an InMemoryRelation, drop the original // cached version and make the new version cached lazily. * val table = sparkSession.table(tableIdent)* if (isCached(table)) { // Uncache the logicalPlan. sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true) // Cache it again. sparkSession.sharedState.cacheManager.cacheQuery(table, Some(tableIdent.table)) } } > CatalogImpl.refreshTable is slow > > > Key: SPARK-22481 > URL: https://issues.apache.org/jira/browse/SPARK-22481 > Project: Spark > Issue Type: Bug > Components: SQ
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931619#comment-15931619 ] Ran Haim commented on SPARK-17436: -- Hi, I think we need to reopen this. It seems that on org.apache.spark.sql.execution.datasources.FileFormatWriter the sortColumns are the columns used by the buckting. There is no way of telling the writer how to sort the data, when not using buckting. I suggest to separate the sortby and the buckting data in the dataframewriter, and allow to use sortby even when not using buckting. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19360) Spark 2.X does not support stored by cluase
[ https://issues.apache.org/jira/browse/SPARK-19360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837503#comment-15837503 ] Ran Haim commented on SPARK-19360: -- Hi, I added my own storage handler actually - and now I cannot use it. > Spark 2.X does not support stored by cluase > --- > > Key: SPARK-19360 > URL: https://issues.apache.org/jira/browse/SPARK-19360 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Ran Haim > > Spark 1.6 and below versions support HiveContext which supports Hive storage > handler with "stored by" clause. However, Spark 2.x does not support "stored > by". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19360) Spark 2.X does not support stored by cluase
Ran Haim created SPARK-19360: Summary: Spark 2.X does not support stored by cluase Key: SPARK-19360 URL: https://issues.apache.org/jira/browse/SPARK-19360 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Ran Haim Spark 1.6 and below versions support HiveContext which supports Hive storage handler with "stored by" clause. However, Spark 2.x does not support "stored by". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833367#comment-15833367 ] Ran Haim commented on SPARK-17436: -- Hi, I did not actually use 2.1 yet - so I cannot be 100% sure. I can see that on org.apache.spark.sql.execution.datasources.FileFormatWriter it is creating a sorter using the inner sorting used: val sortingExpressions: Seq[Expression] = description.partitionColumns ++ bucketIdExpression ++ sortColumns The best way to test it is by using orcdump/parquet-tools by querying the individual files and making sure they are sorted. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 12:06 PM: - Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.1 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-17436: - Description: update *** It seems that in spark 2.1 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. *** When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. was: update *** It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. *** When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-17436: - Description: *** update It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. *** When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. was: When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > *** update > It seems that in spark 2.0 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-17436: - Description: update *** It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. *** When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. was: *** update It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. *** When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > update > *** > It seems that in spark 2.0 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-17436: - Priority: Minor (was: Major) > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim updated SPARK-17436: - Description: When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. was: When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim >Priority: Minor > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 11:30 AM: - Sure , I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim edited comment on SPARK-17436 at 11/20/16 11:29 AM: - Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. It seems that in spark 2.0 code, the sorting issue is resolved. The sorter does consider inner sorting in the sorting key - but I think it will be faster to just insert the rows to a list in a hash map. In any case I suggest to change this issue to minor. was (Author: ran.h...@optimalplus.com): Sure. Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994 ] Ran Haim commented on SPARK-17436: -- Sure. Basically I propose to stop using UnsafeKVExternalSorter, and just use a HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204 ] Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM: - Hi, When you want to write your data to orc files or perquet files, even if the dataframe is partitioned correctly, you have to tell the writer how to partition the data. This means that when you want to write your data in a partitioned folder you lose sorting, and this is unacceptable when thinking on read performance and data on disk size. I already changed the code locally, and it works as excpeted - but I have no permissions to create a PR, and I do not know how to get it. was (Author: ran.h...@optimalplus.com): Hi, When you want to write your data to orc files or perquet files, even if the dataframe is partitioned correctly, you have to tell the writer how to partition the data. This means that when you want to write your data partitioned you lose sorting, and this is unacceptable when thinking on read performance and data on disk size. I already changed the code locally, and it works as excpeted - but I have no permissions to create a PR, and I do not know how to get it. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204 ] Ran Haim commented on SPARK-17436: -- Hi, When you want to write your data to orc files or perquet files, even if the dataframe is partitioned correctly, you have to tell the writer how to partition the data. This means that when you want to write your data partitioned you lose sorting, and this is unacceptable when thinking on read performance and data on disk size. I already changed the code locally, and it works as excpeted - but I have no permissions to create a PR, and I do not know how to get it. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032 ] Ran Haim commented on SPARK-17436: -- I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test This always fails for mecan you point me to someone who can help me? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032 ] Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM: I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install" This always fails for mecan you point me to someone who can help me? was (Author: ran.h...@optimalplus.com): I have basiaclly cloned the repository from https://github.com/apache/spark and ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test This always fails for mecan you point me to someone who can help me? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673230#comment-15673230 ] Ran Haim commented on SPARK-17436: -- I am running thee build on linux, so this is not it. It is something to do with spark-version-info.properties missing. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233 ] Ran Haim edited comment on SPARK-17436 at 11/14/16 3:45 PM: So... Can you help? Or at least point me to someone who can? BTW On spark-core I get test errors like these: org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite) Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class org.apache Run 2: JavaAPISuite.tearDown:92 NullPointer I think it has something to do with this error that I also see: Error while locating file spark-version-info.properties was (Author: ran.h...@optimalplus.com): So... Can you help? Or at least point me to someone who can? BTW On spark-core I get test errors like these: org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite) Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class org.apache Run 2: JavaAPISuite.tearDown:92 NullPointer > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233 ] Ran Haim commented on SPARK-17436: -- So... Can you help? Or at least point me to someone who can? BTW On spark-core I get test errors like these: org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite) Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class org.apache Run 2: JavaAPISuite.tearDown:92 NullPointer > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663358#comment-15663358 ] Ran Haim commented on SPARK-17436: -- Hey, I'll do a pull and try again How to I gain access to create a pull request? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663345#comment-15663345 ] Ran Haim commented on SPARK-17436: -- Look, I think this is a serious bug - it makes parquet files and orc files bigger than they have to be and really not efficient for reading. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663343#comment-15663343 ] Ran Haim commented on SPARK-17436: -- but the whole thing is that the partitioning here actually happens a minute before writing it to the file itself - so no you can reorder it after that - it is already in the file. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640 ] Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM: Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Also I cannot create a pull request, I get 403. Ran, was (Author: ran.h...@optimalplus.com): Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Ran, > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640 ] Ran Haim commented on SPARK-17436: -- Hi, I only got a chance to work on it now. I saw that the whole class tree got changed - I changed the code in org.apache.spark.sql.execution.datasources.FileFormatWriter. The problem is I cannot seem to run a mvn clean install...A lot of tests fail (not relevant to my change, and happen without it) - And I do want to make sure there are relevant tests (though I did not find any). Any Ideas? Ran, > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:35 AM: - Usually you partition the data, and then you order it - this way you preserve ordering. The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered correctly. I would have some time to work on it next week or something like that, can I just do a pull request and put it here? was (Author: ran.h...@optimalplus.com): usually you partition the data, and then you order it - this way you preserve ordering. The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered correctly. I would have some time to work on it next week or something like that, can I just do a pull request and put it here? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483 ] Ran Haim commented on SPARK-17436: -- usually you partition the data, and then you order it - this way you preserve ordering. The problem here occurs in the writer itself, the DataFrame itself is partitioned and ordered correctly. I would have some time to work on it next week or something like that, can I just do a pull request and put it here? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:32 AM: - Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solution for queries. The fix is pretty small, I can work on it myself - how can I do that? was (Author: ran.h...@optimalplus.com): Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457 ] Ran Haim edited comment on SPARK-17436 at 10/27/16 10:22 AM: - Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide a good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? was (Author: ran.h...@optimalplus.com): Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457 ] Ran Haim commented on SPARK-17436: -- Of course it does, every technology that supports partitioning supports ordering in the files themselves Otherwise you just don't provide good solutions for queries. The fix is pretty small, I can work on it myself - how can I do that? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611331#comment-15611331 ] Ran Haim commented on SPARK-17436: -- anyone? > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting
[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471291#comment-15471291 ] Ran Haim commented on SPARK-17436: -- I tries it under 1.6.1, but I did not see anything that fixes that in 2.0. > dataframe.write sometimes does not keep sorting > --- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Ran Haim > > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17436) dataframe.write sometimes does not keep sorting
Ran Haim created SPARK-17436: Summary: dataframe.write sometimes does not keep sorting Key: SPARK-17436 URL: https://issues.apache.org/jira/browse/SPARK-17436 Project: Spark Issue Type: Bug Affects Versions: 2.0.0, 1.6.2, 1.6.1 Reporter: Ran Haim When using partition by, datawriter can sometimes mess up an ordered dataframe. The problem originates in org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. In the writeRows method when too many files are opened (configurable), it starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows again from the sorter and writes them to the corresponding files. The problem is that the sorter actually sorts the rows using the partition key, and that can sometimes mess up the original sort (or secondary sort if you will). I think the best way to fix it is to stop using a sorter, and just put the rows in a map using key as partition key and value as an arraylist, and then just walk through all the keys and write it in the original order - this will probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354851#comment-15354851 ] Ran Haim edited comment on SPARK-15731 at 6/29/16 9:14 AM: --- I have tested it under 1.6.1. When writing the data under the local file system the folders permissions are OK, but when writing the data under HDFS the problem occurs. I don't see a point on testing it under 2.x. thanks. Ran. was (Author: ran.h...@optimalplus.com): I have tested it under 1.6.1, and writing the data under the local file system the permissions are OK, but when writing the data under HDFS the problem occurs. I don't see a point on testing it under 2.x. thanks. Ran. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354851#comment-15354851 ] Ran Haim commented on SPARK-15731: -- I have tested it under 1.6.1, and writing the data under the local file system the permissions are OK, but when writing the data under HDFS the problem occurs. I don't see a point on testing it under 2.x. thanks. Ran. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351026#comment-15351026 ] Ran Haim commented on SPARK-15731: -- Oh, OK - I will try to do this on 1.6 locally and then on 2.x locally. Thanks. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350996#comment-15350996 ] Ran Haim commented on SPARK-15731: -- Yes, I will try it locally. But this might be the problem, I am saving the orcs on HDFS. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351003#comment-15351003 ] Ran Haim commented on SPARK-15731: -- wait a minute, you are not addressing any issues on 1.6.1 which is the latest stable? > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350983#comment-15350983 ] Ran Haim commented on SPARK-15731: -- You did not answer any of my questions, it will help me build a scenario for reproducing this. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim reopened SPARK-15731: -- Hi, MapR upgraded Spark to 1.6.1 and I still get this behavior described. Did you test it on a file standalone spark or under yarn? Did you write the data on HDFS or the linux file system? > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315887#comment-15315887 ] Ran Haim commented on SPARK-15731: -- Fine, you can close it - if it works. I will use a work around. > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315884#comment-15315884 ] Ran Haim commented on SPARK-15731: -- I am using Mapr distribution and they provide spark 1.5.1 :/ - so not really. I did specify the version, but you removed it a few days ago :) > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15731) orc writer directory permissions
[ https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315782#comment-15315782 ] Ran Haim commented on SPARK-15731: -- Hi Kevin. it is pretty much the same, the only difference I see is that I have created a dataframe with a given schema using createDataFrame(Rdd, StructType). I am using spark 1.5.1 - what version did you use? > orc writer directory permissions > > > Key: SPARK-15731 > URL: https://issues.apache.org/jira/browse/SPARK-15731 > Project: Spark > Issue Type: Bug >Reporter: Ran Haim > > When saving orc files with partitions, the partition directories created do > not have x permission (even tough umask is 002), then no other users can get > inside those directories to read the orc file. > When writing parquet files there is no such issue. > code example: > datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15731) orc writer directory permissions
Ran Haim created SPARK-15731: Summary: orc writer directory permissions Key: SPARK-15731 URL: https://issues.apache.org/jira/browse/SPARK-15731 Project: Spark Issue Type: Bug Reporter: Ran Haim When saving orc files with partitions, the partition directories created do not have x permission (even tough umask is 002), then no other users can get inside those directories to read the orc file. When writing parquet files there is no such issue. code example: datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307775#comment-15307775 ] Ran Haim commented on SPARK-10872: -- is there a work around for this? I need this for unit tests. > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) > at > org.a
[jira] [Comment Edited] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284698#comment-15284698 ] Ran Haim edited comment on SPARK-15348 at 5/16/16 3:09 PM: --- This means that if I have a transnational table in hive, I cannot use a spark job to update it or even read it in a coherent way. was (Author: ran.h...@optimalplus.com): If I have a transnational table in hive, I cannot use spark job to update it or even read it in a coherent way. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284698#comment-15284698 ] Ran Haim commented on SPARK-15348: -- If I have a transnational table in hive, I cannot use spark job to update it or even read it in a coherent way. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15349) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim closed SPARK-15349. Resolution: Duplicate > Hive ACID > - > > Key: SPARK-15349 > URL: https://issues.apache.org/jira/browse/SPARK-15349 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15349) Hive ACID
Ran Haim created SPARK-15349: Summary: Hive ACID Key: SPARK-15349 URL: https://issues.apache.org/jira/browse/SPARK-15349 Project: Spark Issue Type: New Feature Reporter: Ran Haim Spark does not support any feature of hive's transnational tables, you cannot use spark to delete/update a table and it also has problems reading the aggregated data when no compaction was done. Also it seems that compaction is not supported - alter table ... partition COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15348) Hive ACID
Ran Haim created SPARK-15348: Summary: Hive ACID Key: SPARK-15348 URL: https://issues.apache.org/jira/browse/SPARK-15348 Project: Spark Issue Type: New Feature Reporter: Ran Haim Spark does not support any feature of hive's transnational tables, you cannot use spark to delete/update a table and it also has problems reading the aggregated data when no compaction was done. Also it seems that compaction is not supported - alter table ... partition COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org