from:"Ran Haim \(JIRA\)"

[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start

2018-10-15 Thread Ran Haim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649990#comment-16649990
 ] 

Ran Haim commented on SPARK-25527:
--

Any update?

> Job stuck waiting for last stage to start
> -
>
> Key: SPARK-25527
> URL: https://issues.apache.org/jira/browse/SPARK-25527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ran Haim
>Priority: Major
> Attachments: threaddumpjob.txt
>
>
> Sometimes it can somehow happen that a job is stuck waiting for the last 
> stage to start.
> There are no Tasks waiting for completion, and the job just hangs.
> There are available Executors for the job to run.
> I do not know how to reproduce this, all I know is that it happens randomly 
> after couple days of hard load.
> Another thing that might help is that it seems to happen when some tasks fail 
> because one or more executors killed (due to memory issues or something).
> Those tasks eventually do get finished by other executors because of retries, 
> but the next stage hangs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25527) Job stuck waiting for last stage to start

2018-09-26 Thread Ran Haim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628392#comment-16628392
 ] 

Ran Haim commented on SPARK-25527:
--

Hi, actually as I said there is one task (in the last stage) that is waiting to 
be started.
I do not know exactly what to look for in the thread dump, I uploaded it - 
[^threaddumpjob.txt]

> Job stuck waiting for last stage to start
> -
>
> Key: SPARK-25527
> URL: https://issues.apache.org/jira/browse/SPARK-25527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ran Haim
>Priority: Major
> Attachments: threaddumpjob.txt
>
>
> Sometimes it can somehow happen that a job is stuck waiting for the last 
> stage to start.
> There are no Tasks waiting for completion, and the job just hangs.
> There are available Executors for the job to run.
> I do not know how to reproduce this, all I know is that it happens randomly 
> after couple days of hard load.
> Another thing that might help is that it seems to happen when some tasks fail 
> because one or more executors killed (due to memory issues or something).
> Those tasks eventually do get finished by other executors because of retries, 
> but the next stage hangs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25527) Job stuck waiting for last stage to start

2018-09-26 Thread Ran Haim (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-25527:
-
Attachment: threaddumpjob.txt

> Job stuck waiting for last stage to start
> -
>
> Key: SPARK-25527
> URL: https://issues.apache.org/jira/browse/SPARK-25527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ran Haim
>Priority: Major
> Attachments: threaddumpjob.txt
>
>
> Sometimes it can somehow happen that a job is stuck waiting for the last 
> stage to start.
> There are no Tasks waiting for completion, and the job just hangs.
> There are available Executors for the job to run.
> I do not know how to reproduce this, all I know is that it happens randomly 
> after couple days of hard load.
> Another thing that might help is that it seems to happen when some tasks fail 
> because one or more executors killed (due to memory issues or something).
> Those tasks eventually do get finished by other executors because of retries, 
> but the next stage hangs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25527) Job stuck waiting for last stage to start

2018-09-25 Thread Ran Haim (JIRA)

Ran Haim created SPARK-25527:


 Summary: Job stuck waiting for last stage to start
 Key: SPARK-25527
 URL: https://issues.apache.org/jira/browse/SPARK-25527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Ran Haim


Sometimes it can somehow happen that a job is stuck waiting for the last stage 
to start.
There are no Tasks waiting for completion, and the job just hangs.
There are available Executors for the job to run.

I do not know how to reproduce this, all I know is that it happens randomly 
after couple days of hard load.


Another thing that might help is that it seems to happen when some tasks fail 
because one or more executors killed (due to memory issues or something).
Those tasks eventually do get finished by other executors because of retries, 
but the next stage hangs.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-12 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248837#comment-16248837
 ] 

Ran Haim edited comment on SPARK-22481 at 11/12/17 12:31 PM:
-

Hi,
I created this simple test that will show how slow refresh table can be.
I create a parquet table, with 10K files just to make it obvious how slow it 
can get - but a table with a lot of partitions will have the same effect.


{code:java}
import java.util.UUID

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.junit.runner.RunWith
import org.scalatest.junit.JUnitRunner
import org.scalatest.{BeforeAndAfter, FunSuite}

@RunWith(classOf[JUnitRunner])
class TestRefreshTable extends FunSuite with BeforeAndAfter{
  {
if (System.getProperty("os.name").toLowerCase.startsWith("windows")) {
  val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath
  System.setProperty("hadoop.home.dir",hadoopDirURI )
}
  }

  var fs: FileSystem = _

  before{
fs = FileSystem.get(new Configuration())
fs.setWriteChecksum(false)
  }

  private val path = "/tmp/test"


  test("big table refresh test"){
val config = new SparkConf().
  setMaster("local[*]").
  setAppName("test").
  set("spark.ui.enabled", "false")

val sparkSession = SparkSession
  .builder()
  .config(config)
  .enableHiveSupport()
  .getOrCreate()
try {

  for (i <- 1 to 1) {
val id = UUID.randomUUID().toString
fs.create(new Path(path, id))
  }

  sparkSession.sql("drop TABLE if exists test")
  sparkSession.sql(
s"""
  CREATE EXTERNAL TABLE test(
  `rowId` bigint)
 ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '$path'
""")


  val start = System.currentTimeMillis()
  sparkSession.catalog.refreshTable("test")
  val end = System.currentTimeMillis()

  println(s"refresh took ${end - start}")
}finally {
  sparkSession.close()
}
  }
}
{code}



was (Author: ran.h...@optimalplus.com):
Hi,
I create this simple test that will show how slow refresh table can be.
I create a parquet table, with 10K files just to make it obvious how slow it 
can get - but a table with a lot of partitions will have the same effect.


{code:java}
import java.util.UUID

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.junit.runner.RunWith
import org.scalatest.junit.JUnitRunner
import org.scalatest.{BeforeAndAfter, FunSuite}

@RunWith(classOf[JUnitRunner])
class TestRefreshTable extends FunSuite with BeforeAndAfter{
  {
if (System.getProperty("os.name").toLowerCase.startsWith("windows")) {
  val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath
  System.setProperty("hadoop.home.dir",hadoopDirURI )
}
  }

  var fs: FileSystem = _

  before{
fs = FileSystem.get(new Configuration())
fs.setWriteChecksum(false)
  }

  private val path = "/tmp/test"


  test("big table refresh test"){
val config = new SparkConf().
  setMaster("local[*]").
  setAppName("test").
  set("spark.ui.enabled", "false")

val sparkSession = SparkSession
  .builder()
  .config(config)
  .enableHiveSupport()
  .getOrCreate()
try {

  for (i <- 1 to 1) {
val id = UUID.randomUUID().toString
fs.create(new Path(path, id))
  }

  sparkSession.sql("drop TABLE if exists test")
  sparkSession.sql(
s"""
  CREATE EXTERNAL TABLE test(
  `rowId` bigint)
 ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '$path'
""")


  val start = System.currentTimeMillis()
  sparkSession.catalog.refreshTable("test")
  val end = System.currentTimeMillis()

  println(s"refresh took ${end - start}")
}finally {
  sparkSession.close()
}
  }
}
{code}


> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spar

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-12 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248837#comment-16248837
 ] 

Ran Haim commented on SPARK-22481:
--

Hi,
I create this simple test that will show how slow refresh table can be.
I create a parquet table, with 10K files just to make it obvious how slow it 
can get - but a table with a lot of partitions will have the same effect.


{code:java}
import java.util.UUID

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.junit.runner.RunWith
import org.scalatest.junit.JUnitRunner
import org.scalatest.{BeforeAndAfter, FunSuite}

@RunWith(classOf[JUnitRunner])
class TestRefreshTable extends FunSuite with BeforeAndAfter{
  {
if (System.getProperty("os.name").toLowerCase.startsWith("windows")) {
  val hadoopDirURI = ClassLoader.getSystemResource("hadoop").toURI.getPath
  System.setProperty("hadoop.home.dir",hadoopDirURI )
}
  }

  var fs: FileSystem = _

  before{
fs = FileSystem.get(new Configuration())
fs.setWriteChecksum(false)
  }

  private val path = "/tmp/test"


  test("big table refresh test"){
val config = new SparkConf().
  setMaster("local[*]").
  setAppName("test").
  set("spark.ui.enabled", "false")

val sparkSession = SparkSession
  .builder()
  .config(config)
  .enableHiveSupport()
  .getOrCreate()
try {

  for (i <- 1 to 1) {
val id = UUID.randomUUID().toString
fs.create(new Path(path, id))
  }

  sparkSession.sql("drop TABLE if exists test")
  sparkSession.sql(
s"""
  CREATE EXTERNAL TABLE test(
  `rowId` bigint)
 ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '$path'
""")


  val start = System.currentTimeMillis()
  sparkSession.catalog.refreshTable("test")
  val end = System.currentTimeMillis()

  println(s"refresh took ${end - start}")
}finally {
  sparkSession.close()
}
  }
}
{code}


> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   spa

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247913#comment-16247913
 ] 

Ran Haim commented on SPARK-22481:
--

No, it is not.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247858#comment-16247858
 ] 

Ran Haim commented on SPARK-22481:
--

Ok, I'll create a simple app to reproduce this, later next week.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247852#comment-16247852
 ] 

Ran Haim commented on SPARK-22481:
--

I will double check on Sunday, but sparksession.table in the end does 
sparkSession.sessionState.executePlan(logicalPlan), and because my tables are 
parquet tables, if I am not mistaken, it will go and fetch the list of files 
from the filesystem.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247838#comment-16247838
 ] 

Ran Haim commented on SPARK-22481:
--

I can check it again on Sunday.
I don't know why it is suprising, as it has to go and fetch a list of files 
from the filesystem.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247828#comment-16247828
 ] 

Ran Haim edited comment on SPARK-22481 at 11/10/17 5:47 PM:


It is as I wrote above.
refreshing 30 tables takes 1 minute in 2.1.1 and 2 seconds in 2.1.0.
It is unnecessary to create the dataset, unless you know it is actually cached 
- this is why it is so slow now.


was (Author: ran.h...@optimalplus.com):
It is as I wrote above.
It takes 1 minute in 2.1.1 and 2 seconds in 2.1.0.
It is unnecessary to create the dataset, unless you know it is actually cached 
- this is why it is so slow now.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-10 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247828#comment-16247828
 ] 

Ran Haim commented on SPARK-22481:
--

It is as I wrote above.
It takes 1 minute in 2.1.1 and 2 seconds in 2.1.0.
It is unnecessary to create the dataset, unless you know it is actually cached 
- this is why it is so slow now.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246176#comment-16246176
 ] 

Ran Haim commented on SPARK-22481:
--

It takes about 2 seconds to create the dataset...i need to refresh 30 tables, 
and that takes a minute now - where it used to take 2 seconds.
In the old code, it does the same because iscashed actually uses the plan and 
not the dataset.

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Critical
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

before 2.1.1:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after 2.1.1:
   override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>

[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* creates a dataset, and this 
is redundant most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Typ

[jira] [Created] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)

Ran Haim created SPARK-22481:


 Summary: CatalogImpl.refreshTable is slow
 Key: SPARK-22481
 URL: https://issues.apache.org/jira/browse/SPARK-22481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.2, 2.1.1
Reporter: Ran Haim
Priority: Critical


CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
*  val df = Dataset.ofRows(sparkSession, logicalPlan)*
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
   * val table = sparkSession.table(tableIdent)*
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2017-11-09 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-22481:
-
Description: 
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
 {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
{color:red}   val table = sparkSession.table(tableIdent){color}
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }



  was:
CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
really slow.
The cause of the issue is that it is now *always* create a dataset, and this is 
redundent most of the time, we only need the dataset if the table is cached.

code before the change:
  override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
val logicalPlan = 
sparkSession.sessionState.catalog.lookupRelation(tableIdent)
// Use lookupCachedData directly since RefreshTable also takes databaseName.
val isCached = 
sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
if (isCached) {
  // Create a data frame to represent the table.
  // TODO: Use uncacheTable once it supports database name.
*  val df = Dataset.ofRows(sparkSession, logicalPlan)*
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(df, 
Some(tableIdent.table))
}
  }

after the change:
   /override def refreshTable(tableName: String): Unit = {
val tableIdent = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
// Temp tables: refresh (or invalidate) any metadata/data cached in the 
plan recursively.
// Non-temp tables: refresh the metadata cache.
sessionCatalog.refreshTable(tableIdent)

// If this table is cached as an InMemoryRelation, drop the original
// cached version and make the new version cached lazily.
   * val table = sparkSession.table(tableIdent)*
if (isCached(table)) {
  // Uncache the logicalPlan.
  sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = true)
  // Cache it again.
  sparkSession.sharedState.cacheManager.cacheQuery(table, 
Some(tableIdent.table))
}
  }




> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Bug
>  Components: SQ

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2017-03-19 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931619#comment-15931619
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
I think we need to reopen this.
It seems that on org.apache.spark.sql.execution.datasources.FileFormatWriter 
the sortColumns are the columns used by the buckting.
There is no way of telling the writer how to sort the data, when not using 
buckting.

I suggest to separate the sortby and the buckting data in the dataframewriter, 
and allow to use sortby even when not using buckting.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19360) Spark 2.X does not support stored by cluase

2017-01-25 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837503#comment-15837503
 ] 

Ran Haim commented on SPARK-19360:
--

Hi,
I added my own storage handler actually - and now I cannot use it.

> Spark 2.X does not support stored by cluase
> ---
>
> Key: SPARK-19360
> URL: https://issues.apache.org/jira/browse/SPARK-19360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Ran Haim
>
> Spark 1.6 and below versions support HiveContext which supports Hive storage 
> handler with "stored by" clause. However, Spark 2.x does not support "stored 
> by". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19360) Spark 2.X does not support stored by cluase

2017-01-25 Thread Ran Haim (JIRA)

Ran Haim created SPARK-19360:


 Summary: Spark 2.X does not support stored by cluase
 Key: SPARK-19360
 URL: https://issues.apache.org/jira/browse/SPARK-19360
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Ran Haim


Spark 1.6 and below versions support HiveContext which supports Hive storage 
handler with "stored by" clause. However, Spark 2.x does not support "stored 
by". 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2017-01-22 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833367#comment-15833367
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
I did not actually use 2.1 yet - so I cannot be 100% sure.
I can see that on org.apache.spark.sql.execution.datasources.FileFormatWriter 
it is creating a sorter using the inner sorting used:
val sortingExpressions: Seq[Expression] =
description.partitionColumns ++ bucketIdExpression ++ sortColumns

The best way to test it is by using orcdump/parquet-tools by querying the 
individual files and making sure they are sorted.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 12:06 PM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.1 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
update
***
It seems that in spark 2.1 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
update
***
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
*** update
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> *** update
> It seems that in spark 2.0 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
update
***
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
*** update
It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
***

When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.




> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> update
> ***
> It seems that in spark 2.0 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Priority: Minor  (was: Major)

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim updated SPARK-17436:
-
Description: 
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



  was:
When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>Priority: Minor
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:30 AM:
-

Sure , I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim edited comment on SPARK-17436 at 11/20/16 11:29 AM:
-

Sure - Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

It seems that in spark 2.0 code, the sorting issue is resolved.
The sorter does consider inner sorting in the sorting key - but I think it will 
be faster to just insert the rows to a list in a hash map.
In any case I suggest to change this issue to minor.



was (Author: ran.h...@optimalplus.com):
Sure.
Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-20 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680994#comment-15680994
 ] 

Ran Haim commented on SPARK-17436:
--

Sure.
Basically I propose to stop using UnsafeKVExternalSorter, and just use a 
HashMap[String, ArrayBuffer[UnsafeRow]] - that is it basically.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204
 ] 

Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM:
-

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data in a partitioned folder you 
lose sorting, and this is unacceptable when thinking on read performance and 
data on disk size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.


was (Author: ran.h...@optimalplus.com):
Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679204#comment-15679204
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032
 ] 

Ran Haim commented on SPARK-17436:
--

I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test

This always fails for mecan you point me to someone who can help me?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032
 ] 

Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM:


I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install"

This always fails for mecan you point me to someone who can help me?


was (Author: ran.h...@optimalplus.com):
I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test

This always fails for mecan you point me to someone who can help me?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673230#comment-15673230
 ] 

Ran Haim commented on SPARK-17436:
--

I am running thee build on linux, so this is not it.
It is something to do with spark-version-info.properties missing.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233
 ] 

Ran Haim edited comment on SPARK-17436 at 11/14/16 3:45 PM:


So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache
  Run 2: JavaAPISuite.tearDown:92 NullPointer

I think it has something to do with this error that I also see:
Error while locating file spark-version-info.properties


was (Author: ran.h...@optimalplus.com):
So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache
  Run 2: JavaAPISuite.tearDown:92 NullPointer

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664233#comment-15664233
 ] 

Ran Haim commented on SPARK-17436:
--

So...
Can you help?
Or at least point me to someone who can?

BTW On spark-core I get test errors like these:
org.apache.spark.JavaAPISuite.map(org.apache.spark.JavaAPISuite)
  Run 1: JavaAPISuite.setUp:85 » NoClassDefFound Could not initialize class 
org.apache
  Run 2: JavaAPISuite.tearDown:92 NullPointer

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663358#comment-15663358
 ] 

Ran Haim commented on SPARK-17436:
--

Hey, I'll do a pull and try again
How to I gain access to create a pull request?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663345#comment-15663345
 ] 

Ran Haim commented on SPARK-17436:
--

Look, I think this is a serious bug  - it makes parquet files and orc files 
bigger than they have to be and really not efficient for reading.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-14 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663343#comment-15663343
 ] 

Ran Haim commented on SPARK-17436:
--

but the whole thing is that the partitioning here actually happens a minute 
before writing it to the file itself - so no you can reorder it after that - it 
is already in the file.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-13 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640
 ] 

Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM:


Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Also I cannot create a pull request, I get 403.

Ran,


was (Author: ran.h...@optimalplus.com):
Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-13 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15661640#comment-15661640
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:35 AM:
-

Usually you partition the data, and then you order it - this way you preserve 
ordering.
The problem here occurs in the writer itself, the DataFrame itself is 
partitioned and ordered correctly.

I would have some time to work on it next week or something like that, can I 
just do a pull request and put it here?



was (Author: ran.h...@optimalplus.com):
usually you partition the data, and then you order it - this way you preserve 
ordering.
The problem here occurs in the writer itself, the DataFrame itself is 
partitioned and ordered correctly.

I would have some time to work on it next week or something like that, can I 
just do a pull request and put it here?


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611483#comment-15611483
 ] 

Ran Haim commented on SPARK-17436:
--

usually you partition the data, and then you order it - this way you preserve 
ordering.
The problem here occurs in the writer itself, the DataFrame itself is 
partitioned and ordered correctly.

I would have some time to work on it next week or something like that, can I 
just do a pull request and put it here?


> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:32 AM:
-

Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solution for queries.

The fix is pretty small, I can work on it myself - how can I do that?


was (Author: ran.h...@optimalplus.com):
Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457
 ] 

Ran Haim edited comment on SPARK-17436 at 10/27/16 10:22 AM:
-

Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide a good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?


was (Author: ran.h...@optimalplus.com):
Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611457#comment-15611457
 ] 

Ran Haim commented on SPARK-17436:
--

Of course it does, every technology that supports partitioning supports 
ordering in the files themselves
Otherwise you just don't provide good solutions for queries.

The fix is pretty small, I can work on it myself - how can I do that?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-10-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611331#comment-15611331
 ] 

Ran Haim commented on SPARK-17436:
--

anyone?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-09-07 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471291#comment-15471291
 ] 

Ran Haim commented on SPARK-17436:
--

I tries it under 1.6.1, but I did not see anything that fixes that in 2.0.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-09-07 Thread Ran Haim (JIRA)

Ran Haim created SPARK-17436:


 Summary: dataframe.write sometimes does not keep sorting
 Key: SPARK-17436
 URL: https://issues.apache.org/jira/browse/SPARK-17436
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Ran Haim


When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15731) orc writer directory permissions

2016-06-29 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354851#comment-15354851
 ] 

Ran Haim edited comment on SPARK-15731 at 6/29/16 9:14 AM:
---

I have tested it under 1.6.1.
When writing the data under the local file system the folders permissions are 
OK, but when writing the data under HDFS the problem occurs.

I don't see a point on testing it under 2.x.

thanks.
Ran.


was (Author: ran.h...@optimalplus.com):
I have tested it under 1.6.1, and writing the data under the local file system 
the permissions are OK, but when writing the data under HDFS the problem occurs.
I don't see a point on testing it under 2.x.

thanks.
Ran.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-29 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354851#comment-15354851
 ] 

Ran Haim commented on SPARK-15731:
--

I have tested it under 1.6.1, and writing the data under the local file system 
the permissions are OK, but when writing the data under HDFS the problem occurs.
I don't see a point on testing it under 2.x.

thanks.
Ran.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351026#comment-15351026
 ] 

Ran Haim commented on SPARK-15731:
--

Oh, OK - I will try to do this on 1.6 locally and then on 2.x locally.

Thanks.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350996#comment-15350996
 ] 

Ran Haim commented on SPARK-15731:
--

Yes, I will try it locally.
But this might be the problem, I am saving the orcs on HDFS.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351003#comment-15351003
 ] 

Ran Haim commented on SPARK-15731:
--

wait a minute, you are not addressing any issues on 1.6.1 which is the latest 
stable?

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-27 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350983#comment-15350983
 ] 

Ran Haim commented on SPARK-15731:
--

You did not answer any of my questions, it will help me build a scenario for 
reproducing this.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15731) orc writer directory permissions

2016-06-27 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim reopened SPARK-15731:
--

Hi,
MapR upgraded Spark to 1.6.1 and I still get this behavior described.
Did you test it on a file standalone spark or under yarn?
Did you write the data on HDFS or the linux file system?

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315887#comment-15315887
 ] 

Ran Haim commented on SPARK-15731:
--

Fine, you can close it - if it works.
I will use a work around.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315884#comment-15315884
 ] 

Ran Haim commented on SPARK-15731:
--

I am using Mapr distribution and they provide spark 1.5.1 :/ - so not really.
I did specify the version, but you removed it a few days ago :)

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315782#comment-15315782
 ] 

Ran Haim commented on SPARK-15731:
--

Hi Kevin.

it is pretty much the same, the only difference I see is that I have created a 
dataframe with a given schema using createDataFrame(Rdd, StructType).
I am using spark 1.5.1 - what version did you use?

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15731) orc writer directory permissions

2016-06-02 Thread Ran Haim (JIRA)

Ran Haim created SPARK-15731:


 Summary: orc writer directory permissions
 Key: SPARK-15731
 URL: https://issues.apache.org/jira/browse/SPARK-15731
 Project: Spark
  Issue Type: Bug
Reporter: Ran Haim


When saving orc files with partitions, the partition directories created do not 
have x permission (even tough umask is 002), then no other users can get inside 
those directories to read the orc file.
When writing parquet files there is no such issue.

code example:
datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-05-31 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307775#comment-15307775
 ] 

Ran Haim commented on SPARK-10872:
--

is there a work around for this?
I need this for unit tests.

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>   at 
> org.a

[jira] [Comment Edited] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284698#comment-15284698
 ] 

Ran Haim edited comment on SPARK-15348 at 5/16/16 3:09 PM:
---

This means that if I have a transnational table in hive, I cannot use a spark 
job to update it or even read it in a coherent way.


was (Author: ran.h...@optimalplus.com):
If I have a transnational table in hive, I cannot use spark job to update it or 
even read it in a coherent way.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284698#comment-15284698
 ] 

Ran Haim commented on SPARK-15348:
--

If I have a transnational table in hive, I cannot use spark job to update it or 
even read it in a coherent way.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15349) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim closed SPARK-15349.

Resolution: Duplicate

> Hive ACID
> -
>
> Key: SPARK-15349
> URL: https://issues.apache.org/jira/browse/SPARK-15349
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15349) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)

Ran Haim created SPARK-15349:


 Summary: Hive ACID
 Key: SPARK-15349
 URL: https://issues.apache.org/jira/browse/SPARK-15349
 Project: Spark
  Issue Type: New Feature
Reporter: Ran Haim


Spark does not support any feature of hive's transnational tables,
you cannot use spark to delete/update a table and it also has problems reading 
the aggregated data when no compaction was done.
Also it seems that compaction is not supported - alter table ... partition  
COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)

Ran Haim created SPARK-15348:


 Summary: Hive ACID
 Key: SPARK-15348
 URL: https://issues.apache.org/jira/browse/SPARK-15348
 Project: Spark
  Issue Type: New Feature
Reporter: Ran Haim


Spark does not support any feature of hive's transnational tables,
you cannot use spark to delete/update a table and it also has problems reading 
the aggregated data when no compaction was done.
Also it seems that compaction is not supported - alter table ... partition  
COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

67 matches

Mail list logo