[jira] [Created] (SPARK-23460) PySpark concurrency python egg cache directory

2018-02-17 Thread Dmitiry (JIRA)
Dmitiry created SPARK-23460:
---

 Summary: PySpark concurrency python egg cache directory
 Key: SPARK-23460
 URL: https://issues.apache.org/jira/browse/SPARK-23460
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.1.2
 Environment: YARN last
Reporter: Dmitiry


We are experiencing intermittent failures when running task on pyspark while 
installing dependencies through --py-files with python egg. We set (else 
permission denied on egg cache):
{noformat}
--conf "spark.executorEnv.PYTHON_EGG_CACHE=./.python-eggs"{noformat}
 

Error:
{noformat}
INFO - File "build/bdist.linux-x86_64/egg/ua_parser/user_agent_parser.py", line 
409, in 
INFO - File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 904, in 
resource_filename
INFO - self, resource_name
INFO - File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1380, in 
get_resource_filename
INFO - return self._extract_resource(manager, zip_path)
INFO - File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1405, in 
_extract_resource
INFO - self.egg_name, self._parts(zip_path)
INFO - File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 984, in 
get_cache_path
INFO - self.extraction_error()
INFO - File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 950, in 
extraction_error
INFO - raise err
INFO - ExtractionError: Can't extract file(s) to egg cache
INFO - 
INFO - The following error occurred while trying to extract file(s) to the 
Python egg
INFO - cache:
INFO - 
INFO - [Errno 17] File exists: './.python-eggs'
INFO - 
INFO - The Python egg cache directory is currently set to:
INFO - 
INFO - ./.python-eggs/
INFO - 
INFO - Perhaps your account does not have write access to this directory? You 
can
INFO - change the cache directory by setting the PYTHON_EGG_CACHE environment
INFO - variable to point to an accessible directory.{noformat}
 

We create a package with an option `safe_zip=False`. But pyspark whatever use 
egg cache directory.

Is there any way around this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18778) Fix the Scala classpath in the spark-shell

2018-02-17 Thread Shaun Jackman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368442#comment-16368442
 ] 

Shaun Jackman commented on SPARK-18778:
---

I have this issue as well. spark-shell fails with Java 9. Is there a work 
around?

{{❯❯❯ spark-shell 'sc.parallelize(1 to 1000).count()'}}{{Failed to initialize 
compiler: object java.lang.Object in compiler mirror not found.}}{{** Note that 
as of 2.8 scala does not assume use of the java classpath.}}{{** For the old 
behavior pass -usejavacp to scala, or if using a Settings}}{{** object 
programmatically, settings.usejavacp.value = true.}}

> Fix the Scala classpath in the spark-shell
> --
>
> Key: SPARK-18778
> URL: https://issues.apache.org/jira/browse/SPARK-18778
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: DjvuLee
>Priority: Major
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18778) Fix the Scala classpath in the spark-shell

2018-02-17 Thread Shaun Jackman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368442#comment-16368442
 ] 

Shaun Jackman edited comment on SPARK-18778 at 2/18/18 5:25 AM:


I have this issue as well. spark-shell fails with Java 9. Is there a work 
around?

{{❯❯❯ spark-shell 'sc.parallelize(1 to 1000).count()'}}

{{Failed to initialize compiler: object java.lang.Object in compiler mirror not 
found.}}{{** Note that as of 2.8 scala does not assume use of the java 
classpath.}}{{** For the old behavior pass -usejavacp to scala, or if using a 
Settings}}{{** object programmatically, settings.usejavacp.value = true.}}


was (Author: sjackman):
I have this issue as well. spark-shell fails with Java 9. Is there a work 
around?

{{❯❯❯ spark-shell 'sc.parallelize(1 to 1000).count()'}}{{Failed to initialize 
compiler: object java.lang.Object in compiler mirror not found.}}{{** Note that 
as of 2.8 scala does not assume use of the java classpath.}}{{** For the old 
behavior pass -usejavacp to scala, or if using a Settings}}{{** object 
programmatically, settings.usejavacp.value = true.}}

> Fix the Scala classpath in the spark-shell
> --
>
> Key: SPARK-18778
> URL: https://issues.apache.org/jira/browse/SPARK-18778
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: DjvuLee
>Priority: Major
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23459:

Description: 
{noformat}
  test("save with an unknown partition column") {
withTempDir { dir =>
  val path = dir.getCanonicalPath
Seq(1L -> "a").toDF("i", "j").write
  .format("parquet")
  .partitionBy("unknownColumn")
  .save(path)
}
  }
{noformat}

We got the following error message:
{noformat}
Partition column unknownColumn not found in schema 
StructType(StructField(i,LongType,false), StructField(j,StringType,true));
{noformat}
We should not call toString, but catalogString in the function 
`partitionColumnsSchema` of `PartitioningUtils.scala`




  was:
{noformat}
  test("save with an unknown partition column") {
withTempDir { dir =>
  val path = dir.getCanonicalPath
Seq(1L -> "a").toDF("i", "j").write
  .format("parquet")
  .partitionBy("unknownColumn")
  .save(path)
}
  }
{noformat}

We got the following error message:
Partition column unknownColumn not found in schema 
StructType(StructField(i,LongType,false), StructField(j,StringType,true));

We should not call toString, but catalogString in the function 
`partitionColumnsSchema` of `PartitioningUtils.scala`






> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23459:
---

 Summary: Improve the error message when unknown column is 
specified in partition columns
 Key: SPARK-23459
 URL: https://issues.apache.org/jira/browse/SPARK-23459
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li


{noformat}
  test("save with an unknown partition column") {
withTempDir { dir =>
  val path = dir.getCanonicalPath
Seq(1L -> "a").toDF("i", "j").write
  .format("parquet")
  .partitionBy("unknownColumn")
  .save(path)
}
  }
{noformat}

We got the following error message:
Partition column unknownColumn not found in schema 
StructType(StructField(i,LongType,false), StructField(j,StringType,true));

We should not call toString, but catalogString in the function 
`partitionColumnsSchema` of `PartitioningUtils.scala`







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-02-17 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368351#comment-16368351
 ] 

Felix Cheung commented on SPARK-23435:
--

Working on this. Debugging a problem.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368331#comment-16368331
 ] 

Susan X. Huynh commented on SPARK-23423:


[~skonto] [~igor.berman] I got some info from a coworker:
{noformat}
The agent will generate a terminal update for each task still in a non-terminal 
state when the executor terminates. These are forward through the master (as 
are all agent generated messages for schedulers) and will be delivered 
"reliability" with an acknowledgement needed from the scheduler.
{noformat}
So, to investigate the missing status updates, I would first look in the agent 
logs around the time the executor was killed, and then check if the master 
received the update.

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23458) OrcSuite flaky test

2018-02-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368295#comment-16368295
 ] 

Marco Gaido commented on SPARK-23458:
-

cc [~dongjoon]

> OrcSuite flaky test
> ---
>
> Key: SPARK-23458
> URL: https://issues.apache.org/jira/browse/SPARK-23458
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: AMPLab Jenkins
>Reporter: Marco Gaido
>Priority: Major
>
> Sometimes we have UT failures with the following stacktrace:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01396221801 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 
> 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115)
>   at 
> 

[jira] [Created] (SPARK-23458) OrcSuite flaky test

2018-02-17 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23458:
---

 Summary: OrcSuite flaky test
 Key: SPARK-23458
 URL: https://issues.apache.org/jira/browse/SPARK-23458
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.4.0
 Environment: AMPLab Jenkins
Reporter: Marco Gaido


Sometimes we have UT failures with the following stacktrace:


{code:java}
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
eventually never returned normally. Attempted 15 times over 10.01396221801 
seconds. Last failure message: There are 1 possibly leaked file streams..
at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
at 
org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
at 
org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
at 
org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
at 
org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583)
at 
org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
at 
org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
at 
org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite$class.run(Suite.scala:1147)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 1 
possibly leaked file streams.
at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54)
at 
org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115)
at 
org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply(SharedSparkSession.scala:115)
at 
org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply(SharedSparkSession.scala:115)
at 
org.scalatest.concurrent.Eventually$class.makeAValiantAttempt$1(Eventually.scala:395)
at 

[jira] [Commented] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368291#comment-16368291
 ] 

Apache Spark commented on SPARK-23457:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20619

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23457:


Assignee: (was: Apache Spark)

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23457:


Assignee: Apache Spark

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23457) Register task completion listeners first for Parquet

2018-02-17 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23457:
--
Description: 
ParquetFileFormat leaks open files in some cases. This issue aims to register 
task completion listener first.

{code}
  test("SPARK-23390 Register task completion listeners first in 
ParquetFileFormat") {
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
s"${Int.MaxValue}") {
  withTempDir { dir =>
val basePath = dir.getCanonicalPath
Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
"first").toString)
Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
"second").toString)
val df = spark.read.parquet(
  new Path(basePath, "first").toString,
  new Path(basePath, "second").toString)
val e = intercept[SparkException] {
  df.collect()
}
assert(e.getCause.isInstanceOf[OutOfMemoryError])
  }
}
  }
{code}

  was:
ParquetFileFormat leaks open files in some cases.

{code}
  test("SPARK-23390 Register task completion listeners first in 
ParquetFileFormat") {
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
s"${Int.MaxValue}") {
  withTempDir { dir =>
val basePath = dir.getCanonicalPath
Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
"first").toString)
Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
"second").toString)
val df = spark.read.parquet(
  new Path(basePath, "first").toString,
  new Path(basePath, "second").toString)
val e = intercept[SparkException] {
  df.collect()
}
assert(e.getCause.isInstanceOf[OutOfMemoryError])
  }
}
  }
{code}


> Register task completion listeners first for Parquet
> 
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-17 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23457:
--
Summary: Register task completion listeners first for ParquetFileFormat  
(was: Register task completion listeners first for Parquet)

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23457) Register task completion listeners first for Parquet

2018-02-17 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23457:
-

 Summary: Register task completion listeners first for Parquet
 Key: SPARK-23457
 URL: https://issues.apache.org/jira/browse/SPARK-23457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


ParquetFileFormat leaks open files in some cases.

{code}
  test("SPARK-23390 Register task completion listeners first in 
ParquetFileFormat") {
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
s"${Int.MaxValue}") {
  withTempDir { dir =>
val basePath = dir.getCanonicalPath
Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
"first").toString)
Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
"second").toString)
val df = spark.read.parquet(
  new Path(basePath, "first").toString,
  new Path(basePath, "second").toString)
val e = intercept[SparkException] {
  df.collect()
}
assert(e.getCause.isInstanceOf[OutOfMemoryError])
  }
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21783:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23456) Turn on `native` ORC implementation by default

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23456:


Assignee: (was: Apache Spark)

> Turn on `native` ORC implementation by default
> --
>
> Key: SPARK-23456
> URL: https://issues.apache.org/jira/browse/SPARK-23456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368289#comment-16368289
 ] 

Apache Spark commented on SPARK-21783:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20634

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21783) Turn on ORC filter push-down by default

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21783:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> Turn on ORC filter push-down by default
> ---
>
> Key: SPARK-21783
> URL: https://issues.apache.org/jira/browse/SPARK-21783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Like Parquet (SPARK-9207), it would be great to turn on ORC option, too.
> This option was turned off by default from the begining, SPARK-2883
> - 
> https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23456) Turn on `native` ORC implementation by default

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23456:


Assignee: Apache Spark

> Turn on `native` ORC implementation by default
> --
>
> Key: SPARK-23456
> URL: https://issues.apache.org/jira/browse/SPARK-23456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23456) Turn on `native` ORC implementation by default

2018-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368288#comment-16368288
 ] 

Apache Spark commented on SPARK-23456:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20634

> Turn on `native` ORC implementation by default
> --
>
> Key: SPARK-23456
> URL: https://issues.apache.org/jira/browse/SPARK-23456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23456) Turn on `native` ORC implementation by default

2018-02-17 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23456:
-

 Summary: Turn on `native` ORC implementation by default
 Key: SPARK-23456
 URL: https://issues.apache.org/jira/browse/SPARK-23456
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-17 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368261#comment-16368261
 ] 

Susan X. Huynh commented on SPARK-23423:


I'll check if the task updates might be dropped under heavy load. 
[~igor.berman] Normally, you should see the TASK_KILLED updates in the logs, 
something like:

 
{noformat}
15:38:47 INFO TaskSchedulerImpl: Executor 2 on 10.0.1.201 killed by driver.
15:38:47 INFO DAGScheduler: Executor lost: 2 (epoch 0)
15:38:47 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from 
BlockManagerMaster.
15:38:47 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(2, 10.0.1.201, 42805, None)
15:38:47 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
15:38:47 INFO ExecutorAllocationManager: Existing executor 2 has been removed 
(new total is 1)
15:38:48 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now 
TASK_KILLED
15:38:48 INFO BlockManagerMaster: Removal of executor 2 requested
15:38:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove 
non-existent executor 2
{noformat}
 

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-17 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368198#comment-16368198
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

[~sethah], thanks for your input.

I believe, GPflow implements linear time GP. However, it is not distributed. 

Regarding investigation of user demand: can't we just hold a vote among the 
users? 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23455) Default Params in ML should be saved separately

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23455:


Assignee: Apache Spark

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23455) Default Params in ML should be saved separately

2018-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23455:


Assignee: (was: Apache Spark)

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23455) Default Params in ML should be saved separately

2018-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368154#comment-16368154
 ] 

Apache Spark commented on SPARK-23455:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20633

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection

2018-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368147#comment-16368147
 ] 

Xiao Li commented on SPARK-23368:
-

[~maryannxue] Thank you for your work! Happy New Year!

> Avoid unnecessary Exchange or Sort after projection
> ---
>
> Key: SPARK-23368
> URL: https://issues.apache.org/jira/browse/SPARK-23368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Minor
>
> After column rename projection, the ProjectExec's outputOrdering and 
> outputPartitioning should reflect the projected columns as well. For example,
> {code:java}
> SELECT b1
> FROM (
> SELECT a a1, b b1
> FROM testData2
> ORDER BY a
> )
> ORDER BY a1{code}
> The inner query is ordered on a1 as well. If we had a rule to eliminate Sort 
> on sorted result, together with this fix, the order-by in the outer query 
> could have been optimized out.
>  
> Similarly, the below query
> {code:java}
> SELECT *
> FROM (
> SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> )
> JOIN testData2 t3
> ON a1 = t3.a{code}
> is equivalent to
> {code:java}
> SELECT *
> FROM testData2 t1
> LEFT JOIN testData2 t2
> ON t1.a = t2.a
> JOIN testData2 t3
> ON t1.a = t3.a{code}
> , so the unnecessary sorting and hash-partitioning that have been optimized 
> out for the second query should have be eliminated in the first query as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-02-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23340:

Fix Version/s: (was: 2.3.0)
   2.4.0

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org