[jira] [Commented] (SPARK-2633) support register spark listener to listener bus with Java API
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071402#comment-14071402 ] Chengxiang Li commented on SPARK-2633: -- For Hive job status monitor, spark listener monitor less job status than expected: current * monitor events: onJobStart, onJobEnd(JobSucceeded/JobFailed) * available status: Started/Succeeded/Failed expect: * monitor events: +onJobSubmitted, onJobStart, onJobEnd(JobSucceeded/JobFailed/+JobKilled) * available status: Submitted/Started/Succeeded/Failed/Killed support register spark listener to listener bus with Java API - Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Currently user can only register spark listener with Scala API, we should add this feature to Java API as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-2630: -- Summary: Input data size of CoalescedRDD is incorrect (was: Input data size goes overflow when size is large then 4G in one task) Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Priority: Critical Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.
woshilaiceshide created SPARK-2640: -- Summary: In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks. Key: SPARK-2640 URL: https://issues.apache.org/jira/browse/SPARK-2640 Project: Spark Issue Type: Bug Components: Spark Core Reporter: woshilaiceshide -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.
[ https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071448#comment-14071448 ] Apache Spark commented on SPARK-2640: - User 'woshilaiceshide' has created a pull request for this issue: https://github.com/apache/spark/pull/1544 In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks. - Key: SPARK-2640 URL: https://issues.apache.org/jira/browse/SPARK-2640 Project: Spark Issue Type: Bug Components: Spark Core Reporter: woshilaiceshide -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.
[ https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1516: - Assignee: John Zhao Yarn Client should not call System.exit, should throw exception instead. Key: SPARK-1516 URL: https://issues.apache.org/jira/browse/SPARK-1516 Project: Spark Issue Type: Improvement Components: Deploy Reporter: DB Tsai Assignee: John Zhao Fix For: 0.9.2, 1.0.1 People submit spark job inside their application to yarn cluster using spark yarn client, and it's not desirable to call System.exit in yarn client which will terminate the parent application as well. We should throw exception instead, and people can determine which action they want to take given the exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency
[ https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1935: - Fix Version/s: 0.9.2 Explicitly add commons-codec 1.5 as a dependency Key: SPARK-1935 URL: https://issues.apache.org/jira/browse/SPARK-1935 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Yin Huai Assignee: Yin Huai Priority: Minor Fix For: 0.9.2, 1.0.1, 1.1.0 Right now, commons-codec is a transitive dependency. When Spark is built by maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an older version (Hadoop 1.0.4 depends on 1.4). This older version can cause problems because 1.4 introduces incompatible changes and new methods. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
Kanwaljit Singh created SPARK-2641: -- Summary: Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
[ https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kanwaljit Singh updated SPARK-2641: --- Description: When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The running job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The running job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file
[ https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kanwaljit Singh updated SPARK-2641: --- Description: When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The submitted job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) PS: spark.executor.instances is also not mentioned on http://spark.apache.org/docs/latest/configuration.html was: When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The running job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) Spark submit doesn't pick up executor instances from properties file Key: SPARK-2641 URL: https://issues.apache.org/jira/browse/SPARK-2641 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kanwaljit Singh When running spark-submit in Yarn cluster mode, we provide properties file using --properties-file option. spark.executor.instances=5 spark.executor.memory=2120m spark.executor.cores=3 The submitted job picks up the cores and memory, but not the correct instances. I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments: // Use properties file as fallback for values which have a direct analog to // arguments in this script. master = Option(master).getOrElse(defaultProperties.get(spark.master).orNull) executorMemory = Option(executorMemory) .getOrElse(defaultProperties.get(spark.executor.memory).orNull) executorCores = Option(executorCores) .getOrElse(defaultProperties.get(spark.executor.cores).orNull) totalExecutorCores = Option(totalExecutorCores) .getOrElse(defaultProperties.get(spark.cores.max).orNull) name = Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull) jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull) Along with these defaults, we should also set default for instances: numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull) PS: spark.executor.instances is also not mentioned on http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2638) Improve concurrency of fetching Map outputs
[ https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071476#comment-14071476 ] Stephen Boesch commented on SPARK-2638: --- Upon examining the codebase further, I see that the above is actually a pattern: it is followed in various places. As another example: consider the logic employed for fetching RDD Partitions. CacheManager.scala: private val loading = new HashSet[RDDBlockId]() .. loading.synchronized { .. while (loading.contains(id)) { try { loading.wait() So .. what am I missing here? Why is it deemed necessary to block ALL partition loaders - instead of just blocking the particular RDDBlockID. So: in this case, the suggested update would be: id.toString.intern.synchronized { while (loading.contains(id)) { try { id.toString.intern.wait() Please explain why is it necessary to block in the former way: i.e. on the loading collection instead of on individual RDDBLockID's ? Improve concurrency of fetching Map outputs --- Key: SPARK-2638 URL: https://issues.apache.org/jira/browse/SPARK-2638 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: All Reporter: Stephen Boesch Priority: Minor Labels: MapOutput, concurrency Fix For: 1.1.0 Original Estimate: 0h Remaining Estimate: 0h This issue was noticed while perusing the MapOutputTracker source code. Notice that the synchronization is on the containing fetching collection - which makes ALL fetches wait if any fetch were occurring. The fix is to synchronize instead on the shuffleId (interned as a string to ensure JVM wide visibility). def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = { val statuses = mapStatuses.get(shuffleId).orNull if (statuses == null) { logInfo(Don't have map outputs for shuffle + shuffleId + , fetching them) var fetchedStatuses: Array[MapStatus] = null fetching.synchronized { // This is existing code // shuffleId.toString.intern.synchronized { // New Code if (fetching.contains(shuffleId)) { // Someone else is fetching it; wait for them to be done while (fetching.contains(shuffleId)) { try { fetching.wait() } catch { case e: InterruptedException = } } This is only a small code change, but the testcases to prove (a) proper functionality and (b) proper performance improvement are not so trivial. For (b) it is not worthwhile to add a testcase to the codebase. Instead I have added a git project that demonstrates the concurrency/performance improvement using the fine-grained approach . The github project is at https://github.com/javadba/scalatesting.git . Simply run sbt test. Note: it is unclear how/where to include this ancillary testing/verification information that will not be included in the git PR: i am open for any suggestions - even as far as simply removing references to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2633) support register spark listener to listener bus with Java API
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071478#comment-14071478 ] Chengxiang Li commented on SPARK-2633: -- add 2 more: # StageInfo class is not well designed to get stage state, should contain something like StageEndReason. # it would be lovely if spark java API could transfer scala collection into java collection, not by user. for example, stageIds: Seq\[Int] in SparkListenerJobStart. support register spark listener to listener bus with Java API - Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Currently user can only register spark listener with Scala API, we should add this feature to Java API as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2638) Improve concurrency of fetching Map outputs
[ https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071490#comment-14071490 ] Stephen Boesch commented on SPARK-2638: --- Other examples: HttpBroadcast.synchronized { SparkEnv.get.blockManager.putSingle( blockId, value_, StorageLevel.MEMORY_AND_DISK, tellMaster = false) } TorrentBroadcast.synchronized { SparkEnv.get.blockManager.putSingle( broadcastId, value_, StorageLevel.MEMORY_AND_DISK, tellMaster = false) } Note: there are well over one hundred other uses of synchronized - but the others seem to be properly scoped - ie synchronized on sufficiently confined objects, or encompassing short-lived operations. Improve concurrency of fetching Map outputs --- Key: SPARK-2638 URL: https://issues.apache.org/jira/browse/SPARK-2638 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Environment: All Reporter: Stephen Boesch Priority: Minor Labels: MapOutput, concurrency Fix For: 1.1.0 Original Estimate: 0h Remaining Estimate: 0h This issue was noticed while perusing the MapOutputTracker source code. Notice that the synchronization is on the containing fetching collection - which makes ALL fetches wait if any fetch were occurring. The fix is to synchronize instead on the shuffleId (interned as a string to ensure JVM wide visibility). def getServerStatuses(shuffleId: Int, reduceId: Int): Array[(BlockManagerId, Long)] = { val statuses = mapStatuses.get(shuffleId).orNull if (statuses == null) { logInfo(Don't have map outputs for shuffle + shuffleId + , fetching them) var fetchedStatuses: Array[MapStatus] = null fetching.synchronized { // This is existing code // shuffleId.toString.intern.synchronized { // New Code if (fetching.contains(shuffleId)) { // Someone else is fetching it; wait for them to be done while (fetching.contains(shuffleId)) { try { fetching.wait() } catch { case e: InterruptedException = } } This is only a small code change, but the testcases to prove (a) proper functionality and (b) proper performance improvement are not so trivial. For (b) it is not worthwhile to add a testcase to the codebase. Instead I have added a git project that demonstrates the concurrency/performance improvement using the fine-grained approach . The github project is at https://github.com/javadba/scalatesting.git . Simply run sbt test. Note: it is unclear how/where to include this ancillary testing/verification information that will not be included in the git PR: i am open for any suggestions - even as far as simply removing references to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed
[ https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071492#comment-14071492 ] navanee commented on SPARK-2575: yes Xiangrui Meng.. But if I use MLUtils.loadLibSVMFile function to get LabeledPoint and pass it to train method, i wont get the issue. SVMWithSGD throwing Input Validation failed Key: SPARK-2575 URL: https://issues.apache.org/jira/browse/SPARK-2575 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.1 Reporter: navanee SVMWithSGD throwing Input Validation failed while using Sparse Array as Input. Though SVMWihtSGD accepts LibSVM format. Exception trace : org.apache.spark.SparkException: Input validation failed. at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124) at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154) at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188) at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala) at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143) at com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172) at com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2642) Add jobId in web UI
YanTang Zhai created SPARK-2642: --- Summary: Add jobId in web UI Key: SPARK-2642 URL: https://issues.apache.org/jira/browse/SPARK-2642 Project: Spark Issue Type: Improvement Components: Web UI Reporter: YanTang Zhai Priority: Minor Web UI has stage id only at present. Multiple stages could not explicitly show as the same job. Job id will be added in wen ui. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-2618: Description: we use shark server to do interative query. every sql run with a job. sometimes we want to immediately run a query that later be submitted to shark server. so we need to provide user to define a job's priority and ensure that high priority job can be firstly launched. i have created a pull request: https://github.com/apache/spark/pull/1528 was:we use shark server to do interative query. every sql run with a job. sometimes we want to immediately run a query that later be submitted to shark server. so we need to provide user to define a job's priority and ensure that high priority job can be firstly launched. use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler - Key: SPARK-2618 URL: https://issues.apache.org/jira/browse/SPARK-2618 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang we use shark server to do interative query. every sql run with a job. sometimes we want to immediately run a query that later be submitted to shark server. so we need to provide user to define a job's priority and ensure that high priority job can be firstly launched. i have created a pull request: https://github.com/apache/spark/pull/1528 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2643) Stages web ui has ERROR when pool name is None
[ https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YanTang Zhai updated SPARK-2643: Component/s: Web UI Stages web ui has ERROR when pool name is None -- Key: SPARK-2643 URL: https://issues.apache.org/jira/browse/SPARK-2643 Project: Spark Issue Type: Bug Components: Web UI Reporter: YanTang Zhai Priority: Minor 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132) at org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40) at org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60) at org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52) at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) 14/07/23 16:01:44 WARN
[jira] [Commented] (SPARK-2298) Show stage attempt in UI
[ https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071542#comment-14071542 ] Apache Spark commented on SPARK-2298: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/1545 Show stage attempt in UI Key: SPARK-2298 URL: https://issues.apache.org/jira/browse/SPARK-2298 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Reynold Xin Assignee: Masayoshi TSUZUKI Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png We should add a column to the web ui to show stage attempt id. Then tasks should be grouped by (stageId, stageAttempt) tuple. When a stage is resubmitted (e.g. due to fetch failures), we should get a different entry in the web ui and tasks for the resubmission go there. See the attached screenshot for the confusing status quo. We currently show the same stage entry twice, and then tasks appear in both. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2644) Hive should not be enabled by default in the build.
[ https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reassigned SPARK-2644: -- Assignee: Prashant Sharma Hive should not be enabled by default in the build. --- Key: SPARK-2644 URL: https://issues.apache.org/jira/browse/SPARK-2644 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2644) Hive should not be enabled by default in the build.
[ https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071556#comment-14071556 ] Apache Spark commented on SPARK-2644: - User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/1546 Hive should not be enabled by default in the build. --- Key: SPARK-2644 URL: https://issues.apache.org/jira/browse/SPARK-2644 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071571#comment-14071571 ] Sean Owen commented on SPARK-2420: -- I'd say it's already just as broken for those apps, since they will not work once they tangle with Hadoop libs on a classpath, unless the classpaths and deployments are set up a certain way. An app that wants to use Guava 17 *should* be able to safely bundle it, even in the presence of code working with Guava 11, since it has been pretty completely backwards compatible. Pretty completely. It's less bad at least. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time
Vlad Komarov created SPARK-2645: --- Summary: Spark driver calls System.exit(50) after calling SparkContext.stop() the second time Key: SPARK-2645 URL: https://issues.apache.org/jira/browse/SPARK-2645 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Vlad Komarov In some cases my application calls SparkContext.stop() after it has already stopped and this leads to stopping JVM that runs spark driver. E.g This program should run forever {code} JavaSparkContext context = new JavaSparkContext(spark://12.34.21.44:7077, DummyApp); try { JavaRDDInteger rdd = context.parallelize(Arrays.asList(1, 2, 3)); rdd.count(); } catch (Throwable e) { e.printStackTrace(); } try { context.cancelAllJobs(); context.stop(); //call stop second time context.stop(); } catch (Throwable e) { e.printStackTrace(); } Thread.currentThread().join(); {code} but it finishes with exit code 50 after calling SparkContext.stop() the second time. Also it throws an exception like this {code} org.apache.spark.ServerStateException: Server is already stopped at org.apache.spark.HttpServer.stop(HttpServer.scala:122) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.SparkContext.stop(SparkContext.scala:984) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91) [spark-core_2.10-1.0.0.jar:1.0.0] at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] at akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.10.4.jar:na] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.10.4.jar:na] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.10.4.jar:na] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.10.4.jar:na] {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071669#comment-14071669 ] Ken Carlile commented on SPARK-2282: Well, something didn't work quite right.. our copy of 1.0.1 is the prebuilt copy for Hadoop 1/CDH3. So I did a git init in that directory, then did a git pull https://github.com/apache/spark/ +refs/pull/1503/head Well, that didn't work... I don't expect you to solve my git noob problems, so I'll work with someone here to figure it out. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x
Sean Owen created SPARK-2646: Summary: log4j initialization not quite compatible with log4j 2.x Key: SPARK-2646 URL: https://issues.apache.org/jira/browse/SPARK-2646 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Sean Owen Priority: Minor The logging code that handles log4j initialization leads to an stack overflow error when used with log4j 2.x, which has just been released. This occurs even a downstream project has correctly adjusted SLF4J bindings, and that is the right thing to do for log4j 2.x, since it is effectively a separate project from 1.x. Here is the relevant bit of Logging.scala: {code} private def initializeLogging() { // If Log4j is being used, but is not initialized, load a default properties file val binder = StaticLoggerBinder.getSingleton val usingLog4j = binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory) val log4jInitialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements if (!log4jInitialized usingLog4j) { val defaultLogProps = org/apache/spark/log4j-defaults.properties Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match { case Some(url) = PropertyConfigurator.configure(url) log.info(sUsing Spark's default log4j profile: $defaultLogProps) case None = System.err.println(sSpark was unable to load $defaultLogProps) } } Logging.initialized = true // Force a call into slf4j to initialize it. Avoids this happening from mutliple threads // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html log } {code} The first minor issue is that there is a call to a logger inside this method, which is initializing logging. In this situation, it ends up causing the initialization to be called recursively until the stack overflow. It would be slightly tidier to log this only after Logging.initialized = true. Or not at all. But it's not the root problem, or else, it would not work at all now. The calls to log4j classes here always reference log4j 1.2 no matter what. For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 is initialized. usingLog4j should be false for log4j 2.x, because the initialization only matters for log4j 1.2. But, it's true, and that's the real issue. And log4jInitialized is always false, since calls to the log4j 1.2 API are stubs and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the loop. This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The SLF4J static binding class has the same name for both versions, unfortunately, which causes the issue. However they're in different packages. For example, if the test included ... and begins with org.slf4j, it should work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package org.apache.logging.slf4j. Of course, I assume that SLF4J will eventually offer its own binding. I hope to goodness they at least name the binding class differently, or else this will again not work. But then some other check can probably be made. (Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA for him. I'll propose a PR too.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event
YanTang Zhai created SPARK-2647: --- Summary: DAGScheduler plugs others when processing one JobSubmitted event Key: SPARK-2647 URL: https://issues.apache.org/jira/browse/SPARK-2647 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai If a few of jobs are submitted, DAGScheduler plugs others when processing one JobSubmitted event. For example ont JobSubmitted event is processed as follows and costs much time spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130) - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call) at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472) at org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233) at StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at
[jira] [Commented] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x
[ https://issues.apache.org/jira/browse/SPARK-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071745#comment-14071745 ] Apache Spark commented on SPARK-2646: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1547 log4j initialization not quite compatible with log4j 2.x Key: SPARK-2646 URL: https://issues.apache.org/jira/browse/SPARK-2646 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.1 Reporter: Sean Owen Priority: Minor The logging code that handles log4j initialization leads to an stack overflow error when used with log4j 2.x, which has just been released. This occurs even a downstream project has correctly adjusted SLF4J bindings, and that is the right thing to do for log4j 2.x, since it is effectively a separate project from 1.x. Here is the relevant bit of Logging.scala: {code} private def initializeLogging() { // If Log4j is being used, but is not initialized, load a default properties file val binder = StaticLoggerBinder.getSingleton val usingLog4j = binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory) val log4jInitialized = LogManager.getRootLogger.getAllAppenders.hasMoreElements if (!log4jInitialized usingLog4j) { val defaultLogProps = org/apache/spark/log4j-defaults.properties Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match { case Some(url) = PropertyConfigurator.configure(url) log.info(sUsing Spark's default log4j profile: $defaultLogProps) case None = System.err.println(sSpark was unable to load $defaultLogProps) } } Logging.initialized = true // Force a call into slf4j to initialize it. Avoids this happening from mutliple threads // and triggering this: http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html log } {code} The first minor issue is that there is a call to a logger inside this method, which is initializing logging. In this situation, it ends up causing the initialization to be called recursively until the stack overflow. It would be slightly tidier to log this only after Logging.initialized = true. Or not at all. But it's not the root problem, or else, it would not work at all now. The calls to log4j classes here always reference log4j 1.2 no matter what. For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 is initialized. usingLog4j should be false for log4j 2.x, because the initialization only matters for log4j 1.2. But, it's true, and that's the real issue. And log4jInitialized is always false, since calls to the log4j 1.2 API are stubs and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the loop. This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The SLF4J static binding class has the same name for both versions, unfortunately, which causes the issue. However they're in different packages. For example, if the test included ... and begins with org.slf4j, it should work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package org.apache.logging.slf4j. Of course, I assume that SLF4J will eventually offer its own binding. I hope to goodness they at least name the binding class differently, or else this will again not work. But then some other check can probably be made. (Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA for him. I'll propose a PR too.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071755#comment-14071755 ] YanTang Zhai commented on SPARK-2647: - I've created PR: https://github.com/apache/spark/pull/1548. Please help to review. Thanks. DAGScheduler plugs others when processing one JobSubmitted event Key: SPARK-2647 URL: https://issues.apache.org/jira/browse/SPARK-2647 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai If a few of jobs are submitted, DAGScheduler plugs others when processing one JobSubmitted event. For example ont JobSubmitted event is processed as follows and costs much time spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130) - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call) at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472) at org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233) at StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event
[ https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071754#comment-14071754 ] Apache Spark commented on SPARK-2647: - User 'YanTangZhai' has created a pull request for this issue: https://github.com/apache/spark/pull/1548 DAGScheduler plugs others when processing one JobSubmitted event Key: SPARK-2647 URL: https://issues.apache.org/jira/browse/SPARK-2647 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai If a few of jobs are submitted, DAGScheduler plugs others when processing one JobSubmitted event. For example ont JobSubmitted event is processed as follows and costs much time spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130) - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call) at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) at org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472) at org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233) at StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071767#comment-14071767 ] Ken Carlile commented on SPARK-2282: Merging just the two files also did not work. I received a bunch of these errors during the test: {code}Exception happened during processing of request from ('127.0.0.1', 33116) Traceback (most recent call last): File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 295, in _handle_request_noblock self.process_request(request, client_address) File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 321, in process_request self.finish_request(request, client_address) File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 334, in finish_request self.RequestHandlerClass(request, client_address, self) File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 649, in __init__ self.handle() File /usr/local/spark-current/python/pyspark/accumulators.py, line 224, in handle num_updates = read_int(self.rfile) File /usr/local/spark-current/python/pyspark/serializers.py, line 337, in read_int raise EOFError EOFError {code} And then it errored out with the usual java thing. PySpark crashes if too many tasks complete quickly -- Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.1, 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 0.9.2, 1.0.0, 1.0.1 Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo 1 /proc/sys/net/ipv4/tcp_tw_reuse echo 1 /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071786#comment-14071786 ] Thomas Graves commented on SPARK-2604: -- Yes we should be adding the overhead in at the check. Spark Application hangs on yarn in edge case scenario of executor memory requirement Key: SPARK-2604 URL: https://issues.apache.org/jira/browse/SPARK-2604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Twinkle Sachdeva In yarn environment, let's say : MaxAM = Maximum allocatable memory ExecMem - Executor's memory if (MaxAM ExecMem ( MaxAM - ExecMem) 384m )) then Maximum resource validation fails w.r.t executor memory , and application master gets launched, but when resource is allocated and again validated, they are returned and application appears to be hanged. Typical use case is to ask for executor memory = maximum allowed memory as per yarn config -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2484) By default does not run hive compatibility tests
[ https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-2484: --- Assignee: Guoqiang Li By default does not run hive compatibility tests Key: SPARK-2484 URL: https://issues.apache.org/jira/browse/SPARK-2484 Project: Spark Issue Type: Improvement Reporter: Guoqiang Li Assignee: Guoqiang Li hive compatibility test takes a long time, in some cases, we don't need to run it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2644) Hive should not be enabled by default in the build.
[ https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-2644. -- Resolution: Duplicate Hive should not be enabled by default in the build. --- Key: SPARK-2644 URL: https://issues.apache.org/jira/browse/SPARK-2644 Project: Spark Issue Type: Bug Reporter: Prashant Sharma Assignee: Prashant Sharma Enabling hive by default causes the build to slow down, which is not desirable. We try to build and run hive tests optionally i.e. only if there is change in the hive codebase in our jenkins. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2484) Build should not run hive compatibility tests by default.
[ https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-2484: --- Summary: Build should not run hive compatibility tests by default. (was: By default does not run hive compatibility tests) Build should not run hive compatibility tests by default. - Key: SPARK-2484 URL: https://issues.apache.org/jira/browse/SPARK-2484 Project: Spark Issue Type: Improvement Reporter: Guoqiang Li Assignee: Guoqiang Li hive compatibility test takes a long time, in some cases, we don't need to run it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time
Lianhui Wang created SPARK-2648: --- Summary: through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time Key: SPARK-2648 URL: https://issues.apache.org/jira/browse/SPARK-2648 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers to connect a executor at a time.when a map has many paritions, at a time there has so much reduces connecting to this map.so it maybe make network's connect to timeout. i created PR for this issue:https://github.com/apache/spark/pull/1549 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
npanj created SPARK-2649: Summary: EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge Key: SPARK-2649 URL: https://issues.apache.org/jira/browse/SPARK-2649 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: npanj Priority: Minor On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it may be an issue with default amit spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2609) Log thread ID when spilling ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2609. -- Resolution: Fixed Log thread ID when spilling ExternalAppendOnlyMap - Key: SPARK-2609 URL: https://issues.apache.org/jira/browse/SPARK-2609 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Priority: Minor Fix For: 1.1.0 It may be useful to track down whether one thread is constantly spilling, versus multiple threads are spilling relatively infrequently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2609) Log thread ID when spilling ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2609: - Assignee: Andrew Or Log thread ID when spilling ExternalAppendOnlyMap - Key: SPARK-2609 URL: https://issues.apache.org/jira/browse/SPARK-2609 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Fix For: 1.1.0 It may be useful to track down whether one thread is constantly spilling, versus multiple threads are spilling relatively infrequently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.
[ https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072044#comment-14072044 ] Yin Huai commented on SPARK-2632: - Seems the exception triggered by importing a method of a non-serializable class is not a new bug. So, I guess we do not want to block our release on it. Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff. Key: SPARK-2632 URL: https://issues.apache.org/jira/browse/SPARK-2632 Project: Spark Issue Type: Bug Affects Versions: 1.0.0, 1.0.1 Reporter: Yin Huai Master is affected by this bug. To reproduce the exception, you can start a local cluster (sbin/start-all.sh) then open a spark shell. {code} class X() { println(What!); def y = 3 } val x = new X import x.y case class Person(name: String, age: Int) sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).collect {code} Then you will find the exception. I am attaching the stack trace below... {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.
[ https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2632: Priority: Major (was: Blocker) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff. Key: SPARK-2632 URL: https://issues.apache.org/jira/browse/SPARK-2632 Project: Spark Issue Type: Bug Affects Versions: 1.0.0, 1.0.1 Reporter: Yin Huai Master is affected by this bug. To reproduce the exception, you can start a local cluster (sbin/start-all.sh) then open a spark shell. {code} class X() { println(What!); def y = 3 } val x = new X import x.y case class Person(name: String, age: Int) sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)).collect {code} Then you will find the exception. I am attaching the stack trace below... {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.
[ https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2640. -- Resolution: Fixed Fix Version/s: 1.1.0 In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks. - Key: SPARK-2640 URL: https://issues.apache.org/jira/browse/SPARK-2640 Project: Spark Issue Type: Bug Components: Spark Core Reporter: woshilaiceshide Assignee: woshilaiceshide Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.
[ https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2640: - Priority: Minor (was: Major) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks. - Key: SPARK-2640 URL: https://issues.apache.org/jira/browse/SPARK-2640 Project: Spark Issue Type: Bug Components: Spark Core Reporter: woshilaiceshide Assignee: woshilaiceshide Priority: Minor Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.
[ https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2640: - Assignee: woshilaiceshide In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks. - Key: SPARK-2640 URL: https://issues.apache.org/jira/browse/SPARK-2640 Project: Spark Issue Type: Bug Components: Spark Core Reporter: woshilaiceshide Assignee: woshilaiceshide Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2277: - Fix Version/s: 1.1.0 Make TaskScheduler track whether there's host on a rack --- Key: SPARK-2277 URL: https://issues.apache.org/jira/browse/SPARK-2277 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li Fix For: 1.1.0 When TaskSetManager adds a pending task, it checks whether the tasks's preferred location is available. Regarding RACK_LOCAL task, we consider the preferred rack available if such a rack is defined for the preferred host. This is incorrect as there may be no alive hosts on that rack at all. Therefore, TaskScheduler should track the hosts on each rack, and provides an API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2277: - Assignee: Rui Li Make TaskScheduler track whether there's host on a rack --- Key: SPARK-2277 URL: https://issues.apache.org/jira/browse/SPARK-2277 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li Assignee: Rui Li Fix For: 1.1.0 When TaskSetManager adds a pending task, it checks whether the tasks's preferred location is available. Regarding RACK_LOCAL task, we consider the preferred rack available if such a rack is defined for the preferred host. This is incorrect as there may be no alive hosts on that rack at all. Therefore, TaskScheduler should track the hosts on each rack, and provides an API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2277. -- Resolution: Fixed Make TaskScheduler track whether there's host on a rack --- Key: SPARK-2277 URL: https://issues.apache.org/jira/browse/SPARK-2277 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Rui Li Assignee: Rui Li Fix For: 1.1.0 When TaskSetManager adds a pending task, it checks whether the tasks's preferred location is available. Regarding RACK_LOCAL task, we consider the preferred rack available if such a rack is defined for the preferred host. This is incorrect as there may be no alive hosts on that rack at all. Therefore, TaskScheduler should track the hosts on each rack, and provides an API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2650) Wrong initial sizes for in-memory column buffers
Michael Armbrust created SPARK-2650: --- Summary: Wrong initial sizes for in-memory column buffers Key: SPARK-2650 URL: https://issues.apache.org/jira/browse/SPARK-2650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical The logic for setting up the initial column buffers is different for Spark SQL compared to Shark and I'm seeing OOMs when caching tables that are larger than available memory (where shark was okay). Two suspicious things: the intialSize is always set to 0 so we always go with the default. The default looks like it was copied from code like 10 * 1024 * 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2561) Repartitioning a SchemaRDD breaks resolution
[ https://issues.apache.org/jira/browse/SPARK-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2561. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Repartitioning a SchemaRDD breaks resolution Key: SPARK-2561 URL: https://issues.apache.org/jira/browse/SPARK-2561 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Fix For: 1.1.0, 1.0.2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1630: Assignee: Davies Liu PythonRDDs don't handle nulls gracefully Key: SPARK-1630 URL: https://issues.apache.org/jira/browse/SPARK-1630 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 0.9.0, 0.9.1 Reporter: Kalpit Shah Assignee: Davies Liu Original Estimate: 2h Remaining Estimate: 2h If PythonRDDs receive a null element in iterators, they currently NPE. It would be better do log a DEBUG message and skip the write of NULL elements. Here are the 2 stack traces : 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main] java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.writeToFile. : java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2650) Wrong initial sizes for in-memory column buffers
[ https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2650: Target Version/s: 1.1.0 Wrong initial sizes for in-memory column buffers Key: SPARK-2650 URL: https://issues.apache.org/jira/browse/SPARK-2650 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical The logic for setting up the initial column buffers is different for Spark SQL compared to Shark and I'm seeing OOMs when caching tables that are larger than available memory (where shark was okay). Two suspicious things: the intialSize is always set to 0 so we always go with the default. The default looks like it was copied from code like 10 * 1024 * 1024... but in Spark SQL its 10 * 102 * 1024. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2569) Customized UDFs in hive not running with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-2569: --- Assignee: Michael Armbrust Customized UDFs in hive not running with Spark SQL -- Key: SPARK-2569 URL: https://issues.apache.org/jira/browse/SPARK-2569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 1.0.4, scala 2.10.3, spark 1.0.0 Reporter: jacky hung Assignee: Michael Armbrust start spark-shell, init (like create hiveContext, import ._ ect, make sure the jar including the UDFs is in classpath) hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. then i tried hql(select t_ts(time) from data_common where limit 1).collect().foreach(println), which failed with NullPointException we had discussion about it in the mail list. http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006 java.lang.NullPointerException org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117) org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119) org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2569) Customized UDFs in hive not running with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2569: Priority: Critical (was: Major) Target Version/s: 1.1.0 Customized UDFs in hive not running with Spark SQL -- Key: SPARK-2569 URL: https://issues.apache.org/jira/browse/SPARK-2569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 1.0.4, scala 2.10.3, spark 1.0.0 Reporter: jacky hung Assignee: Michael Armbrust Priority: Critical start spark-shell, init (like create hiveContext, import ._ ect, make sure the jar including the UDFs is in classpath) hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. then i tried hql(select t_ts(time) from data_common where limit 1).collect().foreach(println), which failed with NullPointException we had discussion about it in the mail list. http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006 java.lang.NullPointerException org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117) org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119) org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file
[ https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072076#comment-14072076 ] Yin Huai commented on SPARK-2576: - [~prashant_] I have also created a [REPL test|https://github.com/yhuai/spark/commit/d77c515f741c65d807cdc147fb32d92e2039bc97] for importing SQLContext.createSchemaRDD. SQLContext is a serializable class. Seems the issue is slightly different with SPARK-2632. The exception is caused by the following code in SparkImports.scala {code} case x: ClassHandler = // I am trying to guess if the import is a defined class // This is an ugly hack, I am not 100% sure of the consequences. // Here we, let everything but defined classes use the import with val. // The reason for this is, otherwise the remote executor tries to pull the // classes involved and may fail. for (imv - x.definedNames) { val objName = req.lineRep.readPath code.append(import + objName + .INSTANCE + req.accessPath + .` + imv + `\n) } {code} Can you take a look at it as well? Thank you:) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file -- Key: SPARK-2576 URL: https://issues.apache.org/jira/browse/SPARK-2576 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.0.1 Environment: One Mesos 0.19 master without zookeeper and 4 mesos slaves. JDK 1.7.51 and Scala 2.10.4 on all nodes. HDFS from CDH5.0.3 Spark version: I tried both with the pre-built CDH5 spark package available from http://spark.apache.org/downloads.html and by packaging spark with sbt 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here http://mesosphere.io/learn/run-spark-on-mesos/ All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have Reporter: Svend Vanderveken Assignee: Yin Huai Priority: Blocker Execution of SQL query against HDFS systematically throws a class not found exception on slave nodes when executing . (this was originally reported on the user list: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html) Sample code (ran from spark-shell): {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class Car(timestamp: Long, objectid: String, isGreen: Boolean) // I get the same error when pointing to the folder hdfs://vm28:8020/test/cardata val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0) val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), ar(2).toBoolean)) cars.registerAsTable(mcars) val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = true) allgreens.collect.take(10).foreach(println) {code} Stack trace on the slave nodes: {code} I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20140714-142853-485682442-5050-25487-2 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: mesos,mnubohadoop 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mesos, mnubohadoop) 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started 14/07/16 13:01:17 INFO Remoting: Starting remoting 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@vm23:38230] 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: akka.tcp://spark@vm28:41632/user/MapOutputTracker 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: akka.tcp://spark@vm28:41632/user/BlockManagerMaster 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140716130117-8ea0 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501) 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server 14/07/16 13:01:18 INFO Executor: Using REPL class URI:
[jira] [Created] (SPARK-2651) Add maven scalastyle plugin
Rahul Singhal created SPARK-2651: Summary: Add maven scalastyle plugin Key: SPARK-2651 URL: https://issues.apache.org/jira/browse/SPARK-2651 Project: Spark Issue Type: Improvement Components: Build Reporter: Rahul Singhal Priority: Minor SBT has a scalastyle plugin which can be executed to check for coding conventions. It would be nice to add the same for maven builds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed
[ https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072140#comment-14072140 ] Xiangrui Meng commented on SPARK-2575: -- loadLibSVMFile converts labels to binary by default. It maps any label value greater than 0.5 to 1.0, or 0.0 otherwise. This is because both 1/0 and +1/-1 are widely used in popular LIBSVM datasets. In MLlib, to use SVMWithSGD you need to use binary labels, 1.0 or 0.0. SVMWithSGD throwing Input Validation failed Key: SPARK-2575 URL: https://issues.apache.org/jira/browse/SPARK-2575 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.1 Reporter: navanee SVMWithSGD throwing Input Validation failed while using Sparse Array as Input. Though SVMWihtSGD accepts LibSVM format. Exception trace : org.apache.spark.SparkException: Input validation failed. at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145) at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124) at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154) at org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188) at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala) at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143) at com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172) at com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2651) Add maven scalastyle plugin
[ https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072156#comment-14072156 ] Rahul Singhal commented on SPARK-2651: -- PR: https://github.com/apache/spark/pull/1550 Add maven scalastyle plugin --- Key: SPARK-2651 URL: https://issues.apache.org/jira/browse/SPARK-2651 Project: Spark Issue Type: Improvement Components: Build Reporter: Rahul Singhal Priority: Minor SBT has a scalastyle plugin which can be executed to check for coding conventions. It would be nice to add the same for maven builds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2651) Add maven scalastyle plugin
[ https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072164#comment-14072164 ] Apache Spark commented on SPARK-2651: - User 'rahulsinghaliitd' has created a pull request for this issue: https://github.com/apache/spark/pull/1550 Add maven scalastyle plugin --- Key: SPARK-2651 URL: https://issues.apache.org/jira/browse/SPARK-2651 Project: Spark Issue Type: Improvement Components: Build Reporter: Rahul Singhal Priority: Minor SBT has a scalastyle plugin which can be executed to check for coding conventions. It would be nice to add the same for maven builds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI
[ https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072161#comment-14072161 ] Masayoshi TSUZUKI commented on SPARK-2567: -- I noticed now but the cause of [SPARK-1726] seems to be the same. Resubmitted stage sometimes remains as active stage in the web UI - Key: SPARK-2567 URL: https://issues.apache.org/jira/browse/SPARK-2567 Project: Spark Issue Type: Bug Reporter: Masayoshi TSUZUKI Attachments: SPARK-2567.png When a stage is resubmitted because of executor lost for example, sometimes more than one resubmitted task appears in the web UI and one stage remains as active even after the job has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2642) Add jobId in web UI
[ https://issues.apache.org/jira/browse/SPARK-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072171#comment-14072171 ] Masayoshi TSUZUKI commented on SPARK-2642: -- Is this the same as [SPARK-1362] ? Add jobId in web UI --- Key: SPARK-2642 URL: https://issues.apache.org/jira/browse/SPARK-2642 Project: Spark Issue Type: Improvement Components: Web UI Reporter: YanTang Zhai Priority: Minor Web UI has stage id only at present. Multiple stages could not explicitly show as the same job. Job id will be added in wen ui. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1362) Web UI should provide page of showing statistics and stage list for a given job
[ https://issues.apache.org/jira/browse/SPARK-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masayoshi TSUZUKI updated SPARK-1362: - Component/s: Web UI Web UI should provide page of showing statistics and stage list for a given job --- Key: SPARK-1362 URL: https://issues.apache.org/jira/browse/SPARK-1362 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 0.9.0 Reporter: wangfei Now spark provide the page of stage, but in spark the conception level is like this-- app job stage task. Page of job is better to monitor the jobs in one app, and only page of stage we can not distinguish jobs easily sometimes -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
[ https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] npanj updated SPARK-2649: - Description: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it seems to be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). Here is error message: -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- was: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it seems to be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge --- Key: SPARK-2649 URL: https://issues.apache.org/jira/browse/SPARK-2649 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: npanj Priority: Minor On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it seems to be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). Here is error message: -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
[ https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] npanj updated SPARK-2649: - Description: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it may be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- was: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it may be an issue with default amit spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge --- Key: SPARK-2649 URL: https://issues.apache.org/jira/browse/SPARK-2649 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: npanj Priority: Minor On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it may be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
[ https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] npanj updated SPARK-2649: - Description: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it seems to be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- was: On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it may be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge --- Key: SPARK-2649 URL: https://issues.apache.org/jira/browse/SPARK-2649 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: npanj Priority: Minor On EC2 httpd daemon doesn't start (so ganglia is not accessble) on Hvm machines like r3.4xlarge( deployed by spark-ec2 script,), Here is an example error (it seems to be an issue with default ami spark.ami.hvm.v14 (ami-35b1885c) ). -- Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such file or directory -- -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully
[ https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072235#comment-14072235 ] Apache Spark commented on SPARK-1630: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1551 PythonRDDs don't handle nulls gracefully Key: SPARK-1630 URL: https://issues.apache.org/jira/browse/SPARK-1630 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 0.9.0, 0.9.1 Reporter: Kalpit Shah Assignee: Davies Liu Original Estimate: 2h Remaining Estimate: 2h If PythonRDDs receive a null element in iterators, they currently NPE. It would be better do log a DEBUG message and skip the write of NULL elements. Here are the 2 stack traces : 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread Thread[stdin writer for python,5,main] java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88) - Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.writeToFile. : java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285) at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280) at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072259#comment-14072259 ] Marcelo Vanzin commented on SPARK-2420: --- Hi Sean, I agree in part about the brokenness of such apps. In part, because I think they're broken because Spark makes it easy for them to be. Also, I think you slightly misread what I wrote, so let me explain. For applications that, let's say, depend on Guava 17, nothing will change with your patch. Such applications already need an explicit dependency on that particular version to build, and need runtime options like {{spark.files.userClassPathFirst}} to be set for things to work. But for applications that depend on the same version of Guava that Spark bundles, none of that is true. They get the dependency transitively, and the class files are available at runtime in the Spark jar, and it's the right version (since Spark needs to make sure they come before Hadoop's version in the classpath, otherwise Spark itself might not work). So if you downgrade the Guava version bundled with Spark, you might break those applications. So yes, technically, they're already broken because Guava is not Spark, but it's very easy to make that mistake. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2569) Customized UDFs in hive not running with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072258#comment-14072258 ] Apache Spark commented on SPARK-2569: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/1552 Customized UDFs in hive not running with Spark SQL -- Key: SPARK-2569 URL: https://issues.apache.org/jira/browse/SPARK-2569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 1.0.4, scala 2.10.3, spark 1.0.0 Reporter: jacky hung Assignee: Michael Armbrust Priority: Critical start spark-shell, init (like create hiveContext, import ._ ect, make sure the jar including the UDFs is in classpath) hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. then i tried hql(select t_ts(time) from data_common where limit 1).collect().foreach(println), which failed with NullPointException we had discussion about it in the mail list. http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006 java.lang.NullPointerException org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117) org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119) org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2652) Turning default configurations for PySpark
Davies Liu created SPARK-2652: - Summary: Turning default configurations for PySpark Key: SPARK-2652 URL: https://issues.apache.org/jira/browse/SPARK-2652 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Reporter: Davies Liu Fix For: 1.1.0 Some default value of configuration does not make sense for PySpark, change them to reasonable ones, such as spark.serializer and spark.kryo.referenceTracking -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode
Davies Liu created SPARK-2653: - Summary: Heap size should be the sum of driver.memory and executor.memory in local mode Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu In local mode, the driver and executor run in the same JVM, so the heap size of JVM should be the sum of spark.driver.memory and spark.executor.memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2654) Leveled logging in PySpark
Davies Liu created SPARK-2654: - Summary: Leveled logging in PySpark Key: SPARK-2654 URL: https://issues.apache.org/jira/browse/SPARK-2654 Project: Spark Issue Type: Improvement Reporter: Davies Liu Add more leveled logging in PySpark, the logging level should be easy controlled by configuration and command line arguments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2655) Change the default logging level to WARN
Davies Liu created SPARK-2655: - Summary: Change the default logging level to WARN Key: SPARK-2655 URL: https://issues.apache.org/jira/browse/SPARK-2655 Project: Spark Issue Type: Improvement Reporter: Davies Liu The current logging level INFO is pretty noisy, reduce these unnecessary logging will provide better experience for users. Currently, Spark is march stable and nature than before, so user will not need those much logging in normal cases. But some high level information will be helpful, such as messages about job and tasks progress, we could changes these important logging into WARN level as an hack, otherwise will need to change all other logging into level DEBUG. PS: it's better to have one line progress logging in terminal (also in title). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time
[ https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2648: --- Priority: Critical (was: Major) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time - Key: SPARK-2648 URL: https://issues.apache.org/jira/browse/SPARK-2648 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang Priority: Critical like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers to connect a executor at a time.when a map has many paritions, at a time there has so much reduces connecting to this map.so it maybe make network's connect to timeout. i created PR for this issue:https://github.com/apache/spark/pull/1549 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time
[ https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2648: --- Target Version/s: 1.1.0 Assignee: Lianhui Wang through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time - Key: SPARK-2648 URL: https://issues.apache.org/jira/browse/SPARK-2648 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang Assignee: Lianhui Wang Priority: Critical like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers to connect a executor at a time.when a map has many paritions, at a time there has so much reduces connecting to this map.so it maybe make network's connect to timeout. i created PR for this issue:https://github.com/apache/spark/pull/1549 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2588) Add some more DSLs.
[ https://issues.apache.org/jira/browse/SPARK-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2588. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: Takuya Ueshin Add some more DSLs. --- Key: SPARK-2588 URL: https://issues.apache.org/jira/browse/SPARK-2588 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1726) Tasks that fail to serialize remain in active stages forever.
[ https://issues.apache.org/jira/browse/SPARK-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1726. - Resolution: Fixed Fix Version/s: 1.1.0 [~kayousterhout] reports this is fixed in master. Tasks that fail to serialize remain in active stages forever. - Key: SPARK-1726 URL: https://issues.apache.org/jira/browse/SPARK-1726 Project: Spark Issue Type: Bug Components: Web UI Reporter: Michael Armbrust Assignee: Andrew Or Fix For: 1.1.0 Attachments: ZombieTask.tiff In the spark shell. {code} scala class Adder(x: Int) { def apply(a: Int) = a + x } defined class Adder scala val add = new Adder(10) scala sc.parallelize(1 to 10).map(add(_)).collect() {code} You get: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$Adder at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} However, the web ui is messed up. See attached screen shot. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-1726) Tasks that fail to serialize remain in active stages forever.
[ https://issues.apache.org/jira/browse/SPARK-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reopened SPARK-1726: --- Tasks that fail to serialize remain in active stages forever. - Key: SPARK-1726 URL: https://issues.apache.org/jira/browse/SPARK-1726 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.1 Reporter: Michael Armbrust Assignee: Andrew Or Attachments: ZombieTask.tiff In the spark shell. {code} scala class Adder(x: Int) { def apply(a: Int) = a + x } defined class Adder scala val add = new Adder(10) scala sc.parallelize(1 to 10).map(add(_)).collect() {code} You get: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$Adder at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} However, the web ui is messed up. See attached screen shot. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2226. - Resolution: Fixed Fix Version/s: 1.1.0 HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. -- Key: SPARK-2226 URL: https://issues.apache.org/jira/browse/SPARK-2226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: William Benton Fix For: 1.1.0 https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q This test file contains the following query: {code} SELECT key FROM src GROUP BY key HAVING max(value) val_255; {code} Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2569) Customized UDFs in hive not running with Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2569. - Resolution: Fixed Fix Version/s: 1.1.0 Customized UDFs in hive not running with Spark SQL -- Key: SPARK-2569 URL: https://issues.apache.org/jira/browse/SPARK-2569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 1.0.4, scala 2.10.3, spark 1.0.0 Reporter: jacky hung Assignee: Michael Armbrust Priority: Critical Fix For: 1.1.0 start spark-shell, init (like create hiveContext, import ._ ect, make sure the jar including the UDFs is in classpath) hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is successful. then i tried hql(select t_ts(time) from data_common where limit 1).collect().foreach(println), which failed with NullPointException we had discussion about it in the mail list. http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006 java.lang.NullPointerException org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117) org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119) org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) org.apache.spark.rdd.RDD.iterator(RDD.scala:228) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2102) Caching with GENERIC column type causes query execution to slow down significantly
[ https://issues.apache.org/jira/browse/SPARK-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2102. - Resolution: Fixed Fix Version/s: 1.1.0 Caching with GENERIC column type causes query execution to slow down significantly -- Key: SPARK-2102 URL: https://issues.apache.org/jira/browse/SPARK-2102 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Fix For: 1.1.0 It is likely that we are doing something wrong with Kryo. [~adav] has a twitter dataset that reproduces the issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2656) Python version without support for exact sample size
Doris Xin created SPARK-2656: Summary: Python version without support for exact sample size Key: SPARK-2656 URL: https://issues.apache.org/jira/browse/SPARK-2656 Project: Spark Issue Type: Sub-task Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2633) support register spark listener to listener bus with Java API
[ https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072579#comment-14072579 ] Marcelo Vanzin commented on SPARK-2633: --- So, being able to register listeners is probably not that complicated, but things may get tricky when actually looking at the events. e.g., {{StageInfo}} objects exposed through {{SparkListenerStageSubmitted}}, they contain Scala APIs that may not be Java-friendly. I guess my question is: is the goal to have full API equivalence in Java-land? If so, it might be easier to clean up the types so that they're all Java-friendly. Otherwise, the Java listener API will always be a second-class citizen. support register spark listener to listener bus with Java API - Key: SPARK-2633 URL: https://issues.apache.org/jira/browse/SPARK-2633 Project: Spark Issue Type: New Feature Components: Java API Reporter: Chengxiang Li Currently user can only register spark listener with Scala API, we should add this feature to Java API as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2656) Python version without support for exact sample size
[ https://issues.apache.org/jira/browse/SPARK-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072585#comment-14072585 ] Apache Spark commented on SPARK-2656: - User 'dorx' has created a pull request for this issue: https://github.com/apache/spark/pull/1554 Python version without support for exact sample size Key: SPARK-2656 URL: https://issues.apache.org/jira/browse/SPARK-2656 Project: Spark Issue Type: Sub-task Reporter: Doris Xin Assignee: Doris Xin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2657) Use more compact data structures than ArrayBuffer in groupBy and cogroup
Matei Zaharia created SPARK-2657: Summary: Use more compact data structures than ArrayBuffer in groupBy and cogroup Key: SPARK-2657 URL: https://issues.apache.org/jira/browse/SPARK-2657 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2657) Use more compact data structures than ArrayBuffer in groupBy and cogroup
[ https://issues.apache.org/jira/browse/SPARK-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072608#comment-14072608 ] Apache Spark commented on SPARK-2657: - User 'mateiz' has created a pull request for this issue: https://github.com/apache/spark/pull/1555 Use more compact data structures than ArrayBuffer in groupBy and cogroup Key: SPARK-2657 URL: https://issues.apache.org/jira/browse/SPARK-2657 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.1.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
[ https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-2574: Assignee: Matei Zaharia Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner -- Key: SPARK-2574 URL: https://issues.apache.org/jira/browse/SPARK-2574 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
[ https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072605#comment-14072605 ] Matei Zaharia commented on SPARK-2574: -- I implemented this as part of https://github.com/apache/spark/pull/1555 Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner -- Key: SPARK-2574 URL: https://issues.apache.org/jira/browse/SPARK-2574 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2549) Functions defined inside of other functions trigger failures
[ https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2549. Resolution: Fixed Issue resolved by pull request 1510 [https://github.com/apache/spark/pull/1510] Functions defined inside of other functions trigger failures Key: SPARK-2549 URL: https://issues.apache.org/jira/browse/SPARK-2549 Project: Spark Issue Type: Sub-task Components: Build Reporter: Patrick Wendell Assignee: Prashant Sharma Fix For: 1.1.0 If we have a function declaration inside of another function, it still triggers mima failures. We should look at how that is implemented in byte code and just always exclude functions like that. {code} def a() = { /* Changing b() should not trigger failures, but it does. */ def b() = {} } {code} I dug into the byte code for inner functions a bit more. I noticed that they tend to use `$$` before the function name. There is more information on that string sequence here: https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286 I did a cursory look and it appears that symbol is mostly (exclusively?) used for anonymous or inner functions: {code} # in RDD package classes $ ls *.class | xargs -I {} javap {} |grep \\$\\$ public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final java.lang.Object org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef); public final scala.collection.Iterator org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator, scala.Function2); public final java.util.HashMap org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, java.util.HashMap, scala.Function2); ... public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT); public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 extends scala.runtime.AbstractFunction2$mcVIJ$sp implements scala.Serializable { public org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
[ https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2574: - Priority: Trivial (was: Major) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner -- Key: SPARK-2574 URL: https://issues.apache.org/jira/browse/SPARK-2574 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Matei Zaharia Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2658) HiveQL: 1 = true should evaluate to true
Michael Armbrust created SPARK-2658: --- Summary: HiveQL: 1 = true should evaluate to true Key: SPARK-2658 URL: https://issues.apache.org/jira/browse/SPARK-2658 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2658) HiveQL: 1 = true should evaluate to true
[ https://issues.apache.org/jira/browse/SPARK-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072629#comment-14072629 ] Apache Spark commented on SPARK-2658: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/1556 HiveQL: 1 = true should evaluate to true Key: SPARK-2658 URL: https://issues.apache.org/jira/browse/SPARK-2658 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2659) HiveQL: Division operator should always perform fractional division
Michael Armbrust created SPARK-2659: --- Summary: HiveQL: Division operator should always perform fractional division Key: SPARK-2659 URL: https://issues.apache.org/jira/browse/SPARK-2659 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2659) HiveQL: Division operator should always perform fractional division
[ https://issues.apache.org/jira/browse/SPARK-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072655#comment-14072655 ] Apache Spark commented on SPARK-2659: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/1557 HiveQL: Division operator should always perform fractional division --- Key: SPARK-2659 URL: https://issues.apache.org/jira/browse/SPARK-2659 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2316: --- Target Version/s: 1.1.0 (was: 1.0.2) StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations
[ https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2316: --- Priority: Critical (was: Major) StorageStatusListener should avoid O(blocks) operations --- Key: SPARK-2316 URL: https://issues.apache.org/jira/browse/SPARK-2316 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Andrew Or Priority: Critical In the case where jobs are frequently causing dropped blocks the storage status listener can bottleneck. This is slow for a few reasons, one being that we use Scala collection operations, the other being that we operations that are O(number of blocks). I think using a few indices here could make this much faster. {code} at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328) at scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403) at scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327) at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105) at org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82) at org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56) at org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67) - locked 0xa27ebe30 (a org.apache.spark.ui.storage.StorageListener) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2458) Make failed application log visible on History Server
[ https://issues.apache.org/jira/browse/SPARK-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072665#comment-14072665 ] Apache Spark commented on SPARK-2458: - User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/1558 Make failed application log visible on History Server - Key: SPARK-2458 URL: https://issues.apache.org/jira/browse/SPARK-2458 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Masayoshi TSUZUKI History server is very helpful for debugging application correctness performance after the application finished. However, when the application failed, the link is not listed on the hisotry server UI and history can't be viewed. It would be very useful if we can check the history of failed application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2660) Enable pretty-printing SchemaRDD Rows
Aaron Davidson created SPARK-2660: - Summary: Enable pretty-printing SchemaRDD Rows Key: SPARK-2660 URL: https://issues.apache.org/jira/browse/SPARK-2660 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, printing a Row results in something like [a,b,c]. It would be cool if there was a way to pretty-print them similar to JSON, into something like {code} { Col1 : a, Col2 : b, Col3 : c } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL
[ https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072681#comment-14072681 ] Apache Spark commented on SPARK-2010: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/1559 Support for nested data in PySpark SQL -- Key: SPARK-2010 URL: https://issues.apache.org/jira/browse/SPARK-2010 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Kan Zhang Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2648) Randomize order of executors when fetching shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2648: --- Summary: Randomize order of executors when fetching shuffle blocks (was: through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time) Randomize order of executors when fetching shuffle blocks - Key: SPARK-2648 URL: https://issues.apache.org/jira/browse/SPARK-2648 Project: Spark Issue Type: Improvement Reporter: Lianhui Wang Assignee: Lianhui Wang Priority: Critical like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers to connect a executor at a time.when a map has many paritions, at a time there has so much reduces connecting to this map.so it maybe make network's connect to timeout. i created PR for this issue:https://github.com/apache/spark/pull/1549 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2660) Enable pretty-printing SchemaRDD Rows
[ https://issues.apache.org/jira/browse/SPARK-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072687#comment-14072687 ] Larry Xiao commented on SPARK-2660: --- I think this one is suitable for newbie like me, though I think it's alright the way it is now :) I'm not familiar with the code, and I found printRDDElement in pipe for RDD. Is it the correct place? Enable pretty-printing SchemaRDD Rows - Key: SPARK-2660 URL: https://issues.apache.org/jira/browse/SPARK-2660 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, printing a Row results in something like [a,b,c]. It would be cool if there was a way to pretty-print them similar to JSON, into something like {code} { Col1 : a, Col2 : b, Col3 : c } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2661) Unpersist last RDD in bagel iteration
Adrian Wang created SPARK-2661: -- Summary: Unpersist last RDD in bagel iteration Key: SPARK-2661 URL: https://issues.apache.org/jira/browse/SPARK-2661 Project: Spark Issue Type: Improvement Affects Versions: 1.0.1 Reporter: Adrian Wang In bagel iteration, we only depend on RDD[n] to get RDD[n+1], so we can unpersist RDD[n-1] after we get RDD[n]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2662) Fix NPE for JsonProtocol
[ https://issues.apache.org/jira/browse/SPARK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072689#comment-14072689 ] Guoqiang Li commented on SPARK-2662: PR: https://github.com/apache/spark/pull/1511 Fix NPE for JsonProtocol Key: SPARK-2662 URL: https://issues.apache.org/jira/browse/SPARK-2662 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2662) Fix NPE for JsonProtocol
Guoqiang Li created SPARK-2662: -- Summary: Fix NPE for JsonProtocol Key: SPARK-2662 URL: https://issues.apache.org/jira/browse/SPARK-2662 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once
[ https://issues.apache.org/jira/browse/SPARK-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072729#comment-14072729 ] Apache Spark commented on SPARK-2568: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/1562 RangePartitioner should go through the data only once - Key: SPARK-2568 URL: https://issues.apache.org/jira/browse/SPARK-2568 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Xiangrui Meng As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort). RangePartitioner should go through data only once (remove the count step). -- This message was sent by Atlassian JIRA (v6.2#6252)