date:20140723


 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-2630:
--

Summary: Input data size of CoalescedRDD is incorrect  (was: Input data 
size goes overflow when size is large then 4G in one task)

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Priority: Critical
 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 sc.textFile(text.4.3.G).coalesce(1).count()
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.

2014-07-23 Thread woshilaiceshide (JIRA)

woshilaiceshide created SPARK-2640:
--

 Summary: In local[N], free cores of the only executor should be 
touched by spark.task.cpus for every finish/start-up of tasks.
 Key: SPARK-2640
 URL: https://issues.apache.org/jira/browse/SPARK-2640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: woshilaiceshide






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.


[ 
https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071448#comment-14071448
 ] 

Apache Spark commented on SPARK-2640:
-

User 'woshilaiceshide' has created a pull request for this issue:
https://github.com/apache/spark/pull/1544

 In local[N], free cores of the only executor should be touched by 
 spark.task.cpus for every finish/start-up of tasks.
 -

 Key: SPARK-2640
 URL: https://issues.apache.org/jira/browse/SPARK-2640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: woshilaiceshide





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.

2014-07-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1516:
-

Assignee: John Zhao

 Yarn Client should not call System.exit, should throw exception instead.
 

 Key: SPARK-1516
 URL: https://issues.apache.org/jira/browse/SPARK-1516
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: DB Tsai
Assignee: John Zhao
 Fix For: 0.9.2, 1.0.1


 People submit spark job inside their application to yarn cluster using spark 
 yarn client, and it's not desirable to call System.exit in yarn client which 
 will terminate the parent application as well.
 We should throw exception instead, and people can determine which action they 
 want to take given the exception.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1935) Explicitly add commons-codec 1.5 as a dependency

2014-07-23 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1935:
-

Fix Version/s: 0.9.2

 Explicitly add commons-codec 1.5 as a dependency
 

 Key: SPARK-1935
 URL: https://issues.apache.org/jira/browse/SPARK-1935
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor
 Fix For: 0.9.2, 1.0.1, 1.1.0


 Right now, commons-codec is a transitive dependency. When Spark is built by 
 maven for Hadoop 1, jets3t 0.7.1 will pull in commons-codec 1.3 which is an 
 older version (Hadoop 1.0.4 depends on 1.4). This older version can cause 
 problems because 1.4 introduces incompatible changes and new methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2014-07-23 Thread Kanwaljit Singh (JIRA)

Kanwaljit Singh created SPARK-2641:
--

 Summary: Spark submit doesn't pick up executor instances from 
properties file
 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2014-07-23 Thread Kanwaljit Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kanwaljit Singh updated SPARK-2641:
---

Description: 
When running spark-submit in Yarn cluster mode, we provide properties file 
using --properties-file option.

spark.executor.instances=5
spark.executor.memory=2120m
spark.executor.cores=3

The running job picks up the cores and memory, but not the correct instances.

I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:

// Use properties file as fallback for values which have a direct analog to
// arguments in this script.
master = 
Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
executorMemory = Option(executorMemory)
  .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
executorCores = Option(executorCores)
  .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
totalExecutorCores = Option(totalExecutorCores)
  .getOrElse(defaultProperties.get(spark.cores.max).orNull)
name = 
Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)

Along with these defaults, we should also set default for instances:

numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)

 Spark submit doesn't pick up executor instances from properties file
 

 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh

 When running spark-submit in Yarn cluster mode, we provide properties file 
 using --properties-file option.
 spark.executor.instances=5
 spark.executor.memory=2120m
 spark.executor.cores=3
 The running job picks up the cores and memory, but not the correct instances.
 I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
 // Use properties file as fallback for values which have a direct analog to
 // arguments in this script.
 master = 
 Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
 executorMemory = Option(executorMemory)
   .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
 executorCores = Option(executorCores)
   .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
 totalExecutorCores = Option(totalExecutorCores)
   .getOrElse(defaultProperties.get(spark.cores.max).orNull)
 name = 
 Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
 jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)
 Along with these defaults, we should also set default for instances:
 numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file

2014-07-23 Thread Kanwaljit Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kanwaljit Singh updated SPARK-2641:
---

Description: 
When running spark-submit in Yarn cluster mode, we provide properties file 
using --properties-file option.

spark.executor.instances=5
spark.executor.memory=2120m
spark.executor.cores=3

The submitted job picks up the cores and memory, but not the correct instances.

I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:

// Use properties file as fallback for values which have a direct analog to
// arguments in this script.
master = 
Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
executorMemory = Option(executorMemory)
  .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
executorCores = Option(executorCores)
  .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
totalExecutorCores = Option(totalExecutorCores)
  .getOrElse(defaultProperties.get(spark.cores.max).orNull)
name = 
Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)

Along with these defaults, we should also set default for instances:

numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)

PS: spark.executor.instances is also not mentioned on 
http://spark.apache.org/docs/latest/configuration.html

  was:
When running spark-submit in Yarn cluster mode, we provide properties file 
using --properties-file option.

spark.executor.instances=5
spark.executor.memory=2120m
spark.executor.cores=3

The running job picks up the cores and memory, but not the correct instances.

I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:

// Use properties file as fallback for values which have a direct analog to
// arguments in this script.
master = 
Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
executorMemory = Option(executorMemory)
  .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
executorCores = Option(executorCores)
  .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
totalExecutorCores = Option(totalExecutorCores)
  .getOrElse(defaultProperties.get(spark.cores.max).orNull)
name = 
Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)

Along with these defaults, we should also set default for instances:

numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)


 Spark submit doesn't pick up executor instances from properties file
 

 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh

 When running spark-submit in Yarn cluster mode, we provide properties file 
 using --properties-file option.
 spark.executor.instances=5
 spark.executor.memory=2120m
 spark.executor.cores=3
 The submitted job picks up the cores and memory, but not the correct 
 instances.
 I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
 // Use properties file as fallback for values which have a direct analog to
 // arguments in this script.
 master = 
 Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
 executorMemory = Option(executorMemory)
   .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
 executorCores = Option(executorCores)
   .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
 totalExecutorCores = Option(totalExecutorCores)
   .getOrElse(defaultProperties.get(spark.cores.max).orNull)
 name = 
 Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
 jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)
 Along with these defaults, we should also set default for instances:
 numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)
 PS: spark.executor.instances is also not mentioned on 
 http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2638) Improve concurrency of fetching Map outputs

2014-07-23 Thread Stephen Boesch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071476#comment-14071476
 ] 

Stephen Boesch commented on SPARK-2638:
---

Upon examining the codebase further, I see that the above is actually a 
pattern: it is followed in various places.  As another example: consider the  
logic employed for fetching RDD Partitions. 

CacheManager.scala:
  private val loading = new HashSet[RDDBlockId]()

..
loading.synchronized {
   ..
while (loading.contains(id)) {
  try {
loading.wait()

So .. what am I missing here?  Why is it deemed necessary to block ALL 
partition loaders - instead of just blocking the particular RDDBlockID.

So: in this case, the suggested update would be:

id.toString.intern.synchronized {

while (loading.contains(id)) {
   try {
  id.toString.intern.wait()

Please explain why is it necessary to block in the former way:  i.e. on  the 
loading collection instead of on individual RDDBLockID's ?



 Improve concurrency of fetching Map outputs
 ---

 Key: SPARK-2638
 URL: https://issues.apache.org/jira/browse/SPARK-2638
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: All
Reporter: Stephen Boesch
Priority: Minor
  Labels: MapOutput, concurrency
 Fix For: 1.1.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 This issue was noticed while perusing the MapOutputTracker source code. 
 Notice that the synchronization is on the containing fetching collection - 
 which makes ALL fetches wait if any fetch were occurring.  
 The fix is to synchronize instead on the shuffleId (interned as a string to 
 ensure JVM wide visibility).
   def getServerStatuses(shuffleId: Int, reduceId: Int): 
 Array[(BlockManagerId, Long)] = {
 val statuses = mapStatuses.get(shuffleId).orNull
 if (statuses == null) {
   logInfo(Don't have map outputs for shuffle  + shuffleId + , fetching 
 them)
   var fetchedStatuses: Array[MapStatus] = null
   fetching.synchronized {   // This is existing code
  //  shuffleId.toString.intern.synchronized {  // New Code
 if (fetching.contains(shuffleId)) {
   // Someone else is fetching it; wait for them to be done
   while (fetching.contains(shuffleId)) {
 try {
   fetching.wait()
 } catch {
   case e: InterruptedException =
 }
   }
 This is only a small code change, but the testcases to prove (a) proper 
 functionality and (b) proper performance improvement are not so trivial.  
 For (b) it is not worthwhile to add a testcase to the codebase. Instead I 
 have added a git project that demonstrates the concurrency/performance 
 improvement using the fine-grained approach . The github project is at
 https://github.com/javadba/scalatesting.git  .  Simply run sbt test. Note: 
 it is unclear how/where to include this ancillary testing/verification 
 information that will not be included in the git PR: i am open for any 
 suggestions - even as far as simply removing references to it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2633) support register spark listener to listener bus with Java API

2014-07-23 Thread Chengxiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071478#comment-14071478
 ] 

Chengxiang Li commented on SPARK-2633:
--

add 2 more:
# StageInfo class is not well designed to get stage state, should contain 
something like StageEndReason.
# it would be lovely if spark java API could transfer scala collection into 
java collection, not by user. for example, stageIds: Seq\[Int] in 
SparkListenerJobStart.

 support register spark listener to listener bus with Java API
 -

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li

 Currently user can only register spark listener with Scala API, we should add 
 this feature to Java API as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2638) Improve concurrency of fetching Map outputs

2014-07-23 Thread Stephen Boesch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071490#comment-14071490
 ] 

Stephen Boesch commented on SPARK-2638:
---


Other examples: 

  HttpBroadcast.synchronized {
SparkEnv.get.blockManager.putSingle(
  blockId, value_, StorageLevel.MEMORY_AND_DISK, tellMaster = false)
  }

  TorrentBroadcast.synchronized {
SparkEnv.get.blockManager.putSingle(
  broadcastId, value_, StorageLevel.MEMORY_AND_DISK, tellMaster = false)
  }

Note: there are well over one hundred other uses of synchronized - but the 
others seem to be properly scoped - ie synchronized on sufficiently confined 
objects, or encompassing short-lived operations.

 Improve concurrency of fetching Map outputs
 ---

 Key: SPARK-2638
 URL: https://issues.apache.org/jira/browse/SPARK-2638
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: All
Reporter: Stephen Boesch
Priority: Minor
  Labels: MapOutput, concurrency
 Fix For: 1.1.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 This issue was noticed while perusing the MapOutputTracker source code. 
 Notice that the synchronization is on the containing fetching collection - 
 which makes ALL fetches wait if any fetch were occurring.  
 The fix is to synchronize instead on the shuffleId (interned as a string to 
 ensure JVM wide visibility).
   def getServerStatuses(shuffleId: Int, reduceId: Int): 
 Array[(BlockManagerId, Long)] = {
 val statuses = mapStatuses.get(shuffleId).orNull
 if (statuses == null) {
   logInfo(Don't have map outputs for shuffle  + shuffleId + , fetching 
 them)
   var fetchedStatuses: Array[MapStatus] = null
   fetching.synchronized {   // This is existing code
  //  shuffleId.toString.intern.synchronized {  // New Code
 if (fetching.contains(shuffleId)) {
   // Someone else is fetching it; wait for them to be done
   while (fetching.contains(shuffleId)) {
 try {
   fetching.wait()
 } catch {
   case e: InterruptedException =
 }
   }
 This is only a small code change, but the testcases to prove (a) proper 
 functionality and (b) proper performance improvement are not so trivial.  
 For (b) it is not worthwhile to add a testcase to the codebase. Instead I 
 have added a git project that demonstrates the concurrency/performance 
 improvement using the fine-grained approach . The github project is at
 https://github.com/javadba/scalatesting.git  .  Simply run sbt test. Note: 
 it is unclear how/where to include this ancillary testing/verification 
 information that will not be included in the git PR: i am open for any 
 suggestions - even as far as simply removing references to it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed

2014-07-23 Thread navanee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071492#comment-14071492
 ] 

navanee commented on SPARK-2575:


yes Xiangrui Meng.. But if I use MLUtils.loadLibSVMFile function to get 
LabeledPoint and pass it to train method, i wont get the issue.

 SVMWithSGD throwing Input Validation failed 
 

 Key: SPARK-2575
 URL: https://issues.apache.org/jira/browse/SPARK-2575
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.1
Reporter: navanee

 SVMWithSGD throwing Input Validation failed while using Sparse Array as 
 Input. Though SVMWihtSGD accepts LibSVM format.
 Exception trace :
 org.apache.spark.SparkException: Input validation failed.
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145)
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188)
   at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala)
   at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2642) Add jobId in web UI

YanTang Zhai created SPARK-2642:
---

 Summary: Add jobId in web UI
 Key: SPARK-2642
 URL: https://issues.apache.org/jira/browse/SPARK-2642
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: YanTang Zhai
Priority: Minor


Web UI has stage id only at present. Multiple stages could not explicitly show 
as the same job.
Job id will be added in wen ui.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler

2014-07-23 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-2618:


Description: 
we use shark server to do interative query. every sql run with a job. sometimes 
we want to immediately run a query that later be submitted to shark server. so 
we need to provide user to define a job's priority and ensure that high 
priority job can be firstly  launched.
i have created a pull request: https://github.com/apache/spark/pull/1528

  was:we use shark server to do interative query. every sql run with a job. 
sometimes we want to immediately run a query that later be submitted to shark 
server. so we need to provide user to define a job's priority and ensure that 
high priority job can be firstly  launched.


 use config spark.scheduler.priority for specifying TaskSet's priority on 
 DAGScheduler
 -

 Key: SPARK-2618
 URL: https://issues.apache.org/jira/browse/SPARK-2618
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang

 we use shark server to do interative query. every sql run with a job. 
 sometimes we want to immediately run a query that later be submitted to shark 
 server. so we need to provide user to define a job's priority and ensure that 
 high priority job can be firstly  launched.
 i have created a pull request: https://github.com/apache/spark/pull/1528



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2643) Stages web ui has ERROR when pool name is None


 [ 
https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YanTang Zhai updated SPARK-2643:


Component/s: Web UI

 Stages web ui has ERROR when pool name is None
 --

 Key: SPARK-2643
 URL: https://issues.apache.org/jira/browse/SPARK-2643
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: YanTang Zhai
Priority: Minor

 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/
 java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:313)
 at scala.None$.get(Option.scala:311)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132)
 at 
 org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60)
 at 
 org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:744)
 14/07/23 16:01:44 WARN

[jira] [Commented] (SPARK-2298) Show stage attempt in UI


[ 
https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071542#comment-14071542
 ] 

Apache Spark commented on SPARK-2298:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1545

 Show stage attempt in UI
 

 Key: SPARK-2298
 URL: https://issues.apache.org/jira/browse/SPARK-2298
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Reynold Xin
Assignee: Masayoshi TSUZUKI
 Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png


 We should add a column to the web ui to show stage attempt id. Then tasks 
 should be grouped by (stageId, stageAttempt) tuple.
 When a stage is resubmitted (e.g. due to fetch failures), we should get a 
 different entry in the web ui and tasks for the resubmission go there.
 See the attached screenshot for the confusing status quo. We currently show 
 the same stage entry twice, and then tasks appear in both. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-2644) Hive should not be enabled by default in the build.


 [ 
https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reassigned SPARK-2644:
--

Assignee: Prashant Sharma

 Hive should not be enabled by default in the build.
 ---

 Key: SPARK-2644
 URL: https://issues.apache.org/jira/browse/SPARK-2644
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2644) Hive should not be enabled by default in the build.


[ 
https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071556#comment-14071556
 ] 

Apache Spark commented on SPARK-2644:
-

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/1546

 Hive should not be enabled by default in the build.
 ---

 Key: SPARK-2644
 URL: https://issues.apache.org/jira/browse/SPARK-2644
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-23 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071571#comment-14071571
]

Sean Owen commented on SPARK-2420:
--

I'd say it's already just as broken for those apps, since they will not work
once they tangle with Hadoop libs on a classpath, unless the classpaths and
deployments are set up a certain way. An app that wants to use Guava 17
*should* be able to safely bundle it, even in the presence of code working with
Guava 11, since it has been pretty completely backwards compatible. Pretty
completely. It's less bad at least.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

During the prototyping of HIVE-7292, many library conflicts showed up because
Spark build contains versions of libraries that's vastly different from
current major Hadoop version. It would be nice if we can choose versions
that's in line with Hadoop or shading them in the assembly. Here are the wish
list:
1. Upgrade protobuf version to 2.5.0 from current 2.4.1
2. Shading Spark's jetty and servlet dependency in the assembly.
3. guava version difference. Spark is using a higher version. I'm not sure
what's the best solution for this.
The list may grow as HIVE-7292 proceeds.
For information only, the attached is a patch that we applied on Spark in
order to make Spark work with Hive. It gives an idea of the scope of changes.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2645) Spark driver calls System.exit(50) after calling SparkContext.stop() the second time

2014-07-23 Thread Vlad Komarov (JIRA)

Vlad Komarov created SPARK-2645:
---

 Summary: Spark driver calls System.exit(50) after calling 
SparkContext.stop() the second time 
 Key: SPARK-2645
 URL: https://issues.apache.org/jira/browse/SPARK-2645
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Vlad Komarov


In some cases my application calls SparkContext.stop() after it has already 
stopped and this leads to stopping JVM that runs spark driver.
E.g
This program should run forever
{code}
JavaSparkContext context = new JavaSparkContext(spark://12.34.21.44:7077, 
DummyApp);
try {
JavaRDDInteger rdd = context.parallelize(Arrays.asList(1, 2, 3));
rdd.count();
} catch (Throwable e) {
e.printStackTrace();
}
try {
context.cancelAllJobs();
context.stop();
//call stop second time
context.stop();
} catch (Throwable e) {
e.printStackTrace();
}
Thread.currentThread().join();
{code}
but it finishes with exit code 50 after calling SparkContext.stop() the second 
time.
Also it throws an exception like this
{code}
org.apache.spark.ServerStateException: Server is already stopped
at org.apache.spark.HttpServer.stop(HttpServer.scala:122) 
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.HttpFileServer.stop(HttpFileServer.scala:48) 
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:81) 
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.SparkContext.stop(SparkContext.scala:984) 
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:92)
 ~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.deploy.client.AppClient$ClientActor.markDead(AppClient.scala:178)
 ~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AppClient.scala:96)
 ~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790) 
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$registerWithMaster$1.apply$mcV$sp(AppClient.scala:91)
 [spark-core_2.10-1.0.0.jar:1.0.0]
at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80) 
[akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
at 
akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241) 
[akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42) 
[akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 [akka-actor_2.10-2.2.3-shaded-protobuf.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[scala-library-2.10.4.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 [scala-library-2.10.4.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.10.4.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [scala-library-2.10.4.jar:na]
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-23 Thread Ken Carlile (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071669#comment-14071669
]

Ken Carlile commented on SPARK-2282:

Well, something didn't work quite right.. our copy of 1.0.1 is the prebuilt
copy for Hadoop 1/CDH3. So I did a git init in that directory, then did a git
pull https://github.com/apache/spark/ +refs/pull/1503/head

Well, that didn't work...

I don't expect you to solve my git noob problems, so I'll work with someone
here to figure it out.

PySpark crashes if too many tasks complete quickly
--

Key: SPARK-2282
URL: https://issues.apache.org/jira/browse/SPARK-2282
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Fix For: 0.9.2, 1.0.0, 1.0.1

Upon every task completion, PythonAccumulatorParam constructs a new socket to
the Accumulator server running inside the pyspark daemon. This can cause a
buildup of used ephemeral ports from sockets in the TIME_WAIT termination
stage, which will cause the SparkContext to crash if too many tasks complete
too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
This bug can be fixed outside of Spark by ensuring these properties are set
(on a linux server);
echo 1 /proc/sys/net/ipv4/tcp_tw_reuse
echo 1 /proc/sys/net/ipv4/tcp_tw_recycle
or by adding the SO_REUSEADDR option to the Socket creation within Spark.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x

2014-07-23 Thread Sean Owen (JIRA)

Sean Owen created SPARK-2646:


 Summary: log4j initialization not quite compatible with log4j 2.x
 Key: SPARK-2646
 URL: https://issues.apache.org/jira/browse/SPARK-2646
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Sean Owen
Priority: Minor


The logging code that handles log4j initialization leads to an stack overflow 
error when used with log4j 2.x, which has just been released. This occurs even 
a downstream project has correctly adjusted SLF4J bindings, and that is the 
right thing to do for log4j 2.x, since it is effectively a separate project 
from 1.x.

Here is the relevant bit of Logging.scala:

{code}
  private def initializeLogging() {
// If Log4j is being used, but is not initialized, load a default 
properties file
val binder = StaticLoggerBinder.getSingleton
val usingLog4j = 
binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory)
val log4jInitialized = 
LogManager.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized  usingLog4j) {
  val defaultLogProps = org/apache/spark/log4j-defaults.properties
  Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
case Some(url) =
  PropertyConfigurator.configure(url)
  log.info(sUsing Spark's default log4j profile: $defaultLogProps)
case None =
  System.err.println(sSpark was unable to load $defaultLogProps)
  }
}
Logging.initialized = true

// Force a call into slf4j to initialize it. Avoids this happening from 
mutliple threads
// and triggering this: 
http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
log
  }
{code}

The first minor issue is that there is a call to a logger inside this method, 
which is initializing logging. In this situation, it ends up causing the 
initialization to be called recursively until the stack overflow. It would be 
slightly tidier to log this only after Logging.initialized = true. Or not at 
all. But it's not the root problem, or else, it would not work at all now. 

The calls to log4j classes here always reference log4j 1.2 no matter what. For 
example, there is not getAllAppenders in log4j 2.x. That's fine. Really, 
usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 is 
initialized.

usingLog4j should be false for log4j 2.x, because the initialization only 
matters for log4j 1.2. But, it's true, and that's the real issue. And 
log4jInitialized is always false, since calls to the log4j 1.2 API are stubs 
and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence the 
loop.

This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The SLF4J 
static binding class has the same name for both versions, unfortunately, which 
causes the issue. However they're in different packages. For example, if the 
test included ... and begins with org.slf4j, it should work, as the SLF4J 
binding for log4j 2.x is provided by log4j 2.x at the moment, and is in package 
org.apache.logging.slf4j.

Of course, I assume that SLF4J will eventually offer its own binding. I hope to 
goodness they at least name the binding class differently, or else this will 
again not work. But then some other check can probably be made.

(Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA 
for him. I'll propose a PR too.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event

YanTang Zhai created SPARK-2647:
---

 Summary: DAGScheduler plugs others when processing one 
JobSubmitted event
 Key: SPARK-2647
 URL: https://issues.apache.org/jira/browse/SPARK-2647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai


If a few of jobs are submitted, DAGScheduler plugs others when processing one 
JobSubmitted event.
For example ont JobSubmitted event is processed as follows and costs much time
spark-akka.actor.default-dispatcher-67 daemon prio=10 tid=0x7f75ec001000 
nid=0x7dd6 in Object.wait() [0x7f76063e1000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
- locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call)
at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
at 
org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
at 
org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
at 
org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
at 
StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at

[jira] [Commented] (SPARK-2646) log4j initialization not quite compatible with log4j 2.x


[ 
https://issues.apache.org/jira/browse/SPARK-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071745#comment-14071745
 ] 

Apache Spark commented on SPARK-2646:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1547

 log4j initialization not quite compatible with log4j 2.x
 

 Key: SPARK-2646
 URL: https://issues.apache.org/jira/browse/SPARK-2646
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Sean Owen
Priority: Minor

 The logging code that handles log4j initialization leads to an stack overflow 
 error when used with log4j 2.x, which has just been released. This occurs 
 even a downstream project has correctly adjusted SLF4J bindings, and that is 
 the right thing to do for log4j 2.x, since it is effectively a separate 
 project from 1.x.
 Here is the relevant bit of Logging.scala:
 {code}
   private def initializeLogging() {
 // If Log4j is being used, but is not initialized, load a default 
 properties file
 val binder = StaticLoggerBinder.getSingleton
 val usingLog4j = 
 binder.getLoggerFactoryClassStr.endsWith(Log4jLoggerFactory)
 val log4jInitialized = 
 LogManager.getRootLogger.getAllAppenders.hasMoreElements
 if (!log4jInitialized  usingLog4j) {
   val defaultLogProps = org/apache/spark/log4j-defaults.properties
   Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
 case Some(url) =
   PropertyConfigurator.configure(url)
   log.info(sUsing Spark's default log4j profile: $defaultLogProps)
 case None =
   System.err.println(sSpark was unable to load $defaultLogProps)
   }
 }
 Logging.initialized = true
 // Force a call into slf4j to initialize it. Avoids this happening from 
 mutliple threads
 // and triggering this: 
 http://mailman.qos.ch/pipermail/slf4j-dev/2010-April/002956.html
 log
   }
 {code}
 The first minor issue is that there is a call to a logger inside this method, 
 which is initializing logging. In this situation, it ends up causing the 
 initialization to be called recursively until the stack overflow. It would be 
 slightly tidier to log this only after Logging.initialized = true. Or not at 
 all. But it's not the root problem, or else, it would not work at all now. 
 The calls to log4j classes here always reference log4j 1.2 no matter what. 
 For example, there is not getAllAppenders in log4j 2.x. That's fine. Really, 
 usingLog4j means using log4j 1.2 and log4jInitialized means log4j 1.2 
 is initialized.
 usingLog4j should be false for log4j 2.x, because the initialization only 
 matters for log4j 1.2. But, it's true, and that's the real issue. And 
 log4jInitialized is always false, since calls to the log4j 1.2 API are stubs 
 and no-ops in this setup, where the caller has swapped in log4j 2.x. Hence 
 the loop.
 This is fixed, I believe, if usingLog4j can be false for log4j 2.x. The 
 SLF4J static binding class has the same name for both versions, 
 unfortunately, which causes the issue. However they're in different packages. 
 For example, if the test included ... and begins with org.slf4j, it should 
 work, as the SLF4J binding for log4j 2.x is provided by log4j 2.x at the 
 moment, and is in package org.apache.logging.slf4j.
 Of course, I assume that SLF4J will eventually offer its own binding. I hope 
 to goodness they at least name the binding class differently, or else this 
 will again not work. But then some other check can probably be made.
 (Credit to Agust Egilsson for finding this; at his request I'm opening a JIRA 
 for him. I'll propose a PR too.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event


[ 
https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071755#comment-14071755
 ] 

YanTang Zhai commented on SPARK-2647:
-

I've created PR: https://github.com/apache/spark/pull/1548. Please help to 
review. Thanks.

 DAGScheduler plugs others when processing one JobSubmitted event
 

 Key: SPARK-2647
 URL: https://issues.apache.org/jira/browse/SPARK-2647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai

 If a few of jobs are submitted, DAGScheduler plugs others when processing one 
 JobSubmitted event.
 For example ont JobSubmitted event is processed as follows and costs much time
 spark-akka.actor.default-dispatcher-67 daemon prio=10 
 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
   - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call)
   at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
   at 
 StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at

[jira] [Commented] (SPARK-2647) DAGScheduler plugs others when processing one JobSubmitted event


[ 
https://issues.apache.org/jira/browse/SPARK-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071754#comment-14071754
 ] 

Apache Spark commented on SPARK-2647:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1548

 DAGScheduler plugs others when processing one JobSubmitted event
 

 Key: SPARK-2647
 URL: https://issues.apache.org/jira/browse/SPARK-2647
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai

 If a few of jobs are submitted, DAGScheduler plugs others when processing one 
 JobSubmitted event.
 For example ont JobSubmitted event is processed as follows and costs much time
 spark-akka.actor.default-dispatcher-67 daemon prio=10 
 tid=0x7f75ec001000 nid=0x7dd6 in Object.wait() [0x7f76063e1000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:503)
   at org.apache.hadoopcdh3.ipc.Client.call(Client.java:1130)
   - locked 0x000783b17330 (a org.apache.hadoopcdh3.ipc.Client$Call)
   at org.apache.hadoopcdh3.ipc.RPC$Invoker.invoke(RPC.java:241)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at sun.reflect.GeneratedMethodAccessor86.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
   at 
 org.apache.hadoopcdh3.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:60)
   at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1472)
   at 
 org.apache.hadoopcdh3.hdfs.DFSClient.getBlockLocations(DFSClient.java:1498)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:208)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$1.doCall(Cdh3DistributedFileSystem.java:204)
   at 
 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
   at 
 org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getFileBlockLocations(Cdh3DistributedFileSystem.java:204)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1812)
   at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1797)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:233)
   at 
 StorageEngineClient.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:141)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:54)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at

[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-23 Thread Ken Carlile (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071767#comment-14071767
 ] 

Ken Carlile commented on SPARK-2282:


Merging just the two files also did not work. I received a bunch of these 
errors during the test: 
{code}Exception happened during processing of request from ('127.0.0.1', 33116)
Traceback (most recent call last):
  File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 295, in 
_handle_request_noblock
self.process_request(request, client_address)
  File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 321, in 
process_request
self.finish_request(request, client_address)
  File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 334, in 
finish_request
self.RequestHandlerClass(request, client_address, self)
  File /usr/local/python-2.7.6/lib/python2.7/SocketServer.py, line 649, in 
__init__
self.handle()
  File /usr/local/spark-current/python/pyspark/accumulators.py, line 224, in 
handle
num_updates = read_int(self.rfile)
  File /usr/local/spark-current/python/pyspark/serializers.py, line 337, in 
read_int
raise EOFError
EOFError
{code}
And then it errored out with the usual java thing. 

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement

2014-07-23 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14071786#comment-14071786
 ] 

Thomas Graves commented on SPARK-2604:
--

Yes  we should be adding the overhead in at the check.

 Spark Application hangs on yarn in edge case scenario of executor memory 
 requirement
 

 Key: SPARK-2604
 URL: https://issues.apache.org/jira/browse/SPARK-2604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Twinkle Sachdeva

 In yarn environment, let's say :
 MaxAM = Maximum allocatable memory
 ExecMem - Executor's memory
 if (MaxAM  ExecMem  ( MaxAM - ExecMem)  384m ))
   then Maximum resource validation fails w.r.t executor memory , and 
 application master gets launched, but when resource is allocated and again 
 validated, they are returned and application appears to be hanged.
 Typical use case is to ask for executor memory = maximum allowed memory as 
 per yarn config



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2484) By default does not run hive compatibility tests


 [ 
https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-2484:
---

Assignee: Guoqiang Li

 By default does not run hive compatibility tests
 

 Key: SPARK-2484
 URL: https://issues.apache.org/jira/browse/SPARK-2484
 Project: Spark
  Issue Type: Improvement
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 hive compatibility test takes a long time, in some cases, we don't need to 
 run it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2644) Hive should not be enabled by default in the build.


 [ 
https://issues.apache.org/jira/browse/SPARK-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-2644.
--

Resolution: Duplicate

 Hive should not be enabled by default in the build.
 ---

 Key: SPARK-2644
 URL: https://issues.apache.org/jira/browse/SPARK-2644
 Project: Spark
  Issue Type: Bug
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 Enabling hive by default causes the build to slow down, which is not 
 desirable. We try to build and run hive tests optionally i.e. only if there 
 is change in the hive codebase in our jenkins.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2484) Build should not run hive compatibility tests by default.


 [ 
https://issues.apache.org/jira/browse/SPARK-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-2484:
---

Summary: Build should not run hive compatibility tests by default.  (was: 
By default does not run hive compatibility tests)

 Build should not run hive compatibility tests by default.
 -

 Key: SPARK-2484
 URL: https://issues.apache.org/jira/browse/SPARK-2484
 Project: Spark
  Issue Type: Improvement
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 hive compatibility test takes a long time, in some cases, we don't need to 
 run it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time

2014-07-23 Thread Lianhui Wang (JIRA)

Lianhui Wang created SPARK-2648:
---

 Summary: through shuffling blocksByAddress avoid much reducers to 
fetch data from a executor at a time
 Key: SPARK-2648
 URL: https://issues.apache.org/jira/browse/SPARK-2648
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang


like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers to 
connect a executor at a time.when a map has many paritions, at a time there has 
so much reduces connecting to this map.so it maybe make network's connect to 
timeout.
i created PR for this issue:https://github.com/apache/spark/pull/1549



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge

npanj created SPARK-2649:


 Summary: EC2: Ganglia-httpd broken on hvm based machines like 
r3.4xlarge
 Key: SPARK-2649
 URL: https://issues.apache.org/jira/browse/SPARK-2649
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: npanj
Priority: Minor



On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it may be an issue with default amit 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2609) Log thread ID when spilling ExternalAppendOnlyMap


 [ 
https://issues.apache.org/jira/browse/SPARK-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2609.
--

Resolution: Fixed

 Log thread ID when spilling ExternalAppendOnlyMap
 -

 Key: SPARK-2609
 URL: https://issues.apache.org/jira/browse/SPARK-2609
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.1.0


 It may be useful to track down whether one thread is constantly spilling, 
 versus multiple threads are spilling relatively infrequently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2609) Log thread ID when spilling ExternalAppendOnlyMap


 [ 
https://issues.apache.org/jira/browse/SPARK-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2609:
-

Assignee: Andrew Or

 Log thread ID when spilling ExternalAppendOnlyMap
 -

 Key: SPARK-2609
 URL: https://issues.apache.org/jira/browse/SPARK-2609
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.1.0


 It may be useful to track down whether one thread is constantly spilling, 
 versus multiple threads are spilling relatively infrequently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.

2014-07-23 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072044#comment-14072044
 ] 

Yin Huai commented on SPARK-2632:
-

Seems the exception triggered by importing a method of a non-serializable class 
is not a new bug. So, I guess we do not want to block our release on it.

 Importing a method of class in Spark REPL causes the REPL to pulls in 
 unnecessary stuff.
 

 Key: SPARK-2632
 URL: https://issues.apache.org/jira/browse/SPARK-2632
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0, 1.0.1
Reporter: Yin Huai

  Master is affected by this bug. To reproduce the exception, you can start a 
 local cluster (sbin/start-all.sh) then open a spark shell.
 {code}
 class X() { println(What!); def y = 3 }
 val x = new X
 import x.y
 case class Person(name: String, age: Int)
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt)).collect
 {code}
 Then you will find the exception. I am attaching the stack trace below...
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2632) Importing a method of class in Spark REPL causes the REPL to pulls in unnecessary stuff.

2014-07-23 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2632:


Priority: Major  (was: Blocker)

 Importing a method of class in Spark REPL causes the REPL to pulls in 
 unnecessary stuff.
 

 Key: SPARK-2632
 URL: https://issues.apache.org/jira/browse/SPARK-2632
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0, 1.0.1
Reporter: Yin Huai

  Master is affected by this bug. To reproduce the exception, you can start a 
 local cluster (sbin/start-all.sh) then open a spark shell.
 {code}
 class X() { println(What!); def y = 3 }
 val x = new X
 import x.y
 case class Person(name: String, age: Int)
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt)).collect
 {code}
 Then you will find the exception. I am attaching the stack trace below...
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
 in stage 0.0 (TID 0) had a not serializable result: $iwC$$iwC$$iwC$$iwC$X
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1029)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1027)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1027)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:632)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:632)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1230)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.


 [ 
https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2640.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 In local[N], free cores of the only executor should be touched by 
 spark.task.cpus for every finish/start-up of tasks.
 -

 Key: SPARK-2640
 URL: https://issues.apache.org/jira/browse/SPARK-2640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: woshilaiceshide
Assignee: woshilaiceshide
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.


 [ 
https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2640:
-

Priority: Minor  (was: Major)

 In local[N], free cores of the only executor should be touched by 
 spark.task.cpus for every finish/start-up of tasks.
 -

 Key: SPARK-2640
 URL: https://issues.apache.org/jira/browse/SPARK-2640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: woshilaiceshide
Assignee: woshilaiceshide
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2640) In local[N], free cores of the only executor should be touched by spark.task.cpus for every finish/start-up of tasks.


 [ 
https://issues.apache.org/jira/browse/SPARK-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2640:
-

Assignee: woshilaiceshide

 In local[N], free cores of the only executor should be touched by 
 spark.task.cpus for every finish/start-up of tasks.
 -

 Key: SPARK-2640
 URL: https://issues.apache.org/jira/browse/SPARK-2640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: woshilaiceshide
Assignee: woshilaiceshide
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2277) Make TaskScheduler track whether there's host on a rack


 [ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2277:
-

Fix Version/s: 1.1.0

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li
 Fix For: 1.1.0


 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2277) Make TaskScheduler track whether there's host on a rack


 [ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2277:
-

Assignee: Rui Li

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li
Assignee: Rui Li
 Fix For: 1.1.0


 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2277) Make TaskScheduler track whether there's host on a rack


 [ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2277.
--

Resolution: Fixed

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li
Assignee: Rui Li
 Fix For: 1.1.0


 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2650) Wrong initial sizes for in-memory column buffers

Michael Armbrust created SPARK-2650:
---

 Summary: Wrong initial sizes for in-memory column buffers
 Key: SPARK-2650
 URL: https://issues.apache.org/jira/browse/SPARK-2650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical


The logic for setting up the initial column buffers is different for Spark SQL 
compared to Shark and I'm seeing OOMs when caching tables that are larger than 
available memory (where shark was okay).

Two suspicious things: the intialSize is always set to 0 so we always go with 
the default.  The default looks like it was copied from code like 10 * 1024 * 
1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2561) Repartitioning a SchemaRDD breaks resolution


 [ 
https://issues.apache.org/jira/browse/SPARK-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2561.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

 Repartitioning a SchemaRDD breaks resolution
 

 Key: SPARK-2561
 URL: https://issues.apache.org/jira/browse/SPARK-2561
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.1.0, 1.0.2






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1630) PythonRDDs don't handle nulls gracefully


 [ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1630:


Assignee: Davies Liu

 PythonRDDs don't handle nulls gracefully
 

 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
Assignee: Davies Liu
   Original Estimate: 2h
  Remaining Estimate: 2h

 If PythonRDDs receive a null element in iterators, they currently NPE. It 
 would be better do log a DEBUG message and skip the write of NULL elements.
 Here are the 2 stack traces :
 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
 Thread[stdin writer for python,5,main]
 java.lang.NullPointerException
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
 -
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.writeToFile.
 : java.lang.NullPointerException
   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2650) Wrong initial sizes for in-memory column buffers


 [ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2650:


Target Version/s: 1.1.0

 Wrong initial sizes for in-memory column buffers
 

 Key: SPARK-2650
 URL: https://issues.apache.org/jira/browse/SPARK-2650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 The logic for setting up the initial column buffers is different for Spark 
 SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
 than available memory (where shark was okay).
 Two suspicious things: the intialSize is always set to 0 so we always go with 
 the default.  The default looks like it was copied from code like 10 * 1024 * 
 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-2569) Customized UDFs in hive not running with Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-2569:
---

Assignee: Michael Armbrust

 Customized UDFs in hive not running with Spark SQL
 --

 Key: SPARK-2569
 URL: https://issues.apache.org/jira/browse/SPARK-2569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 
 1.0.4, scala 2.10.3, spark 1.0.0
Reporter: jacky hung
Assignee: Michael Armbrust

 start spark-shell,
 init (like create hiveContext, import ._ ect, make sure the jar including the 
 UDFs is in classpath)
 hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is 
 successful. 
 then i tried hql(select t_ts(time) from data_common where  limit 
 1).collect().foreach(println), which failed with NullPointException 
 we had discussion about it in the mail list.
 http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006
 java.lang.NullPointerException 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117)
  org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119)
  org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) 
 org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186)
  org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) 
 org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153)
  org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) 
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2569) Customized UDFs in hive not running with Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2569:


Priority: Critical  (was: Major)
Target Version/s: 1.1.0

 Customized UDFs in hive not running with Spark SQL
 --

 Key: SPARK-2569
 URL: https://issues.apache.org/jira/browse/SPARK-2569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 
 1.0.4, scala 2.10.3, spark 1.0.0
Reporter: jacky hung
Assignee: Michael Armbrust
Priority: Critical

 start spark-shell,
 init (like create hiveContext, import ._ ect, make sure the jar including the 
 UDFs is in classpath)
 hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is 
 successful. 
 then i tried hql(select t_ts(time) from data_common where  limit 
 1).collect().foreach(println), which failed with NullPointException 
 we had discussion about it in the mail list.
 http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006
 java.lang.NullPointerException 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117)
  org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119)
  org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) 
 org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186)
  org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) 
 org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153)
  org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) 
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-23 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072076#comment-14072076
 ] 

Yin Huai commented on SPARK-2576:
-

[~prashant_] I have also created a [REPL 
test|https://github.com/yhuai/spark/commit/d77c515f741c65d807cdc147fb32d92e2039bc97]
 for importing SQLContext.createSchemaRDD. SQLContext is a serializable class. 
Seems the issue is slightly different with SPARK-2632. The exception is caused 
by the following code in SparkImports.scala
{code}
case x: ClassHandler =
// I am trying to guess if the import is a defined class
// This is an ugly hack, I am not 100% sure of the consequences.
// Here we, let everything but defined classes use the import with 
val.
// The reason for this is, otherwise the remote executor tries to pull 
the
// classes involved and may fail.
  for (imv - x.definedNames) {
val objName = req.lineRep.readPath
code.append(import  + objName + .INSTANCE + req.accessPath + 
.` + imv + `\n)
  }
{code}
Can you take a look at it as well? Thank you:)

 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend Vanderveken
Assignee: Yin Huai
Priority: Blocker

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI:

[jira] [Created] (SPARK-2651) Add maven scalastyle plugin

2014-07-23 Thread Rahul Singhal (JIRA)

Rahul Singhal created SPARK-2651:


 Summary: Add maven scalastyle plugin
 Key: SPARK-2651
 URL: https://issues.apache.org/jira/browse/SPARK-2651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Rahul Singhal
Priority: Minor


SBT has a scalastyle plugin which can be executed to check for coding 
conventions. It would be nice to add the same for maven builds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed

2014-07-23 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072140#comment-14072140
 ] 

Xiangrui Meng commented on SPARK-2575:
--

loadLibSVMFile converts labels to binary by default. It maps any label value 
greater than 0.5 to 1.0, or 0.0 otherwise. This is because both 1/0 and +1/-1 
are widely used in popular LIBSVM datasets. In MLlib, to use SVMWithSGD you 
need to use binary labels, 1.0 or 0.0.

 SVMWithSGD throwing Input Validation failed 
 

 Key: SPARK-2575
 URL: https://issues.apache.org/jira/browse/SPARK-2575
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.1
Reporter: navanee

 SVMWithSGD throwing Input Validation failed while using Sparse Array as 
 Input. Though SVMWihtSGD accepts LibSVM format.
 Exception trace :
 org.apache.spark.SparkException: Input validation failed.
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145)
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188)
   at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala)
   at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2651) Add maven scalastyle plugin

2014-07-23 Thread Rahul Singhal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072156#comment-14072156
 ] 

Rahul Singhal commented on SPARK-2651:
--

PR: https://github.com/apache/spark/pull/1550

 Add maven scalastyle plugin
 ---

 Key: SPARK-2651
 URL: https://issues.apache.org/jira/browse/SPARK-2651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Rahul Singhal
Priority: Minor

 SBT has a scalastyle plugin which can be executed to check for coding 
 conventions. It would be nice to add the same for maven builds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2651) Add maven scalastyle plugin


[ 
https://issues.apache.org/jira/browse/SPARK-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072164#comment-14072164
 ] 

Apache Spark commented on SPARK-2651:
-

User 'rahulsinghaliitd' has created a pull request for this issue:
https://github.com/apache/spark/pull/1550

 Add maven scalastyle plugin
 ---

 Key: SPARK-2651
 URL: https://issues.apache.org/jira/browse/SPARK-2651
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Rahul Singhal
Priority: Minor

 SBT has a scalastyle plugin which can be executed to check for coding 
 conventions. It would be nice to add the same for maven builds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2567) Resubmitted stage sometimes remains as active stage in the web UI

2014-07-23 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072161#comment-14072161
 ] 

Masayoshi TSUZUKI commented on SPARK-2567:
--

I noticed now but the cause of [SPARK-1726] seems to be the same.

 Resubmitted stage sometimes remains as active stage in the web UI
 -

 Key: SPARK-2567
 URL: https://issues.apache.org/jira/browse/SPARK-2567
 Project: Spark
  Issue Type: Bug
Reporter: Masayoshi TSUZUKI
 Attachments: SPARK-2567.png


 When a stage is resubmitted because of executor lost for example, sometimes 
 more than one resubmitted task appears in the web UI and one stage remains as 
 active even after the job has finished.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2642) Add jobId in web UI

2014-07-23 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072171#comment-14072171
 ] 

Masayoshi TSUZUKI commented on SPARK-2642:
--

Is this the same as [SPARK-1362] ?

 Add jobId in web UI
 ---

 Key: SPARK-2642
 URL: https://issues.apache.org/jira/browse/SPARK-2642
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: YanTang Zhai
Priority: Minor

 Web UI has stage id only at present. Multiple stages could not explicitly 
 show as the same job.
 Job id will be added in wen ui.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1362) Web UI should provide page of showing statistics and stage list for a given job

2014-07-23 Thread Masayoshi TSUZUKI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masayoshi TSUZUKI updated SPARK-1362:
-

Component/s: Web UI

 Web UI should provide page of showing statistics and stage list for a given 
 job
 ---

 Key: SPARK-1362
 URL: https://issues.apache.org/jira/browse/SPARK-1362
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 0.9.0
Reporter: wangfei

 Now spark provide the page of stage, but in spark the conception level is 
 like this-- app  job  stage  task. Page of job is better to monitor the 
 jobs in one app, and only page of stage we can not distinguish  jobs easily 
 sometimes



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge


 [ 
https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-2649:
-

Description: 
On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it seems to be an issue with default ami 
spark.ami.hvm.v14 (ami-35b1885c) ). Here is error message:
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--

  was:
On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it seems to be an issue with default ami 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--


 EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
 ---

 Key: SPARK-2649
 URL: https://issues.apache.org/jira/browse/SPARK-2649
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: npanj
Priority: Minor

 On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
 machines like r3.4xlarge( deployed by spark-ec2 script,), 
 Here is an example error (it seems to be an issue with default ami 
 spark.ami.hvm.v14 (ami-35b1885c) ). Here is error message:
 --
 Starting httpd: httpd: Syntax error on line 153 of 
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into 
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object 
 file: No such file or directory
 --



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge


 [ 
https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-2649:
-

Description: 
On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it may be an issue with default ami 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--

  was:

On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it may be an issue with default amit 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--


 EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
 ---

 Key: SPARK-2649
 URL: https://issues.apache.org/jira/browse/SPARK-2649
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: npanj
Priority: Minor

 On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
 machines like r3.4xlarge( deployed by spark-ec2 script,), 
 Here is an example error (it may be an issue with default ami 
 spark.ami.hvm.v14 (ami-35b1885c) ). 
 --
 Starting httpd: httpd: Syntax error on line 153 of 
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into 
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object 
 file: No such file or directory
 --



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2649) EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge


 [ 
https://issues.apache.org/jira/browse/SPARK-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-2649:
-

Description: 
On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it seems to be an issue with default ami 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--

  was:
On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
machines like r3.4xlarge( deployed by spark-ec2 script,), 

Here is an example error (it may be an issue with default ami 
spark.ami.hvm.v14 (ami-35b1885c) ). 
--
Starting httpd: httpd: Syntax error on line 153 of /etc/httpd/conf/httpd.conf: 
Cannot load modules/mod_authn_alias.so into server: 
/etc/httpd/modules/mod_authn_alias.so: cannot open shared object file: No such 
file or directory
--


 EC2: Ganglia-httpd broken on hvm based machines like r3.4xlarge
 ---

 Key: SPARK-2649
 URL: https://issues.apache.org/jira/browse/SPARK-2649
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: npanj
Priority: Minor

 On EC2 httpd daemon doesn't start (so ganglia is not accessble) on  Hvm 
 machines like r3.4xlarge( deployed by spark-ec2 script,), 
 Here is an example error (it seems to be an issue with default ami 
 spark.ami.hvm.v14 (ami-35b1885c) ). 
 --
 Starting httpd: httpd: Syntax error on line 153 of 
 /etc/httpd/conf/httpd.conf: Cannot load modules/mod_authn_alias.so into 
 server: /etc/httpd/modules/mod_authn_alias.so: cannot open shared object 
 file: No such file or directory
 --



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully


[ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072235#comment-14072235
 ] 

Apache Spark commented on SPARK-1630:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1551

 PythonRDDs don't handle nulls gracefully
 

 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
Assignee: Davies Liu
   Original Estimate: 2h
  Remaining Estimate: 2h

 If PythonRDDs receive a null element in iterators, they currently NPE. It 
 would be better do log a DEBUG message and skip the write of NULL elements.
 Here are the 2 stack traces :
 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
 Thread[stdin writer for python,5,main]
 java.lang.NullPointerException
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
 -
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.writeToFile.
 : java.lang.NullPointerException
   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-23 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072259#comment-14072259
]

Marcelo Vanzin commented on SPARK-2420:
---

Hi Sean,

I agree in part about the brokenness of such apps. In part, because I think
they're broken because Spark makes it easy for them to be. Also, I think you
slightly misread what I wrote, so let me explain.

For applications that, let's say, depend on Guava 17, nothing will change with
your patch. Such applications already need an explicit dependency on that
particular version to build, and need runtime options like
{{spark.files.userClassPathFirst}} to be set for things to work.

But for applications that depend on the same version of Guava that Spark
bundles, none of that is true. They get the dependency transitively, and the
class files are available at runtime in the Spark jar, and it's the right
version (since Spark needs to make sure they come before Hadoop's version in
the classpath, otherwise Spark itself might not work). So if you downgrade the
Guava version bundled with Spark, you might break those applications. So yes,
technically, they're already broken because Guava is not Spark, but it's very
easy to make that mistake.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2569) Customized UDFs in hive not running with Spark SQL


[ 
https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072258#comment-14072258
 ] 

Apache Spark commented on SPARK-2569:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1552

 Customized UDFs in hive not running with Spark SQL
 --

 Key: SPARK-2569
 URL: https://issues.apache.org/jira/browse/SPARK-2569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 
 1.0.4, scala 2.10.3, spark 1.0.0
Reporter: jacky hung
Assignee: Michael Armbrust
Priority: Critical

 start spark-shell,
 init (like create hiveContext, import ._ ect, make sure the jar including the 
 UDFs is in classpath)
 hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is 
 successful. 
 then i tried hql(select t_ts(time) from data_common where  limit 
 1).collect().foreach(println), which failed with NullPointException 
 we had discussion about it in the mail list.
 http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006
 java.lang.NullPointerException 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117)
  org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119)
  org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) 
 org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186)
  org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) 
 org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153)
  org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) 
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2652) Turning default configurations for PySpark

Davies Liu created SPARK-2652:
-

 Summary: Turning default configurations for PySpark
 Key: SPARK-2652
 URL: https://issues.apache.org/jira/browse/SPARK-2652
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Davies Liu
 Fix For: 1.1.0


Some default value of configuration does not make sense for PySpark, change 
them to reasonable ones, such as spark.serializer and 
spark.kryo.referenceTracking



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

Davies Liu created SPARK-2653:
-

 Summary: Heap size should be the sum of driver.memory and 
executor.memory in local mode
 Key: SPARK-2653
 URL: https://issues.apache.org/jira/browse/SPARK-2653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu


In local mode, the driver and executor run in the same JVM, so the heap size of 
JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2654) Leveled logging in PySpark

Davies Liu created SPARK-2654:
-

 Summary: Leveled logging in PySpark
 Key: SPARK-2654
 URL: https://issues.apache.org/jira/browse/SPARK-2654
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


Add more leveled logging in PySpark, the logging level should be easy 
controlled by configuration and command line arguments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2655) Change the default logging level to WARN

Davies Liu created SPARK-2655:
-

 Summary: Change the default logging level to WARN
 Key: SPARK-2655
 URL: https://issues.apache.org/jira/browse/SPARK-2655
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


The current logging level INFO is pretty noisy, reduce these unnecessary 
logging will provide better experience for users.

Currently, Spark is march stable and nature than before, so user will not need 
those much logging in normal cases. But some high level information will be 
helpful, such as messages about job and tasks progress, we could changes these 
important logging into WARN level as an hack, otherwise will need to change all 
other logging into level DEBUG.

PS: it's better to have one line progress logging in terminal (also in title).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time

2014-07-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2648:
---

Priority: Critical  (was: Major)

 through shuffling blocksByAddress avoid much reducers to fetch data from a 
 executor at a time
 -

 Key: SPARK-2648
 URL: https://issues.apache.org/jira/browse/SPARK-2648
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang
Priority: Critical

 like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers 
 to connect a executor at a time.when a map has many paritions, at a time 
 there has so much reduces connecting to this map.so it maybe make network's 
 connect to timeout.
 i created PR for this issue:https://github.com/apache/spark/pull/1549



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2648) through shuffling blocksByAddress avoid much reducers to fetch data from a executor at a time

2014-07-23 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2648:
---

Target Version/s: 1.1.0
Assignee: Lianhui Wang

 through shuffling blocksByAddress avoid much reducers to fetch data from a 
 executor at a time
 -

 Key: SPARK-2648
 URL: https://issues.apache.org/jira/browse/SPARK-2648
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang
Assignee: Lianhui Wang
Priority: Critical

 like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers 
 to connect a executor at a time.when a map has many paritions, at a time 
 there has so much reduces connecting to this map.so it maybe make network's 
 connect to timeout.
 i created PR for this issue:https://github.com/apache/spark/pull/1549



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2588) Add some more DSLs.


 [ 
https://issues.apache.org/jira/browse/SPARK-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2588.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

 Add some more DSLs.
 ---

 Key: SPARK-2588
 URL: https://issues.apache.org/jira/browse/SPARK-2588
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1726) Tasks that fail to serialize remain in active stages forever.


 [ 
https://issues.apache.org/jira/browse/SPARK-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1726.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

[~kayousterhout] reports this is fixed in master.

 Tasks that fail to serialize remain in active stages forever.
 -

 Key: SPARK-1726
 URL: https://issues.apache.org/jira/browse/SPARK-1726
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Michael Armbrust
Assignee: Andrew Or
 Fix For: 1.1.0

 Attachments: ZombieTask.tiff


 In the spark shell.
 {code}
 scala class Adder(x: Int) { def apply(a: Int) = a + x }
 defined class Adder
 scala val add = new Adder(10)
 scala sc.parallelize(1 to 10).map(add(_)).collect()
 {code}
 You get:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
 serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$Adder
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 However, the web ui is messed up.  See attached screen shot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-1726) Tasks that fail to serialize remain in active stages forever.

2014-07-23 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-1726:
---


 Tasks that fail to serialize remain in active stages forever.
 -

 Key: SPARK-1726
 URL: https://issues.apache.org/jira/browse/SPARK-1726
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.1
Reporter: Michael Armbrust
Assignee: Andrew Or
 Attachments: ZombieTask.tiff


 In the spark shell.
 {code}
 scala class Adder(x: Int) { def apply(a: Int) = a + x }
 defined class Adder
 scala val add = new Adder(10)
 scala sc.parallelize(1 to 10).map(add(_)).collect()
 {code}
 You get:
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
 serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$Adder
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 However, the web ui is messed up.  See attached screen shot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.


 [ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2226.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 HAVING should be able to contain aggregate expressions that don't appear in 
 the aggregation list. 
 --

 Key: SPARK-2226
 URL: https://issues.apache.org/jira/browse/SPARK-2226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: William Benton
 Fix For: 1.1.0


 https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
 This test file contains the following query:
 {code}
 SELECT key FROM src GROUP BY key HAVING max(value)  val_255;
 {code}
 Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2569) Customized UDFs in hive not running with Spark SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2569.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Customized UDFs in hive not running with Spark SQL
 --

 Key: SPARK-2569
 URL: https://issues.apache.org/jira/browse/SPARK-2569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux or mac, hive 0.9.0 and hive 0.13.0 with hadoop 
 1.0.4, scala 2.10.3, spark 1.0.0
Reporter: jacky hung
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.1.0


 start spark-shell,
 init (like create hiveContext, import ._ ect, make sure the jar including the 
 UDFs is in classpath)
 hql(CREATE TEMPORARY FUNCTION t_ts AS 'udf.Timestamp'), which is 
 successful. 
 then i tried hql(select t_ts(time) from data_common where  limit 
 1).collect().foreach(println), which failed with NullPointException 
 we had discussion about it in the mail list.
 http://apache-spark-user-list.1001560.n3.nabble.com/run-sparksql-hiveudf-error-throw-NPE-td.html#a9006
 java.lang.NullPointerException 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.getFunctionClass(hiveUdfs.scala:117)
  org.apache.spark.sql.hive.HiveUdf.getFunctionClass(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:119)
  org.apache.spark.sql.hive.HiveUdf.createFunction(hiveUdfs.scala:157) 
 org.apache.spark.sql.hive.HiveUdf.function$lzycompute(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveUdf.function(hiveUdfs.scala:170) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method$lzycompute(hiveUdfs.scala:181) 
 org.apache.spark.sql.hive.HiveSimpleUdf.method(hiveUdfs.scala:180) 
 org.apache.spark.sql.hive.HiveSimpleUdf.wrappers$lzycompute(hiveUdfs.scala:186)
  org.apache.spark.sql.hive.HiveSimpleUdf.wrappers(hiveUdfs.scala:186) 
 org.apache.spark.sql.hive.HiveSimpleUdf.eval(hiveUdfs.scala:220) 
 org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:64)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:160)
  
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:153)
  org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:580) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261) 
 org.apache.spark.rdd.RDD.iterator(RDD.scala:228) 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2102) Caching with GENERIC column type causes query execution to slow down significantly


 [ 
https://issues.apache.org/jira/browse/SPARK-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2102.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Caching with GENERIC column type causes query execution to slow down 
 significantly
 --

 Key: SPARK-2102
 URL: https://issues.apache.org/jira/browse/SPARK-2102
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
 Fix For: 1.1.0


 It is likely that we are doing something wrong with Kryo.
 [~adav] has a twitter dataset that reproduces the issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2656) Python version without support for exact sample size

2014-07-23 Thread Doris Xin (JIRA)

Doris Xin created SPARK-2656:


 Summary: Python version without support for exact sample size
 Key: SPARK-2656
 URL: https://issues.apache.org/jira/browse/SPARK-2656
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2633) support register spark listener to listener bus with Java API

2014-07-23 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072579#comment-14072579
 ] 

Marcelo Vanzin commented on SPARK-2633:
---

So, being able to register listeners is probably not that complicated, but 
things may get tricky when actually looking at the events.

e.g., {{StageInfo}} objects exposed through {{SparkListenerStageSubmitted}}, 
they contain Scala APIs that may not be Java-friendly.

I guess my question is: is the goal to have full API equivalence in Java-land? 
If so, it might be easier to clean up the types so that they're all 
Java-friendly. Otherwise, the Java listener API will always be a second-class 
citizen.

 support register spark listener to listener bus with Java API
 -

 Key: SPARK-2633
 URL: https://issues.apache.org/jira/browse/SPARK-2633
 Project: Spark
  Issue Type: New Feature
  Components: Java API
Reporter: Chengxiang Li

 Currently user can only register spark listener with Scala API, we should add 
 this feature to Java API as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2656) Python version without support for exact sample size


[ 
https://issues.apache.org/jira/browse/SPARK-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072585#comment-14072585
 ] 

Apache Spark commented on SPARK-2656:
-

User 'dorx' has created a pull request for this issue:
https://github.com/apache/spark/pull/1554

 Python version without support for exact sample size
 

 Key: SPARK-2656
 URL: https://issues.apache.org/jira/browse/SPARK-2656
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin
Assignee: Doris Xin





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2657) Use more compact data structures than ArrayBuffer in groupBy and cogroup

Matei Zaharia created SPARK-2657:


 Summary: Use more compact data structures than ArrayBuffer in 
groupBy and cogroup
 Key: SPARK-2657
 URL: https://issues.apache.org/jira/browse/SPARK-2657
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2657) Use more compact data structures than ArrayBuffer in groupBy and cogroup


[ 
https://issues.apache.org/jira/browse/SPARK-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072608#comment-14072608
 ] 

Apache Spark commented on SPARK-2657:
-

User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/1555

 Use more compact data structures than ArrayBuffer in groupBy and cogroup
 

 Key: SPARK-2657
 URL: https://issues.apache.org/jira/browse/SPARK-2657
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner


 [ 
https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-2574:


Assignee: Matei Zaharia

 Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
 --

 Key: SPARK-2574
 URL: https://issues.apache.org/jira/browse/SPARK-2574
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Matei Zaharia





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner


[ 
https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072605#comment-14072605
 ] 

Matei Zaharia commented on SPARK-2574:
--

I implemented this as part of https://github.com/apache/spark/pull/1555

 Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
 --

 Key: SPARK-2574
 URL: https://issues.apache.org/jira/browse/SPARK-2574
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2549) Functions defined inside of other functions trigger failures


 [ 
https://issues.apache.org/jira/browse/SPARK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2549.


Resolution: Fixed

Issue resolved by pull request 1510
[https://github.com/apache/spark/pull/1510]

 Functions defined inside of other functions trigger failures
 

 Key: SPARK-2549
 URL: https://issues.apache.org/jira/browse/SPARK-2549
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.1.0


 If we have a function declaration inside of another function, it still 
 triggers mima failures. We should look at how that is implemented in byte 
 code and just always exclude functions like that.
 {code}
 def a() = {
   /* Changing b() should not trigger failures, but it does. */
   def b() = {}
 }
 {code}
 I dug into the byte code for inner functions a bit more. I noticed that they 
 tend to use `$$` before the function name.
 There is more information on that string sequence here:
 https://github.com/scala/scala/blob/2.10.x/src/reflect/scala/reflect/internal/StdNames.scala#L286
 I did a cursory look and it appears that symbol is mostly (exclusively?) used 
 for anonymous or inner functions:
 {code}
 # in RDD package classes
 $ ls *.class | xargs -I {} javap {} |grep \\$\\$ 
   public final java.lang.Object 
 org$apache$spark$rdd$PairRDDFunctions$$createZero$1(scala.reflect.ClassTag, 
 byte[], scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
   public final java.lang.Object 
 org$apache$spark$rdd$PairRDDFunctions$$createZero$2(byte[], 
 scala.runtime.ObjectRef, scala.runtime.VolatileByteRef);
   public final scala.collection.Iterator 
 org$apache$spark$rdd$PairRDDFunctions$$reducePartition$1(scala.collection.Iterator,
  scala.Function2);
   public final java.util.HashMap 
 org$apache$spark$rdd$PairRDDFunctions$$mergeMaps$1(java.util.HashMap, 
 java.util.HashMap, scala.Function2);
 ...
 public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1 
 extends scala.runtime.AbstractFunction0$mcJ$sp implements scala.Serializable {
   public 
 org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$1(org.apache.spark.rdd.AsyncRDDActionsT);
 public final class org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2 
 extends scala.runtime.AbstractFunction2$mcVIJ$sp implements 
 scala.Serializable {
   public 
 org.apache.spark.rdd.AsyncRDDActions$$anonfun$countAsync$2(org.apache.spark.rdd.AsyncRDDActionsT);
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner


 [ 
https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2574:
-

Priority: Trivial  (was: Major)

 Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
 --

 Key: SPARK-2574
 URL: https://issues.apache.org/jira/browse/SPARK-2574
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Matei Zaharia
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2658) HiveQL: 1 = true should evaluate to true

Michael Armbrust created SPARK-2658:
---

 Summary: HiveQL: 1 = true should evaluate to true
 Key: SPARK-2658
 URL: https://issues.apache.org/jira/browse/SPARK-2658
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2658) HiveQL: 1 = true should evaluate to true


[ 
https://issues.apache.org/jira/browse/SPARK-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072629#comment-14072629
 ] 

Apache Spark commented on SPARK-2658:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1556

 HiveQL: 1 = true should evaluate to true
 

 Key: SPARK-2658
 URL: https://issues.apache.org/jira/browse/SPARK-2658
 Project: Spark
  Issue Type: Bug
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2659) HiveQL: Division operator should always perform fractional division

Michael Armbrust created SPARK-2659:
---

 Summary: HiveQL: Division operator should always perform 
fractional division
 Key: SPARK-2659
 URL: https://issues.apache.org/jira/browse/SPARK-2659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2659) HiveQL: Division operator should always perform fractional division


[ 
https://issues.apache.org/jira/browse/SPARK-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072655#comment-14072655
 ] 

Apache Spark commented on SPARK-2659:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/1557

 HiveQL: Division operator should always perform fractional division
 ---

 Key: SPARK-2659
 URL: https://issues.apache.org/jira/browse/SPARK-2659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations


 [ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2316:
---

Target Version/s: 1.1.0  (was: 1.0.2)

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations


 [ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2316:
---

Priority: Critical  (was: Major)

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical

 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2458) Make failed application log visible on History Server


[ 
https://issues.apache.org/jira/browse/SPARK-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072665#comment-14072665
 ] 

Apache Spark commented on SPARK-2458:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/1558

 Make failed application log visible on History Server
 -

 Key: SPARK-2458
 URL: https://issues.apache.org/jira/browse/SPARK-2458
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Masayoshi TSUZUKI

 History server is very helpful for debugging application correctness  
 performance after the application finished. However, when the application 
 failed, the link is not listed on the hisotry server UI and history can't be 
 viewed.
 It would be very useful if we can check the history of failed application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2660) Enable pretty-printing SchemaRDD Rows

2014-07-23 Thread Aaron Davidson (JIRA)

Aaron Davidson created SPARK-2660:
-

 Summary: Enable pretty-printing SchemaRDD Rows
 Key: SPARK-2660
 URL: https://issues.apache.org/jira/browse/SPARK-2660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson


Right now, printing a Row results in something like [a,b,c]. It would be 
cool if there was a way to pretty-print them similar to JSON, into something 
like

{code}
{
  Col1 : a,
  Col2 : b,
  Col3 : c
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL


[ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072681#comment-14072681
 ] 

Apache Spark commented on SPARK-2010:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1559

 Support for nested data in PySpark SQL
 --

 Key: SPARK-2010
 URL: https://issues.apache.org/jira/browse/SPARK-2010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Kan Zhang
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2648) Randomize order of executors when fetching shuffle blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2648:
---

Summary: Randomize order of executors when fetching shuffle blocks  (was: 
through shuffling blocksByAddress avoid much reducers to fetch data from a 
executor at a time)

 Randomize order of executors when fetching shuffle blocks
 -

 Key: SPARK-2648
 URL: https://issues.apache.org/jira/browse/SPARK-2648
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang
Assignee: Lianhui Wang
Priority: Critical

 like mapreduce we need to shuffle blocksByAddress.it can avoid many reducers 
 to connect a executor at a time.when a map has many paritions, at a time 
 there has so much reduces connecting to this map.so it maybe make network's 
 connect to timeout.
 i created PR for this issue:https://github.com/apache/spark/pull/1549



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2660) Enable pretty-printing SchemaRDD Rows

2014-07-23 Thread Larry Xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072687#comment-14072687
 ] 

Larry Xiao commented on SPARK-2660:
---

I think this one is suitable for newbie like me, though I think it's alright 
the way it is now :)

I'm not familiar with the code, and I found printRDDElement in pipe for RDD.
Is it the correct place?

 Enable pretty-printing SchemaRDD Rows
 -

 Key: SPARK-2660
 URL: https://issues.apache.org/jira/browse/SPARK-2660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson

 Right now, printing a Row results in something like [a,b,c]. It would be 
 cool if there was a way to pretty-print them similar to JSON, into something 
 like
 {code}
 {
   Col1 : a,
   Col2 : b,
   Col3 : c
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2661) Unpersist last RDD in bagel iteration

2014-07-23 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-2661:
--

 Summary: Unpersist last RDD in bagel iteration
 Key: SPARK-2661
 URL: https://issues.apache.org/jira/browse/SPARK-2661
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.1
Reporter: Adrian Wang


In bagel iteration, we only depend on RDD[n] to get RDD[n+1], so we can 
unpersist RDD[n-1] after we get RDD[n]. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2662) Fix NPE for JsonProtocol

2014-07-23 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072689#comment-14072689
 ] 

Guoqiang Li commented on SPARK-2662:


PR: https://github.com/apache/spark/pull/1511

 Fix NPE for JsonProtocol
 

 Key: SPARK-2662
 URL: https://issues.apache.org/jira/browse/SPARK-2662
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li
Assignee: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2662) Fix NPE for JsonProtocol

2014-07-23 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-2662:
--

 Summary: Fix NPE for JsonProtocol
 Key: SPARK-2662
 URL: https://issues.apache.org/jira/browse/SPARK-2662
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Guoqiang Li






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2568) RangePartitioner should go through the data only once