[jira] [Reopened] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread michael procopio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

michael procopio reopened SPARK-21141:
--

My apologies, I mean spark-submit --version.


> spark-update --version is hard to parse
> ---
>
> Key: SPARK-21141
> URL: https://issues.apache.org/jira/browse/SPARK-21141
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Debian 8 using hadoop 2.7.2
>Reporter: michael procopio
>
> We have need of being able to determine the spark version in order to 
> reference our jars:  one set built for 1.x using scala 2.10 and the other 
> built for 2.x using scala 2.11.  spark-update --version returns a lot of 
> extraneous output.  It would be preferable if an option were available that 
> only returned the version number.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

michael procopio reopened SPARK-21140:
--

I am not sure what detail you are looking for.  I provided the test code I was 
using.  Seems to me multiple copies of the data must be generated when 
collecting a partition.  Having to set driver.executor.memory to 3gb to collect 
a partition of 512 mb seems high to me.


> Reduce collect high memory requrements
> --
>
> Key: SPARK-21140
> URL: https://issues.apache.org/jira/browse/SPARK-21140
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Linux Debian 8 using hadoop 2.7.2.
>Reporter: michael procopio
>
> I wrote a very simple Scala application which used flatMap to create an RDD 
> containing a 512 mb partition of 256 byte arrays.  Experimentally, I 
> determined that spark.executor.memory had to be set at 3 gb in order to 
> colledt the data.  This seems extremely high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054043#comment-16054043
 ] 

michael procopio commented on SPARK-21140:
--

I disagree executor memory does depend on the size of the partition being 
collected.  A 6 to 1 ratio to collect  data seems onerous to me.  Seems like 
multiple copies must be created in the executor.

I found that to collect an RDD containing a single partition of 512 mb took 
3gb.  

Here's the code I was using:

package com.test

import org.apache.spark._
import org.apache.spark.SparkContext._

object SparkRdd2 {
def main(args: Array[String]) {
  try
  {
  //
  // Process any arguments.
  //
 def parseOptions( map: Map[String,Any], listArgs: List[String]): 
Map[String,Any] = {
 listArgs match {
 case Nil =>
 map
 case "-master" :: value :: tail =>
 parseOptions( map+("master"-> value),tail)
 case "-recordSize" :: value :: tail =>
 parseOptions( map+("recordSize"-> value.toInt),tail)
 case "-partitionSize" :: value :: tail  =>
 parseOptions( map+("partitionSize"-> value.toLong),tail)
 case "-executorMemory" :: value :: tail =>
 parseOptions( map+("executorMemory"-> value),tail)
 case option :: tail =>
 println("unknown option"+option)
 sys.exit(1)
 }
 }
 val listArgs = args.toList
 val optionmap = parseOptions( Map[String,Any](),listArgs)
 val master = optionmap.getOrElse("master","local").asInstanceOf[String]
 val recordSize = 
optionmap.getOrElse("recordSize",128).asInstanceOf[Int]
 val partitionSize = 
optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long]
 val executorMemory = 
optionmap.getOrElse("executorMemory","6g").asInstanceOf[String]
 println(f"Creating single partition of $partitionSize%d with records 
of length $recordSize%d")
 println(f"Setting spark.executor.memory to $executorMemory")
 //
 // Create SparkConf.
 //
 val sparkConf = new SparkConf()
 
sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good")
 sparkConf.set("spark.executor.cores","1")
 sparkConf.set("spark.executor.instances","1")
 sparkConf.set("spark.executor.memory",executorMemory)
 sparkConf.set("spark.eventLog.enabled","true")
 
sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events");
 sparkConf.set("spark.driver.maxResultSize","0")
 /*
  sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
  sparkConf.set("spark.kryoserializer.buffer.max","768m")
  sparkConf.set("spark.kryoserializer.buffer","64k")
  */
 //
 // Create SparkContext
 //
 val sc = new SparkContext(sparkConf)
 //
 //
 //
 def createdSizedPartition( recordSize:Int ,partitionSize:Long): 
Iterator[Array[Byte]] = {
var sizeReturned:Long = 0
new Iterator[Array[Byte]] {
override def hasNext(): Boolean = {
   (sizeReturned createdSizedPartition( 
rddInfo._1, rddInfo._2))
 val results = sizedRdd.collect
 var countLines: Int = 0
 var countBytes: Long = 0
 var maxRecord: Int = 0
 for (line <- results) {
 countLines = countLines+1
 countBytes = countBytes+line.length
  if (line.length> maxRecord)
  {
 maxRecord = line.length
  }
 }
 println(f"Collected $countLines%d lines")
 println(f"  $countBytes%d bytes")
 println(f"Max record $maxRecord%d bytes")
  } catch {
   case e: Exception =>
  println("Error in executing application: ", e.getMessage)
  throw e
  }

   }
}

After building it can be invoked as:

spark-submit --class com.test.SparkRdd2 --driver-memory 10g 
./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 
536870912

Allows you to vary the 

> Reduce collect high memory requrements
> --
>
>   

[jira] [Created] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread michael procopio (JIRA)
michael procopio created SPARK-21141:


 Summary: spark-update --version is hard to parse
 Key: SPARK-21141
 URL: https://issues.apache.org/jira/browse/SPARK-21141
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.1.1
 Environment: Debian 8 using hadoop 2.7.2
Reporter: michael procopio


We have need of being able to determine the spark version in order to reference 
our jars:  one set built for 1.x using scala 2.10 and the other built for 2.x 
using scala 2.11.  spark-update --version returns a lot of extraneous output.  
It would be preferable if an option were available that only returned the 
version number.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)
michael procopio created SPARK-21140:


 Summary: Reduce collect high memory requrements
 Key: SPARK-21140
 URL: https://issues.apache.org/jira/browse/SPARK-21140
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.1.1
 Environment: Linux Debian 8 using hadoop 2.7.2.

Reporter: michael procopio


I wrote a very simple Scala application which used flatMap to create an RDD 
containing a 512 mb partition of 256 byte arrays.  Experimentally, I determined 
that spark.executor.memory had to be set at 3 gb in order to colledt the data.  
This seems extremely high.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19030) Dropped event errors being reported after SparkContext has been stopped

2016-12-29 Thread michael procopio (JIRA)
michael procopio created SPARK-19030:


 Summary: Dropped event errors being reported after SparkContext 
has been stopped
 Key: SPARK-19030
 URL: https://issues.apache.org/jira/browse/SPARK-19030
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: Debian 8 using spark-submit with MATLAB integration spark 
code is being code using java.
Reporter: michael procopio
Priority: Minor


After stop has been called on SparkContext, errors are being reported.

6/12/29 15:54:04 ERROR scheduler.LiveListenerBus: SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray())

The stack in the hearbeat thread is at the point where the error is thrown is:

Daemon Thread [heartbeat-receiver-event-loop-thread] (Suspended (breakpoint at 
line 124 in LiveListenerBus))
LiveListenerBus.post(SparkListenerEvent) line: 124  
DAGScheduler.executorHeartbeatReceived(String, 
Tuple4[], BlockManagerId) line: 228  
 
YarnScheduler(TaskSchedulerImpl).executorHeartbeatReceived(String, 
Tuple2>[], BlockManagerId) line: 402  

HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp()
 line: 128
Utils$.tryLogNonFatalError(Function0) line: 1290 
HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run() line: 127
Executors$RunnableAdapter.call() line: 511   
ScheduledThreadPoolExecutor$ScheduledFutureTask(FutureTask).run() 
line: 266   

ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor$ScheduledFutureTask)
 line: 180
ScheduledThreadPoolExecutor$ScheduledFutureTask.run() line: 293  

ScheduledThreadPoolExecutor(ThreadPoolExecutor).runWorker(ThreadPoolExecutor$Worker)
 line: 1142 
ThreadPoolExecutor$Worker.run() line: 617   
Thread.run() line: 745  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10452) Pyspark worker security issue

2015-09-04 Thread Michael Procopio (JIRA)
Michael Procopio created SPARK-10452:


 Summary: Pyspark worker security issue
 Key: SPARK-10452
 URL: https://issues.apache.org/jira/browse/SPARK-10452
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 running on hadoop 2.5.2.
Reporter: Michael Procopio
Priority: Critical


The python worker launched by the executor is given the credentials used to 
launch yarn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark

2015-09-04 Thread Michael Procopio (JIRA)
Michael Procopio created SPARK-10453:


 Summary: There's now way to use spark.dynmicAllocation.enabled 
with pyspark
 Key: SPARK-10453
 URL: https://issues.apache.org/jira/browse/SPARK-10453
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: When using spark.dynamicAllocation.enabled, the 
assumption is that memory/core resources will be mediated by the yarn resource 
manager.  Unfortunately, whatever value is used for spark.executor.memory is 
consumed as JVM heap space by the executor.  There's no way to account for the 
memory requirements of the pyspark worker.  Executor JVM heap space should be 
decoupled from spark.executor.memory.
Reporter: Michael Procopio






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org