Re: Preconditions on RDDs for creating a Graph?

2014-04-08 Thread Ankur Dave
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak  wrote:

> What, exactly, needs to be true about the RDDs that you pass to Graph() to
> be sure of constructing a valid graph? (Do they need to have the same
> number of partitions? The same number of partitions and no empty
> partitions? Do you need to repartition them with their default partitioners
> beforehand?)
>

In theory, there should be no preconditions on the vertex and edge RDDs
passed to Graph(). They may each have any number of empty or nonempty
partitions and can be partitioned arbitrarily. GraphX will repartition the
vertex RDD as described below, and it should not repartition the edge RDD
unless you do this explicitly by calling Graph#partitionBy.

Unfortunately, there was a bug
(SPARK-1329)
that caused an ArrayIndexOutOfBoundsException when the edge RDD had more
partitions than the vertex RDD. I just submitted
apache/spark#368,
which should fix this. Could you try applying it and see if that helps?

Why does GraphImpl repartition the vertices RDD?


This is for two reasons:
1. To remove duplicate vertices by colocating them on the same partition:
VertexPartition.apply
.
2. To enable aggregating messages to a vertex by hashing the target vertex
ID using the vertex partitioner and sending the message to the resulting
partition: 
VertexRDD#aggregateUsingIndex
.

Ideally we would skip repartitioning the vertex RDD if it's already
partitioned, though.

Ankur 


Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Xiangrui Meng
After sbt/sbt gen-diea, do not import as an SBT project but choose
"open project" and point it to the spark folder. -Xiangrui

On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen  wrote:
> I let IntelliJ read the Maven build directly and that works fine.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo  wrote:
>> Dear list,
>>
>> SBT compiles fine, but when I do the following:
>> sbt/sbt gen-idea
>> import project as SBT project to IDEA 13.1
>> Make Project
>> and these errors show up:
>>
>> Error:(28, 8) object FileContext is not a member of package
>> org.apache.hadoop.fs
>> import org.apache.hadoop.fs.{FileContext, FileStatus, FileSystem, Path,
>> FileUtil}
>>^
>> Error:(31, 8) object Master is not a member of package
>> org.apache.hadoop.mapred
>> import org.apache.hadoop.mapred.Master
>>^
>> Error:(34, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.api._
>>  ^
>> Error:(35, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
>>  ^
>> Error:(36, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.api.protocolrecords._
>>  ^
>> Error:(37, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.api.records._
>>  ^
>> Error:(38, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.client.YarnClientImpl
>>  ^
>> Error:(39, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.conf.YarnConfiguration
>>  ^
>> Error:(40, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.ipc.YarnRPC
>>  ^
>> Error:(41, 26) object yarn is not a member of package org.apache.hadoop
>> import org.apache.hadoop.yarn.util.{Apps, Records}
>>  ^
>> Error:(49, 11) not found: type YarnClientImpl
>>   extends YarnClientImpl with Logging {
>>   ^
>> Error:(48, 20) not found: type ClientArguments
>> class Client(args: ClientArguments, conf: Configuration, sparkConf:
>> SparkConf)
>>^
>> Error:(51, 18) not found: type ClientArguments
>>   def this(args: ClientArguments, sparkConf: SparkConf) =
>>  ^
>> Error:(54, 18) not found: type ClientArguments
>>   def this(args: ClientArguments) = this(args, new SparkConf())
>>  ^
>> Error:(56, 12) not found: type YarnRPC
>>   var rpc: YarnRPC = YarnRPC.create(conf)
>>^
>> Error:(56, 22) not found: value YarnRPC
>>   var rpc: YarnRPC = YarnRPC.create(conf)
>>  ^
>> Error:(57, 17) not found: type YarnConfiguration
>>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
>> ^
>> Error:(57, 41) not found: type YarnConfiguration
>>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
>> ^
>> Error:(58, 59) value getCredentials is not a member of
>> org.apache.hadoop.security.UserGroupInformation
>>   val credentials = UserGroupInformation.getCurrentUser().getCredentials()
>>   ^
>> Error:(60, 34) not found: type ClientDistributedCacheManager
>>   private val distCacheMgr = new ClientDistributedCacheManager()
>>  ^
>> Error:(72, 5) not found: value init
>> init(yarnConf)
>> ^
>> Error:(73, 5) not found: value start
>> start()
>> ^
>> Error:(76, 24) value getNewApplication is not a member of
>> org.apache.spark.Logging
>> val newApp = super.getNewApplication()
>>^
>> Error:(137, 35) not found: type GetNewApplicationResponse
>>   def verifyClusterResources(app: GetNewApplicationResponse) = {
>>   ^
>> Error:(156, 65) not found: type ApplicationSubmissionContext
>>   def createApplicationSubmissionContext(appId: ApplicationId):
>> ApplicationSubmissionContext = {
>> ^
>> Error:(156, 49) not found: type ApplicationId
>>   def createApplicationSubmissionContext(appId: ApplicationId):
>> ApplicationSubmissionContext = {
>> ^
>> Error:(118, 31) not found: type ApplicationId
>>   def getAppStagingDir(appId: ApplicationId): String = {
>>   ^
>> Error:(224, 69) not found: type LocalResource
>>   def prepareLocalResources(appStagingDir: String): HashMap[String,
>> LocalResource] = {
>> ^
>> Error:(307, 39) not found: type LocalResource
>>   localResources: HashMap[String, LocalResource],
>>  

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Sean Owen
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data Science | London


On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo  wrote:
> Dear list,
>
> SBT compiles fine, but when I do the following:
> sbt/sbt gen-idea
> import project as SBT project to IDEA 13.1
> Make Project
> and these errors show up:
>
> Error:(28, 8) object FileContext is not a member of package
> org.apache.hadoop.fs
> import org.apache.hadoop.fs.{FileContext, FileStatus, FileSystem, Path,
> FileUtil}
>^
> Error:(31, 8) object Master is not a member of package
> org.apache.hadoop.mapred
> import org.apache.hadoop.mapred.Master
>^
> Error:(34, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api._
>  ^
> Error:(35, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
>  ^
> Error:(36, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.protocolrecords._
>  ^
> Error:(37, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.records._
>  ^
> Error:(38, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.client.YarnClientImpl
>  ^
> Error:(39, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.conf.YarnConfiguration
>  ^
> Error:(40, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.ipc.YarnRPC
>  ^
> Error:(41, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.util.{Apps, Records}
>  ^
> Error:(49, 11) not found: type YarnClientImpl
>   extends YarnClientImpl with Logging {
>   ^
> Error:(48, 20) not found: type ClientArguments
> class Client(args: ClientArguments, conf: Configuration, sparkConf:
> SparkConf)
>^
> Error:(51, 18) not found: type ClientArguments
>   def this(args: ClientArguments, sparkConf: SparkConf) =
>  ^
> Error:(54, 18) not found: type ClientArguments
>   def this(args: ClientArguments) = this(args, new SparkConf())
>  ^
> Error:(56, 12) not found: type YarnRPC
>   var rpc: YarnRPC = YarnRPC.create(conf)
>^
> Error:(56, 22) not found: value YarnRPC
>   var rpc: YarnRPC = YarnRPC.create(conf)
>  ^
> Error:(57, 17) not found: type YarnConfiguration
>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> ^
> Error:(57, 41) not found: type YarnConfiguration
>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> ^
> Error:(58, 59) value getCredentials is not a member of
> org.apache.hadoop.security.UserGroupInformation
>   val credentials = UserGroupInformation.getCurrentUser().getCredentials()
>   ^
> Error:(60, 34) not found: type ClientDistributedCacheManager
>   private val distCacheMgr = new ClientDistributedCacheManager()
>  ^
> Error:(72, 5) not found: value init
> init(yarnConf)
> ^
> Error:(73, 5) not found: value start
> start()
> ^
> Error:(76, 24) value getNewApplication is not a member of
> org.apache.spark.Logging
> val newApp = super.getNewApplication()
>^
> Error:(137, 35) not found: type GetNewApplicationResponse
>   def verifyClusterResources(app: GetNewApplicationResponse) = {
>   ^
> Error:(156, 65) not found: type ApplicationSubmissionContext
>   def createApplicationSubmissionContext(appId: ApplicationId):
> ApplicationSubmissionContext = {
> ^
> Error:(156, 49) not found: type ApplicationId
>   def createApplicationSubmissionContext(appId: ApplicationId):
> ApplicationSubmissionContext = {
> ^
> Error:(118, 31) not found: type ApplicationId
>   def getAppStagingDir(appId: ApplicationId): String = {
>   ^
> Error:(224, 69) not found: type LocalResource
>   def prepareLocalResources(appStagingDir: String): HashMap[String,
> LocalResource] = {
> ^
> Error:(307, 39) not found: type LocalResource
>   localResources: HashMap[String, LocalResource],
>   ^
> Error:(343, 38) not found: type ContainerLaunchContext
>   env: HashMap[String, String]): ContainerLaunchContext = {
>  ^
> Error:(341, 15) not found: type GetNewApplicationResponse
>   newApp: GetNewApplicationResponse,
> 

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong,

This is pretty much what I did. I run into the same issue you have.
Since I'm not developing yarn related stuff, I just excluded those two
yarn related project from intellji, and it works. PS, you may need to
exclude java8 project as well now.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Apr 8, 2014 at 10:14 PM, Dong Mo  wrote:
> Dear list,
>
> SBT compiles fine, but when I do the following:
> sbt/sbt gen-idea
> import project as SBT project to IDEA 13.1
> Make Project
> and these errors show up:
>
> Error:(28, 8) object FileContext is not a member of package
> org.apache.hadoop.fs
> import org.apache.hadoop.fs.{FileContext, FileStatus, FileSystem, Path,
> FileUtil}
>^
> Error:(31, 8) object Master is not a member of package
> org.apache.hadoop.mapred
> import org.apache.hadoop.mapred.Master
>^
> Error:(34, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api._
>  ^
> Error:(35, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
>  ^
> Error:(36, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.protocolrecords._
>  ^
> Error:(37, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.api.records._
>  ^
> Error:(38, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.client.YarnClientImpl
>  ^
> Error:(39, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.conf.YarnConfiguration
>  ^
> Error:(40, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.ipc.YarnRPC
>  ^
> Error:(41, 26) object yarn is not a member of package org.apache.hadoop
> import org.apache.hadoop.yarn.util.{Apps, Records}
>  ^
> Error:(49, 11) not found: type YarnClientImpl
>   extends YarnClientImpl with Logging {
>   ^
> Error:(48, 20) not found: type ClientArguments
> class Client(args: ClientArguments, conf: Configuration, sparkConf:
> SparkConf)
>^
> Error:(51, 18) not found: type ClientArguments
>   def this(args: ClientArguments, sparkConf: SparkConf) =
>  ^
> Error:(54, 18) not found: type ClientArguments
>   def this(args: ClientArguments) = this(args, new SparkConf())
>  ^
> Error:(56, 12) not found: type YarnRPC
>   var rpc: YarnRPC = YarnRPC.create(conf)
>^
> Error:(56, 22) not found: value YarnRPC
>   var rpc: YarnRPC = YarnRPC.create(conf)
>  ^
> Error:(57, 17) not found: type YarnConfiguration
>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> ^
> Error:(57, 41) not found: type YarnConfiguration
>   val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
> ^
> Error:(58, 59) value getCredentials is not a member of
> org.apache.hadoop.security.UserGroupInformation
>   val credentials = UserGroupInformation.getCurrentUser().getCredentials()
>   ^
> Error:(60, 34) not found: type ClientDistributedCacheManager
>   private val distCacheMgr = new ClientDistributedCacheManager()
>  ^
> Error:(72, 5) not found: value init
> init(yarnConf)
> ^
> Error:(73, 5) not found: value start
> start()
> ^
> Error:(76, 24) value getNewApplication is not a member of
> org.apache.spark.Logging
> val newApp = super.getNewApplication()
>^
> Error:(137, 35) not found: type GetNewApplicationResponse
>   def verifyClusterResources(app: GetNewApplicationResponse) = {
>   ^
> Error:(156, 65) not found: type ApplicationSubmissionContext
>   def createApplicationSubmissionContext(appId: ApplicationId):
> ApplicationSubmissionContext = {
> ^
> Error:(156, 49) not found: type ApplicationId
>   def createApplicationSubmissionContext(appId: ApplicationId):
> ApplicationSubmissionContext = {
> ^
> Error:(118, 31) not found: type ApplicationId
>   def getAppStagingDir(appId: ApplicationId): String = {
>   ^
> Error:(224, 69) not found: type LocalResource
>   def prepareLocalResources(appStagingDir: String): HashMap[String,
> LocalResource] = {
> ^
> Error:(307, 39) not found: type LocalResource
>   localResources: HashMap[String, LocalResource],
>  

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo
Dear list,

SBT compiles fine, but when I do the following:
sbt/sbt gen-idea
import project as SBT project to IDEA 13.1
Make Project
and these errors show up:

Error:(28, 8) object FileContext is not a member of package
org.apache.hadoop.fs
import org.apache.hadoop.fs.{FileContext, FileStatus, FileSystem, Path,
FileUtil}
   ^
Error:(31, 8) object Master is not a member of package
org.apache.hadoop.mapred
import org.apache.hadoop.mapred.Master
   ^
Error:(34, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.api._
 ^
Error:(35, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.api.ApplicationConstants.Environment
 ^
Error:(36, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.api.protocolrecords._
 ^
Error:(37, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.api.records._
 ^
Error:(38, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.client.YarnClientImpl
 ^
Error:(39, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.conf.YarnConfiguration
 ^
Error:(40, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.ipc.YarnRPC
 ^
Error:(41, 26) object yarn is not a member of package org.apache.hadoop
import org.apache.hadoop.yarn.util.{Apps, Records}
 ^
Error:(49, 11) not found: type YarnClientImpl
  extends YarnClientImpl with Logging {
  ^
Error:(48, 20) not found: type ClientArguments
class Client(args: ClientArguments, conf: Configuration, sparkConf:
SparkConf)
   ^
Error:(51, 18) not found: type ClientArguments
  def this(args: ClientArguments, sparkConf: SparkConf) =
 ^
Error:(54, 18) not found: type ClientArguments
  def this(args: ClientArguments) = this(args, new SparkConf())
 ^
Error:(56, 12) not found: type YarnRPC
  var rpc: YarnRPC = YarnRPC.create(conf)
   ^
Error:(56, 22) not found: value YarnRPC
  var rpc: YarnRPC = YarnRPC.create(conf)
 ^
Error:(57, 17) not found: type YarnConfiguration
  val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
^
Error:(57, 41) not found: type YarnConfiguration
  val yarnConf: YarnConfiguration = new YarnConfiguration(conf)
^
Error:(58, 59) value getCredentials is not a member of
org.apache.hadoop.security.UserGroupInformation
  val credentials = UserGroupInformation.getCurrentUser().getCredentials()
  ^
Error:(60, 34) not found: type ClientDistributedCacheManager
  private val distCacheMgr = new ClientDistributedCacheManager()
 ^
Error:(72, 5) not found: value init
init(yarnConf)
^
Error:(73, 5) not found: value start
start()
^
Error:(76, 24) value getNewApplication is not a member of
org.apache.spark.Logging
val newApp = super.getNewApplication()
   ^
Error:(137, 35) not found: type GetNewApplicationResponse
  def verifyClusterResources(app: GetNewApplicationResponse) = {
  ^
Error:(156, 65) not found: type ApplicationSubmissionContext
  def createApplicationSubmissionContext(appId: ApplicationId):
ApplicationSubmissionContext = {
^
Error:(156, 49) not found: type ApplicationId
  def createApplicationSubmissionContext(appId: ApplicationId):
ApplicationSubmissionContext = {
^
Error:(118, 31) not found: type ApplicationId
  def getAppStagingDir(appId: ApplicationId): String = {
  ^
Error:(224, 69) not found: type LocalResource
  def prepareLocalResources(appStagingDir: String): HashMap[String,
LocalResource] = {
^
Error:(307, 39) not found: type LocalResource
  localResources: HashMap[String, LocalResource],
  ^
Error:(343, 38) not found: type ContainerLaunchContext
  env: HashMap[String, String]): ContainerLaunchContext = {
 ^
Error:(341, 15) not found: type GetNewApplicationResponse
  newApp: GetNewApplicationResponse,
  ^
Error:(342, 39) not found: type LocalResource
  localResources: HashMap[String, LocalResource],
  ^
Error:(426, 11) value submitApplication is not a member of
org.apache.spark.Logging
super.submitApplication(appContext)
  ^
Error:(423, 29) not found: type ApplicationSubmissionContext
  def submitApp(appContext: Applicat

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant
Hi ,

I am getting java.io.NotSerializableException exception while executing
following program.

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
  class Vector (val data: Array[Double]) {}
  implicit object VectorAP extends AccumulatorParam[Vector]  { 
def zero(v: Vector) : Vector = new Vector(new Array(v.data.size)) 
def addInPlace(v1: Vector, v2: Vector) : Vector = { 
  for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i) 
  return v1 
} 
  } 
  def main(sc:SparkContext) {
val vectorAcc = sc.accumulator(new Vector(Array(0, 0)))
val accum = sc.accumulator(0) 
val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10)
file.foreach(line => {println(line); accum+=1; vectorAcc.add(new
Vector(Array(1,1 ))) ;})
println(accum.value)
println(vectorAcc.value.data)
println("=" )
  }
}

--

scala> App.main(sc)
14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called
with curMem=0, maxMem=308713881
14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 127.7 KB, free 294.3 MB)
14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at
:30
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at
:30) with 11 output partitions (allowLocal=false)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0 (foreach
at :30)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[1] at textFile at :29), which has no missing parents
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at
:30
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-exception-custom-Accumulator-tp3971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal
Hi ,

I am getting java.io.NotSerializableException exception while executing
following program.

import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
  class Vector (val data: Array[Double]) {}
  implicit object VectorAP extends AccumulatorParam[Vector]  {
def zero(v: Vector) : Vector = new Vector(new Array(v.data.size))
def addInPlace(v1: Vector, v2: Vector) : Vector = {
  for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i)
  return v1
}
  }
  def main(sc:SparkContext) {
val vectorAcc = sc.accumulator(new Vector(Array(0, 0)))
val accum = sc.accumulator(0)
val file = sc.textFile("/user/root/data/SourceFiles/a.txt", 10)
file.foreach(line => {println(line); accum+=1; vectorAcc.add(new
Vector(Array(1,1 ))) ;})
println(accum.value)
println(vectorAcc.value.data)
println("=" )
  }
}

--

scala> App.main(sc)
14/04/09 01:02:05 INFO storage.MemoryStore: ensureFreeSpace(130760) called
with curMem=0, maxMem=308713881
14/04/09 01:02:05 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 127.7 KB, free 294.3 MB)
14/04/09 01:02:07 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/04/09 01:02:07 INFO spark.SparkContext: Starting job: foreach at
:30
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Got job 0 (foreach at
:30) with 11 output partitions (allowLocal=false)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Final stage: Stage 0
(foreach at :30)
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Missing parents: List()
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[1] at textFile at :29), which has no missing parents
14/04/09 01:02:07 INFO scheduler.DAGScheduler: Failed to run foreach at
:30
org.apache.spark.SparkException: Job aborted: Task not serializable:
java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$App$
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:794)
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:737)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:569)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


issue of driver's HA

2014-04-08 Thread 林武康
Hi all,
We got some troubles on the issue of driver's HA. we run a long-live driver on 
spark standalone mode which service as server that submit jobs as requests 
arrived.  therefore we come across the issue of driver process's HA problem, 
like how to resume jobs after the driver process failed. 
Can anyone share some ideas or practices on this issue? 
 
Thank you for any helps. 
 
Sincerely ,
lin wukang 

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-08 Thread abhietc31
Anybody, please help for abov e query.
It's challanging but will open new horizon for In-Memory analysis.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark with SSL?

2014-04-08 Thread Benjamin Black
Although connection setup is expensive, the overhead of AES on any recent
Intel processor is almost zero. AES-NI is good stuff.

On Tuesday, April 8, 2014, Evan R. Sparks  wrote:

> A bandaid might be to set up ssh tunneling between slaves and master - has
> anyone tried deploying this way? I would expect it to pretty negatively
> impact performance on communication-heavy jobs.
>
>
> On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black 
> 
> > wrote:
>
>> Only if you trust the provider networks and everyone who might have
>> access to them. I don't.
>>
>>
>> On Tuesday, April 8, 2014, Ognen Duzlevski 
>> >
>> wrote:
>>
>>>  Ideally, you just run it in Amazon's VPC or whatever other providers'
>>> equivalent is. In this case running things over SSL would be an overkill.
>>>
>>> On 4/8/14, 3:31 PM, Andrew Ash wrote:
>>>
>>> Not that I know of, but it would be great if that was supported.  The
>>> way I typically handle security now is to put the Spark servers in their
>>> own subnet with strict inbound/outbound firewalls.
>>>
>>>
>>> On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka  wrote:
>>>
 Can Spark be configured to use SSL for all its network communication?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

>>>
>>>
>


Re: Spark with SSL?

2014-04-08 Thread Evan R. Sparks
A bandaid might be to set up ssh tunneling between slaves and master - has
anyone tried deploying this way? I would expect it to pretty negatively
impact performance on communication-heavy jobs.


On Tue, Apr 8, 2014 at 3:23 PM, Benjamin Black  wrote:

> Only if you trust the provider networks and everyone who might have access
> to them. I don't.
>
>
> On Tuesday, April 8, 2014, Ognen Duzlevski 
> wrote:
>
>>  Ideally, you just run it in Amazon's VPC or whatever other providers'
>> equivalent is. In this case running things over SSL would be an overkill.
>>
>> On 4/8/14, 3:31 PM, Andrew Ash wrote:
>>
>> Not that I know of, but it would be great if that was supported.  The way
>> I typically handle security now is to put the Spark servers in their own
>> subnet with strict inbound/outbound firewalls.
>>
>>
>> On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka  wrote:
>>
>>> Can Spark be configured to use SSL for all its network communication?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>


Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Thanks Andrew, I will take a look at it.

On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List] <
ml-node+s1001560n3920...@n3.nabble.com> wrote:

> If you set up Spark's metrics reporting to write to the Ganglia backend
> that will give you a good idea of how much network/disk/CPU is being used
> and on what machines.
>
> https://spark.apache.org/docs/0.9.0/monitoring.html
>
>
> On Tue, Apr 8, 2014 at 12:57 PM, yxzhao <[hidden 
> email]
> > wrote:
>
>> Hi All,
>>
>>   I want to measure the total network traffic for a Spark Job. But I
>> did
>> not see related information from the log. Does anybody know how to measure
>> it? Thanks very much in advance.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Network-Traffic-for-Spark-Job-tp3912.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Network-Traffic-for-Spark-Job-tp3912p3920.html
> To unsubscribe from Measuring Network Traffic for Spark Job, click 
> here
> .
> NAML
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Network-Traffic-for-Spark-Job-tp3912p3963.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: [BLOG] For Beginners

2014-04-08 Thread weida xu
Dears,

I'm very interested in this. However, the links mentioned above are not
accessible from China. Is there any other way to read the two blog pagess?
Thanks a lot.


2014-04-08 12:54 GMT+08:00 prabeesh k :

> Hi all,
>
> Here I am sharing a blog for beginners, about creating spark streaming
> stand alone application and bundle the app as single runnable jar. Take a
> look and drop your comments in blog page.
>
>
> http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html
>
>
> http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html
>
> prabeesh
>


A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
Hi guys,

We're going to hold a series of meetups about machine learning with
Spark in San Francisco.

The first one will be on April 24. Xiangrui Meng from Databricks will
talk about Spark, Spark/Python, features engineering, and MLlib.

See http://www.meetup.com/sfmachinelearning/events/174560212/ for detail.

The next one on May 1 will be join event with Cloudera talking about
unsupervised learning and multinomial logistic regression with L-BFGS
with Spark.

See http://www.meetup.com/sfmachinelearning/events/176105932/

If you would like to share anything related to machine learning with
Spark in SF Machine Learning Meetup, please let me know.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


Re: Spark with SSL?

2014-04-08 Thread Benjamin Black
Only if you trust the provider networks and everyone who might have access
to them. I don't.

On Tuesday, April 8, 2014, Ognen Duzlevski 
wrote:

>  Ideally, you just run it in Amazon's VPC or whatever other providers'
> equivalent is. In this case running things over SSL would be an overkill.
>
> On 4/8/14, 3:31 PM, Andrew Ash wrote:
>
> Not that I know of, but it would be great if that was supported.  The way
> I typically handle security now is to put the Spark servers in their own
> subnet with strict inbound/outbound firewalls.
>
>
> On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka 
> 
> > wrote:
>
>> Can Spark be configured to use SSL for all its network communication?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: Spark with SSL?

2014-04-08 Thread Ognen Duzlevski
Ideally, you just run it in Amazon's VPC or whatever other providers' 
equivalent is. In this case running things over SSL would be an overkill.


On 4/8/14, 3:31 PM, Andrew Ash wrote:
Not that I know of, but it would be great if that was supported.  The 
way I typically handle security now is to put the Spark servers in 
their own subnet with strict inbound/outbound firewalls.



On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka > wrote:


Can Spark be configured to use SSL for all its network communication?



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.





Preconditions on RDDs for creating a Graph?

2014-04-08 Thread Adam Novak
Hello,

I'm trying to create a GraphX Graph by calling Graph(vertices:
RDD[(VertexId, VD)], edges: RDD[Edge[ED]]): Graph[VD, ED]. I'm passing in
two RDDs: one with vertices keyed by ID, and one with edges. I make sure to
coalesce both these RDDs down to the same number of partitions beforehand;
it seems to be an unwritten precondition that the two RDDs need to have
equal numbers of partitions.

That was working fine until today, when I ran across a particular
combination of RDDs that, when passed in, cause Graph() to produce a graph
where the vertices RDD and edges RDD have different numbers of partitions.
I'm not sure what's special about these particular RDDs; they both have 3
partitions going in, but internally GraphImpl apparently does this to the
vertices RDD:

val partitioner = Partitioner.defaultPartitioner(vertices)
val vPartitioned = vertices.partitionBy(partitioner)

This seems to result in the vertices RDD being condensed down to 1
partition while the edges RDD still has 3, leading to an error.

So, I have two questions:

What, exactly, needs to be true about the RDDs that you pass to Graph() to
be sure of constructing a valid graph? (Do they need to have the same
number of partitions? The same number of partitions and no empty
partitions? Do you need to repartition them with their default partitioners
beforehand?)

Why does GraphImpl repartition the vertices RDD?

I'm using Spark 1.0.0-incubating-SNAPSHOT, if it helps.

Thanks,
-Adam Novak
UCSC Bioinformatics Ph.D. Student


Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu
may be unrelated to the question itself, just FYI 

you can run your driver program in worker node with Spark-0.9

http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster

Best, 

-- 
Nan Zhu



On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas wrote:

> Alright, so I guess I understand now why spark-ec2 allows you to select 
> different instance types for the driver node and worker nodes. If the driver 
> node is just driving and not doing any large collect()s or heavy processing, 
> it can be much smaller than the worker nodes.
> 
> With regards to data locality, that may not be an issue in my usage pattern 
> if, in theory, I wanted to make the driver node also do work. I launch 
> clusters using spark-ec2 and source data from S3, so I'm missing out on that 
> data locality benefit from the get-go. The firewall may be an issue if 
> spark-ec2 doesn't punch open the appropriate holes. And it may well not, 
> since it doesn't seem to have an option to configure the driver node to also 
> do work. 
> 
> Anyway, I'll definitely leave things the way they are. If I want a beefier 
> cluster, it's probably much easier to just launch a cluster with more slaves 
> using spark-ec2 than it is to set the driver node to a non-default 
> configuration. 
> 
> 
> On Tue, Apr 8, 2014 at 4:48 PM, Sean Owen  (mailto:so...@cloudera.com)> wrote:
> > If you want the machine that hosts the driver to also do work, you can
> > designate it as a worker too, if I'm not mistaken. I don't think the
> > driver should do work, logically, but, that's not to say that the
> > machine it's on shouldn't do work.
> > --
> > Sean Owen | Director, Data Science | London
> > 
> > 
> > On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas
> > mailto:nicholas.cham...@gmail.com)> wrote:
> > > So I have a cluster in EC2 doing some work, and when I take a look here
> > >
> > > http://driver-node:4040/executors/
> > >
> > > I see that my driver node is snoozing on the job: No tasks, no memory 
> > > used,
> > > and no RDD blocks cached.
> > >
> > > I'm assuming that it was a conscious design choice not to have the driver
> > > node partake in the cluster's workload.
> > >
> > > Why is that? It seems like a wasted resource.
> > >
> > > What's more, the slaves may rise up one day and overthrow the driver out 
> > > of
> > > resentment.
> > >
> > > Nick
> > >
> > >
> > > 
> > > View this message in context: Why doesn't the driver node do any work?
> > > Sent from the Apache Spark User List mailing list archive at Nabble.com 
> > > (http://Nabble.com).
> 



Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
Alright, so I guess I understand now why spark-ec2 allows you to select
different instance types for the driver node and worker nodes. If the
driver node is just driving and not doing any large collect()s or heavy
processing, it can be much smaller than the worker nodes.

With regards to data locality, that may not be an issue in my usage pattern
if, in theory, I wanted to make the driver node also do work. I launch
clusters using spark-ec2 and source data from S3, so I'm missing out on
that data locality benefit from the get-go. The firewall may be an issue if
spark-ec2 doesn't punch open the appropriate holes. And it may well not,
since it doesn't seem to have an option to configure the driver node to
also do work.

Anyway, I'll definitely leave things the way they are. If I want a beefier
cluster, it's probably much easier to just launch a cluster with more
slaves using spark-ec2 than it is to set the driver node to a non-default
configuration.


On Tue, Apr 8, 2014 at 4:48 PM, Sean Owen  wrote:

> If you want the machine that hosts the driver to also do work, you can
> designate it as a worker too, if I'm not mistaken. I don't think the
> driver should do work, logically, but, that's not to say that the
> machine it's on shouldn't do work.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas
>  wrote:
> > So I have a cluster in EC2 doing some work, and when I take a look here
> >
> > http://driver-node:4040/executors/
> >
> > I see that my driver node is snoozing on the job: No tasks, no memory
> used,
> > and no RDD blocks cached.
> >
> > I'm assuming that it was a conscious design choice not to have the driver
> > node partake in the cluster's workload.
> >
> > Why is that? It seems like a wasted resource.
> >
> > What's more, the slaves may rise up one day and overthrow the driver out
> of
> > resentment.
> >
> > Nick
> >
> >
> > 
> > View this message in context: Why doesn't the driver node do any work?
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner
that puts all three values on their own nodes.  Then .foreach() over the
RDD and call your function that will run on each node.

Why do you need to run the function on every node?  Is it some sort of
setup code that needs to be run before other RDD operations?


On Tue, Apr 8, 2014 at 3:05 AM, Adnan  wrote:

> Hello,
>
> I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am
> connecting with Spark Master from scala using SparkContext. I am trying to
> execute a simple java function from the distributed jar on every Spark
> Worker but haven't found a way to communicate with each worker or a Spark
> API function to do it.
>
> Can somebody help me with it or point me in the right direction?
>
> Regards,
> Adnan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-execute-a-function-from-class-in-distributed-jar-on-each-worker-node-tp3870.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Why doesn't the driver node do any work?

2014-04-08 Thread Sean Owen
If you want the machine that hosts the driver to also do work, you can
designate it as a worker too, if I'm not mistaken. I don't think the
driver should do work, logically, but, that's not to say that the
machine it's on shouldn't do work.
--
Sean Owen | Director, Data Science | London


On Tue, Apr 8, 2014 at 8:24 PM, Nicholas Chammas
 wrote:
> So I have a cluster in EC2 doing some work, and when I take a look here
>
> http://driver-node:4040/executors/
>
> I see that my driver node is snoozing on the job: No tasks, no memory used,
> and no RDD blocks cached.
>
> I'm assuming that it was a conscious design choice not to have the driver
> node partake in the cluster's workload.
>
> Why is that? It seems like a wasted resource.
>
> What's more, the slaves may rise up one day and overthrow the driver out of
> resentment.
>
> Nick
>
>
> 
> View this message in context: Why doesn't the driver node do any work?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Shivaram Venkataraman
Is there any reason why you want to start with a vanilla amazon AMI rather
than the ones we build and provide as a part of Spark EC2 scripts ? The
AMIs we provide are close to the vanilla AMI but have the root account
setup properly and install packages like java that are used by Spark.

If you wish to customize the AMI, you could always start with our AMI and
add more packages you like -- I have definitely done this recently and it
works with HVM and PVM as far as I can tell.

Shivaram


On Tue, Apr 8, 2014 at 8:50 AM, Marco Costantini <
silvio.costant...@granatads.com> wrote:

> I was able to keep the "workaround" ...around... by overwriting the
> generated '/root/.ssh/authorized_keys' file with a known good one, in the
> '/etc/rc.local' file
>
>
> On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini <
> silvio.costant...@granatads.com> wrote:
>
>> Another thing I didn't mention. The AMI and user used: naturally I've
>> created several of my own AMIs with the following characteristics. None of
>> which worked.
>>
>> 1) Enabling ssh as root as per this guide (
>> http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
>> When doing this, I do not specify a user for the spark-ec2 script. What
>> happens is that, it works! But only while it's alive. If I stop the
>> instance, create an AMI, and launch a new instance based from the new AMI,
>> the change I made in the '/root/.ssh/authorized_keys' file is overwritten
>>
>> 2) adding the 'ec2-user' to the 'root' group. This means that the
>> ec2-user does not have to use sudo to perform any operations needing root
>> privilidges. When doing this, I specify the user 'ec2-user' for the
>> spark-ec2 script. An error occurs: rsync fails with exit code 23.
>>
>> I believe HVMs still work. But it would be valuable to the community to
>> know that the root user work-around does/doesn't work any more for
>> paravirtual instances.
>>
>> Thanks,
>> Marco.
>>
>>
>> On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini <
>> silvio.costant...@granatads.com> wrote:
>>
>>> As requested, here is the script I am running. It is a simple shell
>>> script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
>>> directory of spark, as usual. The AMI used is the raw one from the AWS
>>> Quick Start section. It is the first option (an Amazon Linux paravirtual
>>> image). Any ideas or confirmation would be GREATLY appreciated. Please and
>>> thank you.
>>>
>>>
>>> #!/bin/sh
>>>
>>> export AWS_ACCESS_KEY_ID=MyCensoredKey
>>> export AWS_SECRET_ACCESS_KEY=MyCensoredKey
>>>
>>> AMI_ID=ami-2f726546
>>>
>>> ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10
>>> -v 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge
>>> launch marcotest
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman <
>>> shivaram.venkatara...@gmail.com> wrote:
>>>
 Hmm -- That is strange. Can you paste the command you are using to
 launch the instances ? The typical workflow is to use the spark-ec2 wrapper
 script using the guidelines at
 http://spark.apache.org/docs/latest/ec2-scripts.html

 Shivaram


 On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
 silvio.costant...@granatads.com> wrote:

> Hi Shivaram,
>
> OK so let's assume the script CANNOT take a different user and that it
> must be 'root'. The typical workaround is as you said, allow the ssh with
> the root user. Now, don't laugh, but, this worked last Friday, but today
> (Monday) it no longer works. :D Why? ...
>
> ...It seems that NOW, when you launch a 'paravirtual' ami, the root
> user's 'authorized_keys' file is always overwritten. This means the
> workaround doesn't work anymore! I would LOVE for someone to verify this.
>
> Just to point out, I am trying to make this work with a paravirtual
> instance and not an HVM instance.
>
> Please and thanks,
> Marco.
>
>
> On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
> shivaram.venkatara...@gmail.com> wrote:
>
>> Right now the spark-ec2 scripts assume that you have root access and
>> a lot of internal scripts assume have the user's home directory hard 
>> coded
>> as /root.   However all the Spark AMIs we build should have root ssh 
>> access
>> -- Do you find this not to be the case ?
>>
>> You can also enable root ssh access in a vanilla AMI by editing
>> /etc/ssh/sshd_config and setting "PermitRootLogin" to yes
>>
>> Thanks
>> Shivaram
>>
>>
>>
>> On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
>> silvio.costant...@granatads.com> wrote:
>>
>>> Hi all,
>>> On the old Amazon Linux EC2 images, the user 'root' was enabled for
>>> ssh. Also, it is the default user for the Spark-EC2 script.
>>>
>>> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
>>> instead of 'root'.
>>>
>

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Andrew Ash
One downside I can think of to having the driver node act as a temporary
member of the cluster would be that you may have firewalls between the
workers and the driver machine that would prevent shuffles from working
properly.  Now you'd need to poke holes in the firewalls to get the cluster
to properly run jobs.

Additionally, if you rely on data locality for your Spark jobs (e.g. having
Spark and HDFS services co-located on every machine) to get decent
performance then having the additional temporary member might actually slow
the overall job down.  Since it may not be co-located with data, you might
observe the straggler effect on that machine.

I for one prefer the current delegation of responsibilities between the
driver and the workers.


On Tue, Apr 8, 2014 at 12:24 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> So I have a cluster in EC2 doing some work, and when I take a look here
>
> http://driver-node:4040/executors/
>
> I see that my driver node is snoozing on the job: No tasks, no memory
> used, and no RDD blocks cached.
>
> I'm assuming that it was a conscious design choice not to have the driver
> node partake in the cluster's workload.
>
> Why is that? It seems like a wasted resource.
>
> What's more, the slaves may rise up one day and overthrow the driver out
> of resentment.
>
> Nick
>
>
> --
> View this message in context: Why doesn't the driver node do any 
> work?
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is being used
and on what machines.

https://spark.apache.org/docs/0.9.0/monitoring.html


On Tue, Apr 8, 2014 at 12:57 PM, yxzhao  wrote:

> Hi All,
>
>   I want to measure the total network traffic for a Spark Job. But I
> did
> not see related information from the log. Does anybody know how to measure
> it? Thanks very much in advance.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Network-Traffic-for-Spark-Job-tp3912.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported.  The way I
typically handle security now is to put the Spark servers in their own
subnet with strict inbound/outbound firewalls.


On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka  wrote:

> Can Spark be configured to use SSL for all its network communication?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
our one cached RDD in this run has id 3



*** onStageSubmitted **
rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** onTaskEnd **
_rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
storageStatusList: List(StorageStatus(BlockManagerId(,
192.168.3.169, 34330, 0),579325132,Map()))


*** onStageCompleted **
_rddInfoMap: Map()


*** onStageSubmitted **
rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** updateRDDInfo **


*** onTaskEnd **
_rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
storageStatusList: List(StorageStatus(BlockManagerId(,
192.168.3.169, 34330, 0),579325132,Map(rdd_3_0 ->
BlockStatus(StorageLevel(false, true, false, true, 1),19944,0,0


*** onStageCompleted **
_rddInfoMap: Map()



On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers  wrote:

> 1) at the end of the callback
>
> 2) yes we simply expose sc.getRDDStorageInfo to the user via REST
>
> 3) yes exactly. we define the RDDs at startup, all of them are cached.
> from that point on we only do calculations on these cached RDDs.
>
> i will add some more println statements for storageStatusList
>
>
>
> On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or  wrote:
>
>> Hi Koert,
>>
>> Thanks for pointing this out. However, I am unable to reproduce this
>> locally. It seems that there is a discrepancy between what the
>> BlockManagerUI and the SparkContext think is persisted. This is strange
>> because both sources ultimately derive this information from the same place
>> - by doing sc.getExecutorStorageStatus. I have a couple of questions for
>> you:
>>
>> 1) In your print statements, do you print them in the beginning or at the
>> end of each callback? It would be good to keep them at the end, since in
>> the beginning the data structures have not been processed yet.
>> 2) You mention that you get the RDD info through your own API. How do you
>> get this information? Is it through sc.getRDDStorageInfo?
>> 3) What did your application do to produce this behavior? Did you make an
>> RDD, persist it once, and then use it many times afterwards or something
>> similar?
>>
>> It would be super helpful if you could also print out what
>> StorageStatusListener's storageStatusList looks like by the end of each
>> onTaskEnd. I will continue to look into this on my side, but do let me know
>> once you have any updates.
>>
>> Andrew
>>
>>
>> On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers  wrote:
>>
>>> yet at same time i can see via our own api:
>>>
>>> "storageInfo": {
>>> "diskSize": 0,
>>> "memSize": 19944,
>>> "numCachedPartitions": 1,
>>> "numPartitions": 1
>>> }
>>>
>>>
>>>
>>> On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers  wrote:
>>>
 i put some println statements in BlockManagerUI

 i have RDDs that are cached in memory. I see this:


 *** onStageSubmitted **
 rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
 false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
 B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


 *** onTaskEnd **
 Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B)


 *** onStageCompleted **
 Map()

 *** onStageSubmitted **
 rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false,
 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
 0.0 B; DiskSize: 0.0 B
 _rddInfoMap: Map(7 -> 

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
1) at the end of the callback

2) yes we simply expose sc.getRDDStorageInfo to the user via REST

3) yes exactly. we define the RDDs at startup, all of them are cached. from
that point on we only do calculations on these cached RDDs.

i will add some more println statements for storageStatusList



On Tue, Apr 8, 2014 at 4:01 PM, Andrew Or  wrote:

> Hi Koert,
>
> Thanks for pointing this out. However, I am unable to reproduce this
> locally. It seems that there is a discrepancy between what the
> BlockManagerUI and the SparkContext think is persisted. This is strange
> because both sources ultimately derive this information from the same place
> - by doing sc.getExecutorStorageStatus. I have a couple of questions for
> you:
>
> 1) In your print statements, do you print them in the beginning or at the
> end of each callback? It would be good to keep them at the end, since in
> the beginning the data structures have not been processed yet.
> 2) You mention that you get the RDD info through your own API. How do you
> get this information? Is it through sc.getRDDStorageInfo?
> 3) What did your application do to produce this behavior? Did you make an
> RDD, persist it once, and then use it many times afterwards or something
> similar?
>
> It would be super helpful if you could also print out what
> StorageStatusListener's storageStatusList looks like by the end of each
> onTaskEnd. I will continue to look into this on my side, but do let me know
> once you have any updates.
>
> Andrew
>
>
> On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers  wrote:
>
>> yet at same time i can see via our own api:
>>
>> "storageInfo": {
>> "diskSize": 0,
>> "memSize": 19944,
>> "numCachedPartitions": 1,
>> "numPartitions": 1
>> }
>>
>>
>>
>> On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers  wrote:
>>
>>> i put some println statements in BlockManagerUI
>>>
>>> i have RDDs that are cached in memory. I see this:
>>>
>>>
>>> *** onStageSubmitted **
>>> rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false,
>>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>>> 0.0 B; DiskSize: 0.0 B
>>> _rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
>>> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
>>> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>>>
>>>
>>> *** onTaskEnd **
>>> Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false, false, false,
>>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>>> 0.0 B; DiskSize: 0.0 B)
>>>
>>>
>>> *** onStageCompleted **
>>> Map()
>>>
>>> *** onStageSubmitted **
>>> rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false,
>>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>>> 0.0 B; DiskSize: 0.0 B
>>> _rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
>>> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
>>> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>>>
>>> *** onTaskEnd **
>>> Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false, false, false,
>>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>>> 0.0 B; DiskSize: 0.0 B)
>>>
>>> *** onStageCompleted **
>>> Map()
>>>
>>>
>>> The storagelevels you see here are never the ones of my RDDs. and
>>> apparently updateRDDInfo never gets called (i had println in there too).
>>>
>>>
>>> On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers  wrote:
>>>
 yes i am definitely using latest


 On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng  wrote:

> That commit fixed the exact problem you described. That is why I want
> to confirm that you switched to the master branch. bin/spark-shell doesn't
> detect code changes, so you need to run ./make-distribution.sh to
> re-compile Spark first. -Xiangrui
>
>
> On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers wrote:
>
>> sorry, i meant to say: note that for a cached rdd in the spark shell
>> it all works fine. but something is going wrong with the
>> SPARK-APPLICATION-UI in our applications that extensively cache and 
>> re-use
>> RDDs
>>
>>
>> On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers wrote:
>>
>>> note that for a cached rdd in the spark shell it all works fine. but
>>> something is going wrong with the spark-shell in our applications that
>>> extensively cache and re-use RDDs
>>>
>>>
>>> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote:
>>>
 i tried again with latest master, which includes commit below, but
 ui page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e10

Spark with SSL?

2014-04-08 Thread kamatsuoka
Can Spark be configured to use SSL for all its network communication?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
Hello Manas,

I don't know Sqoop that much but my best guess is that you're probably
using Postgis which has specific structures for Geometry and so on. And if
you need some spatial operators my gut feeling is that things will be
harder ^^ (but a raw import won't need that...).

So I did a quick check in the Sqoop documentation and it looks like
implementing a connector for this kind of structure should do the trick
(check this: http://sqoop.apache.org/docs/1.99.3/ConnectorDevelopment.html).

In any case, I'll be very interested in this kind of stuffs! More than
that, having such import tool for Oracle Spatial cartridge would be great
as well :-P.

my2c,


Andy Petrella
Belgium (Liège)

*   *
 Data Engineer in *NextLab  sprl* (owner)
 Engaged Citizen Coder for *WAJUG * (co-founder)
 Author of *Learning Play! Framework 2
*
 Bio: on visify 
*   *
Mobile: *+32 495 99 11 04*
Mails:

   - andy.petre...@nextlab.be
   - andy.petre...@gmail.com

*   *
Socials:

   - Twitter: https://twitter.com/#!/noootsab
   - LinkedIn: http://be.linkedin.com/in/andypetrella
   - Blogger: http://ska-la.blogspot.com/
   - GitHub:  https://github.com/andypetrella
   - Masterbranch: https://masterbranch.com/andy.petrella



On Tue, Apr 8, 2014 at 10:00 PM, Manas Kar  wrote:

>  Hi All,
>
> I have some spatial data in postgres machine. I want to be
> able to move that data to Hadoop and do some geo-processing.
>
> I tried using sqoop to move the data to Hadoop but it complained about the
> position data(which it says can’t recognize)
>
> Does anyone have any idea as to how to do it easily?
>
>
>
> Thanks
>
> Manas
>
>
>
>    Manas Kar  
> Intermediate
> Software Developer, Product Development | exactEarth Ltd. 60 Struck
> Ct. Cambridge, Ontario N1R 8L2  office. +1.519.622.4445 ext. 5869 |
> direct: +1.519.620.5869  email. manas@exactearth.com
>
> web. www.exactearth.com
>
>
>
>
>  This e-mail and any attachment is for authorized use by the intended
> recipient(s) only. It contains proprietary or confidential information and
> is not to be copied, disclosed to, retained or used by, any other party. If
> you are not an intended recipient then please promptly delete this e-mail,
> any attachment and all copies and inform the sender. Thank you.
>
<>

Re: ui broken in latest 1.0.0

2014-04-08 Thread Andrew Or
Hi Koert,

Thanks for pointing this out. However, I am unable to reproduce this
locally. It seems that there is a discrepancy between what the
BlockManagerUI and the SparkContext think is persisted. This is strange
because both sources ultimately derive this information from the same place
- by doing sc.getExecutorStorageStatus. I have a couple of questions for
you:

1) In your print statements, do you print them in the beginning or at the
end of each callback? It would be good to keep them at the end, since in
the beginning the data structures have not been processed yet.
2) You mention that you get the RDD info through your own API. How do you
get this information? Is it through sc.getRDDStorageInfo?
3) What did your application do to produce this behavior? Did you make an
RDD, persist it once, and then use it many times afterwards or something
similar?

It would be super helpful if you could also print out what
StorageStatusListener's storageStatusList looks like by the end of each
onTaskEnd. I will continue to look into this on my side, but do let me know
once you have any updates.

Andrew


On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers  wrote:

> yet at same time i can see via our own api:
>
> "storageInfo": {
> "diskSize": 0,
> "memSize": 19944,
> "numCachedPartitions": 1,
> "numPartitions": 1
> }
>
>
>
> On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers  wrote:
>
>> i put some println statements in BlockManagerUI
>>
>> i have RDDs that are cached in memory. I see this:
>>
>>
>> *** onStageSubmitted **
>> rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false,
>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>> 0.0 B; DiskSize: 0.0 B
>> _rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
>> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
>> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>>
>>
>> *** onTaskEnd **
>> Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false, false, false,
>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>> 0.0 B; DiskSize: 0.0 B)
>>
>>
>> *** onStageCompleted **
>> Map()
>>
>> *** onStageSubmitted **
>> rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false,
>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>> 0.0 B; DiskSize: 0.0 B
>> _rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
>> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
>> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>>
>> *** onTaskEnd **
>> Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false, false, false,
>> 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize:
>> 0.0 B; DiskSize: 0.0 B)
>>
>> *** onStageCompleted **
>> Map()
>>
>>
>> The storagelevels you see here are never the ones of my RDDs. and
>> apparently updateRDDInfo never gets called (i had println in there too).
>>
>>
>> On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers  wrote:
>>
>>> yes i am definitely using latest
>>>
>>>
>>> On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng  wrote:
>>>
 That commit fixed the exact problem you described. That is why I want
 to confirm that you switched to the master branch. bin/spark-shell doesn't
 detect code changes, so you need to run ./make-distribution.sh to
 re-compile Spark first. -Xiangrui


 On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers wrote:

> sorry, i meant to say: note that for a cached rdd in the spark shell
> it all works fine. but something is going wrong with the
> SPARK-APPLICATION-UI in our applications that extensively cache and re-use
> RDDs
>
>
> On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers wrote:
>
>> note that for a cached rdd in the spark shell it all works fine. but
>> something is going wrong with the spark-shell in our applications that
>> extensively cache and re-use RDDs
>>
>>
>> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote:
>>
>>> i tried again with latest master, which includes commit below, but
>>> ui page still shows nothing on storage tab.
>>>  koert
>>>
>>>
>>>
>>> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
>>> Author: Andrew Or 
>>> Date:   Mon Mar 31 23:01:14 2014 -0700
>>>
>>> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>>>
>>> If a previously persisted RDD is re-used, its information
>>> disappears from the Storage page.
>>>
>>> This is because the tasks associated with re-using the RDD do
>>> not report the RDD's blocks as updated (which is correct). On stage 
>>> submit,
>>> however, we over

ETL for postgres to hadoop

2014-04-08 Thread Manas Kar
Hi All,
I have some spatial data in postgres machine. I want to be able 
to move that data to Hadoop and do some geo-processing.
I tried using sqoop to move the data to Hadoop but it complained about the 
position data(which it says can't recognize)
Does anyone have any idea as to how to do it easily?

Thanks
Manas


[cid:ee_gradient_tm_150wide.png@f20f7501e5a14d6f85ec33629f725228]
   Manas Kar
Intermediate Software Developer, Product Development | exactEarth Ltd.

60 Struck Ct. Cambridge, Ontario N1R 8L2
office. +1.519.622.4445 ext. 5869 | direct: +1.519.620.5869
email. manas@exactearth.com

web. www.exactearth.com






This e-mail and any attachment is for authorized use by the intended 
recipient(s) only. It contains proprietary or confidential information and is 
not to be copied, disclosed to, retained or used by, any other party. If you 
are not an intended recipient then please promptly delete this e-mail, any 
attachment and all copies and inform the sender. Thank you.
<>

Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Hi All,

  I want to measure the total network traffic for a Spark Job. But I did
not see related information from the log. Does anybody know how to measure
it? Thanks very much in advance.
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Network-Traffic-for-Spark-Job-tp3912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
There are couple of issues here which i was able to find out.

1: We should not use web port which we use to access the web UI. I was usong
that initially so it was not working. 

2: All request should go to Name node and not anything else. 

3: By replacing localhost:9000 in the above request, it started working
fine.

One more question i have based on this is, how can i make it work for domain
rather than using for localhost and port. May be the answer to that is you
need to change that in core-site.xml file and specify the proper web address
rather than localhost there?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-Call-to-dev-17-29-25-4-50070-failed-on-local-exception-java-io-EOFException-tp3907p3910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
So I have a cluster in EC2 doing some work, and when I take a look here

http://driver-node:4040/executors/

I see that my driver node is snoozing on the job: No tasks, no memory used,
and no RDD blocks cached.

I'm assuming that it was a conscious design choice not to have the driver
node partake in the cluster's workload.

Why is that? It seems like a wasted resource.

What's more, the slaves may rise up one day and overthrow the driver out of
resentment.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-doesn-t-the-driver-node-do-any-work-tp3909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Thank you -- this actually helped a lot. Strangely it appears that the task
detail view is not accurate in 0.8 -- that view shows 425ms duration for
one of the tasks, but in the driver log I do indeed see Finished TID 125 in
10940ms.

On that "slow" worker I see the following:

14/04/08 18:06:24 INFO broadcast.HttpBroadcast: Started reading broadcast
variable 81
14/04/08 18:06:34 INFO storage.MemoryStore: ensureFreeSpace(503292) called
with curMem=26011999, maxMem=2662874480
14/04/08 18:06:34 INFO storage.MemoryStore: Block broadcast_81 stored as
values to memory (estimated size 491.5 KB, free 2.5 GB)
14/04/08 18:06:34 INFO broadcast.HttpBroadcast: *Reading broadcast variable
81 took 10.051249937 s*
14/04/08 18:06:34 INFO broadcast.HttpBroadcast: Started reading broadcast
variable 82
14/04/08 18:06:34 INFO storage.MemoryStore: ensureFreeSpace(503292) called
with curMem=26515291, maxMem=2662874480
14/04/08 18:06:34 INFO storage.MemoryStore: Block broadcast_82 stored as
values to memory (estimated size 491.5 KB, free 2.5 GB)
14/04/08 18:06:34 INFO broadcast.HttpBroadcast: Reading broadcast variable
82 took 0.027498244 s
14/04/08 18:06:34 INFO broadcast.HttpBroadcast: Started reading broadcast
variable 83
14/04/08 18:06:34 INFO storage.MemoryStore: ensureFreeSpace(503292) called
with curMem=27018583, maxMem=2662874480

and here is the same variable read on a different worker

14/04/08 18:06:24 INFO broadcast.HttpBroadcast: Started reading
broadcast variable 81
14/04/08 18:06:24 INFO storage.MemoryStore: ensureFreeSpace(503292)
called with curMem=35008553, maxMem=2662874480
14/04/08 18:06:24 INFO storage.MemoryStore: Block broadcast_81 stored
as values to memory (estimated size 491.5 KB, free 2.4 GB)
14/04/08 18:06:24 INFO broadcast.HttpBroadcast: Reading broadcast
variable 81 took 0.029252199 s


Any thoughts why read of variable 81 is so slow on one machine? The job is
a sum by key across several partitions -- it doesn't seem that variable 81
is any larger than the rest (per the log lines above), so it's puzzling
that it's taking so very long...


On Tue, Apr 8, 2014 at 12:24 PM, Aaron Davidson  wrote:

> Also, take a look at the driver logs -- if there is overhead before the
> first task is launched, the driver logs would likely reveal this.
>
>
> On Tue, Apr 8, 2014 at 9:21 AM, Aaron Davidson  wrote:
>
>> Off the top of my head, the most likely cause would be driver GC issues.
>> You can diagnose this by enabling GC printing at the driver and you can fix
>> this by increasing the amount of memory your driver program has (see
>> http://spark.apache.org/docs/0.9.0/tuning.html#garbage-collection-tuning
>> ).
>>
>> The "launch time" statistic would also be useful -- if all tasks are
>> launched at around the same time and complete within 300ms, yet the total
>> time is 10s, that strongly suggests that the overhead is coming before the
>> first task is launched. Similarly, it would be useful to know if there was
>> a large gap between launch times, or if it appeared that the launch times
>> were serial with respect to the durations. If for some reason Spark started
>> using only one executor, say, each task would take the same duration but
>> would be executed one after another.
>>
>>
>> On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska wrote:
>>
>>> Hi Spark users, I'm very much hoping someone can help me out.
>>>
>>> I have a strict performance requirement on a particular query. One of
>>> the stages shows great variance in duration -- from 300ms to 10sec.
>>>
>>> The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark
>>> 0.8)
>>>
>>> I have run the job quite a few times -- the details within the stage
>>> do not account for the overall duration shown for the stage. What
>>> could be taking up time that's not showing within the stage breakdown
>>> UI? Im thinking that reading the data in is reflected in the Duration
>>> column before, so caching should not be a reason(I'm not caching
>>> explicitly)?
>>>
>>> The details within the stage always show roughly the following (both
>>> for the 10second and 600ms query -- very little variation, nothing
>>> over 500ms, ShuffleWrite size is pretty comparable):
>>>
>>> StatusLocality LevelExecutor Launch Time DurationGC TimeShuffle Write
>>> 1864  SUCCESSNODE_LOCAL ###  301 ms  8 ms  111.0 B
>>> 1863  SUCCESSNODE_LOCAL ###  273 ms102.0 B
>>> 1862  SUCCESSNODE_LOCAL ###  245 ms111.0 B
>>> 1861  SUCCESSNODE_LOCAL ###  326 ms  4 ms   102.0 B
>>> 1860  SUCCESSNODE_LOCAL ###  217 ms  6 ms   102.0 B
>>> 1859  SUCCESSNODE_LOCAL ###  277 ms 111.0 B
>>> 1858  SUCCESSNODE_LOCAL ###  262 ms 108.0 B
>>> 1857  SUCCESSNODE_LOCAL ###  217 ms  14 ms  112.0 B
>>> 1856  SUCCESSNODE_LOCAL ###  208 ms  109.0 B
>>> 1855  SUCCESSNODE_LOCAL ###  242 ms  74.0 B
>>> 1854  SUCCESSNODE_LOCAL ###  218 ms  3 ms 58.0 B
>>> 1853  SUCCESSNODE_LOCAL ###  254 ms  12 

java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
I am trying to read file from HDFS on Spark Shell and getting error as below.
When i create first RDD it works fine but when i try to do count on that
RDD, it trows me some connection error. I have single node hdfs setup and on
the same machine, i have spark running. Please help. When i run "jps"
command on same box to see hadoop cluster is working as expected or not than
i see everything alright and see below output.


[hadoop@idcrebalancedev ~]$ jps
23606 DataNode
28245 Jps
23982 TaskTracker
26537 Main
23738 SecondaryNameNode
23858 JobTracker
23488 NameNode


Below is the output for RDD creation and error on count.

scala> val hdfsFile =
sc.textFile("hdfs://idcrebalancedev.bxc.is-teledata.com:50070/user/hadoop/reegal/4300.txt")
14/04/08 14:28:35 INFO MemoryStore: ensureFreeSpace(784) called with
curMem=40160, maxMem=308713881
14/04/08 14:28:35 INFO MemoryStore: Block broadcast_7 stored as values to
memory (estimated size 784.0 B, free 294.4 MB)
hdfsFile: org.apache.spark.rdd.RDD[String] = MappedRDD[15] at textFile at
:12

scala> hdfsFile.count()
java.io.IOException: Call to
idcrebalancedev.bxc.is-teledata.com/172.29.253.4:50070 failed on local
exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107)
at org.apache.hadoop.ipc.Client.call(Client.java:1075)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy9.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:898)
at org.apache.spark.rdd.RDD.count(RDD.scala:720)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:788)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:833)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:745)
at
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:593)
at
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:600)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:603)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:926)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:876)
   

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson
Yup, sorry about that. This error message should not produce incorrect
behavior, but it is annoying. Posted a patch to fix it:
https://github.com/apache/spark/pull/361

Thanks for reporting it!


On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers  wrote:

> when i start spark-shell i now see
>
> ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
> directory
>
> we do not package a lib_managed with our spark build (never did). maybe
> the logic in compute-classpath.sh that searches for datanucleus should
> check for the existence of lib_managed before doing ls on it?
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api:

"storageInfo": {
"diskSize": 0,
"memSize": 19944,
"numCachedPartitions": 1,
"numPartitions": 1
}



On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers  wrote:

> i put some println statements in BlockManagerUI
>
> i have RDDs that are cached in memory. I see this:
>
>
> *** onStageSubmitted **
> rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1);
> CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
> B; DiskSize: 0.0 B
> _rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>
>
> *** onTaskEnd **
> Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1);
> CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
> B; DiskSize: 0.0 B)
>
>
> *** onStageCompleted **
> Map()
>
> *** onStageSubmitted **
> rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false, 1);
> CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
> B; DiskSize: 0.0 B
> _rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
> false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
> B;TachyonSize: 0.0 B; DiskSize: 0.0 B)
>
> *** onTaskEnd **
> Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false, false, false, 1);
> CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
> B; DiskSize: 0.0 B)
>
> *** onStageCompleted **
> Map()
>
>
> The storagelevels you see here are never the ones of my RDDs. and
> apparently updateRDDInfo never gets called (i had println in there too).
>
>
> On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers  wrote:
>
>> yes i am definitely using latest
>>
>>
>> On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng  wrote:
>>
>>> That commit fixed the exact problem you described. That is why I want to
>>> confirm that you switched to the master branch. bin/spark-shell doesn't
>>> detect code changes, so you need to run ./make-distribution.sh to
>>> re-compile Spark first. -Xiangrui
>>>
>>>
>>> On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers  wrote:
>>>
 sorry, i meant to say: note that for a cached rdd in the spark shell it
 all works fine. but something is going wrong with the SPARK-APPLICATION-UI
 in our applications that extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers wrote:

> note that for a cached rdd in the spark shell it all works fine. but
> something is going wrong with the spark-shell in our applications that
> extensively cache and re-use RDDs
>
>
> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote:
>
>> i tried again with latest master, which includes commit below, but ui
>> page still shows nothing on storage tab.
>>  koert
>>
>>
>>
>> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
>> Author: Andrew Or 
>> Date:   Mon Mar 31 23:01:14 2014 -0700
>>
>> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>>
>> If a previously persisted RDD is re-used, its information
>> disappears from the Storage page.
>>
>> This is because the tasks associated with re-using the RDD do not
>> report the RDD's blocks as updated (which is correct). On stage submit,
>> however, we overwrite any existing
>>
>> Author: Andrew Or 
>>
>> Closes #281 from andrewor14/ui-storage-fix and squashes the
>> following commits:
>>
>> 408585a [Andrew Or] Fix storage UI bug
>>
>>
>>
>> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers wrote:
>>
>>> got it thanks
>>>
>>>
>>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote:
>>>
 This is fixed in https://github.com/apache/spark/pull/281. Please
 try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
 wrote:
 > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
 > (apr 5) that the "application detail ui" no longer shows any RDDs
 on the
 > storage tab, despite the fact that they are definitely cached.
 >
 > i am running spark in standalone mode.

>>>
>>>
>>
>

>>>
>>
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i put some println statements in BlockManagerUI

i have RDDs that are cached in memory. I see this:


*** onStageSubmitted **
rddInfo: RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)


*** onTaskEnd **
Map(2 -> RDD "2" (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B)


*** onStageCompleted **
Map()

*** onStageSubmitted **
rddInfo: RDD "7" (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false,
false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
B;TachyonSize: 0.0 B; DiskSize: 0.0 B)

*** onTaskEnd **
Map(7 -> RDD "7" (7) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B)

*** onStageCompleted **
Map()


The storagelevels you see here are never the ones of my RDDs. and
apparently updateRDDInfo never gets called (i had println in there too).


On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers  wrote:

> yes i am definitely using latest
>
>
> On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng  wrote:
>
>> That commit fixed the exact problem you described. That is why I want to
>> confirm that you switched to the master branch. bin/spark-shell doesn't
>> detect code changes, so you need to run ./make-distribution.sh to
>> re-compile Spark first. -Xiangrui
>>
>>
>> On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers  wrote:
>>
>>> sorry, i meant to say: note that for a cached rdd in the spark shell it
>>> all works fine. but something is going wrong with the SPARK-APPLICATION-UI
>>> in our applications that extensively cache and re-use RDDs
>>>
>>>
>>> On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers wrote:
>>>
 note that for a cached rdd in the spark shell it all works fine. but
 something is going wrong with the spark-shell in our applications that
 extensively cache and re-use RDDs


 On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote:

> i tried again with latest master, which includes commit below, but ui
> page still shows nothing on storage tab.
>  koert
>
>
>
> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
> Author: Andrew Or 
> Date:   Mon Mar 31 23:01:14 2014 -0700
>
> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>
> If a previously persisted RDD is re-used, its information
> disappears from the Storage page.
>
> This is because the tasks associated with re-using the RDD do not
> report the RDD's blocks as updated (which is correct). On stage submit,
> however, we overwrite any existing
>
> Author: Andrew Or 
>
> Closes #281 from andrewor14/ui-storage-fix and squashes the
> following commits:
>
> 408585a [Andrew Or] Fix storage UI bug
>
>
>
> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers wrote:
>
>> got it thanks
>>
>>
>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote:
>>
>>> This is fixed in https://github.com/apache/spark/pull/281. Please
>>> try
>>> again with the latest master. -Xiangrui
>>>
>>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
>>> wrote:
>>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
>>> days ago
>>> > (apr 5) that the "application detail ui" no longer shows any RDDs
>>> on the
>>> > storage tab, despite the fact that they are definitely cached.
>>> >
>>> > i am running spark in standalone mode.
>>>
>>
>>
>

>>>
>>
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i am definitely using latest


On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng  wrote:

> That commit fixed the exact problem you described. That is why I want to
> confirm that you switched to the master branch. bin/spark-shell doesn't
> detect code changes, so you need to run ./make-distribution.sh to
> re-compile Spark first. -Xiangrui
>
>
> On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers  wrote:
>
>> sorry, i meant to say: note that for a cached rdd in the spark shell it
>> all works fine. but something is going wrong with the SPARK-APPLICATION-UI
>> in our applications that extensively cache and re-use RDDs
>>
>>
>> On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers  wrote:
>>
>>> note that for a cached rdd in the spark shell it all works fine. but
>>> something is going wrong with the spark-shell in our applications that
>>> extensively cache and re-use RDDs
>>>
>>>
>>> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers wrote:
>>>
 i tried again with latest master, which includes commit below, but ui
 page still shows nothing on storage tab.
  koert



 commit ada310a9d3d5419e101b24d9b41398f609da1ad3
 Author: Andrew Or 
 Date:   Mon Mar 31 23:01:14 2014 -0700

 [Hot Fix #42] Persisted RDD disappears on storage page if re-used

 If a previously persisted RDD is re-used, its information
 disappears from the Storage page.

 This is because the tasks associated with re-using the RDD do not
 report the RDD's blocks as updated (which is correct). On stage submit,
 however, we overwrite any existing

 Author: Andrew Or 

 Closes #281 from andrewor14/ui-storage-fix and squashes the
 following commits:

 408585a [Andrew Or] Fix storage UI bug



 On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers wrote:

> got it thanks
>
>
> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng wrote:
>
>> This is fixed in https://github.com/apache/spark/pull/281. Please try
>> again with the latest master. -Xiangrui
>>
>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
>> wrote:
>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
>> days ago
>> > (apr 5) that the "application detail ui" no longer shows any RDDs
>> on the
>> > storage tab, despite the fact that they are definitely cached.
>> >
>> > i am running spark in standalone mode.
>>
>
>

>>>
>>
>


Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi
Hi Ankit,
Thanx for all the work on Pig.
Finally got it working. Couple of high level bugs right now:

   - Getting it working on Spark 0.9.0
   - Getting UDF working
   - Getting generate functionality working
   - Exhaustive test suite on Spark on Pig

are you maintaining a Jira somewhere?

I am currently trying to deploy it on 0.9.0.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi  wrote:

> We will post fixes from our side at - https://github.com/twitter/pig.
>
> Top on our list are-
> 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
> 0.9 spark).
> 2. Support for algebraic udfs (this mitigates the group by oom problems).
>
> Would definitely love more contribution on this.
>
> Thanks,
> Aniket
>
>
> On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi 
> wrote:
>
>> Dam I am off to NY for Structure Conf. Would it be possible to meet
>> anytime after 28th March?
>> I am really interested in making it stable & production quality.
>>
>> Regards
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem wrote:
>>
>>> Hi Mayur,
>>> Are you going to the Pig meetup this afternoon?
>>> http://www.meetup.com/PigUser/events/160604192/
>>> Aniket and I will be there.
>>> We would be happy to chat about Pig-on-Spark
>>>
>>>
>>>
>>> On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi 
>>> wrote:
>>>
 Hi Lin,
 We are working on getting Pig on spark functional with 0.8.0, have you
 got it working on any spark version ?
 Also what all functionality works on it?
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng wrote:

> Hi Sameer,
>
> Lin (cc'ed) could also give you some updates about Pig on Spark
> development on her side.
>
> Best,
> Xiangrui
>
> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak 
> wrote:
> > Hi Mayur,
> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
> goal is
> > to get SPROK set up next month. I will keep you posted. Can you
> please keep
> > me informed about your progress as well.
> >
> > 
> > From: mayur.rust...@gmail.com
> > Date: Mon, 10 Mar 2014 11:47:56 -0700
> >
> > Subject: Re: Pig on Spark
> > To: user@spark.apache.org
> >
> >
> > Hi Sameer,
> > Did you make any progress on this. My team is also trying it out
> would love
> > to know some detail so progress.
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak 
> wrote:
> >
> > Hi Aniket,
> > Many thanks! I will check this out.
> >
> > 
> > Date: Thu, 6 Mar 2014 13:46:50 -0800
> > Subject: Re: Pig on Spark
> > From: aniket...@gmail.com
> > To: user@spark.apache.org; tgraves...@yahoo.com
> >
> >
> > There is some work to make this work on yarn at
> > https://github.com/aniket486/pig. (So, compile pig with ant
> > -Dhadoopversion=23)
> >
> > You can look at
> https://github.com/aniket486/pig/blob/spork/pig-spark to
> > find out what sort of env variables you need (sorry, I haven't been
> able to
> > clean this up- in-progress). There are few known issues with this, I
> will
> > work on fixing them soon.
> >
> > Known issues-
> > 1. Limit does not work (spork-fix)
> > 2. Foreach requires to turn off schema-tuple-backend (should be a
> pig-jira)
> > 3. Algebraic udfs dont work (spork-fix in-progress)
> > 4. Group by rework (to avoid OOMs)
> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
> jars)
> >
> > ~Aniket
> >
> >
> >
> >
> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves 
> wrote:
> >
> > I had asked a similar question on the dev mailing list a while back
> (Jan
> > 22nd).
> >
> > See the archives:
> >
> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
> > look for spork.
> >
> > Basically Matei said:
> >
> > Yup, that was it, though I believe people at Twitter picked it up
> again
> > recently. I'd suggest
> > asking Dmitriy if you know him. I've seen interest in this from
> several
> > other groups, and
> > if there's enough of it, maybe we can start another open source

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas
Just took a quick look at the overview
here and
the quick start guide
here
.

It looks like Apache Phoenix aims to provide flexible SQL access to data,
both for transactional and analytic purposes, and at interactive speeds.

Nick


On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang  wrote:

> First, I have not tried it myself. However, what I have heard it has some
> basic SQL features so you can query you HBase table like query content on
> HDFS using Hive.
> So it is not "query a simple column", I believe you can do joins and other
> SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured
> and give it a try.
>
> Sorry cannot provide more detailed explanation and help.
>
>
>
> On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier 
> wrote:
>
>> Thanks for the quick reply Bin. Phenix is something I'm going to try for
>> sure but is seems somehow useless if I can use Spark.
>> Probably, as you said, since Phoenix use a dedicated data structure
>> within each HBase Table has a more effective memory usage but if I need to
>> deserialize data stored in a HBase cell I still have to read in memory that
>> object and thus I need Spark. From what I understood Phoenix is good if I
>> have to query a simple column of HBase but things get really complicated if
>> I have to add an index for each column in my table and I store complex
>> object within the cells. Is it correct?
>>
>> Best,
>> Flavio
>>
>>
>>
>>
>> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang  wrote:
>>
>>> Hi Flavio,
>>>
>>> I happened to attend, actually attending the 2014 Apache Conf, I heard a
>>> project called "Apache Phoenix", which fully leverage HBase and suppose to
>>> be 1000x faster than Hive. And it is not memory bounded, in which case sets
>>> up a limit for Spark. It is still in the incubating group and the "stats"
>>> functions spark has already implemented are still on the roadmap. I am not
>>> sure whether it will be good but might be something interesting to check
>>> out.
>>>
>>> /usr/bin
>>>
>>>
>>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier >> > wrote:
>>>
 Hi to everybody,

  in these days I looked a bit at the recent evolution of the big data
 stacks and it seems that HBase is somehow fading away in favour of
 Spark+HDFS. Am I correct?
 Do you think that Spark and HBase should work together or not?

 Best regards,
 Flavio

>>>
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully
cached using context.getRDDStorageInfo which we expose via our own api.

i did not run make-distribution.sh, we have our own scripts to build a
distribution. however if your question is if i correctly deployed the
latest build, let me check to be sure.


On Tue, Apr 8, 2014 at 12:43 PM, Xiangrui Meng  wrote:

> That commit did work for me. Could you confirm the following:
>
> 1) After you called cache(), did you make any actions like count() or
> reduce()? If you don't materialize the RDD, it won't show up in the
> storage tab.
>
> 2) Did you run ./make-distribution.sh after you switched to the current
> master?
>
> Xiangrui
>
> On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers  wrote:
> > i tried again with latest master, which includes commit below, but ui
> page
> > still shows nothing on storage tab.
> > koert
> >
> >
> >
> > commit ada310a9d3d5419e101b24d9b41398f609da1ad3
> > Author: Andrew Or 
> > Date:   Mon Mar 31 23:01:14 2014 -0700
> >
> > [Hot Fix #42] Persisted RDD disappears on storage page if re-used
> >
> > If a previously persisted RDD is re-used, its information disappears
> > from the Storage page.
> >
> > This is because the tasks associated with re-using the RDD do not
> report
> > the RDD's blocks as updated (which is correct). On stage submit,
> however, we
> > overwrite any existing
> >
> > Author: Andrew Or 
> >
> > Closes #281 from andrewor14/ui-storage-fix and squashes the following
> > commits:
> >
> > 408585a [Andrew Or] Fix storage UI bug
> >
> >
> >
> > On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:
> >>
> >> got it thanks
> >>
> >>
> >> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:
> >>>
> >>> This is fixed in https://github.com/apache/spark/pull/281. Please try
> >>> again with the latest master. -Xiangrui
> >>>
> >>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
> wrote:
> >>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
> days
> >>> > ago
> >>> > (apr 5) that the "application detail ui" no longer shows any RDDs on
> >>> > the
> >>> > storage tab, despite the fact that they are definitely cached.
> >>> >
> >>> > i am running spark in standalone mode.
> >>
> >>
> >
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit fixed the exact problem you described. That is why I want to
confirm that you switched to the master branch. bin/spark-shell doesn't
detect code changes, so you need to run ./make-distribution.sh to
re-compile Spark first. -Xiangrui


On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers  wrote:

> sorry, i meant to say: note that for a cached rdd in the spark shell it
> all works fine. but something is going wrong with the SPARK-APPLICATION-UI
> in our applications that extensively cache and re-use RDDs
>
>
> On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers  wrote:
>
>> note that for a cached rdd in the spark shell it all works fine. but
>> something is going wrong with the spark-shell in our applications that
>> extensively cache and re-use RDDs
>>
>>
>> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers  wrote:
>>
>>> i tried again with latest master, which includes commit below, but ui
>>> page still shows nothing on storage tab.
>>>  koert
>>>
>>>
>>>
>>> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
>>> Author: Andrew Or 
>>> Date:   Mon Mar 31 23:01:14 2014 -0700
>>>
>>> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>>>
>>> If a previously persisted RDD is re-used, its information disappears
>>> from the Storage page.
>>>
>>> This is because the tasks associated with re-using the RDD do not
>>> report the RDD's blocks as updated (which is correct). On stage submit,
>>> however, we overwrite any existing
>>>
>>> Author: Andrew Or 
>>>
>>> Closes #281 from andrewor14/ui-storage-fix and squashes the
>>> following commits:
>>>
>>> 408585a [Andrew Or] Fix storage UI bug
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:
>>>
 got it thanks


 On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:

> This is fixed in https://github.com/apache/spark/pull/281. Please try
> again with the latest master. -Xiangrui
>
> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
> wrote:
> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
> days ago
> > (apr 5) that the "application detail ui" no longer shows any RDDs on
> the
> > storage tab, despite the fact that they are definitely cached.
> >
> > i am running spark in standalone mode.
>


>>>
>>
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all
works fine. but something is going wrong with the SPARK-APPLICATION-UI in
our applications that extensively cache and re-use RDDs


On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers  wrote:

> note that for a cached rdd in the spark shell it all works fine. but
> something is going wrong with the spark-shell in our applications that
> extensively cache and re-use RDDs
>
>
> On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers  wrote:
>
>> i tried again with latest master, which includes commit below, but ui
>> page still shows nothing on storage tab.
>>  koert
>>
>>
>>
>> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
>> Author: Andrew Or 
>> Date:   Mon Mar 31 23:01:14 2014 -0700
>>
>> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>>
>> If a previously persisted RDD is re-used, its information disappears
>> from the Storage page.
>>
>> This is because the tasks associated with re-using the RDD do not
>> report the RDD's blocks as updated (which is correct). On stage submit,
>> however, we overwrite any existing
>>
>> Author: Andrew Or 
>>
>> Closes #281 from andrewor14/ui-storage-fix and squashes the following
>> commits:
>>
>> 408585a [Andrew Or] Fix storage UI bug
>>
>>
>>
>> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:
>>
>>> got it thanks
>>>
>>>
>>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:
>>>
 This is fixed in https://github.com/apache/spark/pull/281. Please try
 again with the latest master. -Xiangrui

 On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers 
 wrote:
 > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few
 days ago
 > (apr 5) that the "application detail ui" no longer shows any RDDs on
 the
 > storage tab, despite the fact that they are definitely cached.
 >
 > i am running spark in standalone mode.

>>>
>>>
>>
>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but
something is going wrong with the spark-shell in our applications that
extensively cache and re-use RDDs


On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers  wrote:

> i tried again with latest master, which includes commit below, but ui page
> still shows nothing on storage tab.
> koert
>
>
>
> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
> Author: Andrew Or 
> Date:   Mon Mar 31 23:01:14 2014 -0700
>
> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>
> If a previously persisted RDD is re-used, its information disappears
> from the Storage page.
>
> This is because the tasks associated with re-using the RDD do not
> report the RDD's blocks as updated (which is correct). On stage submit,
> however, we overwrite any existing
>
> Author: Andrew Or 
>
> Closes #281 from andrewor14/ui-storage-fix and squashes the following
> commits:
>
> 408585a [Andrew Or] Fix storage UI bug
>
>
>
> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:
>
>> got it thanks
>>
>>
>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:
>>
>>> This is fixed in https://github.com/apache/spark/pull/281. Please try
>>> again with the latest master. -Xiangrui
>>>
>>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers  wrote:
>>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
>>> ago
>>> > (apr 5) that the "application detail ui" no longer shows any RDDs on
>>> the
>>> > storage tab, despite the fact that they are definitely cached.
>>> >
>>> > i am running spark in standalone mode.
>>>
>>
>>
>


assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see

ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
directory

we do not package a lib_managed with our spark build (never did). maybe the
logic in compute-classpath.sh that searches for datanucleus should check
for the existence of lib_managed before doing ls on it?


RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :

"For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on"

And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

"To minimize global bandwidth consumption and read latency, HDFS tries to
satisfy a read request from a replica that is closest to the reader. If
there exists a replica on the same rack as the reader node, then that
replica is preferred to satisfy the read request"


If I need a block to be used on two datanodes, will it used the replica too
or will it have a network communication?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-creation-on-HDFS-tp3894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit did work for me. Could you confirm the following:

1) After you called cache(), did you make any actions like count() or
reduce()? If you don't materialize the RDD, it won't show up in the
storage tab.

2) Did you run ./make-distribution.sh after you switched to the current master?

Xiangrui

On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers  wrote:
> i tried again with latest master, which includes commit below, but ui page
> still shows nothing on storage tab.
> koert
>
>
>
> commit ada310a9d3d5419e101b24d9b41398f609da1ad3
> Author: Andrew Or 
> Date:   Mon Mar 31 23:01:14 2014 -0700
>
> [Hot Fix #42] Persisted RDD disappears on storage page if re-used
>
> If a previously persisted RDD is re-used, its information disappears
> from the Storage page.
>
> This is because the tasks associated with re-using the RDD do not report
> the RDD's blocks as updated (which is correct). On stage submit, however, we
> overwrite any existing
>
> Author: Andrew Or 
>
> Closes #281 from andrewor14/ui-storage-fix and squashes the following
> commits:
>
> 408585a [Andrew Or] Fix storage UI bug
>
>
>
> On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:
>>
>> got it thanks
>>
>>
>> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:
>>>
>>> This is fixed in https://github.com/apache/spark/pull/281. Please try
>>> again with the latest master. -Xiangrui
>>>
>>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers  wrote:
>>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
>>> > ago
>>> > (apr 5) that the "application detail ui" no longer shows any RDDs on
>>> > the
>>> > storage tab, despite the fact that they are definitely cached.
>>> >
>>> > i am running spark in standalone mode.
>>
>>
>


Re: Spark and HBase

2014-04-08 Thread Bin Wang
First, I have not tried it myself. However, what I have heard it has some
basic SQL features so you can query you HBase table like query content on
HDFS using Hive.
So it is not "query a simple column", I believe you can do joins and other
SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured
and give it a try.

Sorry cannot provide more detailed explanation and help.



On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier wrote:

> Thanks for the quick reply Bin. Phenix is something I'm going to try for
> sure but is seems somehow useless if I can use Spark.
> Probably, as you said, since Phoenix use a dedicated data structure within
> each HBase Table has a more effective memory usage but if I need to
> deserialize data stored in a HBase cell I still have to read in memory that
> object and thus I need Spark. From what I understood Phoenix is good if I
> have to query a simple column of HBase but things get really complicated if
> I have to add an index for each column in my table and I store complex
> object within the cells. Is it correct?
>
> Best,
> Flavio
>
>
>
>
> On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang  wrote:
>
>> Hi Flavio,
>>
>> I happened to attend, actually attending the 2014 Apache Conf, I heard a
>> project called "Apache Phoenix", which fully leverage HBase and suppose to
>> be 1000x faster than Hive. And it is not memory bounded, in which case sets
>> up a limit for Spark. It is still in the incubating group and the "stats"
>> functions spark has already implemented are still on the roadmap. I am not
>> sure whether it will be good but might be something interesting to check
>> out.
>>
>> /usr/bin
>>
>>
>> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
>> wrote:
>>
>>> Hi to everybody,
>>>
>>>  in these days I looked a bit at the recent evolution of the big data
>>> stacks and it seems that HBase is somehow fading away in favour of
>>> Spark+HDFS. Am I correct?
>>> Do you think that Spark and HBase should work together or not?
>>>
>>> Best regards,
>>> Flavio
>>>
>>


Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i tried again with latest master, which includes commit below, but ui page
still shows nothing on storage tab.
koert



commit ada310a9d3d5419e101b24d9b41398f609da1ad3
Author: Andrew Or 
Date:   Mon Mar 31 23:01:14 2014 -0700

[Hot Fix #42] Persisted RDD disappears on storage page if re-used

If a previously persisted RDD is re-used, its information disappears
from the Storage page.

This is because the tasks associated with re-using the RDD do not
report the RDD's blocks as updated (which is correct). On stage submit,
however, we overwrite any existing

Author: Andrew Or 

Closes #281 from andrewor14/ui-storage-fix and squashes the following
commits:

408585a [Andrew Or] Fix storage UI bug



On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers  wrote:

> got it thanks
>
>
> On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui Meng  wrote:
>
>> This is fixed in https://github.com/apache/spark/pull/281. Please try
>> again with the latest master. -Xiangrui
>>
>> On Mon, Apr 7, 2014 at 1:06 PM, Koert Kuipers  wrote:
>> > i noticed that for spark 1.0.0-SNAPSHOT which i checked out a few days
>> ago
>> > (apr 5) that the "application detail ui" no longer shows any RDDs on the
>> > storage tab, despite the fact that they are definitely cached.
>> >
>> > i am running spark in standalone mode.
>>
>
>


Re: Driver Out of Memory

2014-04-08 Thread Aaron Davidson
The driver does not do processing (for the most part), but it does do
scheduling and block management, so it can keep around a significant amount
of metadata about all the stored RDD and broadcast blocks, as well as
statistics about prior executions. Much of this is bounded in some way,
though, so it is unusual for the driver to use too much memory. If you
could get a heap dump (using jmap) of the driver when it's using a lot of
memory, that would likely be very informative about what exactly is piling
up.


On Mon, Apr 7, 2014 at 3:12 PM, Eduardo Costa Alfaia  wrote:

>  Hi Guys,
>
> I would like understanding why the Driver's RAM goes down, Does the
> processing occur only in the workers?
> Thanks
> # Start Tests
> computer1(Worker/Source Stream)
>  23:57:18 up 12:03,  1 user,  load average: 0.03, 0.31, 0.44
>  total   used   free sharedbuffers cached
> Mem:  3945   1084   2860  0 44827
> -/+ buffers/cache:212   3732
> Swap:0  0  0
> computer8 (Driver/Master)
>  23:57:18 up 11:53,  5 users,  load average: 0.43, 1.19, 1.31
>  total   used   free sharedbuffers cached
> Mem:  5897   4430   1466  0384   2662
> -/+ buffers/cache:   1382   4514
> Swap:0  0  0
> computer10(Worker/Source Stream)
>  23:57:18 up 12:02,  1 user,  load average: 0.55, 1.34, 0.98
>  total   used   free sharedbuffers cached
> Mem:  5897564   5332  0 18358
> -/+ buffers/cache:187   5709
> Swap:0  0  0
> computer11(Worker/Source Stream)
>  23:57:18 up 12:02,  1 user,  load average: 0.07, 0.19, 0.29
>  total   used   free sharedbuffers cached
> Mem:  3945603   3342  0 54355
> -/+ buffers/cache:193   3751
> Swap:0  0  0
>
>  After 2 Minutes
>
> computer1
>  00:06:41 up 12:12,  1 user,  load average: 3.11, 1.32, 0.73
>  total   used   free sharedbuffers cached
> Mem:  3945   2950994  0 46   1095
> -/+ buffers/cache:   1808   2136
> Swap:0  0  0
> computer8(Driver/Master)
>  00:06:41 up 12:02,  5 users,  load average: 1.16, 0.71, 0.96
>  total   used   free sharedbuffers cached
> Mem:  5897   5191705  0385   2792
> -/+ buffers/cache:   2014   3882
> Swap:0  0  0
> computer10
>  00:06:41 up 12:11,  1 user,  load average: 2.02, 1.07, 0.89
>  total   used   free sharedbuffers cached
> Mem:  5897   2567   3329  0 21647
> -/+ buffers/cache:   1898   3998
> Swap:0  0  0
> computer11
>  00:06:42 up 12:12,  1 user,  load average: 3.96, 1.83, 0.88
>  total   used   free sharedbuffers cached
> Mem:  3945   3542402  0 57   1099
> -/+ buffers/cache:   2385   1559
> Swap:0  0  0
>
>
> Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Urgently need help interpreting duration

2014-04-08 Thread Aaron Davidson
Also, take a look at the driver logs -- if there is overhead before the
first task is launched, the driver logs would likely reveal this.


On Tue, Apr 8, 2014 at 9:21 AM, Aaron Davidson  wrote:

> Off the top of my head, the most likely cause would be driver GC issues.
> You can diagnose this by enabling GC printing at the driver and you can fix
> this by increasing the amount of memory your driver program has (see
> http://spark.apache.org/docs/0.9.0/tuning.html#garbage-collection-tuning).
>
> The "launch time" statistic would also be useful -- if all tasks are
> launched at around the same time and complete within 300ms, yet the total
> time is 10s, that strongly suggests that the overhead is coming before the
> first task is launched. Similarly, it would be useful to know if there was
> a large gap between launch times, or if it appeared that the launch times
> were serial with respect to the durations. If for some reason Spark started
> using only one executor, say, each task would take the same duration but
> would be executed one after another.
>
>
> On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska wrote:
>
>> Hi Spark users, I'm very much hoping someone can help me out.
>>
>> I have a strict performance requirement on a particular query. One of
>> the stages shows great variance in duration -- from 300ms to 10sec.
>>
>> The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark
>> 0.8)
>>
>> I have run the job quite a few times -- the details within the stage
>> do not account for the overall duration shown for the stage. What
>> could be taking up time that's not showing within the stage breakdown
>> UI? Im thinking that reading the data in is reflected in the Duration
>> column before, so caching should not be a reason(I'm not caching
>> explicitly)?
>>
>> The details within the stage always show roughly the following (both
>> for the 10second and 600ms query -- very little variation, nothing
>> over 500ms, ShuffleWrite size is pretty comparable):
>>
>> StatusLocality LevelExecutor Launch Time DurationGC TimeShuffle Write
>> 1864  SUCCESSNODE_LOCAL ###  301 ms  8 ms  111.0 B
>> 1863  SUCCESSNODE_LOCAL ###  273 ms102.0 B
>> 1862  SUCCESSNODE_LOCAL ###  245 ms111.0 B
>> 1861  SUCCESSNODE_LOCAL ###  326 ms  4 ms   102.0 B
>> 1860  SUCCESSNODE_LOCAL ###  217 ms  6 ms   102.0 B
>> 1859  SUCCESSNODE_LOCAL ###  277 ms 111.0 B
>> 1858  SUCCESSNODE_LOCAL ###  262 ms 108.0 B
>> 1857  SUCCESSNODE_LOCAL ###  217 ms  14 ms  112.0 B
>> 1856  SUCCESSNODE_LOCAL ###  208 ms  109.0 B
>> 1855  SUCCESSNODE_LOCAL ###  242 ms  74.0 B
>> 1854  SUCCESSNODE_LOCAL ###  218 ms  3 ms 58.0 B
>> 1853  SUCCESSNODE_LOCAL ###  254 ms  12 ms   102.0 B
>> 1852  SUCCESSNODE_LOCAL ###  274 ms  8 ms 77.0 B
>>
>
>


Re: Urgently need help interpreting duration

2014-04-08 Thread Aaron Davidson
Off the top of my head, the most likely cause would be driver GC issues.
You can diagnose this by enabling GC printing at the driver and you can fix
this by increasing the amount of memory your driver program has (see
http://spark.apache.org/docs/0.9.0/tuning.html#garbage-collection-tuning).

The "launch time" statistic would also be useful -- if all tasks are
launched at around the same time and complete within 300ms, yet the total
time is 10s, that strongly suggests that the overhead is coming before the
first task is launched. Similarly, it would be useful to know if there was
a large gap between launch times, or if it appeared that the launch times
were serial with respect to the durations. If for some reason Spark started
using only one executor, say, each task would take the same duration but
would be executed one after another.


On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska wrote:

> Hi Spark users, I'm very much hoping someone can help me out.
>
> I have a strict performance requirement on a particular query. One of
> the stages shows great variance in duration -- from 300ms to 10sec.
>
> The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark
> 0.8)
>
> I have run the job quite a few times -- the details within the stage
> do not account for the overall duration shown for the stage. What
> could be taking up time that's not showing within the stage breakdown
> UI? Im thinking that reading the data in is reflected in the Duration
> column before, so caching should not be a reason(I'm not caching
> explicitly)?
>
> The details within the stage always show roughly the following (both
> for the 10second and 600ms query -- very little variation, nothing
> over 500ms, ShuffleWrite size is pretty comparable):
>
> StatusLocality LevelExecutor Launch Time DurationGC TimeShuffle Write
> 1864  SUCCESSNODE_LOCAL ###  301 ms  8 ms  111.0 B
> 1863  SUCCESSNODE_LOCAL ###  273 ms102.0 B
> 1862  SUCCESSNODE_LOCAL ###  245 ms111.0 B
> 1861  SUCCESSNODE_LOCAL ###  326 ms  4 ms   102.0 B
> 1860  SUCCESSNODE_LOCAL ###  217 ms  6 ms   102.0 B
> 1859  SUCCESSNODE_LOCAL ###  277 ms 111.0 B
> 1858  SUCCESSNODE_LOCAL ###  262 ms 108.0 B
> 1857  SUCCESSNODE_LOCAL ###  217 ms  14 ms  112.0 B
> 1856  SUCCESSNODE_LOCAL ###  208 ms  109.0 B
> 1855  SUCCESSNODE_LOCAL ###  242 ms  74.0 B
> 1854  SUCCESSNODE_LOCAL ###  218 ms  3 ms 58.0 B
> 1853  SUCCESSNODE_LOCAL ###  254 ms  12 ms   102.0 B
> 1852  SUCCESSNODE_LOCAL ###  274 ms  8 ms 77.0 B
>


Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Thanks for the quick reply Bin. Phenix is something I'm going to try for
sure but is seems somehow useless if I can use Spark.
Probably, as you said, since Phoenix use a dedicated data structure within
each HBase Table has a more effective memory usage but if I need to
deserialize data stored in a HBase cell I still have to read in memory that
object and thus I need Spark. From what I understood Phoenix is good if I
have to query a simple column of HBase but things get really complicated if
I have to add an index for each column in my table and I store complex
object within the cells. Is it correct?

Best,
Flavio



On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang  wrote:

> Hi Flavio,
>
> I happened to attend, actually attending the 2014 Apache Conf, I heard a
> project called "Apache Phoenix", which fully leverage HBase and suppose to
> be 1000x faster than Hive. And it is not memory bounded, in which case sets
> up a limit for Spark. It is still in the incubating group and the "stats"
> functions spark has already implemented are still on the roadmap. I am not
> sure whether it will be good but might be something interesting to check
> out.
>
> /usr/bin
>
>
> On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier 
> wrote:
>
>> Hi to everybody,
>>
>>  in these days I looked a bit at the recent evolution of the big data
>> stacks and it seems that HBase is somehow fading away in favour of
>> Spark+HDFS. Am I correct?
>> Do you think that Spark and HBase should work together or not?
>>
>> Best regards,
>> Flavio
>>
>


Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.

So yes, you can easily see them deployed side-by-side in a given enterprise.

Sent while mobile. Pls excuse typos etc.
On Apr 8, 2014 5:58 AM, "Flavio Pompermaier"  wrote:

> Hi to everybody,
>
> in these days I looked a bit at the recent evolution of the big data
> stacks and it seems that HBase is somehow fading away in favour of
> Spark+HDFS. Am I correct?
> Do you think that Spark and HBase should work together or not?
>
> Best regards,
> Flavio
>


Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio,

I happened to attend, actually attending the 2014 Apache Conf, I heard a
project called "Apache Phoenix", which fully leverage HBase and suppose to
be 1000x faster than Hive. And it is not memory bounded, in which case sets
up a limit for Spark. It is still in the incubating group and the "stats"
functions spark has already implemented are still on the roadmap. I am not
sure whether it will be good but might be something interesting to check
out.

/usr/bin


On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier wrote:

> Hi to everybody,
>
> in these days I looked a bit at the recent evolution of the big data
> stacks and it seems that HBase is somehow fading away in favour of
> Spark+HDFS. Am I correct?
> Do you think that Spark and HBase should work together or not?
>
> Best regards,
> Flavio
>


Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Hi to everybody,

in these days I looked a bit at the recent evolution of the big data stacks
and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am
I correct?
Do you think that Spark and HBase should work together or not?

Best regards,
Flavio


Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the "workaround" ...around... by overwriting the
generated '/root/.ssh/authorized_keys' file with a known good one, in the
'/etc/rc.local' file


On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini <
silvio.costant...@granatads.com> wrote:

> Another thing I didn't mention. The AMI and user used: naturally I've
> created several of my own AMIs with the following characteristics. None of
> which worked.
>
> 1) Enabling ssh as root as per this guide (
> http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
> When doing this, I do not specify a user for the spark-ec2 script. What
> happens is that, it works! But only while it's alive. If I stop the
> instance, create an AMI, and launch a new instance based from the new AMI,
> the change I made in the '/root/.ssh/authorized_keys' file is overwritten
>
> 2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
> does not have to use sudo to perform any operations needing root
> privilidges. When doing this, I specify the user 'ec2-user' for the
> spark-ec2 script. An error occurs: rsync fails with exit code 23.
>
> I believe HVMs still work. But it would be valuable to the community to
> know that the root user work-around does/doesn't work any more for
> paravirtual instances.
>
> Thanks,
> Marco.
>
>
> On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini <
> silvio.costant...@granatads.com> wrote:
>
>> As requested, here is the script I am running. It is a simple shell
>> script which calls spark-ec2 wrapper script. I execute it from the 'ec2'
>> directory of spark, as usual. The AMI used is the raw one from the AWS
>> Quick Start section. It is the first option (an Amazon Linux paravirtual
>> image). Any ideas or confirmation would be GREATLY appreciated. Please and
>> thank you.
>>
>>
>> #!/bin/sh
>>
>> export AWS_ACCESS_KEY_ID=MyCensoredKey
>> export AWS_SECRET_ACCESS_KEY=MyCensoredKey
>>
>> AMI_ID=ami-2f726546
>>
>> ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
>> 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
>> marcotest
>>
>>
>>
>> On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman <
>> shivaram.venkatara...@gmail.com> wrote:
>>
>>> Hmm -- That is strange. Can you paste the command you are using to
>>> launch the instances ? The typical workflow is to use the spark-ec2 wrapper
>>> script using the guidelines at
>>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>>
>>> Shivaram
>>>
>>>
>>> On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
>>> silvio.costant...@granatads.com> wrote:
>>>
 Hi Shivaram,

 OK so let's assume the script CANNOT take a different user and that it
 must be 'root'. The typical workaround is as you said, allow the ssh with
 the root user. Now, don't laugh, but, this worked last Friday, but today
 (Monday) it no longer works. :D Why? ...

 ...It seems that NOW, when you launch a 'paravirtual' ami, the root
 user's 'authorized_keys' file is always overwritten. This means the
 workaround doesn't work anymore! I would LOVE for someone to verify this.

 Just to point out, I am trying to make this work with a paravirtual
 instance and not an HVM instance.

 Please and thanks,
 Marco.


 On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
 shivaram.venkatara...@gmail.com> wrote:

> Right now the spark-ec2 scripts assume that you have root access and a
> lot of internal scripts assume have the user's home directory hard coded 
> as
> /root.   However all the Spark AMIs we build should have root ssh access 
> --
> Do you find this not to be the case ?
>
> You can also enable root ssh access in a vanilla AMI by editing
> /etc/ssh/sshd_config and setting "PermitRootLogin" to yes
>
> Thanks
> Shivaram
>
>
>
> On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
> silvio.costant...@granatads.com> wrote:
>
>> Hi all,
>> On the old Amazon Linux EC2 images, the user 'root' was enabled for
>> ssh. Also, it is the default user for the Spark-EC2 script.
>>
>> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
>> instead of 'root'.
>>
>> I can see that the Spark-EC2 script allows you to specify which user
>> to log in with, but even when I change this, the script fails for various
>> reasons. And the output SEEMS that the script is still based on the
>> specified user's home directory being '/root'.
>>
>> Am I using this script wrong?
>> Has anyone had success with this 'ec2-user' user?
>> Any ideas?
>>
>> Please and thank you,
>> Marco.
>>
>
>

>>>
>>
>


Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Hi Spark users, I'm very much hoping someone can help me out.

I have a strict performance requirement on a particular query. One of
the stages shows great variance in duration -- from 300ms to 10sec.

The stage is mapPartitionsWithIndex at Operator.scala:210 (running Spark 0.8)

I have run the job quite a few times -- the details within the stage
do not account for the overall duration shown for the stage. What
could be taking up time that's not showing within the stage breakdown
UI? Im thinking that reading the data in is reflected in the Duration
column before, so caching should not be a reason(I'm not caching
explicitly)?

The details within the stage always show roughly the following (both
for the 10second and 600ms query -- very little variation, nothing
over 500ms, ShuffleWrite size is pretty comparable):

StatusLocality LevelExecutor Launch Time DurationGC TimeShuffle Write
1864  SUCCESSNODE_LOCAL ###  301 ms  8 ms  111.0 B
1863  SUCCESSNODE_LOCAL ###  273 ms102.0 B
1862  SUCCESSNODE_LOCAL ###  245 ms111.0 B
1861  SUCCESSNODE_LOCAL ###  326 ms  4 ms   102.0 B
1860  SUCCESSNODE_LOCAL ###  217 ms  6 ms   102.0 B
1859  SUCCESSNODE_LOCAL ###  277 ms 111.0 B
1858  SUCCESSNODE_LOCAL ###  262 ms 108.0 B
1857  SUCCESSNODE_LOCAL ###  217 ms  14 ms  112.0 B
1856  SUCCESSNODE_LOCAL ###  208 ms  109.0 B
1855  SUCCESSNODE_LOCAL ###  242 ms  74.0 B
1854  SUCCESSNODE_LOCAL ###  218 ms  3 ms 58.0 B
1853  SUCCESSNODE_LOCAL ###  254 ms  12 ms   102.0 B
1852  SUCCESSNODE_LOCAL ###  274 ms  8 ms 77.0 B


NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi

I'm using Spark 0.9.0.

When calling saveAsTextFile on a custom hadoop inputformat (loaded with
newAPIHadoopRDD), I get the following error below.

If I call count, I get the correct count of number of records, so the
inputformat is being read correctly... the issue only appears when trying
to use saveAsTextFile.

If I call first() I get the correct output, also. So it doesn't appear to
be anything with the data or inputformat.

Any idea what the actual problem is, since this stack trace is not obvious
(though it seems to be in ResultTask which ultimately causes this).

Is this a known issue at all?


==

14/04/08 16:00:46 ERROR OneForOneStrategy:
java.lang.NullPointerException
at
com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
at
com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
at com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:48)
at org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:123)
at
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1443)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1414)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've
created several of my own AMIs with the following characteristics. None of
which worked.

1) Enabling ssh as root as per this guide (
http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
When doing this, I do not specify a user for the spark-ec2 script. What
happens is that, it works! But only while it's alive. If I stop the
instance, create an AMI, and launch a new instance based from the new AMI,
the change I made in the '/root/.ssh/authorized_keys' file is overwritten

2) adding the 'ec2-user' to the 'root' group. This means that the ec2-user
does not have to use sudo to perform any operations needing root
privilidges. When doing this, I specify the user 'ec2-user' for the
spark-ec2 script. An error occurs: rsync fails with exit code 23.

I believe HVMs still work. But it would be valuable to the community to
know that the root user work-around does/doesn't work any more for
paravirtual instances.

Thanks,
Marco.


On Tue, Apr 8, 2014 at 9:51 AM, Marco Costantini <
silvio.costant...@granatads.com> wrote:

> As requested, here is the script I am running. It is a simple shell script
> which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory
> of spark, as usual. The AMI used is the raw one from the AWS Quick Start
> section. It is the first option (an Amazon Linux paravirtual image). Any
> ideas or confirmation would be GREATLY appreciated. Please and thank you.
>
>
> #!/bin/sh
>
> export AWS_ACCESS_KEY_ID=MyCensoredKey
> export AWS_SECRET_ACCESS_KEY=MyCensoredKey
>
> AMI_ID=ami-2f726546
>
> ./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
> 0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
> marcotest
>
>
>
> On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman <
> shivaram.venkatara...@gmail.com> wrote:
>
>> Hmm -- That is strange. Can you paste the command you are using to launch
>> the instances ? The typical workflow is to use the spark-ec2 wrapper script
>> using the guidelines at
>> http://spark.apache.org/docs/latest/ec2-scripts.html
>>
>> Shivaram
>>
>>
>> On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
>> silvio.costant...@granatads.com> wrote:
>>
>>> Hi Shivaram,
>>>
>>> OK so let's assume the script CANNOT take a different user and that it
>>> must be 'root'. The typical workaround is as you said, allow the ssh with
>>> the root user. Now, don't laugh, but, this worked last Friday, but today
>>> (Monday) it no longer works. :D Why? ...
>>>
>>> ...It seems that NOW, when you launch a 'paravirtual' ami, the root
>>> user's 'authorized_keys' file is always overwritten. This means the
>>> workaround doesn't work anymore! I would LOVE for someone to verify this.
>>>
>>> Just to point out, I am trying to make this work with a paravirtual
>>> instance and not an HVM instance.
>>>
>>> Please and thanks,
>>> Marco.
>>>
>>>
>>> On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
>>> shivaram.venkatara...@gmail.com> wrote:
>>>
 Right now the spark-ec2 scripts assume that you have root access and a
 lot of internal scripts assume have the user's home directory hard coded as
 /root.   However all the Spark AMIs we build should have root ssh access --
 Do you find this not to be the case ?

 You can also enable root ssh access in a vanilla AMI by editing
 /etc/ssh/sshd_config and setting "PermitRootLogin" to yes

 Thanks
 Shivaram



 On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
 silvio.costant...@granatads.com> wrote:

> Hi all,
> On the old Amazon Linux EC2 images, the user 'root' was enabled for
> ssh. Also, it is the default user for the Spark-EC2 script.
>
> Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
> instead of 'root'.
>
> I can see that the Spark-EC2 script allows you to specify which user
> to log in with, but even when I change this, the script fails for various
> reasons. And the output SEEMS that the script is still based on the
> specified user's home directory being '/root'.
>
> Am I using this script wrong?
> Has anyone had success with this 'ec2-user' user?
> Any ideas?
>
> Please and thank you,
> Marco.
>


>>>
>>
>


Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
As requested, here is the script I am running. It is a simple shell script
which calls spark-ec2 wrapper script. I execute it from the 'ec2' directory
of spark, as usual. The AMI used is the raw one from the AWS Quick Start
section. It is the first option (an Amazon Linux paravirtual image). Any
ideas or confirmation would be GREATLY appreciated. Please and thank you.


#!/bin/sh

export AWS_ACCESS_KEY_ID=MyCensoredKey
export AWS_SECRET_ACCESS_KEY=MyCensoredKey

AMI_ID=ami-2f726546

./spark-ec2 -k gds-generic -i ~/.ssh/gds-generic.pem -u ec2-user -s 10 -v
0.9.0 -w 300 --no-ganglia -a ${AMI_ID} -m m3.2xlarge -t m3.2xlarge launch
marcotest



On Mon, Apr 7, 2014 at 6:21 PM, Shivaram Venkataraman <
shivaram.venkatara...@gmail.com> wrote:

> Hmm -- That is strange. Can you paste the command you are using to launch
> the instances ? The typical workflow is to use the spark-ec2 wrapper script
> using the guidelines at
> http://spark.apache.org/docs/latest/ec2-scripts.html
>
> Shivaram
>
>
> On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini <
> silvio.costant...@granatads.com> wrote:
>
>> Hi Shivaram,
>>
>> OK so let's assume the script CANNOT take a different user and that it
>> must be 'root'. The typical workaround is as you said, allow the ssh with
>> the root user. Now, don't laugh, but, this worked last Friday, but today
>> (Monday) it no longer works. :D Why? ...
>>
>> ...It seems that NOW, when you launch a 'paravirtual' ami, the root
>> user's 'authorized_keys' file is always overwritten. This means the
>> workaround doesn't work anymore! I would LOVE for someone to verify this.
>>
>> Just to point out, I am trying to make this work with a paravirtual
>> instance and not an HVM instance.
>>
>> Please and thanks,
>> Marco.
>>
>>
>> On Mon, Apr 7, 2014 at 4:40 PM, Shivaram Venkataraman <
>> shivaram.venkatara...@gmail.com> wrote:
>>
>>> Right now the spark-ec2 scripts assume that you have root access and a
>>> lot of internal scripts assume have the user's home directory hard coded as
>>> /root.   However all the Spark AMIs we build should have root ssh access --
>>> Do you find this not to be the case ?
>>>
>>> You can also enable root ssh access in a vanilla AMI by editing
>>> /etc/ssh/sshd_config and setting "PermitRootLogin" to yes
>>>
>>> Thanks
>>> Shivaram
>>>
>>>
>>>
>>> On Mon, Apr 7, 2014 at 11:14 AM, Marco Costantini <
>>> silvio.costant...@granatads.com> wrote:
>>>
 Hi all,
 On the old Amazon Linux EC2 images, the user 'root' was enabled for
 ssh. Also, it is the default user for the Spark-EC2 script.

 Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
 instead of 'root'.

 I can see that the Spark-EC2 script allows you to specify which user to
 log in with, but even when I change this, the script fails for various
 reasons. And the output SEEMS that the script is still based on the
 specified user's home directory being '/root'.

 Am I using this script wrong?
 Has anyone had success with this 'ec2-user' user?
 Any ideas?

 Please and thank you,
 Marco.

>>>
>>>
>>
>


Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
thank you for your help! let me have a try


Nan Zhu wrote
> If that’s the case, I think mapPartition is what you need, but it seems
> that you have to load the partition into the memory as whole by toArray
> 
> rdd.mapPartition{D => {val p = D.toArray; ...}}  
> 
> --  
> Nan Zhu
> 
> 
> 
> On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:
> 
>> yes, how can i do this conveniently? i can use filter, but there will be
>> so
>> many RDDs and it's not concise
>>  
>>  
>>  
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> (http://Nabble.com).
>>  
>>





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark-shell on standalone cluster gives error " no mesos in java.library.path"

2014-04-08 Thread Christoph Böhm
Forgot to post the solution. I messed up the master URL. In particular, I gave 
the host (master), not a URL. My bad. The error message is weird, though. Seems 
like the URL regex matches master for mesos://...
 
No idea about the Java Runtime Environment Error.


On Mar 26, 2014, at 3:52 PM, Christoph Böhm wrote:

> Hi,
> 
> I have a similar issue like the user below:
> I’m running Spark 0.8.1 (standalone). When I test the streaming 
> NetworkWordCount example as in the docs with local[2] it works fine. As soon 
> as I want to connect to my cluster using [NetworkWordCount master …] it says:
> ---
> Failed to load native Mesos library from 
> /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> Exception in thread "main" java.lang.UnsatisfiedLinkError: no mesos in 
> java.library.path
>   at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
>   at java.lang.Runtime.loadLibrary0(Runtime.java:849)
>   at java.lang.System.loadLibrary(System.java:1088)
>   at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:52)
>   at org.apache.mesos.MesosNativeLibrary.load(MesosNativeLibrary.java:64)
>   at org.apache.spark.SparkContext.(SparkContext.scala:260)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:559)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84)
>   at 
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:76)
>   at 
> org.apache.spark.streaming.examples.JavaNetworkWordCount.main(JavaNetworkWordCount.java:50)
> ---
> 
> I built mesos 0.13 and added the MESOS_NATIVE_LIBRARY entry in spark-env.sh. 
> But then I get:
> ---
> A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fed89801ce9, pid=13580, tid=140657358776064
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_51-b13) (build 
> 1.7.0_51-b13)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x632ce9]  jni_GetByteArrayElements+0x89
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/vagrant/hs_err_pid13580.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> ---
> 
> The error lag says:
> ---
> Current thread (0x7fed8473d000):  JavaThread "MesosSchedulerBackend 
> driver" daemon [_thread_in_vm, id=13638, 
> stack(0x7fed57d7a000,0x7fed57e7b000)]
> …
> ---
> 
> Working on Ubuntu 12.04 in Virtual Box. Tried it with OpenJDK 6 and Oracle 
> Java 7.
> 
> 
> Any ideas??
> Many thanks.
> 
> Christoph



Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
If that’s the case, I think mapPartition is what you need, but it seems that 
you have to load the partition into the memory as whole by toArray

rdd.mapPartition{D => {val p = D.toArray; ...}}  

--  
Nan Zhu



On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:

> yes, how can i do this conveniently? i can use filter, but there will be so
> many RDDs and it's not concise
>  
>  
>  
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
>  
>  




Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
yes, how can i do this conveniently? i can use filter, but there will be so
many RDDs and it's not concise



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
so, the data structure looks like:

D consists of D1, D2, D3 (DX is partition)

and 

DX consists of d1, d2, d3 (dx is the part in your context)?

what you want to do is to transform 

DX to (d1 + d2, d1 + d3, d2 + d3)?



Best, 

-- 
Nan Zhu



On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

> In my application, data parts inside an RDD partition have ralations. so I
> need to do some operations beween them. 
> 
> for example
> RDD T1 has several partitions, each partition has three parts A, B and C.
> then I transform T1 to T2. after transform, T2 also has three parts D, E and
> F, D = A+B, E = A+C, F = B+C. As far as I know, spark only supports
> operations traversing the RDD and calling a function for each element. how
> can I do such a transform?
> 
> in hadoop I copy the data in each partition to a user defined buffer and do
> any operations I like in the buffer, finally I call output.collect() to emit
> the data. But how can I construct a new RDD with distributed partitions in
> spark? makeRDD only distributes a local Scala collection to form an RDD.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
> 




Only TraversableOnce?

2014-04-08 Thread wxhsdp
In my application, data parts inside an RDD partition have ralations. so I
need to do some operations beween them. 

for example
RDD T1 has several partitions, each partition has three parts A, B and C.
then I transform T1 to T2. after transform, T2 also has three parts D, E and
F, D = A+B, E = A+C, F = B+C. As far as I know, spark only supports
operations traversing the RDD and calling a function for each element. how
can I do such a transform?

in hadoop I copy the data in each partition to a user defined buffer and do
any operations I like in the buffer, finally I call output.collect() to emit
the data. But how can I construct a new RDD with distributed partitions in
spark? makeRDD only distributes a local Scala collection to form an RDD.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Adnan
Hello,

I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am
connecting with Spark Master from scala using SparkContext. I am trying to
execute a simple java function from the distributed jar on every Spark
Worker but haven't found a way to communicate with each worker or a Spark
API function to do it.

Can somebody help me with it or point me in the right direction?

Regards,
Adnan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-execute-a-function-from-class-in-distributed-jar-on-each-worker-node-tp3870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
Yes, that is correct. If you are executing a Spark program across multiple
machines, that you need to use a distributed file system (HDFS API
compatible) for reading and writing data. In your case, your setup is
across multiple machines. So what is probably happening is that the the RDD
data is being written in the worker machine's local directory (based on the
checkpoint path that has been provided), which cannot be read again as an
RDD. Hence checkpointing is failing.

TD


On Mon, Apr 7, 2014 at 6:05 PM, Paul Mogren  wrote:

> 1.:  I will paste the full content of the environment page of the example
> application running against the cluster at the end of this message.
> 2. and 3.:  Following #2 I was able to see that the count was incorrectly
> 0 when running against the cluster, and following #3 I was able to get the
> message:
> org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[4] at count
> at :15(0) has different number of partitions than original RDD
> MappedRDD[3] at textFile at :12(2)
>
> I think I understand - state checkpoints and other file-exchange
> operations in Spark cluster require a distributed/shared filesystem, even
> with just a single-host cluster and the driver/shell on a second host. Is
> that correct?
>
> Thank you,
> Paul
>
>
>
> Stages
> Storage
> Environment
> Executors
> NetworkWordCumulativeCountUpdateStateByKey application UI
> Environment
> Runtime Information
>
> NameValue
> Java Home   /usr/lib/jvm/jdk1.8.0/jre
> Java Version1.8.0 (Oracle Corporation)
> Scala Home
> Scala Version   version 2.10.3
> Spark Properties
>
> NameValue
> spark.app.name  NetworkWordCumulativeCountUpdateStateByKey
> spark.cleaner.ttl   3600
> spark.deploy.recoveryMode   ZOOKEEPER
> spark.deploy.zookeeper.url  pubsub01:2181
> spark.driver.host   10.10.41.67
> spark.driver.port   37360
> spark.fileserver.urihttp://10.10.41.67:40368
> spark.home  /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
> spark.httpBroadcast.uri http://10.10.41.67:45440
> spark.jars
>  
> /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
> spark.masterspark://10.10.41.19:7077
> System Properties
>
> NameValue
> awt.toolkit sun.awt.X11.XToolkit
> file.encoding   ANSI_X3.4-1968
> file.encoding.pkg   sun.io
> file.separator  /
> java.awt.graphicsenvsun.awt.X11GraphicsEnvironment
> java.awt.printerjob sun.print.PSPrinterJob
> java.class.version  52.0
> java.endorsed.dirs  /usr/lib/jvm/jdk1.8.0/jre/lib/endorsed
> java.ext.dirs
> /usr/lib/jvm/jdk1.8.0/jre/lib/ext:/usr/java/packages/lib/ext
> java.home   /usr/lib/jvm/jdk1.8.0/jre
> java.io.tmpdir  /tmp
> java.library.path
> java.net.preferIPv4Stacktrue
> java.runtime.name   Java(TM) SE Runtime Environment
> java.runtime.version1.8.0-b132
> java.specification.name Java Platform API Specification
> java.specification.vendor   Oracle Corporation
> java.specification.version  1.8
> java.vendor Oracle Corporation
> java.vendor.url http://java.oracle.com/
> java.vendor.url.bug http://bugreport.sun.com/bugreport/
> java.version1.8.0
> java.vm.infomixed mode
> java.vm.nameJava HotSpot(TM) 64-Bit Server VM
> java.vm.specification.name  Java Virtual Machine Specification
> java.vm.specification.vendorOracle Corporation
> java.vm.specification.version   1.8
> java.vm.vendor  Oracle Corporation
> java.vm.version 25.0-b70
> line.separator
> log4j.configuration conf/log4j.properties
> os.arch amd64
> os.name Linux
> os.version  3.5.0-23-generic
> path.separator  :
> sun.arch.data.model 64
> sun.boot.class.path
> /usr/lib/jvm/jdk1.8.0/jre/lib/resources.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/rt.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/sunrsasign.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jsse.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jce.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/charsets.jar:/usr/lib/jvm/jdk1.8.0/jre/lib/jfr.jar:/usr/lib/jvm/jdk1.8.0/jre/classes
> sun.boot.library.path   /usr/lib/jvm/jdk1.8.0/jre/lib/amd64
> sun.cpu.endian  little
> sun.cpu.isalist
> sun.io.serialization.extendedDebugInfo  true
> sun.io.unicode.encoding UnicodeLittle
> sun.java.command
>  org.apache.spark.streaming.examples.StatefulNetworkWordCount spark://
> 10.10.41.19:7077 localhost 
> sun.java.launcher   SUN_STANDARD
> sun.jnu.encodingANSI_X3.4-1968
> sun.management.compiler HotSpot 64-Bit Tiered Compilers
> sun.nio.ch.bugLevel
> sun.os.patch.level  unknown
> user.countryUS
> user.dir/home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2
> user.home   /home/pmogren
> user.language   en
> user.name   pmogren
> user.timezone   America/New_York
> Classpath Entries
>
> ResourceSource
> /home/pmogren/streamproc/spark-0.9.0-incubating-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
> System Classpath
> /h