date:20140520

Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg

Hi,
I’m new to Spark and trying to test first Spark prog. I’m running SparkPi 
successfully in yarn-client -mode but when running the same in yarn-mode, app 
gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the 
outcome is always the same. Any hints what to look for next? 

cheers,
-jan


vagrant@vm-cluster-node1:~$ ./run_pi.sh
14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at 
vm-cluster-node2/10.211.55.101:8032
14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager 
(ASM), number of NodeManagers: 2
14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, 
queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this 
cluster 2048
14/05/20 06:24:05 INFO Client: Preparing Local resources
14/05/20 06:24:05 INFO Client: Uploading 
file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 to 
hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
14/05/20 06:24:07 INFO Client: Setting up the launch environment
14/05/20 06:24:07 INFO Client: Setting up container launch context
14/05/20 06:24:07 INFO Client: Command for starting the Spark 
ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp 
org.apache.spark.deploy.yarn.ApplicationMaster --class 
org.apache.spark.examples.SparkPi --jar 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 --args  'yarn-standalone'  --args  '10'  --worker-memory 500 --worker-cores 1 
--num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr
14/05/20 06:24:07 INFO Client: Submitting application to ASM
14/05/20 06:24:07 INFO YarnClientImpl: Submitted application 
application_1400563733088_0012
14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET 
REPEATING FOREVER
 application identifier: application_1400563733088_0012
 appId: 12
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: root.vagrant
 appMasterRpcPort: -1
 appStartTime: 1400567047343
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl: 
http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/
 appUser: vagrant


Log files give me no additional help. Latest log entry just acknowledges the 
status change:

hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 
06:24:07,347 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1400563733088_0012 State change from SUBMITTED to ACCEPTED


I’m running the example in local test environment with three virtual nodes in 
Cloudera (CDH5).

Below is the run_pi.sh : 

#!/bin/bash

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2
export SPARK_MASTER_PORT=7077
export 
DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop

export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar

export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n $HADOOP_HOME ]; then
  export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export 
SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar

APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \
--jar $APP_JAR \
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--args 10 \
--num-workers 1 \
--master-memory 1g \
--worker-memory 500m \
--worker-cores 1

unsubscribe

2014-05-20 Thread Jayaraman Babu

CLASSIFICATION : Public


This message has been marked by Jayaraman Babu on Tuesday, May 20, 2014, 
9:55:10 AM.
The above classification labels were added to the message by.
AL ELM Message Classification

This e-mail message and all attachments transmitted with it are intended solely 
for the use of the addressee and may contain legally privileged and 
confidential information. If the reader of this message is not the intended 
recipient, or an employee or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, copying, or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately by replying to this message and please delete it from 
your computer.

Re: combinebykey throw classcastexception

2014-05-20 Thread Sean Owen

You asked off-list, and provided a more detailed example there:

val random = new Random()
val testdata = (1 to 1).map(_=(random.nextInt(),random.nextInt()))
sc.parallelize(testdata).combineByKey[ArrayBuffer[Int]](
  (instant:Int)={new ArrayBuffer[Int]()},
  (bucket:ArrayBuffer[Int],instant:Int)={bucket+=instant},
  (bucket1:ArrayBuffer[Int],bucket2:ArrayBuffer[Int])={bucket1++=bucket2}
).collect()

https://www.quora.com/Why-is-my-combinebykey-throw-classcastexception

I can't reproduce this with Spark 0.9.0  / CDH5 or Spark 1.0.0 RC9.
Your definition looks fine too. (Except that you are dropping the
first value, but that's a different problem.)

On Tue, May 20, 2014 at 2:05 AM, xiemeilong xiemeilong...@gmail.com wrote:
 I am using CDH5 on a three machines cluster. map data from hbase as (string,
 V) pair , then call combineByKey like this:

 .combineByKey[C](
   (v:V)=new C(v),   //this line throw java.lang.ClassCastException: C
 cannot be cast to V
   (v:C,v:V)=C,
   (c1:C,c2:C)=C)


 I am very confused of this, there isn't C to V casting at all.  What's
 wrong?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6059.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Status stays at ACCEPTED

2014-05-20 Thread sandy . ryza

Hi Jan,

How much memory capacity is configured for each node?

If you go to the ResourceManager web UI, does it indicate any containers are 
running?

-Sandy

 On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote:
 
 Hi,
 I’m new to Spark and trying to test first Spark prog. I’m running SparkPi 
 successfully in yarn-client -mode but when running the same in yarn-mode, app 
 gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but 
 the outcome is always the same. Any hints what to look for next? 
 
 cheers,
 -jan
 
 
 vagrant@vm-cluster-node1:~$ ./run_pi.sh
 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at 
 vm-cluster-node2/10.211.55.101:8032
 14/05/20 06:24:05 INFO Client: Got Cluster metric info from 
 ApplicationsManager (ASM), number of NodeManagers: 2
 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, 
 queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
  queueApplicationCount = 0, queueChildQueueCount = 0
 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in 
 this cluster 2048
 14/05/20 06:24:05 INFO Client: Preparing Local resources
 14/05/20 06:24:05 INFO Client: Uploading 
 file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
  to 
 hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 14/05/20 06:24:07 INFO Client: Setting up the launch environment
 14/05/20 06:24:07 INFO Client: Setting up container launch context
 14/05/20 06:24:07 INFO Client: Command for starting the Spark 
 ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp 
 org.apache.spark.deploy.yarn.ApplicationMaster --class 
 org.apache.spark.examples.SparkPi --jar 
 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
  --args  'yarn-standalone'  --args  '10'  --worker-memory 500 --worker-cores 
 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr
 14/05/20 06:24:07 INFO Client: Submitting application to ASM
 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application 
 application_1400563733088_0012
 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET 
 REPEATING FOREVER
 application identifier: application_1400563733088_0012
 appId: 12
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: root.vagrant
 appMasterRpcPort: -1
 appStartTime: 1400567047343
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED
 appTrackingUrl: 
 http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/
 appUser: vagrant
 
 
 Log files give me no additional help. Latest log entry just acknowledges the 
 status change:
 
 hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20
  06:24:07,347 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1400563733088_0012 State change from SUBMITTED to ACCEPTED
 
 
 I’m running the example in local test environment with three virtual nodes in 
 Cloudera (CDH5).
 
 Below is the run_pi.sh : 
 
 #!/bin/bash
 
 export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
 export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2
 export SPARK_MASTER_PORT=7077
 export 
 DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop
 
 export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar
 
 export SPARK_LAUNCH_WITH_SCALA=0
 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
 export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
 export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
 
 export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}
 
 if [ -n $HADOOP_HOME ]; then
  export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
 fi
 export 
 SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar
 
 APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 
 $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \
 --jar $APP_JAR \
 --class org.apache.spark.examples.SparkPi \
 --args yarn-standalone \
 --args 10 \
 --num-workers 1 \
 --master-memory 1g \
 --worker-memory 500m \
 --worker-cores 1

Re: combinebykey throw classcastexception

2014-05-20 Thread xiemeilong

This issue is turned out cased by version mismatch between driver(0.9.1) and
server(0.9.0-cdh5.0.1) just now.  Other function works fine but combinebykey
before.

Thank you very much for your reply.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/combinebykey-throw-classcastexception-tp6060p6087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg

Hi,
each node has 4Gig of memory. After total reboot and re-run of SparkPi  
resource manager shows no running containers and 1 pending container. 

-jan

On 20 May 2014, at 10:24, sandy.r...@cloudera.com sandy.r...@cloudera.com 
wrote:

 Hi Jan,
 
 How much memory capacity is configured for each node?
 
 If you go to the ResourceManager web UI, does it indicate any containers are 
 running?
 
 -Sandy
 
 On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote:
 
 Hi,
 I’m new to Spark and trying to test first Spark prog. I’m running SparkPi 
 successfully in yarn-client -mode but when running the same in yarn-mode, 
 app gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason 
 but the outcome is always the same. Any hints what to look for next? 
 
 cheers,
 -jan
 
 
 vagrant@vm-cluster-node1:~$ ./run_pi.sh
 14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at 
 vm-cluster-node2/10.211.55.101:8032
 14/05/20 06:24:05 INFO Client: Got Cluster metric info from 
 ApplicationsManager (ASM), number of NodeManagers: 2
 14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, 
 queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
 queueApplicationCount = 0, queueChildQueueCount = 0
 14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in 
 this cluster 2048
 14/05/20 06:24:05 INFO Client: Preparing Local resources
 14/05/20 06:24:05 INFO Client: Uploading 
 file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
  to 
 hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 14/05/20 06:24:07 INFO Client: Setting up the launch environment
 14/05/20 06:24:07 INFO Client: Setting up container launch context
 14/05/20 06:24:07 INFO Client: Command for starting the Spark 
 ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp 
 org.apache.spark.deploy.yarn.ApplicationMaster --class 
 org.apache.spark.examples.SparkPi --jar 
 /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
  --args  'yarn-standalone'  --args  '10'  --worker-memory 500 --worker-cores 
 1 --num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr
 14/05/20 06:24:07 INFO Client: Submitting application to ASM
 14/05/20 06:24:07 INFO YarnClientImpl: Submitted application 
 application_1400563733088_0012
 14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET 
 REPEATING FOREVER
application identifier: application_1400563733088_0012
appId: 12
clientToAMToken: null
appDiagnostics:
appMasterHost: N/A
appQueue: root.vagrant
appMasterRpcPort: -1
appStartTime: 1400567047343
yarnAppState: ACCEPTED
distributedFinalState: UNDEFINED
appTrackingUrl: 
 http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/
appUser: vagrant
 
 
 Log files give me no additional help. Latest log entry just acknowledges the 
 status change:
 
 hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20
  06:24:07,347 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
 application_1400563733088_0012 State change from SUBMITTED to ACCEPTED
 
 
 I’m running the example in local test environment with three virtual nodes 
 in Cloudera (CDH5).
 
 Below is the run_pi.sh : 
 
 #!/bin/bash
 
 export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
 export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2
 export SPARK_MASTER_PORT=7077
 export 
 DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop
 
 export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar
 
 export SPARK_LAUNCH_WITH_SCALA=0
 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
 export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
 export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
 
 export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}
 
 if [ -n $HADOOP_HOME ]; then
 export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
 fi
 export 
 SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar
 
 APP_JAR=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 
 $SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client \
 --jar $APP_JAR \
 --class org.apache.spark.examples.SparkPi \
 --args yarn-standalone \
 --args 10 \
 --num-workers 1 \
 --master-memory 1g \
 --worker-memory 500m \
 --worker-cores 1

Re: Worker re-spawn and dynamic node joining

2014-05-20 Thread Han JU

Thank you guys for the detailed answer.
Akhil, yes I would like to have a try of your tool. Is it open-sourced?

2014-05-17 17:55 GMT+02:00 Mayur Rustagi mayur.rust...@gmail.com:

A better way would be use Mesos (and quite possibly Yarn in 1.0.0).
That will allow you to add nodes on the fly leverage it for Spark.
Frankly Standalone mode is not meant to handle those issues. That said we
use our deployment tool as stopping the cluster for adding nodes is not
really an issue at the moment.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi

On Sat, May 17, 2014 at 9:05 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Thanks for the info about adding/removing nodes dynamically. That's
valuable.

2014년 5월 16일 금요일, Akhil Dasak...@sigmoidanalytics.com님이 작성한 메시지:

Hi Han :)

1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
not came back automatically.

= Yes. You can either add OOM killer
exceptionhttp://backdrift.org/how-to-create-oom-killer-exceptions on
all of your Spark processes. Or you can have a cronjob which will keep
monitoring your worker processes and if they goes down the cronjob will
bring it back.

2. How to dynamically add (or remove) some worker machines to (from)
the cluster? We'd like to leverage the auto-scaling group in EC2 for
example.

= You can add/remove worker nodes on the fly by spawning a new machine
and then adding that machine's ip address in the master node then rsyncing
the spark directory with all worker machines including the one you added.
Then simply you can use the *start-all.sh* script inside the master
node to bring up the new worker in action. For removing a worker machine
from master can be done in the same way, you have to remove the workers IP
address from the masters *slaves *file and then you can restart your
slaves and that will get your worker removed.

FYI, we have a deployment tool (a web-based UI) that we use for internal
purposes, it is build on top of the spark-ec2 script (with some changes)
and it has a module for adding/removing worker nodes on the fly. It looks
like the attached screenshot. If you want i can give you some access.

Thanks
Best Regards

On Wed, May 14, 2014 at 9:52 PM, Han JU ju.han.fe...@gmail.com wrote:

Hi all,

Just 2 questions:

1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
not came back automatically.

2. How to dynamically add (or remove) some worker machines to (from)
the cluster? We'd like to leverage the auto-scaling group in EC2 for
example.

We're using spark-standalone.

Thanks a lot.

--
*JU Han*

Data Engineer @ Botify.com

+33 061960

--
*JU Han*

Data Engineer @ Botify.com

+33 061960

question about the license of akka and Spark

2014-05-20 Thread YouPeng Yang

Hi
  Just know akka is under a commercial license,however Spark is under the
apache
license.
  Is there any problem?


Regards

Re: question about the license of akka and Spark

2014-05-20 Thread Tathagata Das

Akka is under Apache 2 license too.
http://doc.akka.io/docs/akka/snapshot/project/licenses.html


On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi
   Just know akka is under a commercial license,however Spark is under the
 apache
 license.
   Is there any problem?


 Regards

Re: question about the license of akka and Spark

2014-05-20 Thread Sean Owen

The page says Akka is Open Source and available under the Apache 2 License.

It may also be available under another license, but that does not
change the fact that it may be used by adhering to the terms of the
AL2.

The section is referring to commercial support that Typesafe sells. I
am not even sure there is a second license; they may merely mean
commercial support contract since the target page doesn't refer to
any other license. That is, it's not a clear dual-license model or
anything.

On Tue, May 20, 2014 at 11:15 AM, YouPeng Yang
yypvsxf19870...@gmail.com wrote:
 Hi

Well,Maybe I  get the wrong reference:
 http://doc.akka.io/docs/akka/2.3.2/intro/what-is-akka.html
   On the page ,the last bold tag  Commercial Support indicate that the akka
 is under commercial license ,by the way,the version is 2.3.2


 2014-05-20 17:30 GMT+08:00 Tathagata Das tathagata.das1...@gmail.com:

 Akka is under Apache 2 license too.
 http://doc.akka.io/docs/akka/snapshot/project/licenses.html


 On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.com
 wrote:

 Hi
   Just know akka is under a commercial license,however Spark is under the
 apache
 license.
   Is there any problem?


 Regards

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread gaurav.dasgupta

Few more details I would like to provide (Sorry as I should have provided
with the previous post):

 *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2)
 - Hadoop Version = 2.4.0 (Hortonworks)
 - I am trying to execute a Spark Streaming program*

Because I am using Hortornworks Hadoop (HDP), YARN is configured with
different port numbers than the default Apache's default configurations. For
example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults
to IP:8032.

When I run the Spark examples using bin/run-example, I can see in the
console logs, that it is connecting to the right port configured by HDP,
i.e., 8050. Please refer the below console log:

*/[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true
SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
bin/run-example org.apache.spark.examples.HdfsTest yarn-client
/user/root/test
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/05/20 06:55:29 INFO Remoting: Starting remoting
14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@IP:60988]
14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@lt;IP:60988]
14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster
14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140520065529-924f
14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with
capacity 4.2 GB.
14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port 35359
with id = ConnectionManagerId(IP,35359)
14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register
BlockManager
14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
Registering block manager IP:35359 with 4.2 GB RAM
14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered BlockManager
14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server
14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/05/20 06:55:29 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:59418
14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started at
http://IP:59418
14/05/20 06:55:29 INFO spark.SparkEnv: Registering MapOutputTracker
14/05/20 06:55:29 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-fc34fdc8-d940-420b-b184-fc7a8a65501a
14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server
14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/05/20 06:55:29 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:53425
14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/storage,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/stage,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages/pool,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/stages,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/environment,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/executors,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/05/20 06:55:29 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/,null}
14/05/20 06:55:29 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/05/20 06:55:29 INFO ui.SparkUI: Started Spark Web UI at http://IP:4040
14/05/20 06:55:29 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/05/20 06:55:29 INFO spark.SparkContext: Added JAR
/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
at http://IP:53425/jars/spark-examples_2.10-assembly-0.9.1.jar with
timestamp 1400586929921
14/05/20 06:55:30 INFO client.RMProxy: Connecting to ResourceManager at
IP:8050
14/05/20 06:55:30 INFO yarn.Client: Got Cluster metric info from
ApplicationsManager (ASM), number of

Re: Status stays at ACCEPTED

2014-05-20 Thread Jan Holmberg

Still the same. I increased the memory of the node holding resource manager to 
5 Gig. I also spotted an HDFS alert of replication factor 3 that I now dropped 
to the number of data nodes. I also shut all down all services not in use. 
Still the issue remains.

I have noticed following two events that are fired when I start the Spark run :

Zookeeper : caught end of stream exception
Yarn : The specific max attempts: 0 for application: 1 is invalid, because it 
is out of the range [1, 2]. Use the global max attempts

-jan


On 20 May 2014, at 11:14, Jan Holmberg 
jan.holmb...@perigeum.fimailto:jan.holmb...@perigeum.fi wrote:

Hi,
each node has 4Gig of memory. After total reboot and re-run of SparkPi  
resource manager shows no running containers and 1 pending container.

-jan

On 20 May 2014, at 10:24, 
sandy.r...@cloudera.commailto:sandy.r...@cloudera.com 
sandy.r...@cloudera.commailto:sandy.r...@cloudera.com wrote:

Hi Jan,

How much memory capacity is configured for each node?

If you go to the ResourceManager web UI, does it indicate any containers are 
running?

-Sandy

On May 19, 2014, at 11:43 PM, Jan Holmberg 
jan.holmb...@perigeum.fimailto:jan.holmb...@perigeum.fi wrote:

Hi,
I’m new to Spark and trying to test first Spark prog. I’m running SparkPi 
successfully in yarn-client -mode but when running the same in yarn-mode, app 
gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the 
outcome is always the same. Any hints what to look for next?

cheers,
-jan


vagrant@vm-cluster-node1:~$ ./run_pi.sh
14/05/20 06:24:04 INFO RMProxy: Connecting to ResourceManager at 
vm-cluster-node2/10.211.55.101:8032
14/05/20 06:24:05 INFO Client: Got Cluster metric info from ApplicationsManager 
(ASM), number of NodeManagers: 2
14/05/20 06:24:05 INFO Client: Queue info ... queueName: root.default, 
queueCurrentCapacity: 0.0, queueMaxCapacity: -1.0,
   queueApplicationCount = 0, queueChildQueueCount = 0
14/05/20 06:24:05 INFO Client: Max mem capabililty of a single resource in this 
cluster 2048
14/05/20 06:24:05 INFO Client: Preparing Local resources
14/05/20 06:24:05 INFO Client: Uploading 
file:/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 to 
hdfs://vm-cluster-node2:8020/user/vagrant/.sparkStaging/application_1400563733088_0012/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
14/05/20 06:24:07 INFO Client: Setting up the launch environment
14/05/20 06:24:07 INFO Client: Setting up container launch context
14/05/20 06:24:07 INFO Client: Command for starting the Spark 
ApplicationMaster: java -server -Xmx1024m -Djava.io.tmpdir=$PWD/tmp 
org.apache.spark.deploy.yarn.ApplicationMaster --class 
org.apache.spark.examples.SparkPi --jar 
/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.0-hadoop2.3.0-cdh5.0.0.jar
 --args  'yarn-standalone'  --args  '10'  --worker-memory 500 --worker-cores 1 
--num-workers 1 1 LOG_DIR/stdout 2 LOG_DIR/stderr
14/05/20 06:24:07 INFO Client: Submitting application to ASM
14/05/20 06:24:07 INFO YarnClientImpl: Submitted application 
application_1400563733088_0012
14/05/20 06:24:08 INFO Client: Application report from ASM: THIS PART GET 
REPEATING FOREVER
  application identifier: application_1400563733088_0012
  appId: 12
  clientToAMToken: null
  appDiagnostics:
  appMasterHost: N/A
  appQueue: root.vagrant
  appMasterRpcPort: -1
  appStartTime: 1400567047343
  yarnAppState: ACCEPTED
  distributedFinalState: UNDEFINED
  appTrackingUrl: 
http://vm-cluster-node2:8088/proxy/application_1400563733088_0012/
  appUser: vagrant


Log files give me no additional help. Latest log entry just acknowledges the 
status change:

hadoop-yarn/hadoop-cmf-yarn-RESOURCEMANAGER-vm-cluster-node2.log.out:2014-05-20 
06:24:07,347 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1400563733088_0012 State change from SUBMITTED to ACCEPTED


I’m running the example in local test environment with three virtual nodes in 
Cloudera (CDH5).

Below is the run_pi.sh :

#!/bin/bash

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
export STANDALONE_SPARK_MASTER_HOST=vm-cluster-node2
export SPARK_MASTER_PORT=7077
export 
DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/hadoop

export SPARK_JAR_HDFS_PATH=/user/spark/share/lib/spark-assembly.jar

export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n $HADOOP_HOME ]; then
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
export 
SPARK_JAR=hdfs://vm-cluster-node2:8020/user/spark/share/lib/spark-assembly.jar

Re: life if an executor

2014-05-20 Thread Koert Kuipers

if they are tied to the spark context, then why can the subprocess not be
started up with the extra jars (sc.addJars) already on class path? this way
a switch like user-jars-first would be a simple rearranging of the class
path for the subprocess, and the messing with classloaders that is
currently done in executor (which forces people to use reflection is
certain situations and is broken if you want user jars first) would be
history
On May 20, 2014 1:07 AM, Matei Zaharia matei.zaha...@gmail.com wrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx

Re: life if an executor

2014-05-20 Thread Koert Kuipers

just for my clarification: off heap cannot be java objects, correct? so we
are always talking about serialized off-heap storage?
On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 That's one the main motivation in using Tachyon ;)
 http://tachyon-project.org/

 It gives off heap in-memory caching. And starting Spark 0.9, you can cache
 any RDD in Tachyon just by specifying the appropriate StorageLevel.

 TD




 On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they tied
 to the sparkcontext and life/die with it?

 thx

Ignoring S3 0 files exception

2014-05-20 Thread Laurent T

Hi,

I'm trying to get data from S3 using sc.textFile(s3n://+filenamePattern)
It seems that if a pattern gives out no result i get an exception like so:

org.apache.hadoop.mapred.InvalidInputException: Input Pattern
s3n://bucket/20140512/* matches 0 files
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)

I'm actually doing a union of multiple RDD so some may have data in them but
some won't.
Is there anyway to say ignore empty patterns so that it can work with the
RDDs that actually found files ?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ignoring-S3-0-files-exception-tp6101.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Advanced log processing

2014-05-20 Thread Laurent T

Thanks for the advice. I think you're right. I'm not sure we're going to use
HBase but starting by partitioning data into multiple buckets will be a
first step. I'll see how it performs on large datasets.

My original question though was more like: is there a spark trick i don't
know about ?
Currently here's what i'm doing:
JavaPairRDD originalData = ...;JavaPairRDD incompleteData = originalData   
.filter(KeepIncompleteData).map(CleanData).cache();List pathList =
incompleteData.flatMap(GetPossibleConciliationPaths).distinct()   
.collect()JavaPairRDD conciliationRDD = null;for (String filePath : pathList
) { JavaPairRDD fileData = sc   .textFile(filePath) 
.flatMap(ProcessData);
if (conciliationRDD == null) {  conciliationRDD = fileData; }   
else {  
conciliationRDD = conciliationRDD .union(fileData); }}JavaPairRDD finalData
= originalData.filter(KeepCompleteData)   
.union(conciliationRDD.join(incompleteData)).saveAsTextFile(dir);
The collect part is what's frightening me the most as there may be alot of
different paths.Does that seem fine ?Would an approach with HBase allow me
to simply join the incomplete data with the stored state using a key ?Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

reading large XML files

2014-05-20 Thread Nathan Kronenfeld

We are trying to read some large GraphML files to use in spark.

Is there an easy way to read XML-based files like this that accounts for
partition boundaries and the like?

 Thanks,
 Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: issue with Scala, Spark and Akka

2014-05-20 Thread Gerard Maas

This error message says I can't find the config for the akka subsystem.
That is typically included in the Spark assembly.
First, you need to compile your spark distro, by running sbt/sbt assembly
on the SPARK_HOME dir.
Then, use the SPARK_HOME (through env or configuration) to point to your
SPARK_HOME dir.

See running standalone app here:
http://spark.apache.org/docs/latest/quick-start.html

-kr, Gerard.





On Tue, May 20, 2014 at 5:01 PM, Greg g...@zooniverse.org wrote:

 Hi,
 I have the following Scala code:
 ===---
 import org.apache.spark.SparkContext

 class test {
   def main(){
 val sc = new SparkContext(local, Scala Word Count)
   }
 }
 ===---
 and the following build.sbt file
 ===---
 name := test

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core % 0.9.1

 libraryDependencies += org.apache.hadoop % hadoop-client % 1.0.4

 libraryDependencies += org.mongodb % mongo-java-driver % 2.11.4

 libraryDependencies += org.mongodb % mongo-hadoop-core % 1.0.0

 resolvers += Akka Repository at http://repo.akka.io/releases/
 ===---
 I get the following error:
 com.typesafe.config.ConfigException$Missing: No configuration setting found
 for key 'akka.version'
 at

 com.typesafe.config.impl.SimpleConfig.findKey(test.sc3587202794988350330.tmp:111)
 at

 com.typesafe.config.impl.SimpleConfig.find(test.sc3587202794988350330.tmp:132)


 Any suggestions on how to fix this?
 thanks, Greg




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/issue-with-Scala-Spark-and-Akka-tp6103.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: filling missing values in a sequence

2014-05-20 Thread Mohit Jaggi

Xiangrui,
Thanks for the pointer. I think it should work...for now I did cook up my
own which is similar but on top of spark core APIs. I would suggest moving
the sliding window RDD to the core spark library. It seems quite general to
me and a cursory look at the code indicates nothing specific to machine
learning.

Mohit.


On Mon, May 19, 2014 at 10:13 PM, Xiangrui Meng men...@gmail.com wrote:

 Actually there is a sliding method implemented in
 mllib.rdd.RDDFunctions. Since this is not for general use cases, we
 didn't include it in spark-core. You can take a look at the
 implementation there and see whether it fits. -Xiangrui

 On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com
 wrote:
  Thanks Sean. Yes, your solution works :-) I did oversimplify my real
  problem, which has other parameters that go along with the sequence.
 
 
  On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote:
 
  Not sure if this is feasible, but this literally does what I think you
  are describing:
 
  sc.parallelize(rdd1.first to rdd1.last)
 
  On Tue, May 13, 2014 at 4:56 PM, Mohit Jaggi mohitja...@gmail.com
 wrote:
   Hi,
   I am trying to find a way to fill in missing values in an RDD. The RDD
   is a
   sorted sequence.
   For example, (1, 2, 3, 5, 8, 11, ...)
   I need to fill in the missing numbers and get
 (1,2,3,4,5,6,7,8,9,10,11)
  
   One way to do this is to slide and zip
   rdd1 =  sc.parallelize(List(1, 2, 3, 5, 8, 11, ...))
   x = rdd1.first
   rdd2 = rdd1 filter (_ != x)
   rdd3 = rdd2 zip rdd1
   rdd4 = rdd3 flatmap { (x, y) = generate missing elements between x
 and
   y }
  
   Another method which I think is more efficient is to use
   mapParititions() on
   rdd1 to be able to iterate on elements of rdd1 in each partition.
   However,
   that leaves the boundaries of the partitions to be unfilled. Is
 there
   a
   way within the function passed to mapPartitions, to read the first
   element
   in the next partition?
  
   The latter approach also appears to work for a general sliding
 window
   calculation on the RDD. The former technique requires a lot of
 sliding
   and
   zipping and I believe it is not efficient. If only I could read the
   next
   partition...I have tried passing a pointer to rdd1 to the function
   passed to
   mapPartitions but the rdd1 pointer turns out to be NULL, I guess
 because
   Spark cannot deal with a mapper calling another mapper (since it
 happens
   on
   a worker not the driver)
  
   Mohit.

Re: life if an executor

2014-05-20 Thread Aaron Davidson

One issue is that new jars can be added during the lifetime of a
SparkContext, which can mean after executors are already started. Off-heap
storage is always serialized, correct.


On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote:

 just for my clarification: off heap cannot be java objects, correct? so we
 are always talking about serialized off-heap storage?
 On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 That's one the main motivation in using Tachyon ;)
 http://tachyon-project.org/

 It gives off heap in-memory caching. And starting Spark 0.9, you can
 cache any RDD in Tachyon just by specifying the appropriate StorageLevel.

 TD




 On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they
 tied to the sparkcontext and life/die with it?

 thx

Re: reading large XML files

2014-05-20 Thread Xiangrui Meng

Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui

On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
 We are trying to read some large GraphML files to use in spark.

 Is there an easy way to read XML-based files like this that accounts for
 partition boundaries and the like?

  Thanks,
  Nathan


 --
 Nathan Kronenfeld
 Senior Visualization Developer
 Oculus Info Inc
 2 Berkeley Street, Suite 600,
 Toronto, Ontario M5A 4J5
 Phone:  +1-416-203-3003 x 238
 Email:  nkronenf...@oculusinfo.com

Evaluating Spark just for Cluster Computing

2014-05-20 Thread pcutil

Hi -

We have a use case for batch processing for which we are trying to figure
out if Apache Spark would be a good fit or not.

We have a universe of identifiers sitting in RDBMS for which we need to go
get input data from RDBMS and then pass that input to analytical models that
generate some output numbers and store it back to the database. This is one
unit of work for us.

So basically we are looking where we can do this processing in parallel for
the universe of identifiers that we have. All the data is in RDBMS and is
not sitting in file system.

Can we use spark for this kind of work and would it be a good fit for that?

Thanks for your help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld

Unfortunately, I don't have a bunch of moderately big xml files; I have
one, really big file - big enough that reading it into memory as a single
string is not feasible.


On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote:

 Try sc.wholeTextFiles(). It reads the entire file into a string
 record. -Xiangrui

 On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
  We are trying to read some large GraphML files to use in spark.
 
  Is there an easy way to read XML-based files like this that accounts for
  partition boundaries and the like?
 
   Thanks,
   Nathan
 
 
  --
  Nathan Kronenfeld
  Senior Visualization Developer
  Oculus Info Inc
  2 Berkeley Street, Suite 600,
  Toronto, Ontario M5A 4J5
  Phone:  +1-416-203-3003 x 238
  Email:  nkronenf...@oculusinfo.com




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Spark and Hadoop

2014-05-20 Thread pcutil

I'm a first time user and need to try just the hello world kind of program in
spark.

Now on downloads page, I see following 3 options for Pre-built packages that
I can download:

- Hadoop 1 (HDP1, CDH3)
- CDH4
- Hadoop 2 (HDP2, CDH5)

I'm confused which one do I need to download. I need to try just the hello
world kind of program which has nothing to do with Hadoop. Or is it that I
can only use Spark with Hadoop?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark and Hadoop

2014-05-20 Thread Andrew Ash

Hi Puneet,

If you're not going to read/write data in HDFS from your Spark cluster,
then it doesn't matter which one you download.  Just go with Hadoop 2 as
that's more likely to connect to an HDFS cluster in the future if you ever
do decide to use HDFS because it's the newer APIs.

Cheers,
Andrew


On Tue, May 20, 2014 at 10:43 AM, pcutil puneet.ma...@gmail.com wrote:

 I'm a first time user and need to try just the hello world kind of program
 in
 spark.

 Now on downloads page, I see following 3 options for Pre-built packages
 that
 I can download:

 - Hadoop 1 (HDP1, CDH3)
 - CDH4
 - Hadoop 2 (HDP2, CDH5)

 I'm confused which one do I need to download. I need to try just the hello
 world kind of program which has nothing to do with Hadoop. Or is it that I
 can only use Spark with Hadoop?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark and Hadoop

2014-05-20 Thread Andras Barjak

You can download any of them, I would go with the latest versions,
or just download the source and build it yourself.
For experimenting with basic things you can just launch the REPL
and start right away in spark local mode not using any hadoop stuff.


2014-05-20 19:43 GMT+02:00 pcutil puneet.ma...@gmail.com:

 I'm a first time user and need to try just the hello world kind of program
 in
 spark.

 Now on downloads page, I see following 3 options for Pre-built packages
 that
 I can download:

 - Hadoop 1 (HDP1, CDH3)
 - CDH4
 - Hadoop 2 (HDP2, CDH5)

 I'm confused which one do I need to download. I need to try just the hello
 world kind of program which has nothing to do with Hadoop. Or is it that I
 can only use Spark with Hadoop?

 Thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Hadoop-tp6113.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-20 Thread Jacob Eisinger


Howdy Gerard,

Yeah, the docker link feature seems to work well for client-server
interaction.  But, peer-to-peer architectures need more for service
discovery.

As for you addressing requirements, I don't completely understand what you
are asking for... you may also want to check out xip.io .  Their wild card
domains sometimes makes for an easy, neat hack.

Finally for the ports, Docker's new host networking [1] feature helps out
with making a Spark Docker container.  (Security is still an issue.)

Jacob

[1] http://blog.docker.io/2014/05/docker-0-11-release-candidate-for-1-0/

Jacob

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Gerard Maas gerard.m...@gmail.com
To: user@spark.apache.org
Date:   05/16/2014 10:26 AM
Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
submit jobs.



Hi Jacob,

Thanks for the help  answer on the docker question. Have you already
experimented with the new link feature in Docker? That does not help the
HDFS issue as the DataNode needs the namenode and vice-versa but it does
facilitate simpler client-server interactions.

My issue described at the beginning is  related to networking between the
host and the docker images, but I was loosing too much time tracking down
the exact problem, so I moved my Spark job driver into the mesos node and
it started working.  Sadly, my Mesos UI is partially crippled as workers
are not addressable (therefore spark job logs are hard to gather)

Your discussion about dynamic port allocation is very relevant to
understand why some components cannot talk with each other.  I'll need to
have a more in-depth read of that discussion to  find a better solution for
my local development environment.

regards,  Gerard.



On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote:
  Howdy,

  You might find the discussion Andrew and I have been having about Docker
  and network security [1] applicable.

  Also, I posted an answer [2] to your stackoverflow question.

  [1]
  
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html

  [2]
  
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100


  Jacob D. Eisinger
  IBM Emerging Technologies
  jeis...@us.ibm.com - (512) 286-6075

  Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi
  Benjamin, Yes, we initially used a modified version of theGerard Maas
  ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a
  modified version of the AmpLabs docker scripts

  From: Gerard Maas gerard.m...@gmail.com
  To: user@spark.apache.org
  Date: 05/05/2014 04:18 PM
  Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
  submit jobs.




  Hi Benjamin,

  Yes, we initially used a modified version of the AmpLabs docker scripts
  [1]. The amplab docker images are a good starting point.
  One of the biggest hurdles has been HDFS, which requires reverse-DNS and
  I didn't want to go the dnsmasq route to keep the containers relatively
  simple to use without the need of external scripts. Ended up running a
  1-node setup nnode+dnode. I'm still looking for a better solution for
  HDFS [2]

  Our usecase using docker is to easily create local dev environments both
  for development and for automated functional testing (using cucumber). My
  aim is to strongly reduce the time of the develop-deploy-test cycle.
  That  also means that we run the minimum number of instances required to
  have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ...

  For the actual cluster deployment we have Chef-based devops toolchain
  that  put things in place on public cloud providers.
  Personally, I think Docker rocks and would like to replace those complex
  cookbooks with Dockerfiles once the technology is mature enough.

  -greetz, Gerard.

  [1] https://github.com/amplab/docker-scripts
  [2]
  
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns



  On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote:
Hi,

Before considering running on Mesos, did you try to submit the
application on Spark deployed without Mesos on Docker containers ?

Currently investigating this idea to deploy quickly a complete set
of clusters with Docker, I'm interested by your findings on sharing
the settings of Kafka and Zookeeper across nodes. How many broker
and zookeeper do you use ?

Regards,



On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com
 wrote:
  Hi all,

  I'm currently working on creating a set of docker images to
  facilitate local development with Spark/streaming on Mesos
  (+zk, hdfs, kafka)

  After solving the initial hurdles to get things

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja

I was actually able to get this to work.  I was NOT setting the classpath
properly originally.

Simply running
java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass

and setting yarn-client as the spark master worked for me.  Originally I
had not put the configuration on the classpath. Also, I used
$SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant
jars.  The job properly connects to the am at the correct port.

Is there any intuition on how spark executor map to yarn workers or how the
different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM?

Thanks,
Arun


On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote:

 Hi Gaurav and Arun,

 Your settings seem reasonable; as long as YARN_CONF_DIR or HADOOP_CONF_DIR
 is properly set, the application should be able to find the correct RM
 port. Have you tried running the examples in yarn-client mode, and your
 custom application in yarn-standalone (now yarn-cluster) mode?



 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com:

 Few more details I would like to provide (Sorry as I should have provided
 with the previous post):

  *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2)
  - Hadoop Version = 2.4.0 (Hortonworks)
  - I am trying to execute a Spark Streaming program*

 Because I am using Hortornworks Hadoop (HDP), YARN is configured with
 different port numbers than the default Apache's default configurations.
 For
 example, *resourcemanager.address* is IP:8050 in HDP whereas it defaults
 to IP:8032.

 When I run the Spark examples using bin/run-example, I can see in the
 console logs, that it is connecting to the right port configured by HDP,
 i.e., 8050. Please refer the below console log:

 */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true

 SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar

 SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
 bin/run-example org.apache.spark.examples.HdfsTest yarn-client
 /user/root/test
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/05/20 06:55:29 INFO Remoting: Starting remoting
 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@IP:60988]
 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@lt;IP:60988]
 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster
 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory
 at
 /tmp/spark-local-20140520065529-924f
 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with
 capacity 4.2 GB.
 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port
 35359
 with id = ConnectionManagerId(IP,35359)
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register
 BlockManager
 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
 Registering block manager IP:35359 with 4.2 GB RAM
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered BlockManager
 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server
 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
 14/05/20 06:55:29 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:59418
 14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started
 at
 http://IP:59418
 14/05/20 06:55:29 INFO spark.SparkEnv: Registering MapOutputTracker
 14/05/20 06:55:29 INFO spark.HttpFileServer: HTTP File server directory is
 /tmp/spark-fc34fdc8-d940-420b-b184-fc7a8a65501a
 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server
 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
 14/05/20 06:55:29 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:53425
 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
 14/05/20 06:55:29 INFO handler.ContextHandler: started
 o.e.j.s.h.ContextHandler{/storage/rdd,null}
 14/05/20 06:55:29 INFO handler.ContextHandler: started
 o.e.j.s.h.ContextHandler{/storage,null}
 14/05/20 06:55:29 INFO handler.ContextHandler: started
 o.e.j.s.h.ContextHandler{/stages/stage,null}
 14/05/20 06:55:29 INFO handler.ContextHandler: started
 o.e.j.s.h.ContextHandler{/stages/pool,null}
 14/05/20 06:55:29 INFO handler.ContextHandler: started
 o.e.j.s.h.ContextHandler{/stages,null}
 14/05/20 06:55:29 INFO handler.ContextHandler: started

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Arun Ahuja

Hi Matei,

Unfortunately, I don't have more detailed information, but we have seen the
loss of workers in standalone mode as well.  If a job is killed through
CTRL-C we will often see in the Spark Master page the number of workers and
cores decrease.  They are still alive and well in the Cloudera Manager
page, but not visible on the Spark master, simply restarting the workers
usually resolves this, but we often seen workers disappear after a failed
or killed job.

If we see this occur again, I'll try and provide some logs.




On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Which version is this with? I haven’t seen standalone masters lose
 workers. Is there other stuff on the machines that’s killing them, or what
 errors do you see?

 Matei

 On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:

  Hey folks,
 
  I'm wondering what strategies other folks are using for maintaining and
 monitoring the stability of stand-alone spark clusters.
 
  Our master very regularly loses workers, and they (as expected) never
 rejoin the cluster.  This is the same behavior I've seen
  using akka cluster (if that's what spark is using in stand-alone mode)
 -- are there configuration options we could be setting
  to make the cluster more robust?
 
  We have a custom script which monitors the number of workers (through
 the web interface) and restarts the cluster when
  necessary, as well as resolving other issues we face (like spark shells
 left open permanently claiming resources), and it
  works, but it's no where close to a great solution.
 
  What are other folks doing?  Is this something that other folks observe
 as well?  I suspect that the loss of workers is tied to
  jobs that run out of memory on the client side or our use of very large
 broadcast variables, but I don't have an isolated test case.
  I'm open to general answers here: for example, perhaps we should simply
 be using mesos or yarn instead of stand-alone mode.
 
  --j

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson

I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job failures. We're running pretty close to
master, though; perhaps it is related to an uncaught exception in the
Worker from a prior version of Spark.


On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Hi Matei,

 Unfortunately, I don't have more detailed information, but we have seen
 the loss of workers in standalone mode as well.  If a job is killed through
 CTRL-C we will often see in the Spark Master page the number of workers and
 cores decrease.  They are still alive and well in the Cloudera Manager
 page, but not visible on the Spark master, simply restarting the workers
 usually resolves this, but we often seen workers disappear after a failed
 or killed job.

 If we see this occur again, I'll try and provide some logs.




 On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 Which version is this with? I haven’t seen standalone masters lose
 workers. Is there other stuff on the machines that’s killing them, or what
 errors do you see?

 Matei

 On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:

  Hey folks,
 
  I'm wondering what strategies other folks are using for maintaining and
 monitoring the stability of stand-alone spark clusters.
 
  Our master very regularly loses workers, and they (as expected) never
 rejoin the cluster.  This is the same behavior I've seen
  using akka cluster (if that's what spark is using in stand-alone mode)
 -- are there configuration options we could be setting
  to make the cluster more robust?
 
  We have a custom script which monitors the number of workers (through
 the web interface) and restarts the cluster when
  necessary, as well as resolving other issues we face (like spark shells
 left open permanently claiming resources), and it
  works, but it's no where close to a great solution.
 
  What are other folks doing?  Is this something that other folks observe
 as well?  I suspect that the loss of workers is tied to
  jobs that run out of memory on the client side or our use of very large
 broadcast variables, but I don't have an isolated test case.
  I'm open to general answers here: for example, perhaps we should simply
 be using mesos or yarn instead of stand-alone mode.
 
  --j

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Andrew Or

I'm assuming you're running Spark 0.9.x, because in the latest version of
Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path
manually. I tested this out on my own YARN cluster and was able to confirm
that.

In Spark 1.0, SPARK_MEM is deprecated and should not be used. Instead, you
should set the per-executor memory through spark.executor.memory, which has
the same effect but takes higher priority. By YARN_WORKER_MEM, do you mean
SPARK_EXECUTOR_MEMORY? It also does the same thing. In Spark 1.0, the
priority hierarchy is as follows:

spark.executor.memory (set through spark-defaults.conf) 
SPARK_EXECUTOR_MEMORY  SPARK_MEM (deprecated)

In Spark 0.9, the hierarchy very similar:

spark.executor.memory (set through SPARK_JAVA_OPTS in spark-env)  SPARK_MEM

For more information:

http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/configuration.html
http://spark.apache.org/docs/0.9.1/configuration.html



2014-05-20 11:30 GMT-07:00 Arun Ahuja aahuj...@gmail.com:

 I was actually able to get this to work.  I was NOT setting the classpath
 properly originally.

 Simply running
 java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass

 and setting yarn-client as the spark master worked for me.  Originally I
 had not put the configuration on the classpath. Also, I used
 $SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant
 jars.  The job properly connects to the am at the correct port.

 Is there any intuition on how spark executor map to yarn workers or how
 the different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM?

 Thanks,
 Arun


 On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote:

 Hi Gaurav and Arun,

 Your settings seem reasonable; as long as YARN_CONF_DIR or
 HADOOP_CONF_DIR is properly set, the application should be able to find the
 correct RM port. Have you tried running the examples in yarn-client mode,
 and your custom application in yarn-standalone (now yarn-cluster) mode?



 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com:

 Few more details I would like to provide (Sorry as I should have provided
 with the previous post):

  *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2)
  - Hadoop Version = 2.4.0 (Hortonworks)
  - I am trying to execute a Spark Streaming program*

 Because I am using Hortornworks Hadoop (HDP), YARN is configured with
 different port numbers than the default Apache's default configurations.
 For
 example, *resourcemanager.address* is IP:8050 in HDP whereas it
 defaults
 to IP:8032.

 When I run the Spark examples using bin/run-example, I can see in the
 console logs, that it is connecting to the right port configured by HDP,
 i.e., 8050. Please refer the below console log:

 */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true

 SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar

 SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
 bin/run-example org.apache.spark.examples.HdfsTest yarn-client
 /user/root/test
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/05/20 06:55:29 INFO Remoting: Starting remoting
 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://spark@IP:60988]
 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@lt;IP:60988]
 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster
 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local directory
 at
 /tmp/spark-local-20140520065529-924f
 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with
 capacity 4.2 GB.
 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port
 35359
 with id = ConnectionManagerId(IP,35359)
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register
 BlockManager
 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
 Registering block manager IP:35359 with 4.2 GB RAM
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered
 BlockManager
 14/05/20 06:55:29 INFO spark.HttpServer: Starting HTTP Server
 14/05/20 06:55:29 INFO server.Server: jetty-7.x.y-SNAPSHOT
 14/05/20 06:55:29 INFO server.AbstractConnector: Started
 SocketConnector@0.0.0.0:59418
 14/05/20 06:55:29 INFO broadcast.HttpBroadcast: Broadcast server started
 at
 http://IP:59418
 14/05/20 06:55:29 INFO

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Matei Zaharia

Are you guys both using Cloudera Manager? Maybe there’s also an issue with the 
integration with that.

Matei

On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote:

 I'd just like to point out that, along with Matei, I have not seen workers 
 drop even under the most exotic job failures. We're running pretty close to 
 master, though; perhaps it is related to an uncaught exception in the Worker 
 from a prior version of Spark.
 
 
 On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote:
 Hi Matei,
 
 Unfortunately, I don't have more detailed information, but we have seen the 
 loss of workers in standalone mode as well.  If a job is killed through 
 CTRL-C we will often see in the Spark Master page the number of workers and 
 cores decrease.  They are still alive and well in the Cloudera Manager page, 
 but not visible on the Spark master, simply restarting the workers usually 
 resolves this, but we often seen workers disappear after a failed or killed 
 job.
 
 If we see this occur again, I'll try and provide some logs.
 
 
 
 
 On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Which version is this with? I haven’t seen standalone masters lose workers. 
 Is there other stuff on the machines that’s killing them, or what errors do 
 you see?
 
 Matei
 
 On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:
 
  Hey folks,
 
  I'm wondering what strategies other folks are using for maintaining and 
  monitoring the stability of stand-alone spark clusters.
 
  Our master very regularly loses workers, and they (as expected) never 
  rejoin the cluster.  This is the same behavior I've seen
  using akka cluster (if that's what spark is using in stand-alone mode) -- 
  are there configuration options we could be setting
  to make the cluster more robust?
 
  We have a custom script which monitors the number of workers (through the 
  web interface) and restarts the cluster when
  necessary, as well as resolving other issues we face (like spark shells 
  left open permanently claiming resources), and it
  works, but it's no where close to a great solution.
 
  What are other folks doing?  Is this something that other folks observe as 
  well?  I suspect that the loss of workers is tied to
  jobs that run out of memory on the client side or our use of very large 
  broadcast variables, but I don't have an isolated test case.
  I'm open to general answers here: for example, perhaps we should simply be 
  using mesos or yarn instead of stand-alone mode.
 
  --j

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus

We're using spark 0.9.0, and we're using it out of the box -- not using
Cloudera Manager or anything similar.

There are warnings from the master that there continue to be heartbeats
from the unregistered workers.   I will see if there are particular
telltale errors on the worker side.

We've had occasional problems with running out of memory on the driver side
(esp. with large broadcast variables) so that may be related.

--j

On Tuesday, May 20, 2014, Matei Zaharia matei.zaha...@gmail.com wrote:

 Are you guys both using Cloudera Manager? Maybe there’s also an issue with
 the integration with that.

 Matei

 On May 20, 2014, at 11:44 AM, Aaron Davidson 
 ilike...@gmail.comjavascript:_e(%7B%7D,'cvml','ilike...@gmail.com');
 wrote:

 I'd just like to point out that, along with Matei, I have not seen workers
 drop even under the most exotic job failures. We're running pretty close to
 master, though; perhaps it is related to an uncaught exception in the
 Worker from a prior version of Spark.


 On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja 
 aahuj...@gmail.comjavascript:_e(%7B%7D,'cvml','aahuj...@gmail.com');
  wrote:

 Hi Matei,

 Unfortunately, I don't have more detailed information, but we have seen
 the loss of workers in standalone mode as well.  If a job is killed through
 CTRL-C we will often see in the Spark Master page the number of workers and
 cores decrease.  They are still alive and well in the Cloudera Manager
 page, but not visible on the Spark master, simply restarting the workers
 usually resolves this, but we often seen workers disappear after a failed
 or killed job.

 If we see this occur again, I'll try and provide some logs.




 On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia 
 matei.zaha...@gmail.comjavascript:_e(%7B%7D,'cvml','matei.zaha...@gmail.com');
  wrote:

 Which version is this with? I haven’t seen standalone masters lose
 workers. Is there other stuff on the machines that’s killing them, or what
 errors do you see?

 Matei

 On May 16, 2014, at 9:53 AM, Josh Marcus 
 jmar...@meetup.comjavascript:_e(%7B%7D,'cvml','jmar...@meetup.com');
 wrote:

  Hey folks,
 
  I'm wondering what strategies other folks are using for maintaining
 and monitoring the stability of stand-alone spark clusters.
 
  Our master very regularly loses workers, and they (as expected) never
 rejoin the cluster.  This is the same behavior I've seen
  using akka cluster (if that's what spark is using in stand-alone mode)
 -- are there configuration options we could be setting
  to make the cluster more robust?
 
  We have a custom script which monitors the number of workers (through
 the web interface) and restarts the cluster when
  necessary, as well as resolving other issues we face (like spark
 shells left open permanently claiming resources), and it
  works, but it's no where close to a great solution.
 
  What are other folks doing?  Is this something that other folks
 observe as well?  I suspect that the loss of workers is tied to
  jobs that run out of memory on the client side or our use of very
 large broadcast variables, but I don't have an isolated test case.
  I'm open to general answers here: for example, perhaps we should
 simply be using mesos or yarn instead of stand-alone mode.
 
  --j

Re: life if an executor

2014-05-20 Thread Koert Kuipers

interesting, so it sounds to me like spark is forced to choose between the
ability to add jars during lifetime and the ability to run tasks with user
classpath first (which important for the ability to run jobs on spark
clusters not under your control, so for the viability of 3rd party spark
apps)


On Tue, May 20, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.com wrote:

 One issue is that new jars can be added during the lifetime of a
 SparkContext, which can mean after executors are already started. Off-heap
 storage is always serialized, correct.


 On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote:

 just for my clarification: off heap cannot be java objects, correct? so
 we are always talking about serialized off-heap storage?
 On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 That's one the main motivation in using Tachyon ;)
 http://tachyon-project.org/

 It gives off heap in-memory caching. And starting Spark 0.9, you can
 cache any RDD in Tachyon just by specifying the appropriate StorageLevel.

 TD




 On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.comwrote:

 I guess it needs to be this way to benefit from caching of RDDs in
 memory. It would be nice however if the RDD cache can be dissociated from
 the JVM heap so that in cases where garbage collection is difficult to
 tune, one could choose to discard the JVM and run the next operation in a
 few one.


 On Mon, May 19, 2014 at 10:06 PM, Matei Zaharia 
 matei.zaha...@gmail.com wrote:

 They’re tied to the SparkContext (application) that launched them.

 Matei

 On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:

 from looking at the source code i see executors run in their own jvm
 subprocesses.

 how long to they live for? as long as the worker/slave? or are they
 tied to the sparkcontext and life/die with it?

 thx

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus

So, for example, I have two disassociated worker machines at the moment.
 The last messages in the spark logs are akka association error messages,
like the following:

14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://
sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://
sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association failed with
[akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
]

On the master side, there are lots and lots of messages of the form:

14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038

--j



On Tue, May 20, 2014 at 3:28 PM, Josh Marcus jmar...@meetup.com wrote:

 We're using spark 0.9.0, and we're using it out of the box -- not using
 Cloudera Manager or anything similar.

 There are warnings from the master that there continue to be heartbeats
 from the unregistered workers.   I will see if there are particular
 telltale errors on the worker side.

 We've had occasional problems with running out of memory on the driver
 side (esp. with large broadcast variables) so that may be related.

 --j


 On Tuesday, May 20, 2014, Matei Zaharia matei.zaha...@gmail.com wrote:

 Are you guys both using Cloudera Manager? Maybe there’s also an issue
 with the integration with that.

 Matei

 On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote:

 I'd just like to point out that, along with Matei, I have not seen
 workers drop even under the most exotic job failures. We're running pretty
 close to master, though; perhaps it is related to an uncaught exception in
 the Worker from a prior version of Spark.


 On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja aahuj...@gmail.com wrote:

 Hi Matei,

 Unfortunately, I don't have more detailed information, but we have seen
 the loss of workers in standalone mode as well.  If a job is killed through
 CTRL-C we will often see in the Spark Master page the number of workers and
 cores decrease.  They are still alive and well in the Cloudera Manager
 page, but not visible on the Spark master, simply restarting the workers
 usually resolves this, but we often seen workers disappear after a failed
 or killed job.

 If we see this occur again, I'll try and provide some logs.




 On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:

 Which version is this with? I haven’t seen standalone masters lose
 workers. Is there other stuff on the machines that’s killing them, or what
 errors do you see?

 Matei

 On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote:

  Hey folks,
 
  I'm wondering what strategies other folks are using for maintaining
 and monitoring the stability of stand-alone spark clusters.
 
  Our master very regularly loses workers, and they (as expected) never
 rejoin the cluster.  This is the same behavior I've seen
  using akka cluster (if that's what spark is using in stand-alone
 mode) -- are there configuration options we could be setting
  to make the cluster more robust?
 
  We have a custom script which monitors the number of workers (through
 the web interface) and restarts the cluster when
  necessary, as well as resolving other issues we face (like spark
 shells left open permanently claiming resources), and it
  works, but it's no where close to a great solution.
 
  What are other folks doing?  Is this something that other folks
 observe as well?  I suspect that the loss of workers is tied to
  jobs that run out of memory on the client side or our use of very
 large broadcast variables, but I don't have an isolated test case.
  I'm open to general answers here: for example, perhaps we should
 simply be using mesos or yarn instead of stand-alone mode.
 
  --j

Spark Streaming using Flume body size limitation

2014-05-20 Thread lemieud

Hi,

I am trying to send events to spark streaming via flume.
It's working fine up to a certain point.
I have problems when the size of the body is over 1020 characters.

Basically, up to 1020 it works
1021 through 1024, the event will be accepted and there is no exception, but
the channel seems to be corrupted. Or at least no more events can make it
through.
1025 and up, I am seeing some exceptions : java.io.StreamCorruptedException:
invalid stream header:

I created a test using the Embedded Agent and a local spark to expose the
problem.
I am using the cdh5 distribution.
cdh-flume.version1.4.0-cdh5.0.0/cdh-flume.version
cdh-spark.version0.9.0-cdh5.0.0/cdh-spark.version
cdh-avro.version1.7.5-cdh5.0.0/cdh-avro.version

File is attached.

Has anybody ever seen this?
Any suggestions?

*** This
e-mail and attachments are confidential, legally privileged, may be subject to
copyright and sent solely for the attention of the addressee(s). Any
unauthorized use or disclosure is prohibited. Statements and opinions expressed
in this e-mail may not represent those of Radialpoint.
~~ Le contenu du présent
courriel est confidentiel, privilégié et peut être soumis à des droits
d'auteur. Il est envoyé à l'intention exclusive de son ou de ses destinataires.
Il est interdit de l'utiliser ou de le divulguer sans autorisation. Les
opinions exprimées dans le présent courriel peuvent diverger de celles de
Radialpoint.

SparkFlumeStreamTest.java (11K)
http://apache-spark-user-list.1001560.n3.nabble.com/attachment/6127/0/SparkFlumeStreamTest.java

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tp6127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-20 Thread Arun Ahuja

Yes, we are on Spark 0.9.0 so that explains the first piece, thanks!

Also, yes, I meant SPARK_WORKER_MEMORY.  Thanks for the hierarchy.
Similarly is there some best practice on setting SPARK_WORKER_INSTANCES and
spark.default.parallelism?

Thanks,
Arun


On Tue, May 20, 2014 at 3:04 PM, Andrew Or and...@databricks.com wrote:

 I'm assuming you're running Spark 0.9.x, because in the latest version of
 Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path
 manually. I tested this out on my own YARN cluster and was able to confirm
 that.

 In Spark 1.0, SPARK_MEM is deprecated and should not be used. Instead, you
 should set the per-executor memory through spark.executor.memory, which has
 the same effect but takes higher priority. By YARN_WORKER_MEM, do you mean
 SPARK_EXECUTOR_MEMORY? It also does the same thing. In Spark 1.0, the
 priority hierarchy is as follows:

 spark.executor.memory (set through spark-defaults.conf) 
 SPARK_EXECUTOR_MEMORY  SPARK_MEM (deprecated)

 In Spark 0.9, the hierarchy very similar:

 spark.executor.memory (set through SPARK_JAVA_OPTS in spark-env) 
 SPARK_MEM

 For more information:

 http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/configuration.html
 http://spark.apache.org/docs/0.9.1/configuration.html



 2014-05-20 11:30 GMT-07:00 Arun Ahuja aahuj...@gmail.com:

 I was actually able to get this to work.  I was NOT setting the classpath
 properly originally.

 Simply running
 java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass

 and setting yarn-client as the spark master worked for me.  Originally I
 had not put the configuration on the classpath. Also, I used
 $SPARK_HOME/bin/compute_classpath.sh now now to get all of the relevant
 jars.  The job properly connects to the am at the correct port.

 Is there any intuition on how spark executor map to yarn workers or how
 the different memory settings interplay, SPARK_MEM vs YARN_WORKER_MEM?

 Thanks,
 Arun


 On Tue, May 20, 2014 at 2:25 PM, Andrew Or and...@databricks.com wrote:

 Hi Gaurav and Arun,

 Your settings seem reasonable; as long as YARN_CONF_DIR or
 HADOOP_CONF_DIR is properly set, the application should be able to find the
 correct RM port. Have you tried running the examples in yarn-client mode,
 and your custom application in yarn-standalone (now yarn-cluster) mode?



 2014-05-20 5:17 GMT-07:00 gaurav.dasgupta gaurav.d...@gmail.com:

 Few more details I would like to provide (Sorry as I should have provided
 with the previous post):

  *- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2)
  - Hadoop Version = 2.4.0 (Hortonworks)
  - I am trying to execute a Spark Streaming program*

 Because I am using Hortornworks Hadoop (HDP), YARN is configured with
 different port numbers than the default Apache's default
 configurations. For
 example, *resourcemanager.address* is IP:8050 in HDP whereas it
 defaults
 to IP:8032.

 When I run the Spark examples using bin/run-example, I can see in the
 console logs, that it is connecting to the right port configured by HDP,
 i.e., 8050. Please refer the below console log:

 */[root@host spark-0.9.1-bin-hadoop2]# SPARK_YARN_MODE=true

 SPARK_JAR=assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar

 SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar
 bin/run-example org.apache.spark.examples.HdfsTest yarn-client
 /user/root/test
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in

 [jar:file:/usr/local/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 14/05/20 06:55:29 INFO slf4j.Slf4jLogger: Slf4jLogger started
 14/05/20 06:55:29 INFO Remoting: Starting remoting
 14/05/20 06:55:29 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://spark@IP:60988]
 14/05/20 06:55:29 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://spark@lt;IP:60988]
 14/05/20 06:55:29 INFO spark.SparkEnv: Registering BlockManagerMaster
 14/05/20 06:55:29 INFO storage.DiskBlockManager: Created local
 directory at
 /tmp/spark-local-20140520065529-924f
 14/05/20 06:55:29 INFO storage.MemoryStore: MemoryStore started with
 capacity 4.2 GB.
 14/05/20 06:55:29 INFO network.ConnectionManager: Bound socket to port
 35359
 with id = ConnectionManagerId(IP,35359)
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Trying to register
 BlockManager
 14/05/20 06:55:29 INFO storage.BlockManagerMasterActor$BlockManagerInfo:
 Registering block manager IP:35359 with 4.2 GB RAM
 14/05/20 06:55:29 INFO storage.BlockManagerMaster: Registered

Re: reading large XML files

2014-05-20 Thread Nathan Kronenfeld

Thanks, that sounds perfect



On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng men...@gmail.com wrote:

 You can search for XMLInputFormat on Google. There are some
 implementations that allow you to specify the tag to split on, e.g.:

 https://github.com/lintool/Cloud9/blob/master/src/dist/edu/umd/cloud9/collection/XMLInputFormat.java

 On Tue, May 20, 2014 at 10:31 AM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
  Unfortunately, I don't have a bunch of moderately big xml files; I have
 one,
  really big file - big enough that reading it into memory as a single
 string
  is not feasible.
 
 
  On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Try sc.wholeTextFiles(). It reads the entire file into a string
  record. -Xiangrui
 
  On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
  nkronenf...@oculusinfo.com wrote:
   We are trying to read some large GraphML files to use in spark.
  
   Is there an easy way to read XML-based files like this that accounts
 for
   partition boundaries and the like?
  
Thanks,
Nathan
  
  
   --
   Nathan Kronenfeld
   Senior Visualization Developer
   Oculus Info Inc
   2 Berkeley Street, Suite 600,
   Toronto, Ontario M5A 4J5
   Phone:  +1-416-203-3003 x 238
   Email:  nkronenf...@oculusinfo.com
 
 
 
 
  --
  Nathan Kronenfeld
  Senior Visualization Developer
  Oculus Info Inc
  2 Berkeley Street, Suite 600,
  Toronto, Ontario M5A 4J5
  Phone:  +1-416-203-3003 x 238
  Email:  nkronenf...@oculusinfo.com




-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable

2014-05-20 Thread pcutil

This is the first time I'm trying the Spark. I just downloaded and trying the
SimpleApp Java program using the maven. I added 2 maven dependencies --
spark-core and scala-library? Even though my program is in java, I was
forced to add the scala dependency. Is that really required?

Now, I'm able to build but while executing below line getting the error:

JavaSparkContext sc = new JavaSparkContext(local, Simple App,
  C:\\tmp\\spark-0.9.1-bin-cdh4, new
String[]{target\\spark-hello-world-1.1.jar});

java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:81)
at SimpleApp.main(SimpleApp.java:8)
at TestSimpleApp.testMain(TestSimpleApp.java:14)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:120)
at junit.framework.TestSuite.runTest(TestSuite.java:228)
at junit.framework.TestSuite.run(TestSuite.java:223)
at
org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35)
at
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
at
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:120)
at
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:103)
at org.apache.maven.surefire.Surefire.run(Surefire.java:169)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
at
org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.Writable
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
... 26 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NoClassDefFoundError-org-apache-hadoop-io-Writable-tp6131.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Evaluating Spark just for Cluster Computing

2014-05-20 Thread Sean Owen

My $0.02: If you are simply reading input records, running a model,
and outputting the result, then it's a simple map-only problem and
you're mostly looking for a process to baby-sit these operations. Lots
of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd
choose whatever is simplest to integrate with the RDBMS and analytical
model; all could work.

Keep in mind that the failure recovery processes in these various
frameworks don't necessarily interact cleanly with your external
systems. For example, if a Spark worker dies while doing some work on
some your IDs, it will happily be restarted, but if your job inserts
results into another table, you may find it has inserted them twice of
course.

On Tue, May 20, 2014 at 6:26 PM, pcutil puneet.ma...@gmail.com wrote:
Hi -

We have a use case for batch processing for which we are trying to figure
out if Apache Spark would be a good fit or not.

We have a universe of identifiers sitting in RDBMS for which we need to go
get input data from RDBMS and then pass that input to analytical models that
generate some output numbers and store it back to the database. This is one
unit of work for us.

So basically we are looking where we can do this processing in parallel for
the universe of identifiers that we have. All the data is in RDBMS and is
not sitting in file system.

Can we use spark for this kind of work and would it be a good fit for that?

Thanks for your help.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Evaluating-Spark-just-for-Cluster-Computing-tp6110.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Andrew Ash

If the distribution of the keys in your groupByKey is skewed (some keys
appear way more often than others) you should consider modifying your job
to use reduceByKey instead wherever possible.
On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote:

So we upped the spark.akka.frameSize value to 128 MB and still observed
the same behavior. It's happening not necessarily when data is being sent
back to the driver, but when there is an inter-cluster shuffle, for example
during a groupByKey.

Is it possible we should focus on tuning these parameters:
spark.storage.memoryFraction
spark.shuffle.memoryFraction ??

On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson ilike...@gmail.comwrote:

This is very likely because the serialized map output locations buffer
exceeds the akka frame size. Please try setting spark.akka.frameSize
(default 10 MB) to some higher number, like 64 or 128.

In the newest version of Spark, this would throw a better error, for what
it's worth.

On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler
jkeeble...@gmail.comwrote:

Has anyone observed Spark worker threads stalling during a shuffle phase
with
the following message (one per worker host) being echoed to the terminal
on
the driver thread?

INFO spark.MapOutputTrackerActor: Asked to send map output locations for
shuffle 0 to [worker host]...

At this point Spark-related activity on the hadoop cluster completely
halts
.. there's no network activity, disk IO or CPU activity, and individual
tasks are not completing and the job just sits in this state. At this
point
we just kill the job a re-start of the Spark server service is
required.

Using identical jobs we were able to by-pass this halt point by
increasing
available heap memory to the workers, but it's odd we don't get an
out-of-memory error or any error at all. Upping the memory available
isn't
a very satisfying answer to what may be going on :)

We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

Thanks for any help or ideas you may have!

Cheers,
Jonathan

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Imports that need to be specified in a Spark application jar?

2014-05-20 Thread Shivani Rao

Hello All,

I am learning that there are certain imports done by Spark REPL that is
used to invoke and run code in a spark shell, that I would have to import
specifically if I need the same functionality in a spark jar run by command
line.

I am getting into a repeated serialization error of an RDD that contains
the Map. The exact RDD is org.apache.spark.rdd.RDD[(String,
Seq[Map[String,MetaData]])] where MetaData is a case class. This
serialization is not encountered when I try to write out the RDD to disk by
using saveAsTextFile or when I try to count the number of elements in the
RDD using the count() function.

Sadly when I run the same commands in the spark-shell, I do not encounter
any error.
I appreciate your help in  advance.

Thanks
Shivani

-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: Setting queue for spark job on yarn

2014-05-20 Thread Sandy Ryza

Hi Ron,

What version are you using?  For 0.9, you need to set it outside your code
with the SPARK_YARN_QUEUE environment variable.

-Sandy


On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:

 Hi,
   How does one submit a spark job to yarn and specify a queue?
   The code that successfully submits to yarn is:

val conf = new SparkConf()
val sc = new SparkContext(yarn-client, Simple App, conf)

Where do I need to specify the queue?

   Thanks in advance for any help on this...

 Thanks,
 Ron

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread jonathan.keebler

Thanks for the suggestion, Andrew.  We have also implemented our solution
using reduceByKey, but observe the same behavior.  For example if we do the
following:

map1
groupByKey
map2
saveAsTextFile

Then the stalling will occur during the map1+groupByKey execution.

If we do

map1
reduceByKey
map2
saveAsTextFile

Then the reduceByKey finishes successfully, but the stalling will occur
during the map2+saveAsTextFile execution.


On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] 
ml-node+s1001560n6134...@n3.nabble.com wrote:

 If the distribution of the keys in your groupByKey is skewed (some keys
 appear way more often than others) you should consider modifying your job
 to use reduceByKey instead wherever possible.
 On May 20, 2014 12:53 PM, Jon Keebler [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=0
 wrote:

 So we upped the spark.akka.frameSize value to 128 MB and still observed
 the same behavior.  It's happening not necessarily when data is being sent
 back to the driver, but when there is an inter-cluster shuffle, for example
 during a groupByKey.

 Is it possible we should focus on tuning these parameters:
 spark.storage.memoryFraction  spark.shuffle.memoryFraction ??


 On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=1
  wrote:

 This is very likely because the serialized map output locations buffer
 exceeds the akka frame size. Please try setting spark.akka.frameSize
 (default 10 MB) to some higher number, like 64 or 128.

 In the newest version of Spark, this would throw a better error, for
 what it's worth.



 On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=2
  wrote:

 Has anyone observed Spark worker threads stalling during a shuffle
 phase with
 the following message (one per worker host) being echoed to the
 terminal on
 the driver thread?

 INFO spark.MapOutputTrackerActor: Asked to send map output locations for
 shuffle 0 to [worker host]...


 At this point Spark-related activity on the hadoop cluster completely
 halts
 .. there's no network activity, disk IO or CPU activity, and individual
 tasks are not completing and the job just sits in this state.  At this
 point
 we just kill the job  a re-start of the Spark server service is
 required.

 Using identical jobs we were able to by-pass this halt point by
 increasing
 available heap memory to the workers, but it's odd we don't get an
 out-of-memory error or any error at all.  Upping the memory available
 isn't
 a very satisfying answer to what may be going on :)

 We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

 Thanks for any help or ideas you may have!

 Cheers,
 Jonathan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html
  To unsubscribe from Spark stalling during shuffle (maybe a memory issue), 
 click
 herehttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=6067code=amtlZWJsZXI0MkBnbWFpbC5jb218NjA2N3wtMjA5NzAzMzE5NQ==
 .
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-20 Thread anishs...@yahoo.co.in

Thanks Mayur, it is working :)

--
Anish Sneh
http://in.linkedin.com/in/anishsneh

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Aaron Davidson

So the current stalling is simply sitting there with no log output? Have
you jstack'd an Executor to see where it may be hanging? Are you observing
memory or disk pressure (df and df -i)?


On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler jkeeble...@gmail.comwrote:

 Thanks for the suggestion, Andrew.  We have also implemented our solution
 using reduceByKey, but observe the same behavior.  For example if we do the
 following:

 map1
 groupByKey
 map2
 saveAsTextFile

 Then the stalling will occur during the map1+groupByKey execution.

 If we do

 map1
 reduceByKey
 map2
 saveAsTextFile

 Then the reduceByKey finishes successfully, but the stalling will occur
 during the map2+saveAsTextFile execution.


 On Tue, May 20, 2014 at 4:22 PM, Andrew Ash [via Apache Spark User List] 
 [hidden
 email] http://user/SendEmail.jtp?type=nodenode=6137i=0 wrote:

 If the distribution of the keys in your groupByKey is skewed (some keys
 appear way more often than others) you should consider modifying your job
 to use reduceByKey instead wherever possible.
 On May 20, 2014 12:53 PM, Jon Keebler [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=0
 wrote:

 So we upped the spark.akka.frameSize value to 128 MB and still observed
 the same behavior.  It's happening not necessarily when data is being sent
 back to the driver, but when there is an inter-cluster shuffle, for example
 during a groupByKey.

 Is it possible we should focus on tuning these parameters:
 spark.storage.memoryFraction  spark.shuffle.memoryFraction ??


 On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=1
  wrote:

 This is very likely because the serialized map output locations buffer
 exceeds the akka frame size. Please try setting spark.akka.frameSize
 (default 10 MB) to some higher number, like 64 or 128.

 In the newest version of Spark, this would throw a better error, for
 what it's worth.



 On Mon, May 19, 2014 at 8:39 PM, jonathan.keebler [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=6134i=2
  wrote:

 Has anyone observed Spark worker threads stalling during a shuffle
 phase with
 the following message (one per worker host) being echoed to the
 terminal on
 the driver thread?

 INFO spark.MapOutputTrackerActor: Asked to send map output locations
 for
 shuffle 0 to [worker host]...


 At this point Spark-related activity on the hadoop cluster completely
 halts
 .. there's no network activity, disk IO or CPU activity, and individual
 tasks are not completing and the job just sits in this state.  At this
 point
 we just kill the job  a re-start of the Spark server service is
 required.

 Using identical jobs we were able to by-pass this halt point by
 increasing
 available heap memory to the workers, but it's odd we don't get an
 out-of-memory error or any error at all.  Upping the memory available
 isn't
 a very satisfying answer to what may be going on :)

 We're running Spark 0.9.0 on CDH5.0 in stand-alone mode.

 Thanks for any help or ideas you may have!

 Cheers,
 Jonathan




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.





 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6134.html
  To unsubscribe from Spark stalling during shuffle (maybe a memory
 issue), click here.
 NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: Spark stalling during shuffle (maybe a
 memory 
 issue)http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p6137.html
  Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.

Spark Performace Comparison Spark on YARN vs Spark Standalone

2014-05-20 Thread anishs...@yahoo.co.in

Hi All

I need to analyse performance of Spark YARN vs Spark Standalone

Please suggest if we have some pre-published comparison statistics available.

TIA
--
Anish Sneh
http://in.linkedin.com/in/anishsneh

Re: facebook data mining with Spark

2014-05-20 Thread Michael Cutler

Hello Joe,

The first step is acquiring some data, either through the Facebook
APIhttps://developers.facebook.com/or a third-party service like
Datasift https://datasift.com/ (paid).  Once you've acquired some data,
and got it somewhere Spark can access it (like HDFS), you can then load and
manipulate it just like any other data.

Here is a pretty-printed example JSON message I got from a
Datasifthttps://datasift.com/ stream
this morning, it illustrates an anonymised someone with *clearly too much
time on their hands* having reached *level 576* on Candy Crush Saga.

{
demographic: {
gender: mostly_female
},
facebook: {
application: Candy Crush Saga,
author: {
type: user,
hash_id: 
},
caption: I just completed level 576, scored 494020 points and
got 3 stars.,
created_at: Tue, 20 May 2014 03:08:09 +,
description: Click here to follow my progress!,
id: 100_123456789012345,
link: 
http://apps.facebook.com/candycrush/?urlMessage=
,
name: Yay, I completed level 576 in Candy Crush Saga!,
source: Candy Crush Saga (123456789012345),
type: link
},
interaction: {
schema: {
version: 3
},
type: facebook,
id: ,
created_at: Tue, 20 May 2014 03:08:09 +,
received_at: 1400555303.6832,
author: {
type: user,
hash_id: 
},
title: Yay, I completed level 576 in Candy Crush Saga!,
link: http://www.facebook.com/100_123456789012345;,
subtype: link,
content: Click here to follow my progress!,
source: Candy Crush Saga (123456789012345)
},
language: {
tag: en,
tag_extended: en,
confidence: 97
}
}

Much like processing Twitter streams, the data arrives as a single JSON
object on each line.  So you need to pass the RDD[String] you get from
opening the textFile through a JSON parser.  Spark has
json4shttps://github.com/json4s/json4sand jackson JSON parsers
embedded in the assembly so you can basically use
those for 'free' without having to bundle them in your JAR.

Here is an example Spark job which answers the age-old question: Who is
better at Candy Crush, boys? or girls?

// We want to extract the level number from Yay, I completed
level 576 in Candy Crush Saga!
// the actual text will change based on the users language but
parsing the 'last number' works
val pattern = (\d+).r

// Produces a RDD[String]
val lines = sc.textFile(facebook-2014-05-19.json)
lines.map(line = {
  // Parse the JSON
  parse(line)
}).filter(json = {
  // Filter out only 'Candy Crush Saga' activity
  json \ facebook \ application == JString(Candy Crush Saga)
}).map(json = {
  // Extract the 'level' or default to zero
  var level = 0;
  pattern.findAllIn( compact(json \ interaction \ title)
).matchData.foreach(m = {
level = m.group(1).toInt
  })
  // Extract the gender
  val gender = compact(json \ demographic \ gender)
  // Return a Tuple of RDD[gender: String, (level: Int, count: Int)]
  ( gender, (level, 1) )
}).filter(a = {
  // Filter out entries with a level of zero
  a._2._1  0
}).reduceByKey( (a, b) = {
  // Sum the levels and counts so we can average later
  ( a._1 + b._1, a._2 + b._2 )
}).collect().foreach(entry = {
  // Print the results
  val gender = entry._1
  val values = entry._2
  val average = values._1 / values._2
  println(gender + : average= + average + , count= + values._2 )
})


See more: https://gist.github.com/cotdp/fda64b4248e43a3c8f46

If you run this on a small sample of data you get results like this:


   - female: average=114, count=15422
   - male: average=104, count=14727

 Which basically says the average level achieved by women is slightly
higher than guys.

Best of luck fishing through Facebook data!

MC



*Michael Cutler*
Founder, CTO


*Mobile: +44 789 990 7847Email:   mich...@tumra.com mich...@tumra.comWeb:
tumra.com http://tumra.com/?utm_source=signatureutm_medium=email*
*Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq*
*Registered in England  Wales, 07916412. VAT No. 130595328*


This email and any files transmitted with it are confidential and may also
be privileged. It is intended only for the person to whom it is addressed.
If you have received this email in error, please inform the sender immediately.
If you are not the intended recipient you must not use, disclose, copy,
print, distribute or rely on this email.


On 20 May 2014 05:07, Joe L selme...@yahoo.com wrote:

 Is there any way to get facebook data into Spark and filter the content of
 it?



 --
 View this message in context:

Python, Spark and HBase

2014-05-20 Thread twizansk

Hello,

This seems like a basic question but I have been unable to find an answer in
the archives or other online sources.

I would like to know if there is any way to load a RDD from HBase in python. 
In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a
TableInputFormat class.  Is there any equivalent in python?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using Spark to analyze complex JSON

2014-05-20 Thread Nick Chammas

The Apache Drill http://incubator.apache.org/drill/ home page has an
interesting heading: Liberate Nested Data.

Is there any current or planned functionality in Spark SQL or Shark to
enable SQL-like querying of complex JSON?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-analyze-complex-JSON-tp6146.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas

Any tips on how to troubleshoot this?


On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote:

 I’m trying to do a simple count() on a large number of GZipped files in
 S3. My job is failing with the following message:

 14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.IOException
 java.io.IOException: incorrect header check
 at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native 
 Method)
 at 
 org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
 at 
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
 at 
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
 at java.io.InputStream.read(InputStream.java:101)

 snipped

 I traced this down to 
 HADOOP-5281https://issues.apache.org/jira/browse/HADOOP-5281,
 but I’m not sure if 1) it’s the same issue, or 2) how to go about resolving
 it.

 I gather I need to update some Hadoop jar? Any tips on where to look/what
 to do?

 I’m running Spark on an EC2 cluster created by spark-ec2 with no special
 options used.

 Nick

 --
 View this message in context: count()-ing gz files gives
 java.io.IOException: incorrect header 
 checkhttp://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.

Unsubscribe

2014-05-20 Thread A.Khanolkar

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Madhu

I have read gzip files from S3 successfully.

It sounds like a file is corrupt or not a valid gzip file.

Does it work with fewer gzip files?
How are you reading the files?




-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to Unsubscribe from the Spark user list

2014-05-20 Thread Nick Chammas

Send an email to this address to unsubscribe from the Spark user list:

user-unsubscr...@spark.apache.org

Sending an email to the Spark user list itself (i.e. this list) *does not
do anything*, even if you put unsubscribe as the subject. We will all
just see your email.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Unsubscribe-from-the-Spark-user-list-tp6150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson

Unfortunately, those errors are actually due to an Executor that exited,
such that the connection between the Worker and Executor failed. This is
not a fatal issue, unless there are analogous messages from the Worker to
the Master (which should be present, if they exist, at around the same
point in time).

Do you happen to have the logs from the Master that indicate that the
Worker terminated? Is it just an Akka disassociation, or some exception?


On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote:

 This isn't helpful of me to say, but, I see the same sorts of problem
 and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
 into when it happens, but usually after heavy use and after running
 for a long time. I had figured I'd see if the changes since 0.9.0
 addressed it and revisit later.

 On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote:
  So, for example, I have two disassociated worker machines at the moment.
  The last messages in the spark logs are akka association error messages,
  like the following:
 
  14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
  [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error [Association
  failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
  akka.remote.EndpointAssociationException: Association failed with
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
  Caused by:
  akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
  Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
  ]
 
  On the master side, there are lots and lots of messages of the form:
 
  14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
  worker-20140520011737-hdn3.int.meetup.com-50038
 
  --j

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-20 Thread Nicholas Chammas

Yes, it does work with fewer GZipped files. I am reading the files in using
sc.textFile() and a pattern string.

For example:

a = sc.textFile('s3n://bucket/2014-??-??/*.gz')
a.count()

Nick



On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote:

 I have read gzip files from S3 successfully.

 It sounds like a file is corrupt or not a valid gzip file.

 Does it work with fewer gzip files?
 How are you reading the files?




 -
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768p6149.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

IllegalStateException when creating Job from shell

2014-05-20 Thread Alex Holmes

Hi,

I'm trying to work with Spark from the shell and create a Hadoop Job
instance. I get the exception you see below because the Job.toString
doesn't like to be called until it has been submitted.

I tried using the :silent command but that didn't seem to have any impact.

scala import org.apache.hadoop.mapreduce.Job
scala val job = new Job()
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
at org.apache.hadoop.mapreduce.Job.toString(Job.java:462)
at
scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .init(console:10)
at .clinit(console)
...

Any help would be greatly appreciated!

Thanks,
Alex

Re: Python, Spark and HBase

2014-05-20 Thread Matei Zaharia

Unfortunately this is not yet possible. There’s a patch in progress posted here 
though: https://github.com/apache/spark/pull/455 — it would be great to get 
your feedback on it.

Matei

On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote:

 Hello,
 
 This seems like a basic question but I have been unable to find an answer in
 the archives or other online sources.
 
 I would like to know if there is any way to load a RDD from HBase in python. 
 In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a
 TableInputFormat class.  Is there any equivalent in python?
 
 Thanks
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus

Aaron:

I see this in the Master's logs:

14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038

There was an executor that launched that did fail, such as:
14/05/20 01:16:05 INFO Master: Launching executor app-20140520011605-0001/2
on worker worker-20140519155427-hdn3.int.meetup.com-50
038
14/05/20 01:17:37 INFO Master: Removing executor app-20140520011605-0001/2
because it is FAILED

... but other executors on other machines also failed without permanently
disassociating.

There are these messages which I don't know if they are related:
14/05/20 01:17:38 INFO LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkMaste
r/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.
6.19%3A47252-18#1027788678] was not delivered. [3] dead letters
encountered. This logging can be turned off or adjusted with confi
guration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/05/20 01:17:38 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka
://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkM
aster%4010.3.6.19%3A47252-18#1027788678] was not delivered. [4] dead
letters encountered. This logging can be turned off or adjust
ed with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.




On Tue, May 20, 2014 at 10:13 PM, Aaron Davidson ilike...@gmail.com wrote:

 Unfortunately, those errors are actually due to an Executor that exited,
 such that the connection between the Worker and Executor failed. This is
 not a fatal issue, unless there are analogous messages from the Worker to
 the Master (which should be present, if they exist, at around the same
 point in time).

 Do you happen to have the logs from the Master that indicate that the
 Worker terminated? Is it just an Akka disassociation, or some exception?


 On Tue, May 20, 2014 at 12:53 PM, Sean Owen so...@cloudera.com wrote:

 This isn't helpful of me to say, but, I see the same sorts of problem
 and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
 into when it happens, but usually after heavy use and after running
 for a long time. I had figured I'd see if the changes since 0.9.0
 addressed it and revisit later.

 On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar...@meetup.com wrote:
  So, for example, I have two disassociated worker machines at the moment.
  The last messages in the spark logs are akka association error messages,
  like the following:
 
  14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
  [akka.tcp://sparkwor...@hdn3.int.meetup.com:50038] -
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]: Error
 [Association
  failed with [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]] [
  akka.remote.EndpointAssociationException: Association failed with
  [akka.tcp://sparkexecu...@hdn3.int.meetup.com:46288]
  Caused by:
  akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
  Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
  ]
 
  On the master side, there are lots and lots of messages of the form:
 
  14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
  worker-20140520011737-hdn3.int.meetup.com-50038
 
  --j

any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-20 Thread Francis . Hu

sparkers,

 

Is there a better way to control memory usage when streaming input's speed
is faster than the speed of handled by spark streaming ?

 

Thanks,

Francis.Hu

60 matches

Mail list logo