Re: spark ssh to slave

2015-06-08 Thread James King
at 2:51 PM, James King jakwebin...@gmail.com wrote: I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have exchanged public keys so they have free access to each other. But when I do spark home/sbin/start-all.sh from 192.168.1.15 I still get 192.168.1.16

spark ssh to slave

2015-06-08 Thread James King
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have exchanged public keys so they have free access to each other. But when I do spark home/sbin/start-all.sh from 192.168.1.15 I still get 192.168.1.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
this just needs further tuning? * Increasing executors, RAM, etc. This doesn't make a difference by itself for this job, so I'm thinking we're already not fully utilising the resources we have in a smaller cluster. Again, any recommendations appreciated. Thanks for the help! James. On 4 June 2015

Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread James Aley
to using hadoopRDD() with the appropriate Input/Output formats? Any advice or tips greatly appreciated! James.

Re: Worker Spark Port

2015-05-15 Thread James King
run on a specific port? Regards jk On Wed, May 13, 2015 at 7:51 PM, James King jakwebin...@gmail.com wrote: Indeed, many thanks. On Wednesday, 13 May 2015, Cody Koeninger c...@koeninger.org wrote: I believe most ports are configurable at this point, look at http://spark.apache.org/docs

Re: Worker Spark Port

2015-05-15 Thread James King
through a context. So, master != driver and executor != worker. Best Ayan On Fri, May 15, 2015 at 7:52 PM, James King jakwebin...@gmail.com wrote: So I'm using code like this to use specific ports: val conf = new SparkConf() .setMaster(master) .setAppName(namexxx) .set

Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
From: http://spark.apache.org/docs/latest/streaming-kafka-integration.html I'm trying to use the direct approach to read messages form Kafka. Kafka is running as a cluster and configured with Zookeeper. On the above page it mentions: In the Kafka parameters, you must specify either

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
of brokers in pre-existing Kafka project apis. I don't know why the Kafka project chose to use 2 different configuration keys. On Wed, May 13, 2015 at 5:00 AM, James King jakwebin...@gmail.com wrote: From: http://spark.apache.org/docs/latest/streaming-kafka-integration.html I'm trying to use

Kafka + Direct + Zookeeper

2015-05-13 Thread James King
I'm trying Kafka Direct approach (for consume) but when I use only this config: kafkaParams.put(group.id, groupdid); kafkaParams.put(zookeeper.connect, zookeeperHostAndPort + /cb_kafka); I get this Exception in thread main org.apache.spark.SparkException: Must specify metadata.broker.list or

Worker Spark Port

2015-05-13 Thread James King
I understated that this port value is randomly selected. Is there a way to enforce which spark port a Worker should use?

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
, James King jakwebin...@gmail.com wrote: Looking at Consumer Configs in http://kafka.apache.org/documentation.html#consumerconfigs The properties *metadata.broker.list* or *bootstrap.servers *are not mentioned. Should I need these for consume side? On Wed, May 13, 2015 at 3:52 PM, James King

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Looking at Consumer Configs in http://kafka.apache.org/documentation.html#consumerconfigs The properties *metadata.broker.list* or *bootstrap.servers *are not mentioned. Should I need these for consume side? On Wed, May 13, 2015 at 3:52 PM, James King jakwebin...@gmail.com wrote: Many thanks

Re: Worker Spark Port

2015-05-13 Thread James King
Indeed, many thanks. On Wednesday, 13 May 2015, Cody Koeninger c...@koeninger.org wrote: I believe most ports are configurable at this point, look at http://spark.apache.org/docs/latest/configuration.html search for .port On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com

Re: Master HA

2015-05-12 Thread James King
Thanks Akhil, I'm using Spark in standalone mode so i guess Mesos is not an option here. On Tue, May 12, 2015 at 1:27 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Mesos has a HA option (of course it includes zookeeper) Thanks Best Regards On Tue, May 12, 2015 at 4:53 PM, James King

Reading Real Time Data only from Kafka

2015-05-12 Thread James King
What I want is if the driver dies for some reason and it is restarted I want to read only messages that arrived into Kafka following the restart of the driver program and re-connection to Kafka. Has anyone done this? any links or resources that can help explain this? Regards jk

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Best Regards On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote: What I want is if the driver dies for some reason and it is restarted I want to read only messages that arrived into Kafka following the restart of the driver program and re-connection to Kafka. Has anyone

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
at 9:01 AM, James King jakwebin...@gmail.com wrote: Thanks Cody. Here are the events: - Spark app connects to Kafka first time and starts consuming - Messages 1 - 10 arrive at Kafka then Spark app gets them - Now driver dies - Messages 11 - 15 arrive at Kafka - Spark driver program

Master HA

2015-05-12 Thread James King
I know that it is possible to use Zookeeper and File System (not for production use) to achieve HA. Are there any other options now or in the near future?

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
that the linked library is much more flexible/reliable than what's available in Spark at this point. James, what you're describing is the default behavior for the createDirectStream api available as part of spark since 1.3. The kafka parameter auto.offset.reset defaults to largest, ie start at the most

Re: Submit Spark application in cluster mode and supervised

2015-05-09 Thread James King
should set your master URL to be spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper From: James King Date: Friday, May 8, 2015 at 11:22 AM

Re: Stop Cluster Mode Running App

2015-05-08 Thread James King
Many Thanks Silvio, Someone also suggested using something similar : ./bin/spark-class org.apache.spark.deploy.Client kill master url driver ID Regards jk On Fri, May 8, 2015 at 2:12 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: Hi James, If you’re on Spark 1.3 you can use

Cluster mode and supervised app with multiple Masters

2015-05-08 Thread James King
Why does this not work ./spark-1.3.0-bin-hadoop2.4/bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master spark://host01:7077,host02:7077 Some.jar With exception: Caused by: java.lang.NumberFormatException: For input string: 7077,host02:7077 It seems to accept only one

Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01 and a STANDBY Master on host02 The LIVE Master is aware of all Workers in the cluster Now I submit a Spark

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread James King
BTW I'm using Spark 1.3.0. Thanks On Fri, May 8, 2015 at 5:22 PM, James King jakwebin...@gmail.com wrote: I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01

Re: Receiver Fault Tolerance

2015-05-06 Thread James King
Many thanks all, your responses have been very helpful. Cheers On Wed, May 6, 2015 at 2:14 PM, ayan guha guha.a...@gmail.com wrote: https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics On Wed, May 6, 2015 at 10:09 PM, James King jakwebin

Receiver Fault Tolerance

2015-05-06 Thread James King
In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation It talks about 'Receiver Fault Tolerance' I'm unsure of what a Receiver is here, from reading it sounds like when you submit an application to the cluster in cluster mode i.e. *--deploy-mode cluster *the driver program will

Re: Exiting driver main() method...

2015-05-04 Thread James Carman
to see here, move along. :) On Sat, May 2, 2015 at 2:44 PM Mohammed Guller moham...@glassbeam.com wrote: No, you don’t need to do anything special. Perhaps, your application is getting stuck somewhere? If you can share your code, someone may be able to help. Mohammed *From:* James

Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-04 Thread James Carman
I have the following simple example program: public class SimpleCount { public static void main(String[] args) { final String master = System.getProperty(spark.master, local[*]); System.out.printf(Running job against spark master %s ...%n, master); final SparkConf

Re: Enabling Event Log

2015-05-01 Thread James King
/spark-events. And this folder does not exits. Best Regards, Shixiong Zhu 2015-04-29 23:22 GMT-07:00 James King jakwebin...@gmail.com: I'm unclear why I'm getting this exception. It seems to have realized that I want to enable Event Logging but ignoring where I want it to log to i.e. file

Exiting driver main() method...

2015-05-01 Thread James Carman
In all the examples, it seems that the spark application doesn't really do anything special in order to exit. When I run my application, however, the spark-submit script just hangs there at the end. Is there something special I need to do to get that thing to exit normally?

Enabling Event Log

2015-04-30 Thread James King
I'm unclear why I'm getting this exception. It seems to have realized that I want to enable Event Logging but ignoring where I want it to log to i.e. file:/opt/cb/tmp/spark-events which does exist. spark-default.conf # Example: spark.master spark://master1:7077,master2:7077

Re: spark-defaults.conf

2015-04-28 Thread James King
explicitly Shouldn't Spark just consult with ZK and us the active master? Or is ZK only used during failure? On Mon, Apr 27, 2015 at 1:53 PM, James King jakwebin...@gmail.com wrote: Thanks. I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile But when I start worker like

submitting to multiple masters

2015-04-28 Thread James King
I have multiple masters running and I'm trying to submit an application using spark-1.3.0-bin-hadoop2.4/bin/spark-submit with this config (i.e. a comma separated list of master urls) --master spark://master01:7077,spark://master02:7077 But getting this exception

[Spark SQL] Problems creating a table in specified schema/database

2015-04-28 Thread James Aley
suggestions? Should this work? James.

spark-defaults.conf

2015-04-27 Thread James King
I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file.

Re: spark-defaults.conf

2015-04-27 Thread James King
, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties

Re: Querying Cluster State

2015-04-26 Thread James King
On Sun, Apr 26, 2015 at 6:31 PM, James King jakwebin...@gmail.com wrote: If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each node, so in total I will have 5 master and 10 Workers. Now to maintain that setup I would like to query spark regarding the number Masters and Workers

Querying Cluster State

2015-04-26 Thread James King
If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each node, so in total I will have 5 master and 10 Workers. Now to maintain that setup I would like to query spark regarding the number Masters and Workers that are currently available using API calls and then take some

Spark Cluster Setup

2015-04-24 Thread James King
I'm trying to find out how to setup a resilient Spark cluster. Things I'm thinking about include: - How to start multiple masters on different hosts? - there isn't a conf/masters file from what I can see Thank you.

Re: Spark Cluster Setup

2015-04-24 Thread James King
://twitter.com/deanwampler http://polyglotprogramming.com On Fri, Apr 24, 2015 at 5:01 AM, James King jakwebin...@gmail.com wrote: I'm trying to find out how to setup a resilient Spark cluster. Things I'm thinking about include: - How to start multiple masters on different hosts

Master -chatter - Worker

2015-04-22 Thread James King
Is there a good resource that covers what kind of chatter (communication) that goes on between driver, master and worker processes? Thanks

Re: Spark Unit Testing

2015-04-21 Thread James King
Hi Emre, thanks for the help will have a look. Cheers! On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming - http

Spark Unit Testing

2015-04-21 Thread James King
I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStreamString, String to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk

Skipped Jobs

2015-04-19 Thread James King
In the web ui i can see some jobs as 'skipped' what does that mean? why are these jobs skipped? do they ever get executed? Regards jk

Re: [GraphX] aggregateMessages with active set

2015-04-13 Thread James
/apache/spark/graphx/impl/GraphImpl.scala#L237-266 Ankur On Thu, Apr 9, 2015 at 3:21 AM, James alcaid1...@gmail.com wrote: In aggregateMessagesWithActiveSet, Spark still have to read all edges. It means that a fixed time which scale with graph size is unavoidable on a pregel-like iteration

Spark Cluster: RECEIVED SIGNAL 15: SIGTERM

2015-04-13 Thread James King
Any idea what this means, many thanks == logs/spark-.-org.apache.spark.deploy.worker.Worker-1-09.out.1 == 15/04/13 07:07:22 INFO Worker: Starting Spark worker 09:39910 with 4 cores, 6.6 GB RAM 15/04/13 07:07:22 INFO Worker: Running Spark version 1.3.0 15/04/13 07:07:22 INFO

Re: [GraphX] aggregateMessages with active set

2015-04-09 Thread James
(...) Ankur On Tue, Apr 7, 2015 at 2:56 AM, James alcaid1...@gmail.com wrote: Hello, The old api of GraphX mapReduceTriplets has an optional parameter activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage. However, to the new api aggregateMessages I could not find

[GraphX] aggregateMessages with active set

2015-04-07 Thread James
Hello, The old api of GraphX mapReduceTriplets has an optional parameter activeSetOpt: Option[(VertexRDD[_] that limit the input of sendMessage. However, to the new api aggregateMessages I could not find this option, why it does not offer any more? Alcaid

Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
when running the thrift server, I need to create a Hive table definition first? Is that the case, or did I miss something? If it is, is there some sensible way to automate this? Many thanks! James [1] https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
by periodically restarting the server with a new context internally. That certainly beats manual curation of Hive table definitions, if it will work? Thanks again, James. On 7 April 2015 at 19:30, Michael Armbrust mich...@databricks.com wrote: 1) What exactly is the relationship between the thrift

Using DIMSUM with ids

2015-04-06 Thread James
The example below illustrates how to use the DIMSUM algorithm to calculate the similarity between each two rows and output row pairs with cosine simiarity that is not less than a threshold.

A stream of json objects using Java

2015-04-02 Thread James King
I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk

Spark + Kafka

2015-04-01 Thread James King
I have a simple setup/runtime of Kafka and Sprak. I have a command line consumer displaying arrivals to Kafka topic. So i know messages are being received. But when I try to read from Kafka topic I get no messages, here are some logs below. I'm thinking there aren't enough threads. How do i

Re: Spark + Kafka

2015-04-01 Thread James King
receiving data from sources like Kafka. 2015-04-01 16:18 GMT+08:00 James King jakwebin...@gmail.com: Thank you bit1129, From looking at the web UI i can see 2 cores Also looking at http://spark.apache.org/docs/1.2.1/configuration.html But can't see obvious configuration for number of receivers

Re: Spark + Kafka

2015-04-01 Thread James King
: Please make sure that you have given more cores than Receiver numbers. *From:* James King jakwebin...@gmail.com *Date:* 2015-04-01 15:21 *To:* user user@spark.apache.org *Subject:* Spark + Kafka I have a simple setup/runtime of Kafka and Sprak. I have a command line consumer displaying

Re: Spark + Kafka

2015-04-01 Thread James King
().getSimpleName()) .setMaster(master); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(duration)); return ssc; } On Wed, Apr 1, 2015 at 11:37 AM, James King jakwebin...@gmail.com wrote: Thanks Saisai, Sure will do. But just a quick note that when i set master

NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe console output. However if I use

Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
at 6:31 PM, James King jakwebin...@gmail.com wrote: I'm trying to run the Java NetwrokWordCount example against a simple spark standalone runtime of one master and one worker. But it doesn't seem to work, the text entered on the Netcat data server is not being picked up and printed to Eclispe

Clean the shuffle data during iteration

2015-03-20 Thread James
Hello, Is that possible to delete shuffle data of previous iteration as it is not necessary? Alcaid

Re: Spark + Kafka

2015-03-19 Thread James King
On Mar 18, 2015, at 2:38 AM, James King jakwebin...@gmail.com wrote: Hi All, Which build of Spark is best when using Kafka? Regards jk

Writing Spark Streaming Programs

2015-03-19 Thread James King
Hello All, I'm using Spark for streaming but I'm unclear one which implementation language to use Java, Scala or Python. I don't know anything about Python, familiar with Scala and have been doing Java for a long time. I think the above shouldn't influence my decision on which language to use

Re: Spark + Kafka

2015-03-19 Thread James King
Many thanks all for the good responses, appreciated. On Thu, Mar 19, 2015 at 8:36 AM, James King jakwebin...@gmail.com wrote: Thanks Khanderao. On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail khanderao.k...@gmail.com wrote: I have used various version of spark (1.0, 1.2.1) without

Re: Writing Spark Streaming Programs

2015-03-19 Thread James King
keep the most complex Scala constructions out of your code) On Thu, Mar 19, 2015 at 3:50 PM, James King jakwebin...@gmail.com wrote: Hello All, I'm using Spark for streaming but I'm unclear one which implementation language to use Java, Scala or Python. I don't know anything about

Re: Spark + Kafka

2015-03-18 Thread James King
not including the mailing list in the response, I'm the only one who will get your message. Regards, Jeff 2015-03-18 10:49 GMT+01:00 James King jakwebin...@gmail.com: Any sub-category recommendations hadoop, MapR, CDH? On Wed, Mar 18, 2015 at 10:48 AM, James King jakwebin...@gmail.com wrote

Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James
I have got NullPointerException in aggregateMessages on a graph which is the output of mapVertices function of a graph. I found the problem is because of the mapVertices funciton did not affect all the triplet of the graph. // Initial the graph, assign a counter to each vertex that contains the

How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming program, thus their local space is not enough to support other application. Thus I wonder is that possible to use a blacklist to avoid using these nodes when running a new spark program?

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
My hadoop version is 2.2.0, and my spark version is 1.2.0 2015-03-14 17:22 GMT+08:00 Ted Yu yuzhih...@gmail.com: Which release of hadoop are you using ? Can you utilize node labels feature ? See YARN-2492 and YARN-796 Cheers On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-08 Thread James
, James alcaid1...@gmail.com wrote: Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119

[SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-06 Thread James
Hello, I want to execute a hql script through `spark-sql` command, my script contains: ``` ALTER TABLE xxx DROP PARTITION (date_key = ${hiveconf:CUR_DATE}); ``` when I execute ``` spark-sql -f script.hql -hiveconf CUR_DATE=20150119 ``` It throws an error like ``` cannot recognize input near

Re: Using graphx to calculate average distance of a big graph

2015-01-06 Thread James
shortest path is an option, you could simply find the APSP using https://github.com/apache/spark/pull/3619 and then take the average distance (apsp.map(_._2.toDouble).mean). Ankur http://www.ankurdave.com/ On Sun, Jan 4, 2015 at 6:28 PM, James alcaid1...@gmail.com wrote: Recently we want

Bug in DISK related Storage level?

2014-11-03 Thread James
Hello, I am trying to load a very large graph to run a GraphX algorithm, and the graph is not fix the memory, I found that if I use DISK_ONLY or MEMORY_AND_DISK_SER storage level, the program will met OOM, but if I use MEMORY_ONLY_SER, the program will not. Thus I want to know what kind of

Re: How to correctly extimate the number of partition of a graph in GraphX

2014-11-02 Thread James
, Nov 1, 2014 at 10:57 PM, James alcaid1...@gmail.com wrote: Hello, I am trying to run Connected Component algorithm on a very big graph. In practice I found that a small number of partition size would lead to OOM, while a large number would cause various time out exceptions. Thus I wonder how

How to correctly extimate the number of partition of a graph in GraphX

2014-11-01 Thread James
Hello, I am trying to run Connected Component algorithm on a very big graph. In practice I found that a small number of partition size would lead to OOM, while a large number would cause various time out exceptions. Thus I wonder how to estimate the number of partition of a graph in GraphX?

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-09 Thread Kevin James Matzen
I have a related question. With Hadoop, I would do the same thing for non-serializable objects and setup(). I also had a use case where it was so expensive to initialize the non-serializable object that I would make it a static member of the mapper, turn on JVM reuse across tasks, and then

subscribe

2014-07-28 Thread James Todd

Re: Purpose of spark-submit?

2014-07-09 Thread Robert James
compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and

Purpose of spark-submit?

2014-07-08 Thread Robert James
What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?

Requirements for Spark cluster

2014-07-08 Thread Robert James
I have a Spark app which runs well on local master. I'm now ready to put it on a cluster. What needs to be installed on the master? What needs to be installed on the workers? If the cluster already has Hadoop or YARN or Cloudera, does it still need an install of Spark?

spark-submit conflicts with dependencies

2014-07-07 Thread Robert James
When I use spark-submit (along with spark-ec2), I get dependency conflicts. spark-assembly includes older versions of apache commons codec and httpclient, and these conflict with many of the libs our software uses. Is there any way to resolve these? Or, if we use the precompiled spark, can we

spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use

spark-assembly libraries conflict with application libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote: spark-submit includes a spark-assembly uber jar, which has older

Addind and subtracting workers on Spark EC2 cluster

2014-07-06 Thread Robert James
If I've created a Spark EC2 cluster, how can I add or take away workers? Also: If I use EC2 spot instances, what happens when Amazon removes them? Will my computation be saved in any way, or will I need to restart from scratch? Finally: The spark-ec2 scripts seem to use Hadoop 1. How can I

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-06 Thread Robert James
I can say from my experience that getting Spark to work with Hadoop 2 is not for the beginner; after solving one problem after another (dependencies, scripts, etc.), I went back to Hadoop 1. Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure why, but, given so, Hadoop 2 has too

Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Robert James
Although Spark's home page offers binaries for Spark 1.0.0 with Hadoop 2, the Maven repository only seems to have one version, which uses Hadoop 1. Is it possible to use a Maven link and Hadoop 2? What is the id? If not: How can I use the prebuilt binaries to use Hadoop 2? Do I just copy the

Re: Is it possible to use Spark, Maven, and Hadoop 2?

2014-06-29 Thread Robert James
to make a jar assembly using your approach? How? If not: How do you distribute the jars to the workers? On Sun, Jun 29, 2014 at 12:20 PM, Robert James srobertja...@gmail.com wrote: Although Spark's home page offers binaries for Spark 1.0.0 with Hadoop 2, the Maven repository only seems to have

Re: Hadoop interface vs class

2014-06-26 Thread Robert James
this problem? (Surely I'm not the only one using Hadoop 2 and sbt or maven or ivy!) On Jun 26, 2014 11:07 AM, Robert James srobertja...@gmail.com wrote: Yes. As far as I can tell, Spark seems to be including Hadoop 1 via its transitive dependency: http://mvnrepository.com/artifact

Spark's Hadooop Dependency

2014-06-25 Thread Robert James
To add Spark to a SBT project, I do: libraryDependencies += org.apache.spark %% spark-core % 1.0.0 % provided How do I make sure that the spark version which will be downloaded will depend on, and use, Hadoop 2, and not Hadoop 1? Even with a line: libraryDependencies += org.apache.hadoop %

Spark's Maven dependency on Hadoop 1

2014-06-25 Thread Robert James
According to http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.0 , spark depends on Hadoop 1.0.4. What about the versions of Spark that work with Hadoop 2? Do they also depend on Hadoop 1.0.4? How does everyone handle this?

Hadoop interface vs class

2014-06-25 Thread Robert James
After upgrading to Spark 1.0.0, I get this error: ERROR org.apache.spark.executor.ExecutorUncaughtExceptionHandler - Uncaught exception in thread Thread[Executor task launch worker-2,5,main] java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,

Centralized Spark Logging solution

2014-06-24 Thread Robert James
We need a centralized spark logging solution. Ideally, it should: * Allow any Spark process to log at multiple levels (info, warn, debug) using a single line, similar to log4j * All logs should go to a central location - so, to read the logs, we don't need to check each worker by itself *

Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
My app works fine under Spark 0.9. I just tried upgrading to Spark 1.0, by downloading the Spark distro to a dir, changing the sbt file, and running sbt assembly, but I get now NoSuchMethodErrors when trying to use spark-submit. I copied in the SimpleApp example from

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
On 6/24/14, Peng Cheng pc...@uow.edu.au wrote: I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in

Re: Unsubscribe

2014-05-23 Thread James Jones
Unsubscribe James Jones Acquisition Editor [ Packt Publishing ] Tel: 0121 265 6486 Web: www.packtpub.com Linkedin: uk.linkedin.com/pub/james-jones/52/3b9/596/ Twitter: @_James_Jones_ Packt Publishing Limited Registered Office: Livery Place, 35 Livery Street, Birmingham, West Midlands

Re: Passing runtime config to workers?

2014-05-18 Thread Robert James
--- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, May 16, 2014 at 1:59 PM, Robert James srobertja...@gmail.comwrote: What is a good way to pass config variables to workers? I've tried setting them

Workers unable to find class, even when in the SparkConf JAR list

2014-05-16 Thread Robert James
I'm using spark-ec2 to run some Spark code. When I set master to local, then it runs fine. However, when I set master to $MASTER, the workers immediately fail, with java.lang.NoClassDefFoundError for the classes. I've used sbt-assembly to make a jar with the classes, confirmed using jar tvf

Passing runtime config to workers?

2014-05-16 Thread Robert James
What is a good way to pass config variables to workers? I've tried setting them in environment variables via spark-env.sh, but, as far as I can tell, the environment variables set there don't appear in workers' environments. If I want to be able to configure all workers, what's a good way to do

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-16 Thread Robert James
I've experienced the same bug, which I had to workaround manually. I posted the details here: http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster On 5/15/14, DB Tsai dbt...@stanford.edu wrote: Hi guys, I think it maybe a bug in Spark. I wrote some code

What is the difference between a Spark Worker and a Spark Slave?

2014-05-16 Thread Robert James
What is the difference between a Spark Worker and a Spark Slave?

<    1   2   3   >