Re: Getting the number of slaves

2014-07-28 Thread Andrew Or
Yes, both of these are derived from the same source, and this source includes the driver. In other words, if you submit a job with 10 executors you will get back 11 for both statuses. 2014-07-28 15:40 GMT-07:00 Sung Hwan Chung coded...@cs.stanford.edu: Do getExecutorStorageStatus and

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi, Could you provide which HBase version you're using? By the way, a quick sanity check on whether the Workers can access HBase? Were you able to manually write one record to HBase with the serialize function? Hardcode and test it ? From: jianshi.hu...@gmail.com Date: Fri, 25 Jul 2014

Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
you want to configure it? Andrew On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson mar...@skimlinks.com wrote: We are having difficulties configuring Spark, partly because we still don't understand some key concepts. For instance, how many executors are there per machine in standalone mode

Re: spark-submit to remote master fails

2014-07-23 Thread Andrew Or
driver. Andrew 2014-07-23 10:40 GMT-07:00 didi did...@gmail.com: Hi all I guess the problem has something to do with the fact i submit the job to remote location I submit from OracleVM running ubuntu and suspect some NAT issues maybe? akka tcp tries this address as follows from the STDERR

Re: Lost executors

2014-07-23 Thread Andrew Or
Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com: I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc

Re: Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread Andrew Or
in spark should not be done through any config or environment variable that references java opts. Andrew 2014-07-23 1:04 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, Sorry for taking this topic again,still I am confused on this. I set SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops

Re: driver memory

2014-07-23 Thread Andrew Or
-submit. The equivalent also holds for executor memory (i.e. --executor-memory). That way you don't have to wrangle with the millions of overlapping configs / environment variables for all the deploy modes. -Andrew 2014-07-23 4:18 GMT-07:00 mrm ma...@skimlinks.com: Hi, I figured out my problem

Re: Why spark-submit command hangs?

2014-07-22 Thread Andrew Or
Hi Earthson, Is your problem resolved? The way you submit your application looks alright to me; spark-submit should be able to parse the combination of --master and --deploy-mode correctly. I suspect you might have hard-coded yarn-cluster or something in your application. Andrew 2014-07-22 1

Re: hadoop version

2014-07-22 Thread Andrew Or
/ephemeral-hdfs. Andrew 2014-07-22 7:07 GMT-07:00 mrm ma...@skimlinks.com: Hi, Where can I find the version of Hadoop my cluster is using? I launched my ec2 cluster using the spark-ec2 script with the --hadoop-major-version=2 option. However, the folder hadoop-native/lib in the master node only

RE: Hive From Spark

2014-07-22 Thread Andrew Lee
for Hive-on-Spark now. On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote: Hive and Hadoop are using an older version of guava libraries (11.0.1) where Spark Hive is using guava 14.0.1+. The community isn't willing to downgrade to 11.0.1 which is the current version

RE: Hive From Spark

2014-07-21 Thread Andrew Lee
Hi All, Currently, if you are running Spark HiveContext API with Hive 0.12, it won't work due to the following 2 libraries which are not consistent with Hive 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common practice, they should be consistent to work inter-operable).

Re: Why spark-submit command hangs?

2014-07-21 Thread Andrew Or
is deprecated) - add --master yarn-cluster in your spark-submit command Another worrying thing is the warning from your logs: 14/07/21 22:38:42 WARN spark.SparkConf: null jar passed to SparkContext constructor How are you creating your SparkContext? Andrew 2014-07-21 7:47 GMT-07:00 Sam Liu

Re: LiveListenerBus throws exception and weird web UI bug

2014-07-21 Thread Andrew Or
workaround for this issue, but you might try to reduce the number of concurrently running tasks (partitions) to avoid emitting too many events. The root cause of the listener queue taking too much time to process events is recorded in SPARK-2316, which we also intend to fix by Spark 1.1. Andrew

Re: gain access to persisted rdd

2014-07-21 Thread Andrew Or
. Andrew 2014-07-21 8:37 GMT-07:00 mrm ma...@skimlinks.com: Hi, I am using pyspark and have persisted a list of rdds within a function, but I don't have a reference to them anymore. The RDD's are listed in the UI, under the Storage tab, and they have names associated to them (e.g. 4

Re: Give more Java Heap Memory on Standalone mode

2014-07-21 Thread Andrew Or
line. Andrew 2014-07-21 10:01 GMT-07:00 Nick R. Katsipoulakis kat...@cs.pitt.edu: Thank you Abel, It seems that your advice worked. Even though I receive a message that it is a deprecated way of defining Spark Memory (the system prompts that I should set spark.driver.memory), the memory

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-21 Thread Andrew Or
, it seems that you set your log level to WARN. The cause is most probably because the cache is not big enough, but setting the log level to INFO will provide you with more information on the exact sizes that are being used by the storage and the blocks). Andrew 2014-07-19 13:01 GMT-07:00 rindra

Re: How to map each line to (line number, line)?

2014-07-21 Thread Andrew Ash
I'm not sure if you guys ever picked a preferred method for doing this, but I just encountered it and came up with this method that's working reasonably well on a small dataset. It should be quite easily generalizable to non-String RDDs. def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]

Re: What is shuffle spill to memory?

2014-07-18 Thread Andrew Or
metrics are aggregated over the entire duration of the task (i.e. within each task you can spill multiple times). Andrew 2014-07-18 4:09 GMT-07:00 Sébastien Rainville sebastienrainvi...@gmail.com : Hi, in the Spark UI, one of the metrics is shuffle spill (memory). What is it exactly? Spilling

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Andrew Or
to be some inconsistency or missing pieces in the logs you posted. After an executor says driver disassociated, what happens in the driver logs? Is an exception thrown or something? It would be useful if you could also post your conf/spark-env.sh. Andrew 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin

Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Andrew Or
thing to check is whether the node from which you launch spark submit can access the internal address of the master (and port 7077). One quick way to verify that is to attempt a telnet into it. Let me know if you find anything. Andrew 2014-07-17 15:57 GMT-07:00 ranjanp piyush_ran...@hotmail.com: Hi

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Andrew Or
Hi Chen, spark.executor.extraJavaOptions is introduced in Spark 1.0, not in Spark 0.9. You need to export SPARK_JAVA_OPTS= -Dspark.config1=value1 -Dspark.config2=value2 in conf/spark-env.sh. Let me know if that works. Andrew 2014-07-17 18:15 GMT-07:00 Tathagata Das tathagata.das1

Re: jar changed on src filesystem

2014-07-17 Thread Andrew Or
HDFS. Try removing all old jars from your .sparkStaging directory and try again? Let me know if that does the job, Andrew 2014-07-16 23:42 GMT-07:00 cmti95035 cmti95...@gmail.com: They're all the same version. Actually even without the --jars parameter it got the same error. Looks like

Re: Errors accessing hdfs while in local mode

2014-07-17 Thread Andrew Or
still work (I just tried this on my own EC2 cluster). By the way, SPARK_MASTER is actually deprecated. Instead, please use bin/spark-submit --master [your master]. Andrew 2014-07-16 23:46 GMT-07:00 Akhil Das ak...@sigmoidanalytics.com: You can try the following in the spark-shell: 1. Run

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Andrew Or
SPARK_JAVA_OPTS is deprecated as of 1.0) 2014-07-17 21:08 GMT-07:00 Chen Song chen.song...@gmail.com: Thanks Andrew. Say that I want to turn on CMS gc for each worker. All I need to do is add the following line to conf/spark-env.sh on node where I submit the application. -XX

running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
Hello community, tried to run storm app on yarn, using cloudera hadoop and spark distro (from http://archive.cloudera.com/cdh5/cdh/5) hadoop version: hadoop-2.3.0-cdh5.0.3.tar.gz spark version: spark-0.9.0-cdh5.0.3.tar.gz DEFAULT_YARN_APPLICATION_CLASSPATH is part of hadoop-api-yarn jar ...

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
thanks Sandzy, no CM-managed cluster, straight from cloudera tar ( http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.3.tar.gz) trying your suggestion immediate! thanks so much for taking time.. On Wed, Jul 16, 2014 at 1:10 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Andrew

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
...@cloudera.com wrote: Andrew, Are you running on a CM-managed cluster? I just checked, and there is a bug here (fixed in 1.0), but it's avoided by having yarn.application.classpath defined in your yarn-site.xml. -Sandy On Wed, Jul 16, 2014 at 10:02 AM, Sean Owen so...@cloudera.com wrote

Re: running Spark App on Yarn produces: Exception in thread main java.lang.NoSuchFieldException: DEFAULT_YARN_APPLICATION_CLASSPATH

2014-07-16 Thread Andrew Milkowski
/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*/value /property On Wed, Jul 16, 2014 at 1:47 PM, Andrew Milkowski amgm2...@gmail.com wrote: Sandy, perfect! you saved me tons of time! added this in yarn-site.xml job ran

Re: hdfs replication on saving RDD

2014-07-15 Thread Andrew Ash
In general it would be nice to be able to configure replication on a per-job basis. Is there a way to do that without changing the config values in the Hadoop conf/ directory between jobs? Maybe by modifying OutputFormats or the JobConf ? On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another

Re: ---cores option in spark-shell

2014-07-14 Thread Andrew Or
Yes, the documentation is actually a little outdated. We will get around to fix it shortly. Please use --driver-cores or --executor-cores instead. 2014-07-14 19:10 GMT-07:00 cjwang c...@cjwang.us: Neither do they work in new 1.0.1 either -- View this message in context:

RE: SPARK_CLASSPATH Warning

2014-07-11 Thread Andrew Lee
As mentioned, deprecated in Spark 1.0+. Try to use the --driver-class-path: ./bin/spark-shell --driver-class-path yourlib.jar:abc.jar:xyz.jar Don't use glob *, specify the JAR one by one with colon. Date: Wed, 9 Jul 2014 13:45:07 -0700 From: kat...@cs.pitt.edu Subject: SPARK_CLASSPATH Warning

RE: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-11 Thread Andrew Lee
Ok, I found it on JIRA SPARK-2390: https://issues.apache.org/jira/browse/SPARK-2390 So it looks like this is a known issue. From: alee...@hotmail.com To: user@spark.apache.org Subject: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option? Date: Tue, 8 Jul 2014 15:17:00

Re: executor failed, cannot find compute-classpath.sh

2014-07-11 Thread Andrew Or
-submit (or spark-shell, which calls spark-submit) with the --verbose flag. Let me know if this fixes it. I will get to fixing the root problem soon. Andrew 2014-07-10 18:43 GMT-07:00 cjwang c...@cjwang.us: Andrew, Thanks for replying. I did the following and the result was still the same. 1

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Andrew Or
? Andrew 2014-07-10 10:17 GMT-07:00 Aris Vlasakakis a...@vlasakakis.com: Thank you very much Yana for replying! So right now the set up is a single-node machine which is my cluster, and YES you are right my submitting laptop has a different path to the spark-1.0.0 installation than the cluster

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Andrew Or
Yes, there are a few bugs in the UI in the event of a node failure. The duplicated stages in both the active and completed tables should be fixed by this PR: https://github.com/apache/spark/pull/1262 The fact that the progress bar on the stages page displays an overflow (e.g. 5/4) is still an

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread Andrew Or
.mbox/%3cCAMJOb8mYTzxrHWcaDOnVoOTw1TFrd9kJjOyj1=nkgmsk5vs...@mail.gmail.com%3e Andrew 2014-07-10 1:57 GMT-07:00 cjwang c...@cjwang.us: Not sure that was what I want. I tried to run Spark Shell on a machine other than the master and got the same error. The 192 was suppose to be a simple

Re: Purpose of spark-submit?

2014-07-09 Thread Andrew Or
I don't see why using SparkSubmit.scala as your entry point would be any different, because all that does is invoke the main class of Client.scala (e.g. for Yarn) after setting up all the class paths and configuration options. (Though I haven't tried this myself) 2014-07-09 9:40 GMT-07:00 Ron

Re: Scheduling in spark

2014-07-08 Thread Andrew Or
Here's the most updated version of the same page: http://spark.apache.org/docs/latest/job-scheduling 2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi svarakh...@gopivotal.com: This is a good start: http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html On Tue, Jul 8, 2014 at 9:11

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 2f1dc868e5714882cf40d2633fb66772baf34789) Hi All, When I enabled the spark-defaults.conf for the eventLog, spark-shell broke while spark-submit works. I'm trying to create a separate directory per user to keep track with their own Spark job event

Re: Spark: All masters are unresponsive!

2014-07-08 Thread Andrew Or
It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
or the --master parameter to spark-submit. We will update the documentation shortly. Thanks for letting us know. Andrew 2014-07-08 16:29 GMT-07:00 Mikhail Strebkov streb...@gmail.com: Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month ago). The setup is a standalone cluster with 4

Re: tiers of caching

2014-07-07 Thread Andrew Or
Others have also asked for this on the mailing list, and hence there's a related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur brings up a good point in that any current implementation of in-memory shuffles will compete with application RDD blocks. I think we should definitely add

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
the red text is because it appears only on the driver containers, not the executor containers. This is because SparkUI belongs to the SparkContext, which only exists on the driver. Andrew 2014-07-07 11:20 GMT-07:00 Yan Fang yanfang...@gmail.com: Hi guys, Not sure if you have similar issues. Did

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
the redirect error has little to do with Spark itself, but more to do with how you set up the cluster. I have actually run into this myself, but I haven't found a workaround. Let me know if you find anything. 2014-07-07 12:07 GMT-07:00 Chester Chen ches...@alpinenow.com: As Andrew explained, the port

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee
Hi Kudryavtsev, Here's what I am doing as a common practice and reference, I don't want to say it is best practice since it requires a lot of customer experience and feedback, but from a development and operating stand point, it will be great to separate the YARN container logs with the Spark

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
LZO-compressed data, so I know there's not a version issue. Andrew On Sun, Jul 6, 2014 at 12:02 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I’ve been reading through several pages trying to figure out how to set up my spark-ec2 cluster to read LZO-compressed files from S3

Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
= ((k._1, k._2), k.)3))) Note that when using .join though, that is an inner join so you only get results from (id1, id2) pairs that have BOTH a score1 and a score2. Andrew On Wed, Jul 2, 2014 at 5:12 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, Is it possible to join RDDs using

Re: write event logs with YARN

2014-07-03 Thread Andrew Or
Hi Christophe, another Andrew speaking. Your configuration looks fine to me. From the stack trace it seems that we are in fact closing the file system pre-maturely elsewhere in the system, such that when it tries to write the APPLICATION_COMPLETE file it throws the exception you see. This does

Re: Run spark unit test on Windows 7

2014-07-02 Thread Andrew Or
Hi Konstatin, We use hadoop as a library in a few places in Spark. I wonder why the path includes null though. Could you provide the full stack trace? Andrew 2014-07-02 9:38 GMT-07:00 Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com: Hi all, I'm trying to run some transformation

Re: Help alleviating OOM errors

2014-07-02 Thread Andrew Or
executor will also die of the same problem. Best, Andrew 2014-07-02 6:22 GMT-07:00 Yana Kadiyska yana.kadiy...@gmail.com: Can you elaborate why You need to configure the spark.shuffle.spill true again in the config -- the default for spark.shuffle.spill is set to true according to the doc

Re: NullPointerException on ExternalAppendOnlyMap

2014-07-02 Thread Andrew Or
your null keys before passing your key value pairs to a combine operator (e.g. groupBy, reduceBy). For instance, rdd.map { case (k, v) = if (k == null) (SPECIAL_VALUE, v) else (k, v) }. Best, Andrew 2014-07-02 10:22 GMT-07:00 Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com: Hi all

RE: write event logs with YARN

2014-07-02 Thread Andrew Lee
Hi Christophe, Make sure you have 3 slashes in the hdfs scheme. e.g. hdfs:///server_name:9000/user/user_name/spark-events and in the spark-defaults.conf as well.spark.eventLog.dir=hdfs:///server_name:9000/user/user_name/spark-events Date: Thu, 19 Jun 2014 11:18:51 +0200 From:

Re: About StorageLevel

2014-06-26 Thread Andrew Or
RDDs they are most interested in, so it makes sense to give them control over caching behavior. Best, Andrew 2014-06-26 5:36 GMT-07:00 tomsheep...@gmail.com tomsheep...@gmail.com: Hi all, I have a newbie question about StorageLevel of spark. I came up with these sentences in spark documents

Re: Spark 1.0.0 on yarn cluster problem

2014-06-25 Thread Andrew Or
Hi Sophia, did you ever resolve this? A common cause for not giving resources to the job is that the RM cannot communicate with the workers. This itself has many possible causes. Do you have a full stack trace from the logs? Andrew 2014-06-13 0:46 GMT-07:00 Sophia sln-1...@163.com

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Andrew Or
://issues.apache.org/jira/browse/SPARK-2260. Thanks for pointing this out, and we will get to fixing these shortly. Best, Andrew 2014-06-20 6:06 GMT-07:00 Gino Bustelo lbust...@gmail.com: I've found that the jar will be copied to the worker from hdfs fine, but it is not added to the spark context

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-23 Thread Andrew Lee
I checked the source code, it looks like it was re-added back based on JIRA SPARK-1588, but I don't know if there's any test case associated with this? SPARK-1588. Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN. Sandy Ryza sa...@cloudera.com 2014-04-29 12:54:02 -0700

Re: hi

2014-06-23 Thread Andrew Or
Ah never mind. The 0.0.0.0 is for the UI, not for Master, which uses the output of the hostname command. But yes, long answer short, go to the web UI and use that URL. 2014-06-23 11:13 GMT-07:00 Andrew Or and...@databricks.com: Hm, spark://localhost:7077 should work, because the standalone

Re: 1.0.1 release plan

2014-06-20 Thread Andrew Ash
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issues without running a patched version of Spark: https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version conflicts for client applications https://issues.apache.org/jira/browse/SPARK-2043 --

Re: options set in spark-env.sh is not reflecting on actual execution

2014-06-20 Thread Andrew Or
to the SparkContext (see http://spark.apache.org/docs/latest/configuration.html#spark-properties). Andrew 2014-06-18 22:21 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I have a doubt regarding the options in spark-env.sh. I set the following values in the file in master and 2 workers

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-19 Thread Andrew Or
will be done through spark-submit, so you may miss out on relevant new features or bug fixes. Andrew 2014-06-19 7:41 GMT-07:00 Koert Kuipers ko...@tresata.com: still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark standalone. for example if i have a akka timeout setting that i

Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
if that does the job. Andrew 2014-06-19 6:04 GMT-07:00 Praveen Seluka psel...@qubole.com: I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN + HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles. Now am trying to run the example Spark job . (In Yarn-cluster

Re: Getting started : Spark on YARN issue

2014-06-19 Thread Andrew Or
(Also, an easier workaround is to simply submit the application from within your cluster, thus saving you all the manual labor of reconfiguring everything to use public hostnames. This may or may not be applicable to your use case.) 2014-06-19 14:04 GMT-07:00 Andrew Or and...@databricks.com

HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Hi All, Have anyone ran into the same problem? By looking at the source code in official release (rc11),this property settings is set to false by default, however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it to fill up the disk pretty fast since SparkContext deploys

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-18 Thread Andrew Lee
Forgot to mention that I am using spark-submit to submit jobs, and a verbose mode print out looks like this with the SparkPi examples.The .sparkStaging won't be deleted. My thoughts is that this should be part of the staging and should be cleaned up as well when sc gets terminated.

Re: Spark is now available via Homebrew

2014-06-18 Thread Andrew Ash
What's the advantage of Apache maintaining the brew installer vs users? Apache handling it means more work on this dev team, but probably a better experience for brew users. Just wanted to weigh pros/cons before committing to support this installation method. Andrew On Wed, Jun 18, 2014 at 5

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Andrew Ash
Wait, so the file only has four lines and the job running out of heap space? Can you share the code you're running that does the processing? I'd guess that you're doing some intense processing on every line but just writing parsed case classes back to disk sounds very lightweight. I On Wed,

Re: Memory footprint of Calliope: Spark - Cassandra writes

2014-06-17 Thread Andrew Ash
Gerard, Strings in particular are very inefficient because they're stored in a two-byte format by the JVM. If you use the Kryo serializer and have use StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for ASCII-like strings will take half the space. Andrew On Tue, Jun 17

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Andrew Or
Standalone-client mode is not officially supported at the moment. For standalone-cluster and yarn-client modes, however, they should work. For both modes, are you running spark-submit from within the cluster, or outside of it? If the latter, could you try running it from within the cluster and

Re: join operation is taking too much time

2014-06-17 Thread Andrew Or
How long does it get stuck for? This is a common sign for the OS thrashing due to out of memory exceptions. If you keep it running longer, does it throw an error? Depending on how large your other RDD is (and your join operation), memory pressure may or may not be the problem at all. It could be

Re: Wildcard support in input path

2014-06-17 Thread Andrew Ash
In Spark you can use the normal globs supported by Hadoop's FileSystem, which are documented here: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path) On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Re: spark master UI does not keep detailed application history

2014-06-16 Thread Andrew Or
Are you referring to accessing a SparkUI for an application that has finished? First you need to enable event logging while the application is still running. In Spark 1.0, you set this by adding a line to $SPARK_HOME/conf/spark-defaults.conf: spark.eventLog.enabled true Other than that, the

Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Andrew Or
Hi Wang Hao, This is not removed. We moved it here: http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html If you're building with SBT, and you don't specify the SPARK_HADOOP_VERSION, then it defaults to 1.0.4. Andrew 2014-06-12 6:24 GMT-07:00 Hao Wang wh.s...@gmail.com

Re: use spark-shell in the source

2014-06-12 Thread Andrew Or
Not sure if this is what you're looking for, but have you looked at java's ProcessBuilder? You can do something like for (line - lines) { val command = line.split( ) // You may need to deal with quoted strings val process = new ProcessBuilder(command) // redirect output of process to main

Re: Can't find pyspark when using PySpark on YARN

2014-06-10 Thread Andrew Or
/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e As described in the link, the last resort is to try building your assembly jar with JAVA_HOME set to Java 6. This usually fixes the problem (more details in the link provided). Cheers, Andrew 2014-06-10 6:35 GMT

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
Can you try file:/root/spark_log? 2014-06-10 19:22 GMT-07:00 zhen z...@latrobe.edu.au: I checked the permission on root and it is the following: drwxr-xr-x 20 root root 4096 Jun 11 01:05 root So anyway, I changed to use /tmp/spark_log instead and this time I made sure that all

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
No, I meant pass the path to the history server start script. 2014-06-10 19:33 GMT-07:00 zhen z...@latrobe.edu.au: Sure here it is: drwxrwxrwx 2 1000 root 4096 Jun 11 01:05 spark_logs Zhen -- View this message in context:

Re: Comprehensive Port Configuration reference?

2014-06-09 Thread Andrew Ash
Hi Jacob, The port configuration docs that we worked on together are now available at: http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security Thanks for the help! Andrew On Wed, May 28, 2014 at 3:21 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
. Andrew On Thu, Jun 5, 2014 at 2:15 PM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Hi All, Please help me set Executor JVM memory size. I am using Spark shell and it appears that the executors are started with a predefined JVM heap of 512m as soon as Spark shell starts. How can I change

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
for why SPARK_MEM was deprecated. See https://github.com/apache/spark/pull/99 On Thu, Jun 5, 2014 at 2:37 PM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Thank you, Andrew, I am using Spark 0.9.1 and tried your approach like this: bin/spark-shell --driver-java-options

Re: Join : Giving incorrect result

2014-06-05 Thread Andrew Ash
think some fixes in spilling landed. Andrew On Thu, Jun 5, 2014 at 3:05 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Ajay, thanks for reporting this. There was indeed a bug, specifically in the way join tasks spill to disk (which happened when you had more concurrent tasks competing

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
as the work that Aaron mentioned is happening, I think he might be referring to the discussion and code surrounding https://issues.apache.org/jira/browse/SPARK-983 Cheers! Andrew On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover roger.hoo...@gmail.com wrote: I think it would very handy to be able

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Andrew Ash
Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote: hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Andrew Ash
can at least confirm that the setting is making it to the application with that webui. Cheers, Andrew On Wed, Jun 4, 2014 at 3:48 AM, nilmish nilmish@gmail.com wrote: The error is resolved. I was using a comparator which was not serialised because of which it was throwing the error. I have

Re: How to change default storage levels

2014-06-04 Thread Andrew Ash
You can change storage level on an individual RDD with .persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change what the default persistency level is for RDDs. Andrew On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote: Hi I'm using Spark 0.9.1 and Shark

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov oleg.proudni

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-03 Thread Andrew Or
: https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html I've tested that zipped modules can as least be imported via zipimport. Any ideas? -Simon On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or and...@databricks.com wrote: Hi Simon, You shouldn't have

Re: WebUI's Application count doesn't get updated

2014-06-03 Thread Andrew Ash
Your applications are probably not connecting to your existing cluster and instead running in local mode. Are you passing the master URL to the SparkPi application? Andrew On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote: - HI all, - Application running

Re: How to create RDDs from another RDD?

2014-06-03 Thread Andrew Ash
Hmm that sounds like it could be done in a custom OutputFormat, but I'm not familiar enough with custom OutputFormats to say that's the right thing to do. On Tue, Jun 3, 2014 at 10:23 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi Andrew, Thanks for your answer. The reason of the question

Re: Error related to serialisation in spark streaming

2014-06-03 Thread Andrew Ash
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used to thinking of closure cleaner as something Spark built. Do you have somewhere I can read more about this? On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: So are you using Java 7 or 8. 7

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Andrew Or
, the steps outlined there are quite useful. Let me know if you get it working (or not). Cheers, Andrew 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen xche...@gmail.com: Hi folks, I have a weird problem when using pyspark with yarn. I started ipython as follows: IPYTHON=1 ./pyspark --master

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Andrew Ash
/issues/171 Pull request that adds an AvroSerializer to Chill: https://github.com/twitter/chill/pull/172 Issue on the old Spark tracker: https://spark-project.atlassian.net/browse/SPARK-746 Matt can you comment if this change helps you streamline that gist even further? Andrew On Tue, May 27, 2014

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Andrew Ash
get you started? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala Cheers, Andrew On Tue, May 27, 2014 at 4:10 AM, Carter gyz...@hotmail.com wrote: Any suggestion is very much appreciated. -- View this message in context: http

Re: Running a spark-submit compatible app in spark-shell

2014-05-26 Thread Andrew Or
Hi Roger, This was due to a bug in the Spark shell code, and is fixed in the latest master (and RC11). Here is the commit that fixed it: https://github.com/apache/spark/commit/8edbee7d1b4afc192d97ba192a5526affc464205. Try it now and it should work. :) Andrew 2014-05-26 10:35 GMT+02:00 Perttu

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-25 Thread Andrew Ash
Hi Martin, Tim suggested that you pastebin the mesos logs -- can you share those for the list? Cheers, Andrew On Thu, May 15, 2014 at 5:02 PM, Martin Weindel martin.wein...@gmail.comwrote: Andrew, thanks for your response. When using the coarse mode, the jobs run fine. My problem

Re: problem about broadcast variable in iteration

2014-05-25 Thread Andrew Ash
/spark/pull/126 Alternatively, it sounds like your algorithm needs some additional state to join against to produce each successive iteration of RDD. Have you considered storing that data in an RDD rather than a broadcast variable? Andrew On Wed, May 7, 2014 at 10:02 PM, randylu randyl...@gmail.com

Re: KryoSerializer Exception

2014-05-25 Thread Andrew Ash
. Do you have a sense of how large the serialized items in your RDD are? Andrew On Sat, May 10, 2014 at 6:32 AM, Andrea Esposito and1...@gmail.com wrote: UP, doesn't anyone know something about it? ^^ 2014-05-06 12:05 GMT+02:00 Andrea Esposito and1...@gmail.com: Hi there, sorry if i'm

Re: Comprehensive Port Configuration reference?

2014-05-25 Thread Andrew Ash
-port-to-send-to Thanks for taking a look through! I also realized that I had a couple mistakes with the 0.9 to 1.0 transition so appropriately documented those now as well in the updated PR. Cheers! Andrew On Fri, May 23, 2014 at 2:43 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy Andrew

Re: Comprehensive Port Configuration reference?

2014-05-23 Thread Andrew Ash
. Cheers, Andrew On Wed, May 7, 2014 at 10:19 AM, Mark Baker dist...@acm.org wrote: On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com wrote: In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job

Re: Computing cosine similiarity using pyspark

2014-05-23 Thread Andrew Ash
for future users to leverage! Andrew On Thu, May 22, 2014 at 10:49 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, I have bunch of vectors like [0.1234,-0.231,0.23131] and so on. and I want to compute cosine similarity and pearson correlation using pyspark.. How do I do this? Any

<    1   2   3   4   5   6   >