Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC. wrote: >user$ pyspark [some-options] --driver-java-options > spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar This command line does not look correct. "spark.yarn.jar" is not a JVM command line option. You most probably need

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads. > user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar=' > /usr/lib/spark/assembly/lib/spark-assembly-*.jar' > > My questi

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > So just to clarify for me: When specifying 'spark.yarn.jar' as I did > above, even if I don't use HDFS to create a > RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is > still necessary

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. < subscripti...@didata.us> wrote: > You're probably right about the above because, as seen *below* for > pyspark (but probably for other Spark > applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is > specified, the app invocat

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-09 Thread Marcelo Vanzin
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also explained in the link to the docs I posted earlier.) The second interpretation below is "local:" URLs in Spark, but that doesn't work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older either). On Mon, Sep 8,

Re: spark-streaming "Could not compute split" exception

2014-09-09 Thread Marcelo Vanzin
This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the ca

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Marcelo Vanzin
Hi, Yes, this is a problem, and I'm not aware of any simple workarounds (or complex one for that matter). There are people working to fix this, you can follow progress here: https://issues.apache.org/jira/browse/SPARK-1239 On Tue, Sep 9, 2014 at 2:54 PM, jbeynon wrote: > I'm running on Yarn with

Re: spark-streaming "Could not compute split" exception

2014-09-09 Thread Marcelo Vanzin
Your executor is exiting or crashing unexpectedly: On Tue, Sep 9, 2014 at 3:13 PM, Penny Espinoza wrote: > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit > code from container container_1410224367331_0006_01_03 is : 1 > 2014-09-09 21:47:26,345 WARN > org.apache.hadoo

Re: Task not serializable

2014-09-10 Thread Marcelo Vanzin
You're using "hadoopConf", a Configuration object, in your closure. That type is not serializable. You can use " -Dsun.io.serialization.extendedDebugInfo=true" to debug serialization issues. On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra wrote: > Thanks Sean. > Please find attached my code. Let

Re: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-10 Thread Marcelo Vanzin
Yes please pretty please. This is really annoying. On Sun, Sep 7, 2014 at 6:31 AM, Ognen Duzlevski wrote: > > I keep getting below reply every time I send a message to the Spark user > list? Can this person be taken off the list by powers that be? > Thanks! > Ognen > > Forwarded Message

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen wrote: > This structure is not specific to Hadoop, but in theory works in any > JAR file. You can put JARs in JARs and refer to them with Class-Path > entries in META-INF/MANIFEST.MF. Funny that you mention that, since someone internally asked the same q

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen wrote: > What's the Hadoop jar structure in question then? Is it something special > like a WAR file? I confess I had never heard of this so thought this was > about generic JAR stuff. What I've been told (and Steve's e-mail alludes to) is that you can p

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin
On Wed, Sep 10, 2014 at 3:48 PM, Steve Lewis wrote: > In modern projects there are a bazillion dependencies - when I use Hadoop I > just put them in a lib directory in the jar - If I have a project that > depends on 50 jars I need a way to deliver them to Spark - maybe wordcount > can be written w

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Marcelo Vanzin
Yes, what Sandy said. On top of that, I would suggest filing a bug for a new command line argument for spark-submit to make the launcher process exit cleanly as soon as a cluster job starts successfully. That can be helpful for code that launches Spark jobs but monitors the job through different m

Re: spark-submit command-line with --files

2014-09-20 Thread Marcelo Vanzin
Hi chinchu, Where does the code trying to read the file run? Is it running on the driver or on some executor? If it's running on the driver, in yarn-cluster mode, the file should have been copied to the application's work directory before the driver is started. So hopefully just doing "new FileIn

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You'll need to look at the driver output to have a better idea of what's going on. You can use "yarn logs --applicationId blah" after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request you'

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
t. > > > > > On Wed, Sep 24, 2014 at 11:37 PM, Marcelo Vanzin > wrote: >> >> You'll need to look at the driver output to have a better idea of >> what's going on. You can use "yarn logs --applicationId blah" after >> your app is fin

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
n im not able to view the UI how can > change it in cloudera.? > > > > > On Thu, Sep 25, 2014 at 12:04 AM, Marcelo Vanzin > wrote: >> >> You need to use the command line yarn application that I mentioned >> ("yarn logs"). You can't look at the log

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin
Sounds like "spark-01" is not resolving correctly on your machine (or is the wrong address). Can you ping "spark-01" and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds wrote: > Hello, > I'm learning about Spark Streaming and I'm really excited

Re: how to run spark job on yarn with jni lib?

2014-09-25 Thread Marcelo Vanzin
Hmmm, you might be suffering from SPARK-1719. Not sure what the proper workaround is, but it sounds like your native libs are not in any of the "standard" lib directories; one workaround might be to copy them there, or add their location to /etc/ld.so.conf (I'm assuming Linux). On Thu, Sep 25, 20

Re: Question About Submit Application

2014-09-25 Thread Marcelo Vanzin
Then I think it's time for you to look at the Spark Master logs... On Thu, Sep 25, 2014 at 7:51 AM, danilopds wrote: > Hi Marcelo, > > Yes, I can ping "spark-01" and I also include the IP and host in my file > /etc/hosts. > My VM can ping the local machine too. > > > > -- > View this message in c

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
On Thu, Sep 25, 2014 at 8:55 AM, jamborta wrote: > I am running spark with the default settings in yarn client mode. For some > reason yarn always allocates three containers to the application (wondering > where it is set?), and only uses two of them. The default number of executors in Yarn mode

Re: SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread Marcelo Vanzin
You can pass the HDFS location of those extra jars in the spark-submit "--jars" argument. Spark will take care of using Yarn's distributed cache to make them available to the executors. Note that you may need to provide the full hdfs URL (not just the path, since that will be interpreted as a local

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
ives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. On Thu, Sep 25, 2014 at 2:20 PM, Tamas Jambor wrote: > Thank you. > > Where is the number of containers set? > > On Thu, Sep 25, 2014 at 7:17 PM,

Re: how to run spark job on yarn with jni lib?

2014-09-26 Thread Marcelo Vanzin
I assume you did those things in all machines, not just on the machine launching the job? I've seen that workaround used successfully (well, actually, they copied the library to /usr/lib or something, but same idea). On Thu, Sep 25, 2014 at 7:45 PM, taqilabon wrote: > You're right, I'm suffering

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through t

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
up in a few different contexts, but I don't think there's an "official" solution yet.) On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor wrote: > thanks Marcelo. > > What's the reason it is not possible in cluster mode, either? > > On Wed, Oct 1, 2014 at 5:42 P

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
No, you can't instantiate a SparkContext to start apps in cluster mode. For Yarn, for example, you'd have to call directly into org.apache.spark.deploy.yarn.Client; that class will tell the Yarn cluster to launch the driver for you and then instantiate the SparkContext. On Wed, Oct 1, 2014 at 10:

Re: Application details for failed and teminated jobs

2014-10-02 Thread Marcelo Vanzin
You may want to take a look at this PR: https://github.com/apache/spark/pull/1558 Long story short: while not a terrible idea to show running applications, your particular case should be solved differently. Applications are responsible for calling "SparkContext.stop()" at the end of their run, cur

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Marcelo Vanzin
Hi Anurag, Spark SQL (from the Spark standard distribution / sources) currently requires Hive 0.12; as you mention, CDH4 has Hive 0.10, so that's not gonna work. CDH 5.2 ships with Spark 1.1.0 and is modified so that Spark SQL can talk to the Hive 0.13.1 that is also bundled with CDH, so if that'

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Marcelo Vanzin
Hi Greg, I'm not sure exactly what it is that you're trying to achieve, but I'm pretty sure those variables are not supposed to be set by users. You should take a look at the documentation for "spark.driver.extraClassPath" and "spark.driver.extraLibraryPath", and the equivalent options for executo

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread Marcelo Vanzin
Hi Eric, Check the "Debugging Your Application" section at: http://spark.apache.org/docs/latest/running-on-yarn.html Long story short: upload your log4j.properties using the "--files" argument of spark-submit. (Mental note: we could make the log level configurable via a system property...) On

Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Marcelo Vanzin
Hi Philip, The assemblies are part of the CDH distribution. You can get them here: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html As of Spark 1.1 (and, thus, CDH 5.2), assemblies are not published to maven repositories anymore (you can see commit [1] for details). [1] h

Re: scala: java.net.BindException?

2014-10-16 Thread Marcelo Vanzin
This error is not fatal, since Spark will retry on a different port.. but this might be a problem, for different reasons, if somehow your code is trying to instantiate multiple SparkContexts. I assume "nn.SimpleNeuralNetwork" is part of your application, and since it seems to be instantiating a ne

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Marcelo Vanzin
On top of what Andrew said, you shouldn't need to manually add the mllib jar to your jobs; it's already included in the Spark assembly jar. On Thu, Oct 16, 2014 at 11:51 PM, eric wong wrote: > Hi, > > i using the comma separated style for submit multiple jar files in the > follow shell but it doe

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar wrote: > Here are my questions : > 1. Sharing spark context : How exactly multiple users can share the cluster > using same spark > context ? That's not something you might want to

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar wrote: >> That's not something you might want to do usually. In general, a >> SparkContext maps to a user application > > My question was basically this. In this page in the official doc, under > "Scheduling within an application" section, it talks a

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin
ext to share the same resource or 2) > add dynamic resource management for Yarn mode is very much wanted. > > Jianshi > > On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin wrote: >> >> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar >> wrote: >> >> That&#

Re: JavaHiveContext class not found error. Help!!

2014-10-23 Thread Marcelo Vanzin
Hello there, This is more of a question for the cdh-users list, but in any case... In CDH 5.1 we skipped packaging of the Hive module in SparkSQL. That has been fixed in CDH 5.2, so if it's possible for you I'd recommend upgrading. On Thu, Oct 23, 2014 at 2:53 PM, nitinkak001 wrote: > I am tryin

Re: Exceptions not caught?

2014-10-23 Thread Marcelo Vanzin
On Thu, Oct 23, 2014 at 3:40 PM, ankits wrote: > 2014-10-23 15:39:50,845 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) > java.io.IOException: org.apache.thrift.protocol.TProtocolException: This looks like an exception that's happening on an executor and just being reported in the driver's

Re: Spark using HDFS data [newb]

2014-10-23 Thread Marcelo Vanzin
You assessment is mostly correct. I think the only thing I'd reword is the comment about splitting the data, since Spark itself doesn't do that, but read on. On Thu, Oct 23, 2014 at 6:12 PM, matan wrote: > In case I nailed it, how then does it handle a distributed hdfs file? does > it pull all of

Re: SparkContext.stop() ?

2014-10-31 Thread Marcelo Vanzin
Actually, if you don't call SparkContext.stop(), the event log information that is used by the history server will be incomplete, and your application will never show up in the history server's UI. If you don't use that functionality, then you're probably ok not calling it as long as your applicat

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-11-05 Thread Marcelo Vanzin
On Mon, Oct 27, 2014 at 7:37 PM, buring wrote: > Here is error log,I abstract as follows: > INFO [binaryTest---main]: before first > WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver > thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136): > org.xerial.snappy.SnappyEr

Re: Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Marcelo Vanzin
Hello, CDH 5.1.3 ships with a version of Hive that's not entirely the same as the Hive Spark 1.1 supports. So when building your custom Spark, you should make sure you change all the dependency versions to point to the CDH versions. IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have

Re: How to incrementally compile spark examples using mvn

2014-11-15 Thread Marcelo Vanzin
I haven't tried scala:cc, but you can ask maven to just build a particular sub-project. For example: mvn -pl :spark-examples_2.10 compile On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang wrote: > Hi, > > > > I have already successfully compile and run spark examples. My problem is > that i

Re: Spark on YARN

2014-11-18 Thread Marcelo Vanzin
Can you check in your RM's web UI how much of each resource does Yarn think you have available? You can also check that in the Yarn configuration directly. Perhaps it's not configured to use all of the available resources. (If it was set up with Cloudera Manager, CM will reserve some room for daem

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
Hi Anson, We've seen this error when incompatible classes are used in the driver and executors (e.g., same class name, but the classes are different and thus the serialized data is different). This can happen for example if you're including some 3rd party libraries in your app's jar, or changing t

Re: spark-shell giving me error of unread block data

2014-11-19 Thread Marcelo Vanzin
x27;s version of Spark, not trying to run an Apache Spark release on top of CDH, right? (If that's the case, then we could probably move this conversation to cdh-us...@cloudera.org, since it would be CDH-specific.) > On Wed Nov 19 2014 at 4:52:51 PM Marcelo Vanzin wrote: >> >

Re: How to incrementally compile spark examples using mvn

2014-11-20 Thread Marcelo Vanzin
Hi Yiming, On Wed, Nov 19, 2014 at 5:35 PM, Yiming (John) Zhang wrote: > Thank you for your reply. I was wondering whether there is a method of > reusing locally-built components without installing them? That is, if I have > successfully built the spark project as a whole, how should I configur

Re: spark-submit and logging

2014-11-20 Thread Marcelo Vanzin
Check the "--files" argument in the output "spark-submit -h". On Thu, Nov 20, 2014 at 7:51 AM, Matt Narrell wrote: > How do I configure the files to be uploaded to YARN containers. So far, I’ve > only seen "--conf spark.yarn.jar=hdfs://….” which allows me to specify the > HDFS location of the

Re: spark-submit and logging

2014-11-20 Thread Marcelo Vanzin
Hi Tobias, With the current Yarn code, packaging the configuration in your app's jar and adding the "-Dlog4j.configuration=log4jConf.xml" argument to the extraJavaOptions configs should work. That's not the recommended way for get it to work, though, since this behavior may change in the future.

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
Do you expect to be able to use the spark context on the remote task? If you do, that won't work. You'll need to rethink what it is you're trying to do, since SparkContext is not serializable and it doesn't make sense to make it so. If you don't, you could mark the field as @transient. But the tw

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
Hello, On Mon, Nov 24, 2014 at 12:07 PM, aecc wrote: > This is the stacktrace: > > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA > - field (class "$iwC$$iwC$$iwC$$iwC", name: "aaa", typ

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
On Mon, Nov 24, 2014 at 1:56 PM, aecc wrote: > I checked sqlContext, they use it in the same way I would like to use my > class, they make the class Serializable with transient. Does this affects > somehow the whole pipeline of data moving? I mean, will I get performance > issues when doing this b

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
That's an interesting question for which I do not know the answer. Probably a question for someone with more knowledge of the internals of the shell interpreter... On Mon, Nov 24, 2014 at 2:19 PM, aecc wrote: > Ok, great, I'm gonna do do it that way, thanks :). However I still don't > understand

Re: Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Marcelo Vanzin
Hello, What exactly are you trying to see? Workers don't generate any events that would be logged by enabling that config option. Workers generate logs, and those are captured and saved to disk by the cluster manager, generally, without you having to do anything. On Mon, Nov 24, 2014 at 7:46 PM,

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-12-02 Thread Marcelo Vanzin
On Tue, Dec 2, 2014 at 11:22 AM, Judy Nash wrote: > Any suggestion on how can user with custom Hadoop jar solve this issue? You'll need to include all the dependencies for that custom Hadoop jar to the classpath. Those will include Guava (which is not included in its original form as part of the

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
mllib while calling it from examples > project? > > Thanks & Regards, > Meethu M > > > On Monday, 24 November 2014 3:33 PM, Yiming (John) Zhang > wrote: > > > Thank you, Marcelo and Sean, "mvn install" is a good answer for my demands. > > -邮件

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
hat a sub-project depends on? >>> >>> i rather avoid "mvn install" since this creates a local maven repo. i have >>> been stung by that before (spend a day trying to do something and got weird >>> errors because some toy version i once build was stuck in my

Re: spark shell and hive context problem

2014-12-09 Thread Marcelo Vanzin
Hello, In CDH 5.2 you need to manually add Hive classes to the classpath of your Spark job if you want to use the Hive integration. Also, be aware that since Spark 1.1 doesn't really support the version of Hive shipped with CDH 5.2, this combination is to be considered extremely experimental. On

Re: Spark 1.0.0 Standalone mode config

2014-12-10 Thread Marcelo Vanzin
Hello, What do you mean by "app that uses 2 cores and 8G of RAM"? Spark apps generally involve multiple processes. The command line options you used affect only one of them (the driver). You may want to take a look at similar configuration for executors. Also, check the documentation: http://spar

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
t developing it as a public API, but mostly for internal Hive use. It can give you a few ideas, though. Also, SPARK-3215. On Thu, Dec 11, 2014 at 5:41 PM, Marcelo Vanzin wrote: > Hi Manoj, > > I'm not aware of any public projects that do something like that, > except for the Ooya

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
Hi Manoj, I'm not aware of any public projects that do something like that, except for the Ooyala server which you say doesn't cover your needs. We've been playing with something like that inside Hive, though: On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel wrote: > Hi, > > If spark based services

Re: java.lang.IllegalStateException: unread block data

2014-12-12 Thread Marcelo Vanzin
Hi, This is a question more suited for cdh-us...@cloudera.org, since it's probably CDH-specific. In the meantime, check the following: - if you're using Yarn, check that you've also updated the copy of the Spark assembly in HDFS (especially if you're using CM to manage things) - make sure all JDK

Re: SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-17 Thread Marcelo Vanzin
Hi Anton, That could solve some of the issues (I've played with that a little bit). But there are still some areas where this would be sub-optimal, because Spark still uses system properties in some places and those are global, not per-class loader. (SparkSubmit is the biggest offender here, but

Re: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Marcelo Vanzin
On Fri, Dec 19, 2014 at 4:05 PM, Haopu Wang wrote: > My application doesn’t depends on hadoop-client directly. > > It only depends on spark-core_2.10 which depends on hadoop-client 1.0.4. > This can be checked by Maven repository at > http://mvnrepository.com/artifact/org.apache.spark/spark-core_2

Re: Yarn not running as many executors as I'd like

2014-12-19 Thread Marcelo Vanzin
How many cores / memory do you have available per NodeManager, and how many cores / memory are you requesting for your job? Remember that in Yarn mode, Spark launches "num executors + 1" containers. The extra container, by default, reserves 1 core and about 1g of memory (more if running in cluster

Re: Who manage the log4j appender while running spark on yarn?

2014-12-22 Thread Marcelo Vanzin
If you don't specify your own log4j.properties, Spark will load the default one (from core/src/main/resources/org/apache/spark/log4j-defaults.properties, which ends up being packaged with the Spark assembly). You can easily override the config file if you want to, though; check the "Debugging" sec

Re: different akka versions and spark

2015-01-05 Thread Marcelo Vanzin
Spark doesn't really shade akka; it pulls a different build (kept under the "org.spark-project.akka" group and, I assume, with some build-time differences from upstream akka?), but all classes are still in the original location. The upgrade is a little more unfortunate than just changing akka, sin

Re: Spark History Server can't read event logs

2015-01-07 Thread Marcelo Vanzin
The Spark code generates the log directory with "770" permissions. On top of that you need to make sure of two things: - all directories up to /apps/spark/historyserver/logs/ are readable by the user running the history server - the user running the history server belongs to the group that owns /a

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-07 Thread Marcelo Vanzin
This particular case shouldn't cause problems since both of those libraries are java-only (the scala version appended there is just for helping the build scripts). But it does look weird, so it would be nice to fix it. On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar wrote: > It seems that spar

Re: spark 1.1 got error when working with cdh5.3.0 standalone mode

2015-01-07 Thread Marcelo Vanzin
This could be cause by many things including wrong configuration. Hard to tell with just the info you provided. Is there any reason why you want to use your own Spark instead of the one shipped with CDH? CDH 5.3 has Spark 1.2, so unless you really need to run Spark 1.1, you should be better off wi

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
apr and the user > that runs Spark in our case is a unix ID called mapr (in the mapr group). > Therefore, this can't read my job event logs as shown above. > > > Thanks, > Michael > > > -Original Message- > From: Marcelo Vanzin [mailto:van...@clo

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Nevermind my last e-mail. HDFS complains about not understanding "3777"... On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin wrote: > Hmm. Can you set the permissions of "/apps/spark/historyserver/logs" > to 3777? I'm not sure HDFS respects the group id bit, but it&#

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Sorry for the noise; but I just remembered you're actually using MapR (and not HDFS), so maybe the "3777" trick could work... On Thu, Jan 8, 2015 at 10:32 AM, Marcelo Vanzin wrote: > Nevermind my last e-mail. HDFS complains about not understanding "3777"... > &

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Marcelo Vanzin
Just to add to Sandy's comment, check your client configuration (generally in /etc/spark/conf). If you're using CM, you may need to run the "Deploy Client Configuration" command on the cluster to update the configs to match the new version of CDH. On Thu, Jan 8, 2015 at 11:38 AM, Sandy Ryza wrote

Re: SparkSQL

2015-01-08 Thread Marcelo Vanzin
Disclaimer: this seems more of a CDH question, I'd suggest sending these to the CDH mailing list in the future. CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but it does not include the thrift server because of incompatibilities with the CDH version of Hive. To use Hive support,

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
Disclaimer: CDH questions are better handled at cdh-us...@cloudera.org. But the question I'd like to ask is: why do you need your own Spark build? What's wrong with CDH's Spark that it doesn't work for you? On Thu, Jan 8, 2015 at 3:01 PM, freedafeng wrote: > Could anyone come up with your experi

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 3:33 PM, freedafeng wrote: > I installed the custom as a standalone mode as normal. The master and slaves > started successfully. > However, I got error when I ran a job. It seems to me from the error message > the some library was compiled against hadoop1, but my spark was

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
Hi Manoj, As long as you're logged in (i.e. you've run kinit), everything should just work. You can run "klist" to make sure you're logged in. On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel wrote: > Hi, > > For running spark 1.2 on Hadoop cluster with Kerberos, what spark > configurations are requi

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 4:09 PM, Manoj Samel wrote: > Some old communication (Oct 14) says Spark is not certified with Kerberos. > Can someone comment on this aspect ? Spark standalone doesn't support kerberos. Spark running on top of Yarn works fine with kerberos. -- Marcelo --

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
I ran this with CDH 5.2 without a problem (sorry don't have 5.3 readily available at the moment): $ HBASE='/opt/cloudera/parcels/CDH/lib/hbase/\*' $ spark-submit --driver-class-path $HBASE --conf "spark.executor.extraClassPath=$HBASE" --master yarn --class org.apache.spark.examples.HBaseTest /opt/

Re: /tmp directory fills up

2015-01-12 Thread Marcelo Vanzin
Hi Alessandro, You can look for a log line like this in your driver's output: 15/01/12 10:51:01 INFO storage.DiskBlockManager: Created local directory at /data/yarn/nm/usercache/systest/appcache/application_1421081007635_0002/spark-local-20150112105101-4f3d If you're deploying your application i

Re: How does unmanaged memory work with the executor memory limits?

2015-01-12 Thread Marcelo Vanzin
Short answer: yes. Take a look at: http://spark.apache.org/docs/latest/running-on-yarn.html Look for "memoryOverhead". On Mon, Jan 12, 2015 at 2:06 PM, Michael Albert wrote: > Greetings! > > My executors apparently are being terminated because they are > "running beyond physical memory limits"

Re: how to run python app in yarn?

2015-01-14 Thread Marcelo Vanzin
As the error message says... On Wed, Jan 14, 2015 at 3:14 PM, freedafeng wrote: > Error: Cluster deploy mode is currently not supported for python > applications. Use "yarn-client" instead of "yarn-cluster" for pyspark apps. -- Marcelo

Re: Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Marcelo Vanzin
You're specifying the queue in the spark-submit command line: --queue thequeue Are you sure that queue exists? On Thu, Jan 15, 2015 at 11:23 AM, Manoj Samel wrote: > Hi, > > Setup is as follows > > Hadoop Cluster 2.3.0 (CDH5.0) > - Namenode HA > - Resource manager HA > - Secured with Kerbero

Re: spark java options

2015-01-16 Thread Marcelo Vanzin
Hi Kane, What's the complete command line you're using to submit the app? Where to you expect these options to appear? On Fri, Jan 16, 2015 at 11:12 AM, Kane Kim wrote: > I want to add some java options when submitting application: > --conf "spark.executor.extraJavaOptions=-XX:+UnlockCommercialF

Re: spark java options

2015-01-16 Thread Marcelo Vanzin
Hi Kane, Here's the command line you sent me privately: ./spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class SimpleApp --conf "spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures -XX:+FlightRecorder" --master local simpleapp.jar ./test.log You're running the app in "local" mode. In tha

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't f

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes. The failure you're seeing seems to indicate the worker nodes do not have access to your code. On Mon, Apr 14, 2

Re: Proper caching method

2014-04-14 Thread Marcelo Vanzin
Hi Joe, If you cache rdd1 but not rdd2, any time you need rdd2's result, it will have to be computed. It will use rdd1's cached data, but it will have to compute its result again. On Mon, Apr 14, 2014 at 5:32 AM, Joe L wrote: > Hi I am trying to cache 2Gbyte data and to implement the following

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
> other code running except for that. (BTW, I don't use in my > code above... I just removed it for security purposes.) > > Thanks, > > Ian > > > > On Mon, Apr 14, 2014 at 12:45 PM, Marcelo Vanzin > wrote: >> >> Hi Ian, >> >> When

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Marcelo Vanzin
Hi Sung, On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung wrote: > while (true) { > rdd.map((row : Array[Double]) => { > row[numCols - 1] = computeSomething(row) > }).reduce(...) > } > > If it fails at some point, I'd imagine that the intermediate info being > stored in row[numCols - 1] w

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Marcelo Vanzin
Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung wrote: > The goal is to keep an intermediate value per row in memory, which would > allow faster subsequent computations. I.e., computeSomething would depend on > the previous value from the previous computation. I think the fundamental

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin
Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L wrote: > And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't verify that it real

Re: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Marcelo Vanzin
Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken wrote: > I haven't figured out how to let the hostname default to the host mentioned > in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, > but that's not so important. Try adding "/etc/hadoop/conf" to SPARK_CLASS

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Marcelo Vanzin
Hi, One thing you can do is set the spark version your project depends on to "1.0.0-SNAPSHOT" (make sure it matches the version of Spark you're building); then before building your project, run "sbt publishLocal" on the Spark tree. On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wrote: > i fixed it. >

Re: Task not serializable: collect, take

2014-05-01 Thread Marcelo Vanzin
Have you tried making A extend Serializable? On Thu, May 1, 2014 at 3:47 PM, SK wrote: > Hi, > > I have the following code structure. I compiles ok, but at runtime it aborts > with the error: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: > Task not serializable: java

Re: Spark and Java 8

2014-05-06 Thread Marcelo Vanzin
Hi Kristoffer, You're correct that CDH5 only supports up to Java 7 at the moment. But Yarn apps do not run in the same JVM as Yarn itself (and I believe MR1 doesn't either), so it might be possible to pass arguments in a way that tells Yarn to launch the application master / executors with the Jav

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin
Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a differen

<    1   2   3   4   5   6   >