from:"Marcelo Vanzin"

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin

Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin

Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes. The failure you're seeing seems to indicate the worker nodes do not have access to your code. On Mon, Apr 14,

Re: Proper caching method

2014-04-14 Thread Marcelo Vanzin

Hi Joe, If you cache rdd1 but not rdd2, any time you need rdd2's result, it will have to be computed. It will use rdd1's cached data, but it will have to compute its result again. On Mon, Apr 14, 2014 at 5:32 AM, Joe L selme...@yahoo.com wrote: Hi I am trying to cache 2Gbyte data and to

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin

.) Thanks, Ian On Mon, Apr 14, 2014 at 12:45 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-18 Thread Marcelo Vanzin

Hi Sung, On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: while (true) { rdd.map((row : Array[Double]) = { row[numCols - 1] = computeSomething(row) }).reduce(...) } If it fails at some point, I'd imagine that the intermediate info being stored in

Re: Do developers have to be aware of Spark's fault tolerance mechanism?

2014-04-21 Thread Marcelo Vanzin

Hi Sung, On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: The goal is to keep an intermediate value per row in memory, which would allow faster subsequent computations. I.e., computeSomething would depend on the previous value from the previous computation. I

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin

Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't

Re: Problem connecting to HDFS in Spark shell

2014-04-21 Thread Marcelo Vanzin

Hi Ken, On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken ken.willi...@windlogics.com wrote: I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important. Try adding

Re: NoSuchMethodError from Spark Java

2014-04-30 Thread Marcelo Vanzin

Hi, One thing you can do is set the spark version your project depends on to 1.0.0-SNAPSHOT (make sure it matches the version of Spark you're building); then before building your project, run sbt publishLocal on the Spark tree. On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wxh...@gmail.com wrote: i

Re: Task not serializable: collect, take

2014-05-01 Thread Marcelo Vanzin

Have you tried making A extend Serializable? On Thu, May 1, 2014 at 3:47 PM, SK skrishna...@gmail.com wrote: Hi, I have the following code structure. I compiles ok, but at runtime it aborts with the error: Exception in thread main org.apache.spark.SparkException: Job aborted: Task not

Re: Spark and Java 8

2014-05-06 Thread Marcelo Vanzin

Hi Kristoffer, You're correct that CDH5 only supports up to Java 7 at the moment. But Yarn apps do not run in the same JVM as Yarn itself (and I believe MR1 doesn't either), so it might be possible to pass arguments in a way that tells Yarn to launch the application master / executors with the

Re: Spark to utilize HDFS's mmap caching

2014-05-12 Thread Marcelo Vanzin

Is that true? I believe that API Chanwit is talking about requires explicitly asking for files to be cached in HDFS. Spark automatically benefits from the kernel's page cache (i.e. if some block is in the kernel's page cache, it will be read more quickly). But the explicit HDFS cache is a

Re: Spark to utilize HDFS's mmap caching

2014-05-13 Thread Marcelo Vanzin

the cache. Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use that facility for adding files to the cache or anything like that. But yes, it does benefit from things already cached. On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote: Is that true? I believe

Re: problem with hdfs access in spark job

2014-05-16 Thread Marcelo Vanzin

Hi Marcin, On Wed, May 14, 2014 at 7:22 AM, Marcin Cylke marcin.cy...@ext.allegro.pl wrote: - This looks like some problems with HA - but I've checked namenodes during the job was running, and there was no switch between master and slave namenode. 14/05/14 15:25:44 ERROR

Re: unsubscribe

2014-05-19 Thread Marcelo Vanzin

Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it

Re: Invalid Class Exception

2014-05-27 Thread Marcelo Vanzin

On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar suman.somasun...@oracle.com wrote: I am running this on a Solaris machine with logical partitions. All the partitions (workers) access the same Spark folder. Can you check whether you have multiple versions of the offending class

Re: ClassCastExceptions when using Spark shell

2014-05-29 Thread Marcelo Vanzin

Hi Sebastian, That exception generally means you have the class loaded by two different class loaders, and some code is trying to mix instances created by the two different loaded classes. Do you happen to have that class both in the spark jars and in your app's uber-jar? That might explain the

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin

Hi Rahul, I'll just copy paste your question here to aid with context, and reply afterwards. - Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters?? Below is

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin

Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin

Hi Jamal, If what you want is to process lots of files in parallel, the best approach is probably to load all file names into an array and parallelize that. Then each task will take a path as input and can process it however it wants. Or you could write the file list to a file, and then use

Re: Processing audio/video/images

2014-06-02 Thread Marcelo Vanzin

) But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run the above logic on all the millions files. How should I go about this? Thanks On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Jamal, If what you want is to process lots of files in parallel

Re: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Marcelo Vanzin

Ah, not that it should matter, but I'm on Linux and you seem to be on Windows... maybe there is something weird going on with the Windows launcher? On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin van...@cloudera.com wrote: Just tried this and it worked fine for me: ./bin/spark-shell --jars

Re: HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-11 Thread Marcelo Vanzin

The error is saying that your client libraries are older than what your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7). Try double-checking that your build is actually using that version (e.g., by looking at the hadoop jar files in lib_managed/jars). On Wed, Jun 11, 2014 at 2:07 AM, bijoy

Re: trying to understand yarn-client mode

2014-06-19 Thread Marcelo Vanzin

Coincidentally, I just ran into the same exception. What's probably happening is that you're specifying some jar file in your job as an absolute local path (e.g. just /home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config has the default FS set to HDFS. So your driver does not know

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Marcelo Vanzin

Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote: i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this?

Re: trying to understand yarn-client mode

2014-06-20 Thread Marcelo Vanzin

On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers ko...@tresata.com wrote: thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Marcelo Vanzin

object in Scala is similar to a class with only static fields / methods in Java. So when you set its fields in the driver, the object does not get serialized and sent to the executors; they have their own copy of the class and its static fields, which haven't been initialized. Use a proper class,

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin

Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote: Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin

This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: HI, I am getting this error. Can anyone help out to explain why is

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin

want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote: This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin

suggest me how to increase the memory limits or how to tackle this problem. I am a novice. If you want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote: This is generally a side effect of your executor being killed. For example

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin

Sorry, that would be sc.stop() (not close). On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Rahul, Can you try calling sc.close() at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani rahulbhojwani2

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin

: java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62) Can you help in that? On Wed, Jul 9, 2014 at 2:07 AM, Marcelo Vanzin van...@cloudera.com wrote: Sorry, that would be sc.stop

Re: Spark job tracker.

2014-07-10 Thread Marcelo Vanzin

That output means you're running in yarn-cluster mode. So your code is running inside the ApplicationMaster and has no access to the local terminal. If you want to see the output: - try yarn-client mode, then your code will run inside the launcher process - check the RM web ui and look at the

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Marcelo Vanzin

Have you looked at the slave machine to see if the process has actually launched? If it has, have you tried peeking into its log file? (That error is printed whenever the executors fail to report back to the driver. Insufficient resources to launch the executor is the most common cause of that,

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin

On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr mattcoarr.w...@gmail.com wrote: Thanks Marcelo, I'm not seeing anything in the logs that clearly explains what's causing this to break. One interesting point that we just discovered is that if we run the driver and the slave (worker) on the

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Marcelo Vanzin

Could you share some code (or pseudo-code)? Sounds like you're instantiating the JDBC connection in the driver, and using it inside a closure that would be run in a remote executor. That means that the connection object would need to be serializable. If that sounds like what you're doing, it

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin

at 1:21 PM, Marcelo Vanzin van...@cloudera.com wrote: When I meant the executor log, I meant the log of the process launched by the worker, not the worker. In my CDH-based Spark install, those end up in /var/run/spark/work. If you look at your worker log, you'll see it's launching the executor

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin

sharath.abhis...@gmail.com wrote: Hello Marcelo Vanzin, Can you explain bit more on this? I tried using client mode but can you explain how can i use this port to write the log or output to this port?Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin

You can upload your own log4j.properties using spark-submit's --files argument. On Tue, Jul 22, 2014 at 12:45 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I fixed the error with the yarn-client mode issue which i mentioned in my earlier post. Now i want to edit the log4j.properties to

Re: Spark job tracker.

2014-07-22 Thread Marcelo Vanzin

The spark log classes are based on the actual class names. So if you want to filter out a package's logs you need to specify the full package name (e.g. org.apache.spark.storage instead of just spark.storage). On Tue, Jul 22, 2014 at 2:07 PM, abhiguruvayya sharath.abhis...@gmail.com wrote:

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Marcelo Vanzin

Discussions about how CDH packages Spark aside, you should be using the spark-class script (assuming you're still in 0.9) instead of executing Java directly. That will make sure that the environment needed to run Spark apps is set up correctly. CDH 5.1 ships with Spark 1.0.0, so it has

Re: Create a new object by given classtag

2014-08-04 Thread Marcelo Vanzin

Hello, Try something like this: scala def newFoo[T]()(implicit ct: ClassTag[T]): T = ct.runtimeClass.newInstance().asInstanceOf[T] newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T scala newFoo[String]() res2: String = scala newFoo[java.util.ArrayList[String]]() res5:

Re: Initial job has not accepted any resources

2014-08-07 Thread Marcelo Vanzin

There are two problems that might be happening: - You're requesting more resources than the master has available, so your executors are not starting. Given your explanation this doesn't seem to be the case. - The executors are starting, but are having problems connecting back to the driver. In

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin

Can you try with -Pyarn instead of -Pyarn-alpha? I'm pretty sure CDH4 ships with the newer Yarn API. On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu linkpatrick...@live.com wrote: Hi, Following the document: # Cloudera CDH 4.2.0 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests

Re: [Compile error] Spark 1.0.2 against cloudera 2.0.0-cdh4.6.0 error

2014-08-07 Thread Marcelo Vanzin

that ~4.2 is enough like YARN alpha, which is supported as a one-off as I understand, to work. All bets are off before YARN stable really, in my book. On Thu, Aug 7, 2014 at 6:32 PM, Marcelo Vanzin van...@cloudera.com wrote: Can you try with -Pyarn instead of -Pyarn-alpha? I'm pretty sure CDH4

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread Marcelo Vanzin

Could you share what's the cluster manager you're using and exactly where the error shows up (driver or executor)? A quick look reveals that Standalone and Yarn use different options to control this, for example. (Maybe that already should be a bug.) On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom

Re: Reference External Variables in Map Function (Inner class)

2014-08-12 Thread Marcelo Vanzin

You could create a copy of the variable inside your Parse class; that way it would be serialized with the instance you create when calling map() below. On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri sunny.k...@gmail.com wrote: Are there any other workarounds that could be used to pass in the

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-12 Thread Marcelo Vanzin

Hi, sorry for the delay. Would you have yarn available to test? Given the discussion in SPARK-2878, this might be a different incarnation of the same underlying issue. The option in Yarn is spark.yarn.user.classpath.first On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom dan...@wibidata.com wrote: I'm

Re: spark-submit with Yarn

2014-08-19 Thread Marcelo Vanzin

On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja aahuj...@gmail.com wrote: /opt/cloudera/parcels/CDH/bin/spark-submit \ --master yarn \ --deploy-mode client \ This should be enough. But when I view the job 4040 page, SparkUI, there is a single executor (just the driver node) and I see

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin

On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: An “unaccepted” reply to this thread from Dean Chen suggested to build Spark with a newer version of Hadoop (2.4.1) and this has worked to some extent. I’m now able to submit jobs (omitting an explicit

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin

Ah, sorry, forgot to talk about the second issue. On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote: However, now the Spark jobs running in the ApplicationMaster on a given node fails to find the active resourcemanager. Below is a log excerpt from one of the assigned

Re: spark-submit with HA YARN

2014-08-20 Thread Marcelo Vanzin

Hi, On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com wrote: Specifying the driver-class-path yields behavior like https://issues.apache.org/jira/browse/SPARK-2420 and https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can of worms here if I also

Re: java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper

2014-08-20 Thread Marcelo Vanzin

My guess is that your test is trying to serialize a closure referencing connectionInfo; that closure will have a reference to the test instance, since the instance is needed to execute that method. Try to make the connectionInfo method local to the method where it's needed, or declare it in an

Re: Spark memory settings on yarn

2014-08-20 Thread Marcelo Vanzin

That command line you mention in your e-mail doesn't look like something started by Spark. Spark would start one of ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend, not org.apache.hadoop.mapred.YarnChild. On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu cente...@gmail.com wrote:

Re: Hive From Spark

2014-08-21 Thread Marcelo Vanzin

Hi Du, I don't believe the Guava change has made it to the 1.1 branch. The Guava doc says hashInt was added in 12.0, so what's probably happening is that you have and old version of Guava in your classpath before the Spark jars. (Hadoop ships with Guava 11, so that may be the source of your

Re: How can I start history-server with kerberos HDFS ?

2014-09-03 Thread Marcelo Vanzin

The history server (and other Spark daemons) do not read spark-defaults.conf. There's a bug open to implement that (SPARK-2098), and an open PR to fix it, but it's still not in Spark. On Wed, Sep 3, 2014 at 11:00 AM, Zhanfeng Huo huozhanf...@gmail.com wrote: Hi, I have seted properties in

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin

local means everything runs in the same process; that means there is no need for master and worker daemons to start processes. On Wed, Sep 3, 2014 at 3:12 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, If launched with “local” as master, where are master

Re: If master is local, where are master and workers?

2014-09-03 Thread Marcelo Vanzin

The only monitoring available is the driver's Web UI, which will generally be available on port 4040. On Wed, Sep 3, 2014 at 3:43 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: How can that single process be monitored? Thanks! -Original Message- From: Marcelo

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin

On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote: In daily development, it's common to modify your projects and re-run the jobs. If using zip or egg to package your code, you need to do this every time after modification, I think it will be boring. That's why shell

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin

Hi Davies, On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. Sure, if you already have the infrastructure in place, it might be worthwhile

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin

On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar This command line does not look correct. spark.yarn.jar is not a JVM command line option.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin

On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads. user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar=' /usr/lib/spark/assembly/lib/spark-assembly-*.jar' My question is,

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin

On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. subscripti...@didata.us wrote: So just to clarify for me: When specifying 'spark.yarn.jar' as I did above, even if I don't use HDFS to create a RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is still necessary to

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin

On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. subscripti...@didata.us wrote: You're probably right about the above because, as seen *below* for pyspark (but probably for other Spark applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is specified, the app invocation

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-09 Thread Marcelo Vanzin

Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also explained in the link to the docs I posted earlier.) The second interpretation below is local: URLs in Spark, but that doesn't work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older either). On Mon, Sep 8,

Re: spark-streaming Could not compute split exception

2014-09-09 Thread Marcelo Vanzin

This has all the symptoms of Yarn killing your executors due to them exceeding their memory limits. Could you check your RM/NM logs to see if that's the case? (The error was because of an executor at domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's log file.) If that's the

Re: Yarn Driver OOME (Java heap space) when executors request map output locations

2014-09-09 Thread Marcelo Vanzin

Hi, Yes, this is a problem, and I'm not aware of any simple workarounds (or complex one for that matter). There are people working to fix this, you can follow progress here: https://issues.apache.org/jira/browse/SPARK-1239 On Tue, Sep 9, 2014 at 2:54 PM, jbeynon jbey...@gmail.com wrote: I'm

Re: Task not serializable

2014-09-10 Thread Marcelo Vanzin

You're using hadoopConf, a Configuration object, in your closure. That type is not serializable. You can use -Dsun.io.serialization.extendedDebugInfo=true to debug serialization issues. On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Thanks Sean.

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin

On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen so...@cloudera.com wrote: This structure is not specific to Hadoop, but in theory works in any JAR file. You can put JARs in JARs and refer to them with Class-Path entries in META-INF/MANIFEST.MF. Funny that you mention that, since someone internally

Re: Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-10 Thread Marcelo Vanzin

On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen so...@cloudera.com wrote: What's the Hadoop jar structure in question then? Is it something special like a WAR file? I confess I had never heard of this so thought this was about generic JAR stuff. What I've been told (and Steve's e-mail alludes to)

Re: spark-submit command-line with --files

2014-09-20 Thread Marcelo Vanzin

Hi chinchu, Where does the code trying to read the file run? Is it running on the driver or on some executor? If it's running on the driver, in yarn-cluster mode, the file should have been copied to the application's work directory before the driver is started. So hopefully just doing new

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin

You'll need to look at the driver output to have a better idea of what's going on. You can use yarn logs --applicationId blah after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin

:37 PM, Marcelo Vanzin van...@cloudera.com wrote: You'll need to look at the driver output to have a better idea of what's going on. You can use yarn logs --applicationId blah after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin

, Sep 25, 2014 at 12:04 AM, Marcelo Vanzin van...@cloudera.com wrote: You need to use the command line yarn application that I mentioned (yarn logs). You can't look at the logs through the UI after the app stops. On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda raghuveer.cha...@gmail.com wrote

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin

Sounds like spark-01 is not resolving correctly on your machine (or is the wrong address). Can you ping spark-01 and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds danilob...@gmail.com wrote: Hello, I'm learning about Spark Streaming and I'm

Re: how to run spark job on yarn with jni lib?

2014-09-25 Thread Marcelo Vanzin

Hmmm, you might be suffering from SPARK-1719. Not sure what the proper workaround is, but it sounds like your native libs are not in any of the standard lib directories; one workaround might be to copy them there, or add their location to /etc/ld.so.conf (I'm assuming Linux). On Thu, Sep 25,

Re: Question About Submit Application

2014-09-25 Thread Marcelo Vanzin

Then I think it's time for you to look at the Spark Master logs... On Thu, Sep 25, 2014 at 7:51 AM, danilopds danilob...@gmail.com wrote: Hi Marcelo, Yes, I can ping spark-01 and I also include the IP and host in my file /etc/hosts. My VM can ping the local machine too. -- View this

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin

On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote: I am running spark with the default settings in yarn client mode. For some reason yarn always allocates three containers to the application (wondering where it is set?), and only uses two of them. The default number of

Re: SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread Marcelo Vanzin

You can pass the HDFS location of those extra jars in the spark-submit --jars argument. Spark will take care of using Yarn's distributed cache to make them available to the executors. Note that you may need to provide the full hdfs URL (not just the path, since that will be interpreted as a local

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin

Comma separated list of archives to be extracted into the working directory of each executor. On Thu, Sep 25, 2014 at 2:20 PM, Tamas Jambor jambo...@gmail.com wrote: Thank you. Where is the number of containers set? On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van

Re: how to run spark job on yarn with jni lib?

2014-09-26 Thread Marcelo Vanzin

I assume you did those things in all machines, not just on the machine launching the job? I've seen that workaround used successfully (well, actually, they copied the library to /usr/lib or something, but same idea). On Thu, Sep 25, 2014 at 7:45 PM, taqilabon g945...@gmail.com wrote: You're

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin

You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin

in a few different contexts, but I don't think there's an official solution yet.) On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor jambo...@gmail.com wrote: thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin

No, you can't instantiate a SparkContext to start apps in cluster mode. For Yarn, for example, you'd have to call directly into org.apache.spark.deploy.yarn.Client; that class will tell the Yarn cluster to launch the driver for you and then instantiate the SparkContext. On Wed, Oct 1, 2014 at

Re: Application details for failed and teminated jobs

2014-10-02 Thread Marcelo Vanzin

You may want to take a look at this PR: https://github.com/apache/spark/pull/1558 Long story short: while not a terrible idea to show running applications, your particular case should be solved differently. Applications are responsible for calling SparkContext.stop() at the end of their run,

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Marcelo Vanzin

Hi Anurag, Spark SQL (from the Spark standard distribution / sources) currently requires Hive 0.12; as you mention, CDH4 has Hive 0.10, so that's not gonna work. CDH 5.2 ships with Spark 1.1.0 and is modified so that Spark SQL can talk to the Hive 0.13.1 that is also bundled with CDH, so if

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Marcelo Vanzin

Hi Greg, I'm not sure exactly what it is that you're trying to achieve, but I'm pretty sure those variables are not supposed to be set by users. You should take a look at the documentation for spark.driver.extraClassPath and spark.driver.extraLibraryPath, and the equivalent options for executors.

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread Marcelo Vanzin

Hi Eric, Check the Debugging Your Application section at: http://spark.apache.org/docs/latest/running-on-yarn.html Long story short: upload your log4j.properties using the --files argument of spark-submit. (Mental note: we could make the log level configurable via a system property...) On

Re: Spark assembly for YARN/CDH5

2014-10-16 Thread Marcelo Vanzin

Hi Philip, The assemblies are part of the CDH distribution. You can get them here: http://www.cloudera.com/content/cloudera/en/downloads/cdh/cdh-5-2-0.html As of Spark 1.1 (and, thus, CDH 5.2), assemblies are not published to maven repositories anymore (you can see commit [1] for details). [1]

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Marcelo Vanzin

On top of what Andrew said, you shouldn't need to manually add the mllib jar to your jobs; it's already included in the Spark assembly jar. On Thu, Oct 16, 2014 at 11:51 PM, eric wong win19...@gmail.com wrote: Hi, i using the comma separated style for submit multiple jar files in the follow

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin

Hi Ashwin, Let me try to answer to the best of my knowledge. On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar ashwinshanka...@gmail.com wrote: Here are my questions : 1. Sharing spark context : How exactly multiple users can share the cluster using same spark context ? That's not

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin

On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually. In general, a SparkContext maps to a user application My question was basically this. In this page in the official doc, under Scheduling within an application

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin

resource or 2) add dynamic resource management for Yarn mode is very much wanted. Jianshi On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar ashwinshanka...@gmail.com wrote: That's not something you might want to do usually

Re: JavaHiveContext class not found error. Help!!

2014-10-23 Thread Marcelo Vanzin

Hello there, This is more of a question for the cdh-users list, but in any case... In CDH 5.1 we skipped packaging of the Hive module in SparkSQL. That has been fixed in CDH 5.2, so if it's possible for you I'd recommend upgrading. On Thu, Oct 23, 2014 at 2:53 PM, nitinkak001

Re: Exceptions not caught?

2014-10-23 Thread Marcelo Vanzin

On Thu, Oct 23, 2014 at 3:40 PM, ankits ankitso...@gmail.com wrote: 2014-10-23 15:39:50,845 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: This looks like an exception that's happening on an executor and just being

Re: SparkContext.stop() ?

2014-10-31 Thread Marcelo Vanzin

Actually, if you don't call SparkContext.stop(), the event log information that is used by the history server will be incomplete, and your application will never show up in the history server's UI. If you don't use that functionality, then you're probably ok not calling it as long as your

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-11-05 Thread Marcelo Vanzin

On Mon, Oct 27, 2014 at 7:37 PM, buring qyqb...@gmail.com wrote: Here is error log,I abstract as follows: INFO [binaryTest---main]: before first WARN [org.apache.spark.scheduler.TaskSetManager---Result resolver thread-0]: Lost task 0.0 in stage 0.0 (TID 0, spark-dev136):

Re: Backporting spark 1.1.0 to CDH 5.1.3

2014-11-10 Thread Marcelo Vanzin

Hello, CDH 5.1.3 ships with a version of Hive that's not entirely the same as the Hive Spark 1.1 supports. So when building your custom Spark, you should make sure you change all the dependency versions to point to the CDH versions. IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have

Re: How to incrementally compile spark examples using mvn

2014-11-15 Thread Marcelo Vanzin

I haven't tried scala:cc, but you can ask maven to just build a particular sub-project. For example: mvn -pl :spark-examples_2.10 compile On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I have already successfully compile and run spark examples. My problem

1 2 3 4 5 >

1 - 100 of 482 matches

Mail list logo