I assume you did those things in all machines, not just on the machine
launching the job?
I've seen that workaround used successfully (well, actually, they
copied the library to /usr/lib or something, but same idea).
On Thu, Sep 25, 2014 at 7:45 PM, taqilabon g945...@gmail.com wrote:
You're
Hmmm, you might be suffering from SPARK-1719.
Not sure what the proper workaround is, but it sounds like your native
libs are not in any of the standard lib directories; one workaround
might be to copy them there, or add their location to /etc/ld.so.conf
(I'm assuming Linux).
On Thu, Sep 25,
Then I think it's time for you to look at the Spark Master logs...
On Thu, Sep 25, 2014 at 7:51 AM, danilopds danilob...@gmail.com wrote:
Hi Marcelo,
Yes, I can ping spark-01 and I also include the IP and host in my file
/etc/hosts.
My VM can ping the local machine too.
--
View this
On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote:
I am running spark with the default settings in yarn client mode. For some
reason yarn always allocates three containers to the application (wondering
where it is set?), and only uses two of them.
The default number of
You can pass the HDFS location of those extra jars in the spark-submit
--jars argument. Spark will take care of using Yarn's distributed
cache to make them available to the executors. Note that you may need
to provide the full hdfs URL (not just the path, since that will be
interpreted as a local
Comma separated list of archives to be
extracted into the
working directory of each executor.
On Thu, Sep 25, 2014 at 2:20 PM, Tamas Jambor jambo...@gmail.com wrote:
Thank you.
Where is the number of containers set?
On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van
You'll need to look at the driver output to have a better idea of
what's going on. You can use yarn logs --applicationId blah after
your app is finished (e.g. by killing it) to look at it.
My guess is that your cluster doesn't have enough resources available
to service the container request
:37 PM, Marcelo Vanzin van...@cloudera.com
wrote:
You'll need to look at the driver output to have a better idea of
what's going on. You can use yarn logs --applicationId blah after
your app is finished (e.g. by killing it) to look at it.
My guess is that your cluster doesn't have enough
, Sep 25, 2014 at 12:04 AM, Marcelo Vanzin van...@cloudera.com
wrote:
You need to use the command line yarn application that I mentioned
(yarn logs). You can't look at the logs through the UI after the app
stops.
On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda
raghuveer.cha...@gmail.com wrote
Sounds like spark-01 is not resolving correctly on your machine (or
is the wrong address). Can you ping spark-01 and does that reach the
VM where you set up the Spark Master?
On Wed, Sep 24, 2014 at 1:12 PM, danilopds danilob...@gmail.com wrote:
Hello,
I'm learning about Spark Streaming and I'm
Hi chinchu,
Where does the code trying to read the file run? Is it running on the
driver or on some executor?
If it's running on the driver, in yarn-cluster mode, the file should
have been copied to the application's work directory before the driver
is started. So hopefully just doing new
You're using hadoopConf, a Configuration object, in your closure.
That type is not serializable.
You can use -Dsun.io.serialization.extendedDebugInfo=true to debug
serialization issues.
On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Thanks Sean.
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen so...@cloudera.com wrote:
This structure is not specific to Hadoop, but in theory works in any
JAR file. You can put JARs in JARs and refer to them with Class-Path
entries in META-INF/MANIFEST.MF.
Funny that you mention that, since someone internally
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen so...@cloudera.com wrote:
What's the Hadoop jar structure in question then? Is it something special
like a WAR file? I confess I had never heard of this so thought this was
about generic JAR stuff.
What I've been told (and Steve's e-mail alludes to)
Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also
explained in the link to the docs I posted earlier.)
The second interpretation below is local: URLs in Spark, but that doesn't
work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older
either).
On Mon, Sep 8,
This has all the symptoms of Yarn killing your executors due to them
exceeding their memory limits. Could you check your RM/NM logs to see
if that's the case?
(The error was because of an executor at
domU-12-31-39-0B-F1-D1.compute-1.internal, so you can check that NM's
log file.)
If that's the
Hi,
Yes, this is a problem, and I'm not aware of any simple workarounds
(or complex one for that matter). There are people working to fix
this, you can follow progress here:
https://issues.apache.org/jira/browse/SPARK-1239
On Tue, Sep 9, 2014 at 2:54 PM, jbeynon jbey...@gmail.com wrote:
I'm
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
user$ pyspark [some-options] --driver-java-options
spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar
This command line does not look correct. spark.yarn.jar is not a JVM
command line option.
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads.
user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar='
/usr/lib/spark/assembly/lib/spark-assembly-*.jar'
My question is,
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
So just to clarify for me: When specifying 'spark.yarn.jar' as I did
above, even if I don't use HDFS to create a
RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is
still necessary to
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC.
subscripti...@didata.us wrote:
You're probably right about the above because, as seen *below* for
pyspark (but probably for other Spark
applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is
specified, the app invocation
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote:
In daily development, it's common to modify your projects and re-run
the jobs. If using zip or egg to package your code, you need to do
this every time after modification, I think it will be boring.
That's why shell
Hi Davies,
On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote:
In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.
Sure, if you already have the infrastructure in place, it might be
worthwhile
The history server (and other Spark daemons) do not read
spark-defaults.conf. There's a bug open to implement that
(SPARK-2098), and an open PR to fix it, but it's still not in Spark.
On Wed, Sep 3, 2014 at 11:00 AM, Zhanfeng Huo huozhanf...@gmail.com wrote:
Hi,
I have seted properties in
local means everything runs in the same process; that means there is
no need for master and worker daemons to start processes.
On Wed, Sep 3, 2014 at 3:12 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
If launched with “local” as master, where are master
The only monitoring available is the driver's Web UI, which will
generally be available on port 4040.
On Wed, Sep 3, 2014 at 3:43 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
How can that single process be monitored? Thanks!
-Original Message-
From: Marcelo
Hi Du,
I don't believe the Guava change has made it to the 1.1 branch. The
Guava doc says hashInt was added in 12.0, so what's probably
happening is that you have and old version of Guava in your classpath
before the Spark jars. (Hadoop ships with Guava 11, so that may be the
source of your
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote:
An “unaccepted” reply to this thread from Dean Chen suggested to build Spark
with a newer version of Hadoop (2.4.1) and this has worked to some extent.
I’m now able to submit jobs (omitting an explicit
Ah, sorry, forgot to talk about the second issue.
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote:
However, now the Spark jobs running in the ApplicationMaster on a given node
fails to find the active resourcemanager. Below is a log excerpt from one
of the assigned
Hi,
On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com wrote:
Specifying the driver-class-path yields behavior like
https://issues.apache.org/jira/browse/SPARK-2420 and
https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a
can of worms here if I also
My guess is that your test is trying to serialize a closure
referencing connectionInfo; that closure will have a reference to
the test instance, since the instance is needed to execute that
method.
Try to make the connectionInfo method local to the method where it's
needed, or declare it in an
That command line you mention in your e-mail doesn't look like
something started by Spark. Spark would start one of
ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend,
not org.apache.hadoop.mapred.YarnChild.
On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu cente...@gmail.com wrote:
On Tue, Aug 19, 2014 at 2:34 PM, Arun Ahuja aahuj...@gmail.com wrote:
/opt/cloudera/parcels/CDH/bin/spark-submit \
--master yarn \
--deploy-mode client \
This should be enough.
But when I view the job 4040 page, SparkUI, there is a single executor (just
the driver node) and I see
You could create a copy of the variable inside your Parse class;
that way it would be serialized with the instance you create when
calling map() below.
On Tue, Aug 12, 2014 at 10:56 AM, Sunny Khatri sunny.k...@gmail.com wrote:
Are there any other workarounds that could be used to pass in the
Hi, sorry for the delay. Would you have yarn available to test? Given
the discussion in SPARK-2878, this might be a different incarnation of
the same underlying issue.
The option in Yarn is spark.yarn.user.classpath.first
On Mon, Aug 11, 2014 at 1:33 PM, DNoteboom dan...@wibidata.com wrote:
I'm
Could you share what's the cluster manager you're using and exactly
where the error shows up (driver or executor)?
A quick look reveals that Standalone and Yarn use different options to
control this, for example. (Maybe that already should be a bug.)
On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom
There are two problems that might be happening:
- You're requesting more resources than the master has available, so
your executors are not starting. Given your explanation this doesn't
seem to be the case.
- The executors are starting, but are having problems connecting back
to the driver. In
Can you try with -Pyarn instead of -Pyarn-alpha?
I'm pretty sure CDH4 ships with the newer Yarn API.
On Thu, Aug 7, 2014 at 8:11 AM, linkpatrickliu linkpatrick...@live.com wrote:
Hi,
Following the document:
# Cloudera CDH 4.2.0
mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests
that ~4.2 is enough
like YARN alpha, which is supported as a one-off as I understand, to
work.
All bets are off before YARN stable really, in my book.
On Thu, Aug 7, 2014 at 6:32 PM, Marcelo Vanzin van...@cloudera.com wrote:
Can you try with -Pyarn instead of -Pyarn-alpha?
I'm pretty sure CDH4
Hello,
Try something like this:
scala def newFoo[T]()(implicit ct: ClassTag[T]): T =
ct.runtimeClass.newInstance().asInstanceOf[T]
newFoo: [T]()(implicit ct: scala.reflect.ClassTag[T])T
scala newFoo[String]()
res2: String =
scala newFoo[java.util.ArrayList[String]]()
res5:
Discussions about how CDH packages Spark aside, you should be using
the spark-class script (assuming you're still in 0.9) instead of
executing Java directly. That will make sure that the environment
needed to run Spark apps is set up correctly.
CDH 5.1 ships with Spark 1.0.0, so it has
sharath.abhis...@gmail.com wrote:
Hello Marcelo Vanzin,
Can you explain bit more on this? I tried using client mode but can you
explain how can i use this port to write the log or output to this
port?Thanks in advance!
--
View this message in context:
http://apache-spark-user-list.1001560.n3
You can upload your own log4j.properties using spark-submit's
--files argument.
On Tue, Jul 22, 2014 at 12:45 PM, abhiguruvayya
sharath.abhis...@gmail.com wrote:
I fixed the error with the yarn-client mode issue which i mentioned in my
earlier post. Now i want to edit the log4j.properties to
The spark log classes are based on the actual class names. So if you
want to filter out a package's logs you need to specify the full
package name (e.g. org.apache.spark.storage instead of just
spark.storage).
On Tue, Jul 22, 2014 at 2:07 PM, abhiguruvayya
sharath.abhis...@gmail.com wrote:
On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr
mattcoarr.w...@gmail.com wrote:
Thanks Marcelo, I'm not seeing anything in the logs that clearly explains
what's causing this to break.
One interesting point that we just discovered is that if we run the driver
and the slave (worker) on the
Could you share some code (or pseudo-code)?
Sounds like you're instantiating the JDBC connection in the driver,
and using it inside a closure that would be run in a remote executor.
That means that the connection object would need to be serializable.
If that sounds like what you're doing, it
at 1:21 PM, Marcelo Vanzin van...@cloudera.com wrote:
When I meant the executor log, I meant the log of the process launched
by the worker, not the worker. In my CDH-based Spark install, those
end up in /var/run/spark/work.
If you look at your worker log, you'll see it's launching the executor
Have you looked at the slave machine to see if the process has
actually launched? If it has, have you tried peeking into its log
file?
(That error is printed whenever the executors fail to report back to
the driver. Insufficient resources to launch the executor is the most
common cause of that,
That output means you're running in yarn-cluster mode. So your code is
running inside the ApplicationMaster and has no access to the local
terminal.
If you want to see the output:
- try yarn-client mode, then your code will run inside the launcher process
- check the RM web ui and look at the
Someone might be able to correct me if I'm wrong, but I don't believe
standalone mode supports kerberos. You'd have to use Yarn for that.
On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote:
Hi all,
I encounter a strange issue when using spark 1.0 to access hdfs with
Kerberos
I
This is generally a side effect of your executor being killed. For
example, Yarn will do that if you're going over the requested memory
limits.
On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani
rahulbhojwani2...@gmail.com wrote:
HI,
I am getting this error. Can anyone help out to explain why is
want I can post
my code here.
Thanks
On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com wrote:
This is generally a side effect of your executor being killed. For
example, Yarn will do that if you're going over the requested memory
limits.
On Tue, Jul 8, 2014 at 12:17 PM
suggest me how to increase the memory
limits or how to tackle this problem. I am a novice. If you want I can post
my code here.
Thanks
On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin van...@cloudera.com
wrote:
This is generally a side effect of your executor being killed. For
example
Sorry, that would be sc.stop() (not close).
On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Rahul,
Can you try calling sc.close() at the end of your program, so Spark
can clean up after itself?
On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani
rahulbhojwani2
:
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at
org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62)
Can you help in that?
On Wed, Jul 9, 2014 at 2:07 AM, Marcelo Vanzin van...@cloudera.com wrote:
Sorry, that would be sc.stop
object in Scala is similar to a class with only static fields /
methods in Java. So when you set its fields in the driver, the
object does not get serialized and sent to the executors; they have
their own copy of the class and its static fields, which haven't been
initialized.
Use a proper class,
Hi Koert,
Could you provide more details? Job arguments, log messages, errors, etc.
On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote:
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead of hdfs. what could cause this?
On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers ko...@tresata.com wrote:
thanks! i will try that.
i guess what i am most confused about is why the executors are trying to
retrieve the jars directly using the info i provided to add jars to my spark
context. i mean, thats bound to fail no? i
Coincidentally, I just ran into the same exception. What's probably
happening is that you're specifying some jar file in your job as an
absolute local path (e.g. just
/home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config
has the default FS set to HDFS.
So your driver does not know
Ah, not that it should matter, but I'm on Linux and you seem to be on
Windows... maybe there is something weird going on with the Windows
launcher?
On Wed, Jun 11, 2014 at 10:34 AM, Marcelo Vanzin van...@cloudera.com wrote:
Just tried this and it worked fine for me:
./bin/spark-shell --jars
The error is saying that your client libraries are older than what
your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7).
Try double-checking that your build is actually using that version
(e.g., by looking at the hadoop jar files in lib_managed/jars).
On Wed, Jun 11, 2014 at 2:07 AM, bijoy
Hi Jamal,
If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.
Or you could write the file list to a file, and then use
)
But except dna.jpeg Lets say, I have millions of dna.jpeg and I want to run
the above logic on all the millions files.
How should I go about this?
Thanks
On Mon, Jun 2, 2014 at 5:09 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Jamal,
If what you want is to process lots of files in parallel
Hi Rahul,
I'll just copy paste your question here to aid with context, and
reply afterwards.
-
Can I write the RDD data in excel file along with mapping in
apache-spark? Is that a correct way? Isn't that a writing will be a
local function and can't be passed over the clusters??
Below is
Hello there,
On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote:
workbook = xlsxwriter.Workbook('output_excel.xlsx')
worksheet = workbook.add_worksheet()
data = sc.textFile(xyz.txt)
# xyz.txt is a file whose each line contains string delimited by SPACE
row=0
def
Hi Sebastian,
That exception generally means you have the class loaded by two
different class loaders, and some code is trying to mix instances
created by the two different loaded classes.
Do you happen to have that class both in the spark jars and in your
app's uber-jar? That might explain the
On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar
suman.somasun...@oracle.com wrote:
I am running this on a Solaris machine with logical partitions. All the
partitions (workers) access the same Spark folder.
Can you check whether you have multiple versions of the offending
class
Hey Andrew,
Since we're seeing so many of these e-mails, I think it's worth
pointing out that it's not really obvious to find unsubscription
information for the lists.
The community link on the Spark site
(http://spark.apache.org/community.html) does not have instructions
for unsubscribing; it
Hi Marcin,
On Wed, May 14, 2014 at 7:22 AM, Marcin Cylke
marcin.cy...@ext.allegro.pl wrote:
- This looks like some problems with HA - but I've checked namenodes during
the job was running, and there
was no switch between master and slave namenode.
14/05/14 15:25:44 ERROR
the cache.
Ah, yeah, sure. What I meant is that Spark itself will not, AFAIK, use
that facility for adding files to the cache or anything like that. But
yes, it does benefit from things already cached.
On May 12, 2014, at 11:10 AM, Marcelo Vanzin van...@cloudera.com wrote:
Is that true? I believe
Is that true? I believe that API Chanwit is talking about requires
explicitly asking for files to be cached in HDFS.
Spark automatically benefits from the kernel's page cache (i.e. if
some block is in the kernel's page cache, it will be read more
quickly). But the explicit HDFS cache is a
Hi Kristoffer,
You're correct that CDH5 only supports up to Java 7 at the moment. But
Yarn apps do not run in the same JVM as Yarn itself (and I believe MR1
doesn't either), so it might be possible to pass arguments in a way
that tells Yarn to launch the application master / executors with the
Have you tried making A extend Serializable?
On Thu, May 1, 2014 at 3:47 PM, SK skrishna...@gmail.com wrote:
Hi,
I have the following code structure. I compiles ok, but at runtime it aborts
with the error:
Exception in thread main org.apache.spark.SparkException: Job aborted:
Task not
Hi,
One thing you can do is set the spark version your project depends on
to 1.0.0-SNAPSHOT (make sure it matches the version of Spark you're
building); then before building your project, run sbt publishLocal
on the Spark tree.
On Wed, Apr 30, 2014 at 12:11 AM, wxhsdp wxh...@gmail.com wrote:
i
Hi Sung,
On Mon, Apr 21, 2014 at 10:52 AM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
The goal is to keep an intermediate value per row in memory, which would
allow faster subsequent computations. I.e., computeSomething would depend on
the previous value from the previous computation.
I
Hi Joe,
On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote:
And, I haven't gotten any answers to my questions.
One thing that might explain that is that, at least for me, all (and I
mean *all*) of your messages are ending up in my GMail spam folder,
complaining that GMail can't
Hi Ken,
On Mon, Apr 21, 2014 at 1:39 PM, Williams, Ken
ken.willi...@windlogics.com wrote:
I haven't figured out how to let the hostname default to the host mentioned
in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do,
but that's not so important.
Try adding
Hi Sung,
On Fri, Apr 18, 2014 at 5:11 PM, Sung Hwan Chung
coded...@cs.stanford.edu wrote:
while (true) {
rdd.map((row : Array[Double]) = {
row[numCols - 1] = computeSomething(row)
}).reduce(...)
}
If it fails at some point, I'd imagine that the intermediate info being
stored in
Hi Ian,
When you run your packaged application, are you adding its jar file to
the SparkContext (by calling the addJar() method)?
That will distribute the code to all the worker nodes. The failure
you're seeing seems to indicate the worker nodes do not have access to
your code.
On Mon, Apr 14,
Hi Joe,
If you cache rdd1 but not rdd2, any time you need rdd2's result, it
will have to be computed. It will use rdd1's cached data, but it will
have to compute its result again.
On Mon, Apr 14, 2014 at 5:32 AM, Joe L selme...@yahoo.com wrote:
Hi I am trying to cache 2Gbyte data and to
.)
Thanks,
Ian
On Mon, Apr 14, 2014 at 12:45 PM, Marcelo Vanzin van...@cloudera.com
wrote:
Hi Ian,
When you run your packaged application, are you adding its jar file to
the SparkContext (by calling the addJar() method)?
That will distribute the code to all the worker nodes
Hi Francis,
This might be a long shot, but do you happen to have built spark on an
encrypted home dir?
(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't
401 - 482 of 482 matches
Mail list logo