Hi,
I’m new to Spark and trying to test first Spark prog. I’m running SparkPi
successfully in yarn-client -mode but when running the same in yarn-mode, app
gets stuck to ACCEPTED phase. I’ve tried hours to hunt down the reason but the
outcome is always the same. Any hints what to look for next?
CLASSIFICATION : Public
This message has been marked by Jayaraman Babu on Tuesday, May 20, 2014,
9:55:10 AM.
The above classification labels were added to the message by.
AL ELM Message Classification
This e-mail message and all attachments transmitted with it are intended solely
for the use
You asked off-list, and provided a more detailed example there:
val random = new Random()
val testdata = (1 to 1).map(_=(random.nextInt(),random.nextInt()))
sc.parallelize(testdata).combineByKey[ArrayBuffer[Int]](
(instant:Int)={new ArrayBuffer[Int]()},
Hi Jan,
How much memory capacity is configured for each node?
If you go to the ResourceManager web UI, does it indicate any containers are
running?
-Sandy
On May 19, 2014, at 11:43 PM, Jan Holmberg jan.holmb...@perigeum.fi wrote:
Hi,
I’m new to Spark and trying to test first Spark prog.
This issue is turned out cased by version mismatch between driver(0.9.1) and
server(0.9.0-cdh5.0.1) just now. Other function works fine but combinebykey
before.
Thank you very much for your reply.
--
View this message in context:
Hi,
each node has 4Gig of memory. After total reboot and re-run of SparkPi
resource manager shows no running containers and 1 pending container.
-jan
On 20 May 2014, at 10:24, sandy.r...@cloudera.com sandy.r...@cloudera.com
wrote:
Hi Jan,
How much memory capacity is configured for each
Thank you guys for the detailed answer.
Akhil, yes I would like to have a try of your tool. Is it open-sourced?
2014-05-17 17:55 GMT+02:00 Mayur Rustagi mayur.rust...@gmail.com:
A better way would be use Mesos (and quite possibly Yarn in 1.0.0).
That will allow you to add nodes on the fly
Hi
Just know akka is under a commercial license,however Spark is under the
apache
license.
Is there any problem?
Regards
Akka is under Apache 2 license too.
http://doc.akka.io/docs/akka/snapshot/project/licenses.html
On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:
Hi
Just know akka is under a commercial license,however Spark is under the
apache
license.
Is there any problem?
The page says Akka is Open Source and available under the Apache 2 License.
It may also be available under another license, but that does not
change the fact that it may be used by adhering to the terms of the
AL2.
The section is referring to commercial support that Typesafe sells. I
am not even
Few more details I would like to provide (Sorry as I should have provided
with the previous post):
*- Spark Version = 0.9.1 (using pre-built spark-0.9.1-bin-hadoop2)
- Hadoop Version = 2.4.0 (Hortonworks)
- I am trying to execute a Spark Streaming program*
Because I am using Hortornworks
Still the same. I increased the memory of the node holding resource manager to
5 Gig. I also spotted an HDFS alert of replication factor 3 that I now dropped
to the number of data nodes. I also shut all down all services not in use.
Still the issue remains.
I have noticed following two events
if they are tied to the spark context, then why can the subprocess not be
started up with the extra jars (sc.addJars) already on class path? this way
a switch like user-jars-first would be a simple rearranging of the class
path for the subprocess, and the messing with classloaders that is
just for my clarification: off heap cannot be java objects, correct? so we
are always talking about serialized off-heap storage?
On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
That's one the main motivation in using Tachyon ;)
http://tachyon-project.org/
It gives
Hi,
I'm trying to get data from S3 using sc.textFile(s3n://+filenamePattern)
It seems that if a pattern gives out no result i get an exception like so:
org.apache.hadoop.mapred.InvalidInputException: Input Pattern
s3n://bucket/20140512/* matches 0 files
at
Thanks for the advice. I think you're right. I'm not sure we're going to use
HBase but starting by partitioning data into multiple buckets will be a
first step. I'll see how it performs on large datasets.
My original question though was more like: is there a spark trick i don't
know about ?
We are trying to read some large GraphML files to use in spark.
Is there an easy way to read XML-based files like this that accounts for
partition boundaries and the like?
Thanks,
Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley
This error message says I can't find the config for the akka subsystem.
That is typically included in the Spark assembly.
First, you need to compile your spark distro, by running sbt/sbt assembly
on the SPARK_HOME dir.
Then, use the SPARK_HOME (through env or configuration) to point to your
Xiangrui,
Thanks for the pointer. I think it should work...for now I did cook up my
own which is similar but on top of spark core APIs. I would suggest moving
the sliding window RDD to the core spark library. It seems quite general to
me and a cursory look at the code indicates nothing specific to
One issue is that new jars can be added during the lifetime of a
SparkContext, which can mean after executors are already started. Off-heap
storage is always serialized, correct.
On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote:
just for my clarification: off heap cannot
Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui
On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We are trying to read some large GraphML files to use in spark.
Is there an easy way to read XML-based files like this that
Hi -
We have a use case for batch processing for which we are trying to figure
out if Apache Spark would be a good fit or not.
We have a universe of identifiers sitting in RDBMS for which we need to go
get input data from RDBMS and then pass that input to analytical models that
generate some
Unfortunately, I don't have a bunch of moderately big xml files; I have
one, really big file - big enough that reading it into memory as a single
string is not feasible.
On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng men...@gmail.com wrote:
Try sc.wholeTextFiles(). It reads the entire file
I'm a first time user and need to try just the hello world kind of program in
spark.
Now on downloads page, I see following 3 options for Pre-built packages that
I can download:
- Hadoop 1 (HDP1, CDH3)
- CDH4
- Hadoop 2 (HDP2, CDH5)
I'm confused which one do I need to download. I need to try
Hi Puneet,
If you're not going to read/write data in HDFS from your Spark cluster,
then it doesn't matter which one you download. Just go with Hadoop 2 as
that's more likely to connect to an HDFS cluster in the future if you ever
do decide to use HDFS because it's the newer APIs.
Cheers,
Andrew
You can download any of them, I would go with the latest versions,
or just download the source and build it yourself.
For experimenting with basic things you can just launch the REPL
and start right away in spark local mode not using any hadoop stuff.
2014-05-20 19:43 GMT+02:00 pcutil
Howdy Gerard,
Yeah, the docker link feature seems to work well for client-server
interaction. But, peer-to-peer architectures need more for service
discovery.
As for you addressing requirements, I don't completely understand what you
are asking for... you may also want to check out xip.io .
I was actually able to get this to work. I was NOT setting the classpath
properly originally.
Simply running
java -cp /etc/hadoop/conf/:yarn, hadoop jars com.domain.JobClass
and setting yarn-client as the spark master worked for me. Originally I
had not put the configuration on the classpath.
Hi Matei,
Unfortunately, I don't have more detailed information, but we have seen the
loss of workers in standalone mode as well. If a job is killed through
CTRL-C we will often see in the Spark Master page the number of workers and
cores decrease. They are still alive and well in the Cloudera
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job failures. We're running pretty close to
master, though; perhaps it is related to an uncaught exception in the
Worker from a prior version of Spark.
On Tue, May 20, 2014 at 11:36 AM,
I'm assuming you're running Spark 0.9.x, because in the latest version of
Spark you shouldn't have to add the HADOOP_CONF_DIR to the java class path
manually. I tested this out on my own YARN cluster and was able to confirm
that.
In Spark 1.0, SPARK_MEM is deprecated and should not be used.
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the
integration with that.
Matei
On May 20, 2014, at 11:44 AM, Aaron Davidson ilike...@gmail.com wrote:
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job
We're using spark 0.9.0, and we're using it out of the box -- not using
Cloudera Manager or anything similar.
There are warnings from the master that there continue to be heartbeats
from the unregistered workers. I will see if there are particular
telltale errors on the worker side.
We've had
interesting, so it sounds to me like spark is forced to choose between the
ability to add jars during lifetime and the ability to run tasks with user
classpath first (which important for the ability to run jobs on spark
clusters not under your control, so for the viability of 3rd party spark
apps)
So, for example, I have two disassociated worker machines at the moment.
The last messages in the spark logs are akka association error messages,
like the following:
14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://
sparkwor...@hdn3.int.meetup.com:50038] - [akka.tcp://
Hi,
I am trying to send events to spark streaming via flume.
It's working fine up to a certain point.
I have problems when the size of the body is over 1020 characters.
Basically, up to 1020 it works
1021 through 1024, the event will be accepted and there is no exception, but
the channel seems
Yes, we are on Spark 0.9.0 so that explains the first piece, thanks!
Also, yes, I meant SPARK_WORKER_MEMORY. Thanks for the hierarchy.
Similarly is there some best practice on setting SPARK_WORKER_INSTANCES and
spark.default.parallelism?
Thanks,
Arun
On Tue, May 20, 2014 at 3:04 PM, Andrew Or
Thanks, that sounds perfect
On Tue, May 20, 2014 at 1:38 PM, Xiangrui Meng men...@gmail.com wrote:
You can search for XMLInputFormat on Google. There are some
implementations that allow you to specify the tag to split on, e.g.:
This is the first time I'm trying the Spark. I just downloaded and trying the
SimpleApp Java program using the maven. I added 2 maven dependencies --
spark-core and scala-library? Even though my program is in java, I was
forced to add the scala dependency. Is that really required?
Now, I'm able
My $0.02: If you are simply reading input records, running a model,
and outputting the result, then it's a simple map-only problem and
you're mostly looking for a process to baby-sit these operations. Lots
of things work -- Spark, M/R (+ Crunch), Hadoop Streaming, etc. I'd
choose whatever is
If the distribution of the keys in your groupByKey is skewed (some keys
appear way more often than others) you should consider modifying your job
to use reduceByKey instead wherever possible.
On May 20, 2014 12:53 PM, Jon Keebler jkeeble...@gmail.com wrote:
So we upped the spark.akka.frameSize
Hello All,
I am learning that there are certain imports done by Spark REPL that is
used to invoke and run code in a spark shell, that I would have to import
specifically if I need the same functionality in a spark jar run by command
line.
I am getting into a repeated serialization error of an
Hi Ron,
What version are you using? For 0.9, you need to set it outside your code
with the SPARK_YARN_QUEUE environment variable.
-Sandy
On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
Hi,
How does one submit a spark job to yarn and specify a queue?
The code
Thanks for the suggestion, Andrew. We have also implemented our solution
using reduceByKey, but observe the same behavior. For example if we do the
following:
map1
groupByKey
map2
saveAsTextFile
Then the stalling will occur during the map1+groupByKey execution.
If we do
map1
reduceByKey
map2
Thanks Mayur, it is working :)
--
Anish Sneh
http://in.linkedin.com/in/anishsneh
So the current stalling is simply sitting there with no log output? Have
you jstack'd an Executor to see where it may be hanging? Are you observing
memory or disk pressure (df and df -i)?
On Tue, May 20, 2014 at 2:03 PM, jonathan.keebler jkeeble...@gmail.comwrote:
Thanks for the suggestion,
Hi All
I need to analyse performance of Spark YARN vs Spark Standalone
Please suggest if we have some pre-published comparison statistics available.
TIA
--
Anish Sneh
http://in.linkedin.com/in/anishsneh
Hello Joe,
The first step is acquiring some data, either through the Facebook
APIhttps://developers.facebook.com/or a third-party service like
Datasift https://datasift.com/ (paid). Once you've acquired some data,
and got it somewhere Spark can access it (like HDFS), you can then load and
Hello,
This seems like a basic question but I have been unable to find an answer in
the archives or other online sources.
I would like to know if there is any way to load a RDD from HBase in python.
In Java/Scala I can do this by initializing a NewAPIHadoopRDD with a
TableInputFormat class. Is
The Apache Drill http://incubator.apache.org/drill/ home page has an
interesting heading: Liberate Nested Data.
Is there any current or planned functionality in Spark SQL or Shark to
enable SQL-like querying of complex JSON?
Nick
--
View this message in context:
Any tips on how to troubleshoot this?
On Thu, May 15, 2014 at 4:15 PM, Nick Chammas nicholas.cham...@gmail.comwrote:
I’m trying to do a simple count() on a large number of GZipped files in
S3. My job is failing with the following message:
14/05/15 19:12:37 WARN scheduler.TaskSetManager:
I have read gzip files from S3 successfully.
It sounds like a file is corrupt or not a valid gzip file.
Does it work with fewer gzip files?
How are you reading the files?
-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context:
Send an email to this address to unsubscribe from the Spark user list:
user-unsubscr...@spark.apache.org
Sending an email to the Spark user list itself (i.e. this list) *does not
do anything*, even if you put unsubscribe as the subject. We will all
just see your email.
Nick
--
View this
Unfortunately, those errors are actually due to an Executor that exited,
such that the connection between the Worker and Executor failed. This is
not a fatal issue, unless there are analogous messages from the Worker to
the Master (which should be present, if they exist, at around the same
point
Yes, it does work with fewer GZipped files. I am reading the files in using
sc.textFile() and a pattern string.
For example:
a = sc.textFile('s3n://bucket/2014-??-??/*.gz')
a.count()
Nick
On Tue, May 20, 2014 at 10:09 PM, Madhu ma...@madhu.com wrote:
I have read gzip files from S3
Hi,
I'm trying to work with Spark from the shell and create a Hadoop Job
instance. I get the exception you see below because the Job.toString
doesn't like to be called until it has been submitted.
I tried using the :silent command but that didn't seem to have any impact.
scala import
Unfortunately this is not yet possible. There’s a patch in progress posted here
though: https://github.com/apache/spark/pull/455 — it would be great to get
your feedback on it.
Matei
On May 20, 2014, at 4:21 PM, twizansk twiza...@gmail.com wrote:
Hello,
This seems like a basic question
Aaron:
I see this in the Master's logs:
14/05/20 01:17:37 INFO Master: Attempted to re-register worker at same
address: akka.tcp://sparkwor...@hdn3.int.meetup.com:50038
14/05/20 01:17:37 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038
There
sparkers,
Is there a better way to control memory usage when streaming input's speed
is faster than the speed of handled by spark streaming ?
Thanks,
Francis.Hu
60 matches
Mail list logo