This ultimately means a problem with SSL in the version of Java you
are using to run SBT. If you look around the internet, you'll see a
bunch of discussion, most of which seems to boil down to reinstall, or
update, Java.
--
Sean Owen | Director, Data Science | London
On Fri, Apr 4, 2014 at 12:21
Hi all,
Could anyone explain me about the lines below?
computer1 - worker
computer8 - driver(master)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added
input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5
KB, free: 540.3 MB)
14/04/04 14:24:56 INFO
Hi all,
I am doing some tests using JavaNetworkWordcount and I have some
questions about the performance machine, my tests' time are
approximately 2 min.
Why does the RAM Memory decrease meaningly? I have done tests with 2, 3
machines and I had gotten the same behavior.
What should I
Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.
Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0
part-k
Is there a way to save each partition in specific folders?
i.e. src
part0/part-0
Do we have a list of things we really want to get in for 1.X? Perhaps move
any jira out to a 1.1 release if we aren't targetting them for 1.0.
It might be nice to send out reminders when these dates are approaching.
Tom
On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com
Hi Rahul,
Spark will be available in Fedora 21 (see:
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently
scheduled on 2014-10-14 but they already have produced spec files and source
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec
file,
Hi Guys,
Could anyone help me understand this driver behavior when I start the
JavaNetworkWordCount?
computer8
16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55
total used free shared buffers
cached
Mem: 5897
Hi All,
I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
already being addressed, but I am having a devil of a time with a spark
0.9.0 client jar for hadoop 2.X. If I go to the site and download:
- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache
mirror
Hi Erik,
I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for
maven build:
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package
And from
I believe you got to set following
SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is)
SPARK_YARN=true
then type sbt/sbt assembly
If you are using Maven to compile
mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean
package
Hope this helps
-A
On Fri, Apr 4, 2014
Hi Evan,
Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark
Thank you,
Konstantin Kudryavtsev
On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote:
Have a look
Hi,
Can you explain a little more what's going on? Which one submits a job to the
yarn cluster that creates an application master and spawns containers for the
local jobs? I tried yarn-client and submitted to our yarn cluster and it seems
to work that way. Shouldn't Client.scala be running
Thanks all for the update - I have actually built using those options every
which way I can think of so perhaps this is something I am doing about how
I upload the jar to our artifactory repo server. Anyone have a working pom
file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo
Hi all,
I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20
this parallelism level, is it correct?
The machine's processor is a dual core.
Thanks
--
Informativa sulla Privacy: http://www.unibs.it/node/8155
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote:
Is it always needed that sparkcontext object be created in Main method of
class.Is it necessary?Can we create sc object in other class and try to
use it by passing this object through function and use it?
The Spark context can
Hi Wisely,
Could you please post your pom.xml here.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html
Sent from the Apache Spark User List
Hi Guys,
Could anyone explain me this behavior? After 2 min of tests
computer1- worker
computer10 - worker
computer8 - driver(master)
computer1
18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14
total used free shared buffers
cached
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.
On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi all,
I have put this line
Hi Francis,
This might be a long shot, but do you happen to have built spark on an
encrypted home dir?
(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't
I'm trying to get a clear idea about how exceptions are handled in Spark?
Is there somewhere where I can read about this? I'm on spark .7
For some reason I was under the impression that such exceptions are
swallowed and the value that produced them ignored but the exception is
logged. However,
Exceptions should be sent back to the driver program and logged there (with a
SparkException thrown if a task fails more than 4 times), but there were some
bugs before where this did not happen for non-Serializable exceptions. We
changed it to pass back the stack traces only (as text), which
In such construct, each operator builds on the previous one, including any
materialized results etc. If I use a SQL for each of them, I suspect the
later SQLs will not leverage the earlier SQLs by any means - hence these
will be inefficient to first approach. Let me know if this is not
What do you advice me Nicholas?
Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can
get out of it are 2 concurrent tasks at any one time. So setting your
default parallelism to 20 won't help.
On Fri, Apr 4, 2014 at 11:41 AM,
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.
On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Exceptions should be sent back to the driver program and logged there
(with a SparkException thrown if a task
Btw, thank you for your help.
On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote:
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.
On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Minor typo in the example. The first SELECT statement should actually be:
sql(SELECT * FROM src)
Where `src` is a HiveTable with schema (key INT value STRING).
On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote:
In such construct, each operator builds on the
Spark community,
What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes but don't know the actual
number.
can someone share?
Thanks
If you want more parallelism, you need more cores. So, use a machine with
more cores, or use a cluster of machines.
spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the
easiest way to do this.
If you're stuck on a single machine with 2 cores, then set your default
parallelism to
FYI, one thing we’ve added now is support for reading multiple text files from
a directory as separate records: https://github.com/apache/spark/pull/327. This
should remove the need for mapPartitions discussed here.
Avro and SequenceFiles look like they may not make it for 1.0, but there’s a
This can’t be done through the script right now, but you can do it manually as
long as the cluster is stopped. If the cluster is stopped, just go into the AWS
Console, right click a slave and choose “launch more of these” to add more. Or
select multiple slaves and delete them. When you run
Hi Christophe,
Thanks for your reply and the spec file. I have solved my issue for now. I
didn't want to rely building spark using the spec file (%build section) as I
don't want to be maintaining the list of files that need to be packaged. I
ended up adding maven build support to
Hi Tathagata,
You are right, this code compile, but I am some problems with high
memory consummation, I sent today some email about this, but no response
until now.
Thanks
Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is
there any
Logging inside a map function shouldn't freeze things. The messages
should be logged on the worker logs, since the code is executed on the
executors. If you throw a SparkException, however, it'll be propagated to
the driver after it has failed 4 or more times (by default).
On Fri, Apr 4, 2014 at
There is no compress type for snappy.
Sent from my iPhone5s
On 2014年4月4日, at 23:06, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Can anybody suggest how to change compression level (Record, Block) for
Snappy?
if it possible, of course
thank you in advance
Thank
All
Are there any drawbacks or technical challenges (or any information, really)
related to using Spark directly on a global parallel filesystem like
Lustre/GPFS?
Any idea of what would be involved in doing a minimal proof of concept? Is it
just possible to run Spark unmodified (without the
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
Thanks will take a look...
Sent from my iPad
On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu
wrote:
We use avro objects in our project, and have a Kryo serializer for generic
Avro SpecificRecords. Take a look at:
Does spark in general assure exactly once semantics? What happens to
those guarantees in the presence of updateStateByKey operations -- are
they also assured to be exactly once?
Thanks
manku.timma at outlook dot com
Hey Parviz,
There was a similar thread a while ago... I think that many companies like
to be discrete about the size of large clusters. But of course it would be
great if people wanted to share openly :)
For my part - I can say that Spark has been benchmarked on
hundreds-of-nodes clusters before
We might be able to incorporate the maven rpm plugin into our build. If
that can be done in an elegant way it would be nice to have that
distribution target for people who wanted to try this with arbitrary Spark
versions...
Personally I have no familiarity with that plug-in, so curious if anyone
We run Spark (in Standalone mode) on top of a network-mounted file system
(NFS), rather than HDFS, and find it to work great. It required no modification
or special configuration to set this up; as Matei says, we just point Spark to
data using the file location.
-- Jeremy
On Apr 4, 2014, at
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
As long as the filesystem is mounted at the same path on every node, you
should be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won't expose
42 matches
Mail list logo