I haven’t seen this but it may be a bug in Typesafe Config, since this is
serializing a Config object. We don’t actually use Typesafe Config ourselves.
Do you have any nulls in the data itself by any chance? And do you know how
that Config object is getting there?
Matei
On Apr 9, 2014, at
Ok I thought it may be closing over the config option. I am using config
for job configuration, but extracting vals from that. So not sure why as I
thought I'd avoided closing over it. Will go back to source and see where
it is creeping in.
On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia
rdd.foreach(p = {
print(p)
})
The above closure gets executed on workers, you need to look at the logs of
the workers to see the output.
but if i'm in local mode, where's the logs of local driver, there are no
/logs and /work dirs in /SPARK_HOME which are set in standalone mode.
--
View
hi, you can take a look here:
http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html
Sent from the Apache Spark User List mailing list archive at
Hi Mayur,
I wondered if you could share your findings in some way (github, blog post,
etc). I guess your experience will be very interesting/useful for many
people
sent from Lenovo YogaTablet
On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Hi Ankit,
Thanx for all the work
Yes that help to understand better how works spark. But that was also what I
was afraid, I think the network communications will take to much time for my
job.
I will continue to look for a trick in order to not have network
communications.
I saw on the hadoop website that : To minimize global
You need to use proper HDFS URI with saveAsTextFile.
For Example:
rdd.saveAsTextFile(hdfs://NameNode:Port/tmp/Iris/output.tmp)
Regards,
Adnan
Asaf Lahav wrote
Hi,
We are using Spark with data files on HDFS. The files are stored as files
for predefined hadoop user (hdfs).
The folder is
Then problem is not on spark side, you have three options, choose any one of
them:
1. Change permissions on /tmp/Iris folder from shell on NameNode with hdfs
dfs -chmod command.
2. Run your hadoop service with hdfs user.
3. Disable dfs.permissions in conf/hdfs-site.xml.
Regards,
Adnan
avito
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed question. :)
We are at the very beginning of a large scale big data project and after
two months of exploration work we'd like to settle on the technologies to
use, roll up our sleeves
When you say Spark is one of the forerunners for our technology choice,
what are the other options you are looking into ?
I start cross validation runs on a 40 core, 160 GB spark job using a
script...I woke up in the morning, none of the jobs crashed ! and the
project just came out of incubation
Hi Asaf,
The user who run SparkContext is decided by the below code in SparkContext,
normally this user.name is the user who started JVM, you can start your
application with -Duser.name=xxx to specify a username you want, this specified
username will be the user to communicate with HDFS.
val
Spark has been endorsed by Cloudera as the successor to MapReduce. That
says a lot...
On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hello Spark Users,
With the recent graduation of Spark to a top level project (grats, btw!),
maybe a well timed
I am getting a python ImportError on Spark standalone cluster. I have set the
PYTHONPATH on both worker and slave and the package imports properly when I
run PySpark command line on both machines. This only happens with Master -
Slave communication. Here is the error below:
14/04/10 13:40:19
Here are several good ones:
https://www.google.com/search?q=cloudera+sparkoq=cloudera+sparkaqs=chrome..69i57j69i65l3j69i60l2.4439j0j7sourceid=chromeespv=2es_sm=119ie=UTF-8
On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.comwrote:
Do you have the link to the Cloudera
Mike Olson's comment:
http://vision.cloudera.com/mapreduce-spark/
Here's the partnership announcement:
http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html
On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.com
wrote:
Do you have the
I'll provide answers from our own experience at Bizo. We've been using
Spark for 1+ year now and have found it generally better than previous
approaches (Hadoop + Hive mostly).
On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
I. Is it too much magic? Lots
Bam !!!
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Apr 10, 2014 at 3:07 AM, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com
I dont think it'll do failure detection etc of spark job in Oozie as of
yet. You should be able to trigger it from Oozie (worst case as a shell
script).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Apr 10, 2014 at
Thank you for the reply Mayur, it would be nice to have a comparison about
that.
I hope one day it will be available, or to have the time to test it myself
:)
So you're using Mesos for the moment, right? Which are the main differences
in you experience? YARN seems to be more flexible and
Probably for the XML case the best resource I found iare
http://stevenskelton.ca/real-time-data-mining-spark/ and
http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/
.
And about JSON? If I have to work with JSON and I want to use fasterxml
implementation?
There was a closure over the config object lurking around - but in any case
upgrading to 1.2.0 for config did the trick as it seems to have been a bug
in Typesafe config,
Thanks Matei!
On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath nick.pentre...@gmail.comwrote:
Ok I thought it may be
The biggest issue I've come across is that the cluster is somewhat unstable
when under memory pressure. Meaning that if you attempt to persist an RDD
that's too big for memory, even with MEMORY_AND_DISK, you'll often still
get OOMs. I had to carefully modify some of the space tuning parameters
I would echo much of what Andrew has said.
I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk
space dedicated to spark, data storage in separate HDFS shares). I've
been using spark since 0.7, and as with Andrew I've observed
significant and consistent improvements in stability
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come across is that the cluster is somewhat
unstable when under memory pressure. Meaning that if you attempt to
persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll
often
14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added:
app-20140410080041-0017/9 on worker-20140409145028-ken-
VirtualBox-39159 (ken-VirtualBox:39159) with 4 cores
14/04/10
08:00:42 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20140410080041-0017/9 on hostPort
Can anyone comment on their experience running Spark Streaming in
production?
On Thu, Apr 10, 2014 at 10:33 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote:
The biggest issue I've come across is that the cluster is
Sorry, I forgot to mention this is spark-0.9.1 and shark-0.9.1.
Ken
On Thursday, April 10, 2014 9:02 AM, Ken Ellinwood kellinw...@yahoo.com wrote:
14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added:
app-20140410080041-0017/9 on worker-20140409145028-ken-
VirtualBox-39159
Here are answers to a subset of your questions:
1. Memory management
The general direction of these questions is whether it's possible to take
RDD caching related memory management more into our own hands as LRU
eviction is nice most of the time but can be very suboptimal in some of our
use
4. Shuffle on disk
Is it true - I couldn't find it in official docs, but did see this mentioned
in various threads - that shuffle _always_ hits disk? (Disregarding OS
caches.) Why is this the case? Are you planning to add a function to do
shuffle in memory or are there some intrinsic reasons
Hello friends:
I recently compiled and installed Spark v0.9 from the Apache distribution.
Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well
(actually, the
entire big-data suite from CDH is installed), but for the moment I'm
using my
manually built Apache Spark for 'ground-up'
To add onto the discussion about memory working space, 0.9 introduced the
ability to spill data within a task to disk, and in 1.0 we’re also changing the
interface to allow spilling data within the same *group* to disk (e.g. when you
do groupBy and get a key with lots of values). The main
I am doing the exact same thing for the purpose of learning. I also
don't have a hadoop cluster and plan to scale on ec2 as soon as I get
it working locally.
I am having good success just using the binaries on and not compiling
from source... Is there a reason why you aren't just using the
I think this was solved in a recent merge:
https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779
Is that what you are looking for? If so, mind marking the JIRA as resolved?
On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I see that this was fixed using a fixed string in SparkContext.scala.
Wouldn’t it be better to use something like:
getClass.getPackage.getImplementationVersion
to get the version from the jar manifest (and thus from the sbt definition)?
The same holds for SparkILoopInit.scala in the welcome
Pierre - I'm not sure that would work. I just opened a Spark shell and did
this:
scala classOf[SparkContext].getClass.getPackage.getImplementationVersion
res4: String = 1.7.0_25
It looks like this is the JVM version.
- Patrick
On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans
This is likely because hdfs's core-site.xml (or something similar) provides
an fs.default.name which changes the default FileSystem and Spark uses
the Hadoop FileSystem API to resolve paths. Anyway, your solution is
definitely a good one -- another would be to remote hdfs from Spark's
classpath if
I just updated and got the following:
[error] (external-mqtt/*:update) sbt.ResolveException: unresolved dependency:
org.eclipse.paho#mqtt-client;0.4.0: not found
[error] Total time: 7 s, completed Apr 10, 2014 4:27:09 PM
Chesters-MacBook-Pro:spark chester$ git branch
* branch-1.0
master
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package
you added on the PYTHONPATH? How did you set the path, was it in
conf/spark-env.sh?
Matei
On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote:
I am getting a python ImportError on Spark standalone
Looks like it. I'm guessing this didn't make the cut for 0.9.1, and will
instead be included with 1.0.0.
So would you access it just by calling sc.version from the shell? And will
this automatically make it into the Python API?
I'll mark the JIRA issue as resolved.
On Thu, Apr 10, 2014 at 5:05
39 matches
Mail list logo