Found the issue, actually splits in HBase was not uniform, so one job was
taking 90% of time.
BTW, is there a way to save the details available port 4040 after job is
finished?
On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath nick.pentre...@gmail.comwrote:
It's tricky really since you may not
Hi,
I am looking for ways to share the sparkContext, meaning i need to
be able to perform multiple operations on the same spark context.
Below is code of a simple app i am testing
def main(args: Array[String]) {
println(Welcome to example application!)
val sc = new
fair scheduler merely reorders tasks .. I think he is looking to run
multiple pieces of code on a single context on demand from customers...if
the code order is decided then fair scheduler will ensure that all tasks
get equal cluster time :)
Mayur Rustagi
Ph: +919632149971
h
okay you caught me on this.. I havnt used python api.
Lets try
http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#partitionByon
the rdd customize the partitioner instead of hash to a custom
function.
Please update on the list if it works, it seems to be a
Hi Mayur,
Thanks for replying. Is it usually double the size of data on disk?
I have observed this many times. Storage section of Spark is telling me that
100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB.
Thanks and Regards,
Suraj Sheth
From: Mayur Rustagi
The problem is that Java objects can take more space than the underlying data,
but there are options in Spark to store data in serialized form to get around
this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html.
Matei
On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar
It seems are you are already using parititonBy, you can simply plugin in
your custom function instead of lambda x:x it should use that to
partition. Range partitioner is available in Scala I am not sure if its
exposed directly in python.
Regards
Mayur
Mayur Rustagi
Ph: +919632149971
h
Thank you Mayur, I think that will help me a lot
Best,
Tao
2014-02-26 8:56 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com:
Type of Shuffling is best explained by Matei in Spark Internals .
http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203
Why dont you look at that then if you have follow
I'm not able to run the GraphX examples from the Scala REPL. Can anyone
point to the correct documentation that talks about the configuration
and/or how to build GraphX for the REPL ?
Thanks
Hi hyqgod,
This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).
To answer your question, though:
Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
In Spark 0.9 and master, you can pass the -i argument to spark-shell to load a
script containing commands before opening the prompt. This is also a feature of
the Scala shell as a whole (try scala -help for details).
Also, once you’re in the shell, you can use :load file.scala to execute the
If i use groupbyKey as so...
JavaPairRDDString, Listlt;String twos = ones.groupByKey(3).cache();
How would I write to a file/ or Hadoop the contents of the List of Strings.
Do i need to transform the JavaPairRDD to JavaRDD and call f saveAsTextFile?
--
View this message in context:
Hi, all
I'm trying to build Spark in IntelliJ IDEA 13.
I clone the latest repo and run sbt/sbt gen-idea in the root folder.
Then import it into IntelliJ IDEA. Scala plugin for IntelliJ IDEA has been
installed.
Everything seems ok until I ran Build Make Project:
Information: Using javac
I also use IntelliJ 13 on a Mac, with only Java 7, and have never seen this.
If you look at the Spark build, you will see that it specifies Java 6, not 7.
Even if you changed java.version in the build, you would not get this
error, since it specifies source and target to be the same value.
In
I am not sure.. the suggestion is to open a TB file and remove a line?
That doesnt sounds that good.
I am hacking my way by using a filter..
Can I put a try:except clause in my lambda function.. Maybe i should just
try that out.
But thanks for the suggestion.
Also, can i run scripts against spark
Can someone point me to a simple, short code example of creating a basic
Actor that gets a context and runs an operation such as .textFile.count?
I am trying to figure out how to create just a basic actor that gets a
message like this:
case class Msg(filename:String, ctx: SparkContext)
and
I am an newbie!! I am running Spark 0.90 in standalone mode on my mac. The
master and worker run on the same machine. Both of them startup fine (at
least that is what I see in the log).
*Upon start-up master log is:*
14/02/26 15:38:08 INFO Slf4jLogger: Slf4jLogger started
14/02/26 15:38:08
Hello Andy,
This is a problem we have seen in using the CQL Java driver under heavy
ready loads where it is using NIO and is waiting on many pending responses
which causes to many open sockets and hence too many open files. Are you by
any chance using async queries?
I am the maintainer of
Agree that filter is perhaps unintuitive. Though the Scala collections API has
filter and filterNot which together provide context that makes it more
intuitive.
And yes the change could be via added methods that don't break existing API.
Still overall I would be -1 on this unless a
On Fri, Feb 7, 2014 at 7:48 AM, Aaron Davidson ilike...@gmail.com wrote:
Sorry for delay, by long-running I just meant if you were running an
iterative algorithm that was slowing down over time. We have observed this
in the spark-perf benchmark; as file system state builds up, the job can
Yes! Spark streaming programs are just like any spark program and so any
ec2 cluster setup using the spark-ec2 scripts can be used to run spark
streaming programs as well.
On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote:
Hi,
Does the ec2 support for spark 0.9
On Thu, Feb 27, 2014 at 6:17 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
Yes! Spark streaming programs are just like any spark program and so any
ec2 cluster setup using the spark-ec2 scripts can be used to run spark
streaming programs as well.
Great. Does it come with any input
Sortbykey would be better I think as I am not sure groupbyKey will sort the
keyspace globally.
I would say you should
you take input K, V
GroupbyKey K,V = K,Seq(V..)
partitionBy default partitioner (hash)
SoryByKey K,Seq(V..)
Output this, only thing is if you need K,V pairs you will have to
Just as a second note, I am able to build the source in the official 0.9.0
release
(http://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-Against-CDH5-tp2129p2130.html
Sent
The provided Spark EC2
scriptshttps://spark.incubator.apache.org/docs/0.9.0/ec2-scripts.htmland
default AMI ship with Python 2.6.8.
I would like to use Python 2.7.5 or later. I believe that among the 2.x
versions, 2.7 is the most popular.
What's the easiest way to get my Spark cluster on Python
Yes, the default spark EC2 cluster runs the standalone deploy mode. Since
Spark 0.9, the standalone deploy mode allows you to launch the driver app
within the cluster itself and automatically restart it if it fails. You can
read about launching your app inside the cluster
After successfully building the official 0.9.0 release I attempted to build
off of the github code again and was successfully able to do so. Not really
sure what happened, but it works now.
--
View this message in context:
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
spark streaming in production, the author seems to have missed the topic of
how to manage cloud instances.
On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia buendia...@gmail.comwrote:
What's the updated way of deploying
Spark 0.9 uses protobuf 2.5.0
Hadoop 2.2 uses protobuf 2.5.0
protobuf 2.5.0 can read massages serialized with protobuf 2.4.1
So there is not any reason why you can't read some messages from hadoop 2.2
with protobuf 2.5.0, probably you somehow have 2.4.1 in your class path. Of
course it's very bad,
In that same pom
profile
idyarn/id
properties
hadoop.major.version2/hadoop.major.version
hadoop.version2.2.0/hadoop.version
protobuf.version2.5.0/protobuf.version
/properties
modules
moduleyarn/module
/modules
/profile
Hi,
Running:
./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher
tcp://127.0.1.1:1234 foo
causes over 100% cpu usage on os x. Given that it's just a simple zmq
publisher, this shouldn't be expected. Is there something wrong with that
example?
Yeah, the Spark on EMR bootstrap scripts referenced
herehttp://aws.amazon.com/articles/4926593393724923need some
polishing. I had a lot of trouble just getting through that
tutorial. And yes, the version of Spark they're using is 0.8.1.
On Fri, Feb 28, 2014 at 2:39 PM, Aureliano Buendia
I'm trying to run a simple execution of the SparkPi example. I started the
master and one worker, then executed the job on my local cluster, but end
up getting a sequence of errors all ending with
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection
I am using Spark 0.9
I have an array of tuples, and I want to sort these tuples using the *sortByKey
*API as follows in Spark shell:
val A:Array[(String, String)] = Array((1, One), (9, Nine), (3,
three), (5, five), (4, four))
val P = sc.parallelize(A)
// MyComparator is an example, maybe I have
Does this suggest value in an integration of GraphX and neo4j?
Sent from my Verizon Wireless Phone
- Reply message -
From: Matei Zaharia matei.zaha...@gmail.com
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: Incrementally add/remove vertices in GraphX
Date: Sun,
Nope, nested RDDs aren't supported:
https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ
https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ
https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ
On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi
I have an RDD of (K, Array[V]) pairs.
For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2)))
How can I do a groupByKey such that I get back an RDD of the form (K,
Array[V]) pairs.
Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))
Hi,
I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it
and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment
1) My inlut file should be one big file or separate smaller files?
2) if
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:
hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE
hadoop just
If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote:
Thanks for the advice Mayur.
I thought I'd report back on the performance difference... Spark
standalone
mode has executors processing
Are you running in yarn-standalone mode or yarn-client mode? Also, what
YARN scheduler and what NodeManager heartbeat?
On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote:
Thanks for the advice Mayur.
I thought I'd report back on the performance difference... Spark
polkosity, have you seen the job server that Ooyala open sourced? I think
it's very similar to what you're proposing with a REST API and re-using a
SparkContext.
https://github.com/apache/incubator-spark/pull/222
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server
On Mon, Mar
+1
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Mar 3, 2014 at 4:10 PM, polkosity polkos...@gmail.com wrote:
Thats exciting! Will be looking into that, thanks Andrew.
Related topic, has anyone had any
yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.
On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote:
Thats exciting! Will be looking into that, thanks Andrew.
Related
Vector is an enhanced Array[Double]. You can compare it like Array[Double].
E.g.,
scala val v1 = Vector(1.0, 2.0)
v1: org.apache.spark.util.Vector = (1.0, 2.0)
scala val v2 = Vector(1.0, 2.0)
v2: org.apache.spark.util.Vector = (1.0, 2.0)
scala val exactResult =
Where on the filesystem does spark write the shuffle files?
From BlockManager code + ShuffleMapTask code, it writes under
spark.local.dir or java.io.tmpdir.
val diskBlockManager = new DiskBlockManager(shuffleBlockManager,
conf.get(spark.local.dir, System.getProperty(java.io.tmpdir)))
On Mon, Mar 3, 2014 at 10:45 PM, Usman Ghani us...@platfora.com
I've encountered similar problems.
Maybe you can try using hostname or FQDN (rather than IP address) of your node
for the master URI.
In my case, AKKA picks the FQDN for master URI and worker has to use exactly
the same string for connection.
From: Benny Thompson [mailto:ben.d.tho...@gmail.com]
Hi Ognen,
See if this helps. I was working on this :
class MyClass[T](sc : SparkContext, flag1 : Boolean, rdd : RDD[T], hdfsPath :
String) extends Actor {
def act(){
if(flag1) this.process()
else this.count
}
private def process(){
println(sc.textFile(hdfsPath).count)
Hi,
Try to clean your temp dir, System.getProperty(java.io.tmpdir)
Also, Can you paste a longer stacktrace?
Thanks
Best Regards
On Tue, Mar 4, 2014 at 2:55 PM, goi cto goi@gmail.com wrote:
Hi,
I am running a spark java program on a local machine. when I try to write
the output to
Exception in thread delete Spark temp dir C:\Users\...
java.io.IOException: failed to delete: C:\Users\...\simple-project-1.0.jar
at org.apache.spark.util.utils$.deleteRecursively(Utils.scala:495)
at
org.apache.spark.util.utils$$anonfun$deleteRecursively$1.apply(Utils.scala:491)
I deleted my
Hello, I am using Spark with Scala and I am attempting to understand the
different filtering and mapping capabilities available. I haven't found an
example of the specific task I would like to do.
I am trying to read in a tab spaced text file and filter specific entries.
I would like this
Thanks Sean, I think that is doing what I needed. It was much simpler than
what I had been attempting.
Is it possible to do an OR statement filter? So, that for example column 2
can be filtered by A2 appearances and column 3 by A4?
--
View this message in context:
Hi Ognen,
Any particular reason of choosing scalatra over options like play or spray
?
Is scalatra much better in serving apis or is it due to similarity with
ruby's sinatra ?
Did you try the other options and then pick scalatra ?
Thanks.
Deb
On Tue, Mar 4, 2014 at 4:50 AM, Ognen Duzlevski
Thanks.
Does it make sence to add ==/equals method for Vector with this (or same)
behavior?
2014-03-04 6:00 GMT+02:00 Shixiong Zhu zsxw...@gmail.com:
Vector is an enhanced Array[Double]. You can compare it like
Array[Double]. E.g.,
scala val v1 = Vector(1.0, 2.0)
v1:
Deb,
On 3/4/14, 9:02 AM, Debasish Das wrote:
Hi Ognen,
Any particular reason of choosing scalatra over options like play or
spray ?
Is scalatra much better in serving apis or is it due to similarity
with ruby's sinatra ?
Did you try the other options and then pick scalatra ?
Not really.
Hi Mayur,
I am using CDH4.6.0p0.26. And the latest Cloudera Spark parcel is Spark
0.9.0 CDH4.6.0p0.50.
As I mentioned, somehow, the Cloudera Spark version doesn't contain the
run-example shell scripts.. However, it is automatically configured and it
is pretty easy to set up across the cluster...
Hi there,
I tried the Kafka WordCount example and it works perfect and the code is
pretty straightforward to understand.
Can anyone show to me how to start your own maven project with the
KafkaWordCount example using minimum-effort.
1. How the pom file should look like (including jar-plugin?
Hi there,
I tried the SimpleApp WordCount example and it works perfect on local
environment. My code:
object SimpleApp {
def main(args: Array[String]) {
val logFile = README.md
val conf = new SparkConf()
.setMaster(zk://172.31.0.11:2181/mesos)
.setAppName(Simple App)
Hi!
I created an EMR cluster with Spark and HBase according to
http://aws.amazon.com/articles/4926593393724923 with --hbase flag to
include HBase. Although spark and shark both work nicely with the provided
S3 examples, there is a problem with external tables pointing to the HBase
instance.
We
Hi TD,
I have seen in the web UI the stage number that result has been zero and in
the field GC Times there is nothing.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n2306/CaptureStage.png
--
View this message in context:
Hi Sean,
We're not using log4j actually, we're trying to redirect all logging to
slf4j which then uses logback as the logging implementation.
The fix you mentioned - am I right to assume it is not part of the latest
released Spark version (0.9.0)? If so, are there any workarounds or advices
on
Rob,
I have seen this too. I have 16 nodes in my spark cluster and for some
reason (after app failures) one of the workers will go offline. I will
ssh to the machine in question and find that the java process is running
but for some reason the master is not noticing this. I have not had the
Hi Christian,
The PYSPARK_PYTHON environment variable specifies the python executable to
use for pyspark. You can put the path to a virtualenv's python executable
and it will work fine. Remember you have to have the same installation at
the same path on each of your cluster nodes for pyspark to
Whoopdeedoo, after just waiting for like an hour (well, I was doing other
stuff) the process holding that address seems to have died automatically
and now I can start up pyspark without any warnings.
Would there be a faster way to go through this than just wait around for
the orphaned process to
Hi!
I created an EMR cluster with Spark and HBase according to
http://aws.amazon.com/articles/4926593393724923 with --hbase flag to include
HBase. Although spark and shark both work nicely with the provided S3
examples, there is a problem with external tables pointing to the HBase
instance.
We
Thanks Bryn.
On Wed, Mar 5, 2014 at 9:00 PM, Bryn Keller xol...@xoltar.org wrote:
Hi Christian,
The PYSPARK_PYTHON environment variable specifies the python executable to
use for pyspark. You can put the path to a virtualenv's python executable
and it will work fine. Remember you have to
Hi Patrick,
Thanks for the patch. I tried building a patched version
of spark-core_2.10-0.9.0-incubating.jar but the Maven build fails:
*[ERROR]
/home/das/Work/thx/incubator-spark/core/src/main/scala/org/apache/spark/Logging.scala:22:
object impl is not a member of package org.slf4j*
*[ERROR]
The real question is why do you want to run pig script using Spark
Are you planning to user spark as underlying processing engine for Spark?
thats not simple
Are you planning to feed Pig data to spark for further processing, then you
can write it to HDFS trigger your spark script.
rdd.pipe is
i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.
On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote:
i have a running job that i cancel while
How do you cancel the job. Which API do you use?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers ko...@tresata.com wrote:
i also noticed that jobs (with a new JobGroupId) which
Hi,
Quick question do I need to compile spark against exactly same version of
mesos library, currently spark depends on 0.13.
The problem I am facing is following I am running MLib example with SVM and
it works nicely when I use coarse grained mode, however when running fine
grained mode on
One issue is that job cancellation is posted on eventloop. So its possible
that subsequent jobs submitted to job queue may beat the job cancellation
event hence the job cancellation event may end up closing them too.
So there's definitely a race condition you are risking even if not running
into.
got it. seems like i better stay away from this feature for now..
On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
One issue is that job cancellation is posted on eventloop. So its possible
that subsequent jobs submitted to job queue may beat the job cancellation
Hey,
Maybe I don't understand the slf4j model completely, but I think you
need to add a concrete implementation of a logger. So in your case
you'd the logback-classic binding in place of the log4j binding at
compile time:
http://mvnrepository.com/artifact/ch.qos.logback/logback-classic/1.1.1
-
Hi,
I've tried to enable debug logging, but can't figure out what might be
going wrong. Can anyone assist decyphering the log?
The log of the startup and run attempts is at http://pastebin.com/XyeY92VF
This uses SparkILoop, DEBUG level logging and settings.debug.value = true
option.
Line 323:
We are trying to use kryo serialization, but with kryo serialization ON the
memory consumption does not change. We have tried this on multiple sets of
data.
We have also checked the logs of Kryo serialization and have confirmed that
Kryo is being used.
Can somebody please help us with this?
The
So this happened again today. As I noted before, the Spark shell starts up
fine after I reconnect to the cluster, but this time around I tried opening
a file and doing some processing. I get this message over and over (and
can't do anything):
14/03/06 15:43:09 WARN scheduler.TaskSchedulerImpl:
Hi,
I've successfully built 0.9.0-incubating on Solaris using sbt, following
the instructions at http://spark.incubator.apache.org/docs/latest/ and
it seems to work OK. However, when I start it up I get an error about
missing Hadoop native libraries. I can't find any mention of how to
build
Thanks Mayur. I don't have clear idea on how pipe works wanted to
understand more on it. But when do we use pipe() and how it works ?. Can
you please share some sample code if you have ( even pseudo-code is fine )
? It will really help.
Regards,
Suman Bharadwaj S
On Thu, Mar 6, 2014 at 3:46 AM,
Is it an error, or just a warning? In any case, you need to get those libraries
from a build of Hadoop for your platform. Then add them to the
SPARK_LIBRARY_PATH environment variable in conf/spark-env.sh, or to your
-Djava.library.path if launching an application separately.
These libraries
Hi,
I am trying to setup Spark in windows for development environment. I get
following error when I run sbt. Pl help me to resolve this issue. I am working
for Verizon and am in my company network and can't access internet without
proxy.
C:\Userssbt
Getting org.fusesource.jansi jansi 1.11 ...
export JAVA_OPTS=$JAVA_OPTS -Dhttp.proxyHost=yourserver
-Dhttp.proxyPort=8080 -Dhttp.proxyUser=username
-Dhttp.proxyPassword=password
Also please use separate thread for different questions.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Thanks Alan.
I am very new to Spark. I am trying to set Spark development environment in
Windows. I added below mentioned export as set in sbt.bat file and tried,
it was not working. Where will I see .gitconfig?
set JAVA_OPTS=%JAVA_OPTS% -Dhttp.proxyHost=myservername -Dhttp.proxyPort=8080
Dana,
When you run multiple applications under Spark, and if each application
takes up the entire cluster resources, it is expected that one will block
the other completely, thus you're seeing that the wall time add together
sequentially. In addition there is some overhead associated with
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.
Can someone please let me know the status of Spork or any other effort that
will let us run Pig on Spark? We can
I had asked a similar question on the dev mailing list a while back (Jan 22nd).
See the archives:
http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look
for spork.
Basically Matei said:
Yup, that was it, though I believe people at Twitter picked it up again
recently.
There is some work to make this work on yarn at
https://github.com/aniket486/pig. (So, compile pig with ant
-Dhadoopversion=23)
You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
find out what sort of env variables you need (sorry, I haven't been able to
clean this up-
On 06/03/2014 18:55, Matei Zaharia wrote:
For the native libraries, you can use an existing Hadoop build and
just put them on the path. For linking to Hadoop, Spark grabs it
through Maven, but you can do mvn install locally on your version
of Hadoop to install it to your local Maven cache, and
Hi Aniket,Many thanks! I will check this out.
Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket...@gmail.com
To: user@spark.apache.org; tgraves...@yahoo.com
There is some work to make this work on yarn at
https://github.com/aniket486/pig. (So, compile pig with ant
Can you see your webUI of Spark. Is it running? (would run on
masterurl:8080)
if so what is the master URL shown thr..
MASTER=spark://URL:PORT ./bin/spark-shell
Should work.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On
I see the same error. I am trying a standalone example integrated into a Play
Framework v2.2.2 application. The error occurs when I try to create a Spark
Streaming Context. Compilation succeeds, so I am guessing it has to do with
the version of Akka getting picked up at runtime.
--
View this
Are you launching your application using scala or java command? scala
command bring in a version of Akka that we have found to cause conflicts
with Spark's version for Akka. So its best to launch using Java.
TD
On Thu, Mar 6, 2014 at 3:45 PM, Deepak Nulu deepakn...@gmail.com wrote:
I see the
I was just able to fix this in my environment.
By looking at the repository/cache in my Play Framework installation, I was
able to determine that spark-0.9.0-incubating uses Akka version 2.2.3.
Similarly, looking at repository/local revealed that Play Framework 2.2.2
ships with Akka version
The difference between your two jobs is that take() is optimized and
only runs on the machine where you are using the shell, whereas
sortByKey requires using many machines. It seems like maybe python
didn't get upgraded correctly on one of the slaves. I would look in
the /root/spark/work/ folder
I dont have a Eclipse setup so I am not sure what is going on here. I would
try to use maven in the command line with a pom to see if this compiles.
Also, try to cleanup your system maven cache. Who knows if it had pulled in
a wrong version of kafka 0.8 and using it all the time. Blowing away the
many thanks for guiding.
2014-03-06 23:39 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:
Hi qingyang,
1. You do not need to install shark on every node.
2. Not really sure..it's just a warning so I'd see if it works despite it
3. You need to provide the actual hdfs path, e.g.
We're not using Ooyala's job server. We are holding the spark context for
reuse within our own REST server (with a service to run each job).
Our low-latency job now reads all its data from a memory cached RDD, instead
of from HDFS seq file (upstream jobs cache resultant RDDs for downstream
jobs
Hello,
What is the general approach people take when trying to do analysis
across multiple large files where the data to be extracted from a
successive file depends on the data extracted from a previous file or
set of files?
For example:
I have the following: a group of HDFS files each
Would you be the best person in the world share some code. Its a pretty
common problem .
On Mar 6, 2014 6:36 PM, polkosity polkos...@gmail.com wrote:
We're not using Ooyala's job server. We are holding the spark context for
reuse within our own REST server (with a service to run each job).
1 - 100 of 75373 matches
Mail list logo