Hi Prabeesh,
Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can
do a ps aux | grep shark and see how much memory it is being allocated,
mostly it should be 512mb, in that case increase the limit.
Thanks
Best Regards
On Fri, May 23, 2014 at 10:22 AM, prabeesh k
We have an internal patched version of Spark webUI which exports
application related data as Json. We use monitoring systems as well as
alternate UI for that json data for our specific application. Found it much
cleaner. Can provide 0.9.1 version.
Would submit as a pull request soon.
Mayur
Unsubscribe
James Jones
Acquisition Editor
[ Packt Publishing ]
Tel: 0121 265 6486
Web: www.packtpub.com
Linkedin: uk.linkedin.com/pub/james-jones/52/3b9/596/
Twitter: @_James_Jones_
Packt Publishing Limited
Registered Office: Livery Place, 35 Livery Street, Birmingham, West Midlands,
I am not sure if EC2 script was updated for R3, R3 doesnt provide formatted
instance store also requires newer version of AMI for the same.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, May 23, 2014
Well its hard to use text data as time of input.
But if you are adament here's what you would do.
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files
you receive push them into the folder in order
How many cores do you see on your spark master (8080 port).
By default spark application should take all cores when you launch it.
Unless you have set max core configuration.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
Also I am unsure if Spark on Hbase leverages Locality. When you cache
process data do you see node_local jobs in process list.
Spark on HDFS leverages locality quite well can really boost performance
by 3-4x in my experience.
If you are loading all your data from HBase to spark then you are
You might use bin/shark-withdebug to find the exact issue for the failure.
That said, easiest way to get the cluster running, is to get rid of
dis-functional machine from spark cluster (remove it from slaves file).
Hope that helps.
On Thu, May 22, 2014 at 9:04 PM, Yana Kadiyska
I have been analyzing Storm performance and there's no significant overhead
added to the processing nodes. I'm interested in those results over Spark
as well.
Thanks in advance,
Otávio Carvalho.
Undergrad. CompSci Student at UFRGS
Porto Alegre, Brazil.
2014-05-20 18:46 GMT-03:00
Hi,
I think I found the problem.
In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read
the first 1020 bytes, but no more. The code should make sure to read
everything.
The following change will fix the problem:
in.read(bodyBuff)
to:
in.readFully(bodyBuff)
I attached a
Hi TD,
I use 0.9.1. Thanks for letting me know. This issue drove me up the wall. I
even made a method to close all that I could think of:
def stopSpark(ssc: StreamingContext) = {
ssc.sparkContext.cleanup(500)
ssc.sparkContext.clearFiles()
ssc.sparkContext.clearJars()
For some reason the patch did not make it.
Trying via email:
/D
On May 23, 2014, at 9:52 AM, lemieud david.lemi...@radialpoint.com wrote:
Hi,
I think I found the problem.
In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read
the first 1020 bytes, but no more. The
In trying to sort some largish datasets, we came across the
spark.shuffle.consolidateFiles property, and I found in the source code
that it is set, by default, to false, with a note to default it to true
when the feature is stable.
Does anyone know what is unstable about this? If we set it true,
That would be great, Mayur, thanks!
Anyhow, to be more specific, my question really was the following:
Is there any way to link events in the SparkListener to an action triggered in
your code?
Cheers
Pierre Borckmans
Software team
RealImpact Analytics | Brussels Office
Hi Nathan,
There's some explanation in the spark configuration section:
```
If set to true, consolidates intermediate files created during a shuffle.
Creating fewer files can improve filesystem performance for shuffles with
large numbers of reduce tasks. It is recommended to set this to true
Sounds like just what we need. For Hadoop we have progress bar to show the
current status of the job . We like to do the same for spark. The yarn client
only shows the percentage progress does show any text info.
Does your PR works for yarn mode ?
Chester
Sent from my iPhone
On May 23,
Mayur,
I'm interested on it as well. Can you send me?
Cheers,
Otávio Carvalho.
Undergrad. Student at Federal University of Rio Grande do Sul
Porto Alegre, Brazil.
2014-05-23 11:00 GMT-03:00 Pierre Borckmans
pierre.borckm...@realimpactanalytics.com:
That would be great, Mayur, thanks!
Unsubscribe
Hi everyone,
I've also been interested in better understanding what ports are used where
and the direction the network connections go. I've observed a running
cluster and read through code, and came up with the below documentation
addition.
https://github.com/apache/spark/pull/856
Scott and
I’ve been looking at how this is implemented in the UI:
https://github.com/apache/spark/blob/branch-0.9/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
1/ it’s easy to get the RDD name at the Stage events level
2/ the tricky part is that at the task level, we cannot link
Hi Jamal,
I don't believe there are pre-written algorithms for Cosine similarity or
Pearson Porrelation in PySpark that you can re-use. If you end up writing
your own implementation of the algorithm though, the project would
definitely appreciate if you shared that code back with the project for
Hi Pierre,
I asked a similar question on this list about 6 weeks ago. Here is one
answer
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E
I got that is of particular note:
In the upcoming release of
Thanks Philip,
I don’t want to go the JobLogger way (too hacky ;) )
In version 1.0, if i’m not mistaken, you can even do what I’m asking for, since
they removed the “private” for TaskInfo and such and replaced it with the
“@DeveloperApi” annotation.
I was looking for a simple way to do this
Hi,
I have view the code about UGI in spark. If spark interactive with kerberos
HDFS, The spark will apply delegate token in scheduler side, and stored as
credential into the UGI; And the credential will be transferred to spark
executor so that they can authenticate the HDFS. My question is
Do you need cosine distance and correlation between vectors or between
variables (elements of vector)? It would be helpful if you could tell us
details of your task.
On Thu, May 22, 2014 at 5:49 PM, jamal sasha jamalsha...@gmail.com wrote:
Hi,
I have bunch of vectors like
Created https://issues.apache.org/jira/browse/SPARK-1916
I'll submit a pull request soon.
/D
On May 23, 2014, at 9:56 AM, David Lemieux david.lemi...@radialpoint.com
wrote:
For some reason the patch did not make it.
Trying via email:
/D
On May 23, 2014, at 9:52 AM, lemieud
Hi Shrikar,
How did you build Spark 1.0.0-SNAPSHOT on your machine? My
understanding is that `sbt publishLocal` is not enough and you really
need `sbt assembly` instead. Give it a try and report back.
As to your build.sbt, upgrade Scala to 2.10.4 and org.apache.spark
%% spark-streaming %
Hi all,
Configuration: Standalone 0.9.1-cdh4 cluster, 7 workers per node, 32gb per
worker
I'm running a job on a spark cluster, and running into some strange
behavior. After a while, the akka frame sizes exceed 10mb, and then the
whole job seizes up. I set spark.akka.frameSize to 128 in the
Hi,
I get the following exception when using Spark to run various programs.
java.io.InvalidClassException: org.apache.spark.SerializableWritable;
local class incompatible: stream classdesc serialVersionUID =
6301214776158303468, local class serialVersionUID = -7785455416944904980
at
Michael,
What an excellent example! Thank you for posting such a detailed
explanation and sample code. So I see what you’re doing and it looks like
it works very well as long as your source data has a well-known and fixed
structure.
I’m looking for a pattern that can be used to expose JSON data
Still the same error no change
Thanks,
Shrikar
On Fri, May 23, 2014 at 2:38 PM, Jacek Laskowski ja...@japila.pl wrote:
Hi Shrikar,
How did you build Spark 1.0.0-SNAPSHOT on your machine? My
understanding is that `sbt publishLocal` is not enough and you really
need `sbt assembly` instead.
I'm running into an authentication issue when running against YARN. I am
using my own method to create the JAR assembly file and most likely I am
missing something. This method used to work, but I recently ran into this
problem. Here is the error from the YARN server:
14/05/23 19:03:02 INFO
We are using a back version of spark (0.8.1) that depends on a customized
version of kafka 0.7.2-spark. Where are the sources for it - either
svn/github or simply the sources..jar
For reference here is the maven repo location for the binaries:
For those whom were not able to attend the last Seattle Spark Meetup, we had a
great session by Claudiu Barbura on xPatterns on Spark, Shark, Tachyon, and
Mesos - you can find the slides at:
http://www.slideshare.net/ClaudiuBarbura/seattle-spark-meetup-may-2014.
As well, check out the next
Hi Spark Users,
I'm trying to read and process an Avro dataset using the interactive spark
scala shell. When my pipeline executes I get the ClassNotFoundException
pasted at the end of this email.
I'm trying to use the Generic Avro API (not the Specific API).
Here's a gist of the commands I'm
On Fri, May 16, 2014 at 11:19 AM, Akhil Das ak...@sigmoidanalytics.comwrote:
Hi
I have sent a pull request https://github.com/apache/spark/pull/681 you
can verify it and add it :)
Matei,
Would you please verify this pull request for Jenkins? It has been a couple
of weeks.
Thanks
Best
36 matches
Mail list logo