the memory requirements seem to be rapidly growing hen using higher rank... I
am unable to get over 20 without running out of memory. is this
expected?thanks, Antony.
Can someone explain what is the difference between parameter server and
spark ?
There's already an issue on this topic
https://issues.apache.org/jira/browse/SPARK-4590
Another example of DL in Spark essentially based on downpour SDG
http://deepdist.com
On Sat, Jan 10, 2015 at 2:27 AM, Peng
the actual case looks like this:* spark 1.1.0 on yarn (cdh 5.2.1)* ~8-10
executors, 36GB phys RAM per host* input RDD is roughly 3GB containing
~150-200M items (and this RDD is made persistent using .cache())* using pyspark
yarn is configured with the limit yarn.nodemanager.resource.memory-mb of
Hi Michael,
I see you capped the cores to 60.
I wonder what's the settings you used for standalone mode that you compared
with?
I can try to run a MLib workload on both to compare.
Tim
On Jan 9, 2015, at 6:42 AM, Michael V Le m...@us.ibm.com wrote:
Hi Tim,
Thanks for your response.
it worked thanks.
this doc page
https://spark.apache.org/docs/1.2.0/sql-programming-guide.htmlrecommends
to use spark.sql.parquet.compression.codec to set the compression coded
and I thought this setting would be forwarded to the hive context given
that HiveContext extends SQLContext, but it was
Thanks, Gaurav and Corey,
Probably I didn’t make myself clear. I am looking for best Spark practice
similar to Shiny for R, the analysis/visualziation results can be easily
published to web server and shown from web browser. Or any dashboard for Spark?
Best regards,
Cui Lin
From: gtinside
You may try to change the schudlingMode to FAIR, the default is FIFO. Take
a look at this page
https://spark.apache.org/docs/1.1.0/job-scheduling.html#scheduling-within-an-application
On Sat, Jan 10, 2015 at 10:24 AM, YaoPau jonrgr...@gmail.com wrote:
I'm looking for ways to reduce the
Update: I resolved this by increasing the granularity of RDD persistence for
complex map-reduce operations, as the one whose reduceByKey stage was
failing.
Coolio.
Lucio
--
View this message in context:
I’m curious what the status of implementing hive analytics functions in
spark.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Many of these seem missing. I’m assuming they’re not implemented yet?
Is there an ETA on them?
or am I the first to bring this
I'm looking for ways to reduce the runtime of my Spark job. My code is a
single file of scala code and is written in this order:
(1) val lines = Import full dataset using sc.textFile
(2) val ABonly = Parse out all rows that are not of type A or B
(3) val processA = Process only the A rows from
Hey,
I am having a similar issue, did you manage to find a solution yet? Please
check my post below for reference:
http://apache-spark-user-list.1001560.n3.nabble.com/IOError-Errno-2-No-such-file-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html
Thank you,
As Jerry said, this is not related to shuffle file consolidation.
The unique thing about this problem is that it's failing to find a file
while trying to _write_ to it, in append mode. The simplest explanation for
this would be that the file is deleted in between some check for existence
and
From your pseudo code, it would be sequential and done twice
1+2+3
then 1+2+4
If you do a .cache() in step 2 then you would have 1+2+3 , then 4
I ran several steps in parrallel from the same program but never using the
same source RDD so I do not know the limitations there. I simply started
Hi Xiangrui,
Thanks a lot for you answer.
So I fixed my Julia code, also calculated PCA using R as well.
R Code:
-
data - read.csv('/home/upul/Desktop/iris.csv');
X - data[,1:4]
pca - prcomp(X, center = TRUE, scale=FALSE)
transformed - predict(pca, newdata = X)
Julia Code (Fixed)
Guys,
registerTempTable(Employees)
gives me the error
Exception in thread main scala.ScalaReflectionException: class
org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial
classloader with boot classpath
Just make sure you are not connecting to the Old RPC Port (9160), new
binary port is running on 9042.
What is your rpc_address listed in cassandra.yaml? Also make sure you
have start_native_transport: *true *in the yaml file.
Thanks
Best Regards
On Sat, Jan 10, 2015 at 8:44 AM, Ankur Srivastava
Hi Kevin,
I'm currently working on implementing windowing. If you'd like to see
something that's not covered by a JIRA, please file one!
best,
wb
- Original Message -
From: Kevin Burton bur...@spinn3r.com
To: user@spark.apache.org
Sent: Saturday, January 10, 2015 12:12:38 PM
I've got a data set of activity by user. For each user, I'd like to train a
decision tree model. I currently have the feature creation step implemented
in Spark and would naturally like to use mllib's decision tree model.
However, it looks like the decision tree model expects the whole RDD and
Hi,
How can I measure the time an RDD takes to execute?
In particular, I want to do it for the following piece of code:
«
val ssc = new StreamingContext(sparkConf, Seconds(5))
val distFile = ssc.textFileStream(/home/myuser/twitter-dump)
val words = distFile.flatMap(_.split( )).filter(_.length
What is your spark version that is running on the EC2 cluster? From the build
file https://github.com/knoldus/Play-Spark-Scala/blob/master/build.sbt of
your play application it seems that it uses Spark 1.0.1.
Thanks
Best Regards
On Fri, Jan 9, 2015 at 7:17 PM, Eduardo Cusa
-dev, +user
http://spark.apache.org/docs/latest/job-scheduling.html
On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta alexbare...@gmail.com
wrote:
Is it possible to specify a priority level for a job, such that the active
jobs might be scheduled in order of priority?
Alex
Thanks Cheng Michael! Makes sense. Appreciate the tips!
Idiomatic scala isn't performant. I’ll definitely start using while loops or
tail recursive methods. I have noticed this in the spark code base.
I might try turning off columnar compression (via
http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties
Setting a high weight such as 1000 also makes it possible to implement
*priority* between pools—in essence, the weight-1000 pool will always get
to launch tasks first whenever it has jobs active.
On Sat, Jan 10,
Cody,
Maybe I'm not getting this, but it doesn't look like this page is
describing a priority queue scheduling policy. What this section discusses
is how resources are shared between queues. A weight-1000 pool will get
1000 times more resources allocated to it than a priority 1 queue. Great,
but
Hi Rajesh,
There's a great web-based notebook visualize tool called Zeppelin. (And
it's opensource!)
Check it out:
http://zeppelin.incubator.apache.org
Regards,
Kevin
--
View this message in context:
Mark,
Thanks, but I don't see how this documentation solves my problem. You are
referring me to documentation of fair scheduling; whereas, I am asking
about as unfair a scheduling policy as can be: a priority queue.
Alex
On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com
There is path /tmp/spark-jobserver/file where all the jar are kept by
default. probably deleting from there should work
On 11 Jan 2015 12:51, Sasi [via Apache Spark User List]
ml-node+s1001560n21081...@n3.nabble.com wrote:
How to remove submitted JARs from spark-jobserver?
How to remove submitted JARs from spark-jobserver?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Removing-JARs-from-spark-jobserver-tp21081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, I am new to the MLlib in Spark. Can the DecisionTree model in MLlib deal
with missing values? If so, what data structure should I use for the input?
Moreover, my data has categorical features, but the LabeledPoint requires
double data type, in this case what can I do?
Thank you very much.
29 matches
Mail list logo