I use standalone mode submit task.But often,I got an error.The stacktrace as
2014-06-12 11:37:36,578 [INFO] [org.apache.spark.Logging$class]
[Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-18]
- Executor updated: app-20140612092238-0007/0 is now FAILED (Command exited
with
If you are interested in openstack/swift integration with Spark, please
drop me a line. We are looking into improving the integration.
Thanks.
With the yarn-client mode,I submit a job from client to yarn,and the spark
file spark-env.sh:
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_EXECUTOR_INSTANCES=4
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=2G
Hi all,
I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was
looking for the way to convert text into sparse vector with TFIDF weighting
scheme.
I found that MLI library supports that but it is compatible with Spark 0.8.
What are all the options available to achieve text
You can create tf vectors and then use
RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For
tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth
having a simple interface for it. -Xiangrui
On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi stutiawas...@hcl.com wrote:
Hi guys,
I ran into the same exception (while trying the same example), and after
overriding hadoop-client artifact in my pom.xml, I got another error
(below).
System config:
ubuntu 12.04
intellijj 13.
scala 2.10.3
maven:
dependency
groupIdorg.apache.spark/groupId
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you!
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And is it possible to get the names
of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)?
Thank you!
Check out SparkContext.getPersistentRDDs!
On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote:
Hi,
How do I check the rdds that I have persisted? I have some code that looks
like:
rd1.cache()
rd2.cache()
...
rdN.cache()
How can I unpersist all rdd's at once? And
Hi, I see this has been asked before but has not gotten any satisfactory
answer so I'll try again:
(here is the original thread I found:
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E
)
I have a set of workers dying and coming back
Hi Daniel,
Thank you for your help! This is the sort of thing I was looking for.
However, when I type sc.getPersistentRDDs, i get the error
AttributeError: 'SparkContext' object has no attribute
'getPersistentRDDs'.
I don't get any error when I type sc.defaultParallelism for example.
I would
My exception stack looks about the same.
java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
Hi, all
When I try to run Spark PageRank using:
./bin/spark-submit \
--master spark://192.168.1.12:7077 \
--class org.apache.spark.examples.bagel.WikipediaPageRank \
~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \
hdfs://192.168.1.12:9000/freebase-13G 0.05 100
Hi, Laurent
You could set Spark.executor.memory and heap size by following methods:
1. in you conf/spark-env.sh:
*export SPARK_WORKER_MEMORY=38g*
*export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit
-XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m*
2. you could also add modification for
Hi,
You can use map functions like flatmapValues and mapValues, which will
apply the map fucntion on each pairRDD contained in your input
pairDstreamK,V and returns the paired DstreamK,V
On Fri, Jun 13, 2014 at 8:48 AM, ryan_seq [via Apache Spark User List]
val myRdds = sc.getPersistentRDDs
assert(myRdds.size === 1)
It'll return a map. Its pretty old 0.8.0 onwards.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jun 13, 2014 at 9:42 AM, mrm
I have also had trouble in worker joining the working set. I have typically
moved to Mesos based setup. Frankly for high availability you are better
off using a cluster manager.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
Sorry if this is a dumb question but why not several calls to
map-partitions sequentially. Are you looking to avoid function
serialization or is your function damaging partitions?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
This appears to be missing from PySpark.
Reported in SPARK-2141 https://issues.apache.org/jira/browse/SPARK-2141.
On Fri, Jun 13, 2014 at 10:43 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
val myRdds = sc.getPersistentRDDs
assert(myRdds.size === 1)
It'll return a map. Its
You can resolve the columns to create keys using them.. then join. Is that
what you did?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jun 12, 2014 at 9:24 PM, SK skrishna...@gmail.com wrote:
This issue is
I just forgot to call start on the context.
Works now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Java-Custom-Receiver-onStart-method-never-called-tp7525p7579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Nick,
Thank you for the reply, I forgot to mention I was using pyspark in my first
message.
Maria
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html
Sent from the Apache Spark User List mailing list archive at
There is probably a subtlety between the ability to run tasks with data
process-local and node-local that I think I'm missing.
I'm doing a basic test which is the following:
1) Copy a large text file from the local file system into HDFS using
hadoop fs -copyFromLocal
2) Run Spark's wordcount
Yeah, unfortunately PySpark still lags behind the Scala API a bit, but it's
being patched up at a good pace.
On Fri, Jun 13, 2014 at 1:43 PM, mrm ma...@skimlinks.com wrote:
Hi Nick,
Thank you for the reply, I forgot to mention I was using pyspark in my
first
message.
Maria
--
View
Thanks Saisai, I think I will just try lowering my spark.cleaner.ttl value
- I've set it to an hour.
On Thu, Jun 12, 2014 at 7:32 PM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Michael,
I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata
cleaner, which will clean
I'm interested in this issue as well. I have spark streaming jobs that
seems to run well for a while, but slowly degrade and don't recover.
On Wed, Jun 11, 2014 at 11:08 PM, Boduo Li onpo...@gmail.com wrote:
It seems that the slow reduce tasks are caused by slow shuffling. Here is
the logs
Hi All,
I'm new to Spark. Just tried out the example code on Spark website for
L-BFGS. But the code val model = new LogisticRegressionModel(... gave me
an error:
console:19: error: constructor LogisticRegressionModel in class
LogisticRegres
sionModel cannot be accessed in class $iwC
val
Aaron,
spark.executor.memory is set to 2454m in my spark-defaults.conf, which is a
reasonable value for EC2 instances which I use (they are m3.medium
machines). However, it doesn't help and each executor uses only 512 MB of
memory. To figure out why, I examined spark-submit and spark-class scripts
I used groupBy to create the keys for both RDDs. Then I did the join.
I think though it be useful if in the future Spark could allows us to
specify the fields on which to join, even when the keys are different.
Scalding allows this feature.
--
View this message in context:
I'm running a 1.0.0 standalone cluster based on amplab/dockerscripts with 3
workers. I'm testing out spark-submit and I'm getting errors using
*--deploy-mode cluster* and using an http:// url to my JAR. I'm getting the
following error back.
Sending launch command to spark://master:7077
Driver
Hi Congrui,
Since it's private in mllib package, one workaround will be write your
code in scala file with mllib package in order to use the constructor
of LogisticRegressionModel.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
I've seen these caused by the OOM killer. I recommend checking
/var/log/syslog to see if it was activated due to lack of system
memory.
On Thu, Jun 12, 2014 at 11:45 PM, libl 271592...@qq.com wrote:
I use standalone mode submit task.But often,I got an error.The stacktrace as
2014-06-12
Hi DB,
Thank you for the help! I'm new to this, so could you give a bit more
details how this could be done?
Sincerely,
Congrui Yi
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html
Sent from
This is very odd. If it is running fine on mesos, I dont see a obvious
reason why it wont work on Spark standalone cluster.
Is the .4 million file already present in the monitored directory when the
context is started? In that case, the file will not be picked up (unless
textFileStream is created
I get the same problem, but I'm running in a dev environment based on
docker scripts. The additional issue is that the worker processes do not
die and so the docker container does not exit. So I end up with worker
containers that are not participating in the cluster.
On Fri, Jun 13, 2014 at 9:44
There doesn't seem to be any obvious reason - that's why it looks like a bug.
The .4 million file is present in the directory when the context is started
- same as for all other files (which are processed just fine by the
application). In the logs we can see that the file is being picked up by
Hi All
I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark
scope in analytics, my Spark codes interacts with HDFS mostly.
I have a confusion that how Spark choose on which node it will distribute its
work.
Since we assume that it can be an alternative to Hadoop
If you look at the file 400k.output, you'll see the string
file:/newdisk1/praveshj/pravesh/data/input/testing4lk.txt
This file contains 0.4 mn records. So the file is being picked up but the
app goes on to hang later on.
Also you mentioned the term Standalone cluster in your previous reply
On Fri, Jun 13, 2014 at 1:55 PM, Albert Chu ch...@llnl.gov wrote:
1) How is this data process-local? I *just* copied it into HDFS. No
spark worker or executor should have loaded it.
Yeah, I thought that PROCESS_LOCAL meant the data was already in the JVM on
the worker node, but I do see the
Hi,
I have a List[ (String, Int, Int) ] that I would liek to convert to an RDD.
I tried to use sc.parallelize and sc.makeRDD, but in each case the original
order of items in the List gets modified. Is there a simple way to convert a
List to RDD without using SparkContext?
thanks
--
View this
I may be wrong, but I think RDDs must be created inside a
SparkContext. To somehow preserve the order of the list, perhaps you
could try something like:
sc.parallelize((1 to xs.size).zip(xs))
On Fri, Jun 13, 2014 at 6:08 PM, SK skrishna...@gmail.com wrote:
Hi,
I have a List[ (String, Int,
I have been trying to get detailed history of previous spark shell executions
(after exiting spark shell). In standalone mode and Spark 1.0, I think the
spark master UI is supposed to provide detailed execution statistics of all
previously run jobs. This is supposed to be viewable by clicking on
Thanks. But that did not work.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606p7609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Sorry I wasn't being clear. The idea off the top of my head was that
you could append an original position index to each element (using the
line above), and modified what ever processing functions you have in
mind to make them aware of these indices. And I think you are right
that RDD collections
Hi,
My unit test is failing (the output is not matching the expected output). I
would like to printout the value of the output. But
rdd.foreach(r=println(r)) does not work from the unit test. How can I print
or write out the output to a file/screen?
thanks.
--
View this message in context:
You need to factor your program so that it’s not just a main(). This is not a
Spark-specific issue, it’s about how you’d unit test any program in general. In
this case, your main() creates a SparkContext, so you can’t pass one from
outside, and your code has to read data from a file and write
Hi,
Would appreciate insights and wisdom on a problem we are working on:
1. Context:
- Given a csv file like:
- d1,c1,a1
- d1,c1,a2
- d1,c2,a1
- d1,c1,a1
- d2,c1,a3
- d2,c2,a1
- d3,c1,a1
- d3,c3,a1
- d3,c2,a1
- d3,c3,a2
Hello,
I have been playing around with mllib's decision tree library. It is
working great, thanks.
I have a question regarding overfitting. It appears to me that the current
implementation doesn't allows user to specify the minimum number of samples
per node. This results in some nodes only
Thank you for your suggestion. We will try it out and see how it performs. We
think the single call to mapPartitions will be faster but we could be wrong.
It would be nice to have a clone method on the iterator.
--
View this message in context:
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD)
works fine locally. Now I can do groupByKey et al. Am not sure if it is
scalable for millions of records memory efficient.
heers
k/
On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote:
Hi,
50 matches
Mail list logo