Command exited with code 137

2014-06-13 Thread libl
I use standalone mode submit task.But often,I got an error.The stacktrace as 2014-06-12 11:37:36,578 [INFO] [org.apache.spark.Logging$class] [Method:logInfo] [Line:49] [Thread:spark-akka.actor.default-dispatcher-18] - Executor updated: app-20140612092238-0007/0 is now FAILED (Command exited with

openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

Spark 1.0.0 on yarn cluster problem

2014-06-13 Thread Sophia
With the yarn-client mode,I submit a job from client to yarn,and the spark file spark-env.sh: export HADOOP_HOME=/usr/lib/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop SPARK_EXECUTOR_INSTANCES=4 SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_MEMORY=1G SPARK_DRIVER_MEMORY=2G

Convert text into tfidf vectors for Classification

2014-06-13 Thread Stuti Awasthi
Hi all, I wanted to perform Text Classification using Spark1.0 Naïve Bayes. I was looking for the way to convert text into sparse vector with TFIDF weighting scheme. I found that MLI library supports that but it is compatible with Spark 0.8. What are all the options available to achieve text

Re: Convert text into tfidf vectors for Classification

2014-06-13 Thread Xiangrui Meng
You can create tf vectors and then use RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth having a simple interface for it. -Xiangrui On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi stutiawas...@hcl.com wrote:

Re: wholeTextFiles() : java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected

2014-06-13 Thread visenger
Hi guys, I ran into the same exception (while trying the same example), and after overriding hadoop-client artifact in my pom.xml, I got another error (below). System config: ubuntu 12.04 intellijj 13. scala 2.10.3 maven: dependency groupIdorg.apache.spark/groupId

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

list of persisted rdds

2014-06-13 Thread mrm
Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And is it possible to get the names of the rdd's that are currently persisted (list = rd1, rd2, ..., rdN)? Thank you!

Re: list of persisted rdds

2014-06-13 Thread Daniel Darabos
Check out SparkContext.getPersistentRDDs! On Fri, Jun 13, 2014 at 1:06 PM, mrm ma...@skimlinks.com wrote: Hi, How do I check the rdds that I have persisted? I have some code that looks like: rd1.cache() rd2.cache() ... rdN.cache() How can I unpersist all rdd's at once? And

Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-13 Thread Yana Kadiyska
Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again: (here is the original thread I found: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E ) I have a set of workers dying and coming back

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Daniel, Thank you for your help! This is the sort of thing I was looking for. However, when I type sc.getPersistentRDDs, i get the error AttributeError: 'SparkContext' object has no attribute 'getPersistentRDDs'. I don't get any error when I type sc.defaultParallelism for example. I would

Re: wholeTextFiles not working with HDFS

2014-06-13 Thread Sguj
My exception stack looks about the same. java.io.FileNotFoundException: File /user/me/target/capacity-scheduler.xml does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397) at

BUG? Why does MASTER have to be set to spark://hostname:port?

2014-06-13 Thread Hao Wang
Hi, all When I try to run Spark PageRank using: ./bin/spark-submit \ --master spark://192.168.1.12:7077 \ --class org.apache.spark.examples.bagel.WikipediaPageRank \ ~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \ hdfs://192.168.1.12:9000/freebase-13G 0.05 100

Re: how to set spark.executor.memory and heap size

2014-06-13 Thread Hao Wang
Hi, Laurent You could set Spark.executor.memory and heap size by following methods: 1. in you conf/spark-env.sh: *export SPARK_WORKER_MEMORY=38g* *export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m* 2. you could also add modification for

Re: Transform K,V pair to a new K,V pair

2014-06-13 Thread lalit1303
Hi, You can use map functions like flatmapValues and mapValues, which will apply the map fucntion on each pairRDD contained in your input pairDstreamK,V and returns the paired DstreamK,V On Fri, Jun 13, 2014 at 8:48 AM, ryan_seq [via Apache Spark User List]

Re: list of persisted rdds

2014-06-13 Thread Mayur Rustagi
val myRdds = sc.getPersistentRDDs assert(myRdds.size === 1) It'll return a map. Its pretty old 0.8.0 onwards. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 13, 2014 at 9:42 AM, mrm

Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-13 Thread Mayur Rustagi
I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: multiple passes in mapPartitions

2014-06-13 Thread Mayur Rustagi
Sorry if this is a dumb question but why not several calls to map-partitions sequentially. Are you looking to avoid function serialization or is your function damaging partitions? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
This appears to be missing from PySpark. Reported in SPARK-2141 https://issues.apache.org/jira/browse/SPARK-2141. On Fri, Jun 13, 2014 at 10:43 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: val myRdds = sc.getPersistentRDDs assert(myRdds.size === 1) It'll return a map. Its

Re: specifying fields for join()

2014-06-13 Thread Mayur Rustagi
You can resolve the columns to create keys using them.. then join. Is that what you did? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Jun 12, 2014 at 9:24 PM, SK skrishna...@gmail.com wrote: This issue is

Re: Java Custom Receiver onStart method never called

2014-06-13 Thread jsabin
I just forgot to call start on the context. Works now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Java-Custom-Receiver-onStart-method-never-called-tp7525p7579.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: list of persisted rdds

2014-06-13 Thread mrm
Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/list-of-persisted-rdds-tp7564p7581.html Sent from the Apache Spark User List mailing list archive at

process local vs node local subtlety question/issue

2014-06-13 Thread Albert Chu
There is probably a subtlety between the ability to run tasks with data process-local and node-local that I think I'm missing. I'm doing a basic test which is the following: 1) Copy a large text file from the local file system into HDFS using hadoop fs -copyFromLocal 2) Run Spark's wordcount

Re: list of persisted rdds

2014-06-13 Thread Nicholas Chammas
Yeah, unfortunately PySpark still lags behind the Scala API a bit, but it's being patched up at a good pace. On Fri, Jun 13, 2014 at 1:43 PM, mrm ma...@skimlinks.com wrote: Hi Nick, Thank you for the reply, I forgot to mention I was using pyspark in my first message. Maria -- View

Re: Spilled shuffle files not being cleared

2014-06-13 Thread Michael Chang
Thanks Saisai, I think I will just try lowering my spark.cleaner.ttl value - I've set it to an hour. On Thu, Jun 12, 2014 at 7:32 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Michael, I think you can set up spark.cleaner.ttl=xxx to enable time-based metadata cleaner, which will clean

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-13 Thread Michael Chang
I'm interested in this issue as well. I have spark streaming jobs that seems to run well for a while, but slowly degrade and don't recover. On Wed, Jun 11, 2014 at 11:08 PM, Boduo Li onpo...@gmail.com wrote: It seems that the slow reduce tasks are caused by slow shuffling. Here is the logs

MLlib-a problem of example code for L-BFGS

2014-06-13 Thread Congrui Yi
Hi All, I'm new to Spark. Just tried out the example code on Spark website for L-BFGS. But the code val model = new LogisticRegressionModel(... gave me an error: console:19: error: constructor LogisticRegressionModel in class LogisticRegres sionModel cannot be accessed in class $iwC val

Re: How to specify executor memory in EC2 ?

2014-06-13 Thread Aliaksei Litouka
Aaron, spark.executor.memory is set to 2454m in my spark-defaults.conf, which is a reasonable value for EC2 instances which I use (they are m3.medium machines). However, it doesn't help and each executor uses only 512 MB of memory. To figure out why, I examined spark-submit and spark-class scripts

Re: specifying fields for join()

2014-06-13 Thread SK
I used groupBy to create the keys for both RDDs. Then I did the join. I think though it be useful if in the future Spark could allows us to specify the fields on which to join, even when the keys are different. Scalding allows this feature. -- View this message in context:

spark-submit fails to get jar from http source

2014-06-13 Thread lbustelo
I'm running a 1.0.0 standalone cluster based on amplab/dockerscripts with 3 workers. I'm testing out spark-submit and I'm getting errors using *--deploy-mode cluster* and using an http:// url to my JAR. I'm getting the following error back. Sending launch command to spark://master:7077 Driver

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai
Hi Congrui, Since it's private in mllib package, one workaround will be write your code in scala file with mllib package in order to use the constructor of LogisticRegressionModel. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Command exited with code 137

2014-06-13 Thread Jim Blomo
I've seen these caused by the OOM killer. I recommend checking /var/log/syslog to see if it was activated due to lack of system memory. On Thu, Jun 12, 2014 at 11:45 PM, libl 271592...@qq.com wrote: I use standalone mode submit task.But often,I got an error.The stacktrace as 2014-06-12

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread Congrui Yi
Hi DB, Thank you for the help! I'm new to this, so could you give a bit more details how this could be done? Sincerely, Congrui Yi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html Sent from

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
This is very odd. If it is running fine on mesos, I dont see a obvious reason why it wont work on Spark standalone cluster. Is the .4 million file already present in the monitored directory when the context is started? In that case, the file will not be picked up (unless textFileStream is created

Re: Master not seeing recovered nodes(Got heartbeat from unregistered worker ....)

2014-06-13 Thread Gino Bustelo
I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster. On Fri, Jun 13, 2014 at 9:44

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread praveshjain1991
There doesn't seem to be any obvious reason - that's why it looks like a bug. The .4 million file is present in the directory when the context is started - same as for all other files (which are processed just fine by the application). In the logs we can see that the file is being picked up by

How Spark Choose Worker Nodes for respective HDFS block

2014-06-13 Thread anishs...@yahoo.co.in
Hi All I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark scope in analytics, my Spark codes interacts with HDFS mostly. I have a confusion that how Spark choose on which node it will distribute its work. Since we assume that it can be an alternative to Hadoop

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread praveshjain1991
If you look at the file 400k.output, you'll see the string file:/newdisk1/praveshj/pravesh/data/input/testing4lk.txt This file contains 0.4 mn records. So the file is being picked up but the app goes on to hang later on. Also you mentioned the term Standalone cluster in your previous reply

Re: process local vs node local subtlety question/issue

2014-06-13 Thread Nicholas Chammas
On Fri, Jun 13, 2014 at 1:55 PM, Albert Chu ch...@llnl.gov wrote: 1) How is this data process-local? I *just* copied it into HDFS. No spark worker or executor should have loaded it. Yeah, I thought that PROCESS_LOCAL meant the data was already in the JVM on the worker node, but I do see the

convert List to RDD

2014-06-13 Thread SK
Hi, I have a List[ (String, Int, Int) ] that I would liek to convert to an RDD. I tried to use sc.parallelize and sc.makeRDD, but in each case the original order of items in the List gets modified. Is there a simple way to convert a List to RDD without using SparkContext? thanks -- View this

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
I may be wrong, but I think RDDs must be created inside a SparkContext. To somehow preserve the order of the list, perhaps you could try something like: sc.parallelize((1 to xs.size).zip(xs)) On Fri, Jun 13, 2014 at 6:08 PM, SK skrishna...@gmail.com wrote: Hi, I have a List[ (String, Int,

spark master UI does not keep detailed application history

2014-06-13 Thread zhen
I have been trying to get detailed history of previous spark shell executions (after exiting spark shell). In standalone mode and Spark 1.0, I think the spark master UI is supposed to provide detailed execution statistics of all previously run jobs. This is supposed to be viewable by clicking on

Re: convert List to RDD

2014-06-13 Thread SK
Thanks. But that did not work. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606p7609.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
Sorry I wasn't being clear. The idea off the top of my head was that you could append an original position index to each element (using the line above), and modified what ever processing functions you have in mind to make them aware of these indices. And I think you are right that RDD collections

printing in unit test

2014-06-13 Thread SK
Hi, My unit test is failing (the output is not matching the expected output). I would like to printout the value of the output. But rdd.foreach(r=println(r)) does not work from the unit test. How can I print or write out the output to a file/screen? thanks. -- View this message in context:

Re: guidance on simple unit testing with Spark

2014-06-13 Thread Matei Zaharia
You need to factor your program so that it’s not just a main(). This is not a Spark-specific issue, it’s about how you’d unit test any program in general. In this case, your main() creates a SparkContext, so you can’t pass one from outside, and your code has to read data from a file and write

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

MLLib : Decision Tree with minimum points per node

2014-06-13 Thread Justin Yip
Hello, I have been playing around with mllib's decision tree library. It is working great, thanks. I have a question regarding overfitting. It appears to me that the current implementation doesn't allows user to specify the minimum number of samples per node. This results in some nodes only

Re: multiple passes in mapPartitions

2014-06-13 Thread zhen
Thank you for your suggestion. We will try it out and see how it performs. We think the single call to mapPartitions will be faster but we could be wrong. It would be nice to have a clone method on the iterator. -- View this message in context:

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar
Answered one of my questions (#5) : val pairs = new PairRDDFunctions(RDD) works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records memory efficient. heers k/ On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar ksanka...@gmail.com wrote: Hi,