Spark on Mesos cause mesos-master OOM

2014-08-22 Thread Chengwei Yang
Hi List, We're recently trying to running spark on Mesos, however, we encountered a fatal error that mesos-master process will continuousely consume memory and finally killed by OOM Killer, this situation only happening if has spark job (fine-grained mode) running. We finally root caused the

Re: OOM Java heap space error on saveAsTextFile

2014-08-22 Thread Akhil Das
What operation are you performing before doing the saveAsTextFile? If you are doing a groupBy/sortBy/mapPartition/reduceByKey operations then you can specify the number of partitions. We were facing these kind of problems and specifying the correct partition solved the issue. Thanks Best Regards

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread Evan Chan
There's no way to avoid a shuffle due to the first and last elements of each partition needing to be computed with the others, but I wonder if there is a way to do a minimal shuffle. On Thu, Aug 21, 2014 at 6:13 PM, cjwang c...@cjwang.us wrote: One way is to do zipWithIndex on the RDD. Then use

Re: Advantage of using cache()

2014-08-22 Thread Nieyuan
Because map-reduce tasks like join will save shuffle data to disk . So the only diffrence with caching or no-caching version is : .map { case (x, (n, i)) = (x, n)} - Thanks, Nieyuan -- View this message in context:

Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476 LDA is on the roadmap for the 1.2 release, hopefully we will officially support it then! Best, Burak - Original Message - From: Denny Lee denny.g@gmail.com To: user@spark.apache.org Sent: Thursday, August

Re: DStream start a separate DStream

2014-08-22 Thread Mayur Rustagi
Why dont you directly use DStream created as output of windowing process? Any reason Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Aug 21, 2014 at 8:38 PM, Josh J joshjd...@gmail.com wrote: Hi, I

iterator cause NotSerializableException

2014-08-22 Thread Kevin Jung
Hi The following code gives me 'Task not serializable: java.io.NotSerializableException: scala.collection.mutable.ArrayOps$ofInt' var x = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3) var iter = Array(5).toIterator var value = 5 var value2 = iter.next x.map( q = q*value).collect //Line 1, it works.

countByWindow save the count ?

2014-08-22 Thread Josh J
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh

Installation On Windows machine

2014-08-22 Thread Mishra, Abhishek
Hello Team, I was just trying to install spark on my windows server 2012 machine and use it in my project; but unfortunately I do not find any documentation for the same. Please let me know if we have drafted anything for spark users on Windows. I am really in need of it as we are using

On Spark Standalone mode, Where the driver program will run?

2014-08-22 Thread taoist...@gmail.com
Hi all, 1. On Spark Standalone mode, client sumbit application. Where the driver program will run? client or master? 2. Standalone is reliable? can use in production mode ? taoist...@gmail.com

[PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I am using PySpark with IPython notebook. pre data = sc.parallelize(range(1000), 10) #successful data.map(lambda x: x+1).collect() #Error data.count() /pre Something similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html But it does not

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I'm running pyspark with Python 2.7.8 under Virtualenv System Python Version: Python 2.6.x -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html Sent from the Apache

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-22 Thread LPG
Hi all, Somehow related to this question and this data structure, what is the best way to extract features using names instead of positions? Of course, it is previously necessary to store the names in some way... Thanks in advance -- View this message in context:

Block input-* already exists on this machine; not re-adding it warnings

2014-08-22 Thread Aniket Bhatnagar
Hi everyone I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local machine. It seems to be working fine but I keep getting the following warnings. I am not sure what it means and weather it is something to worry about or not. 2014-08-22 15:53:43,803 [pool-1-thread-7] WARN

Re: Finding Rank in Spark

2014-08-22 Thread athiradas
Does anyone knw a way to do this? I tried it by sorting it and writing an auto increment function. But since its parallel computing the result is wrong. Is there anyway? please reply -- View this message in context:

Understanding how to create custom DStreams in Spark streaming

2014-08-22 Thread Aniket Bhatnagar
Hi everyone Sorry about the noob question, but I am struggling to understand ways to create DStreams in Spark. Here is my understanding based on what I could gather from documentation and studying Spark code (as well as some hunch). Please correct me if I am wrong. 1. In most cases, one would

Losing Executors on cluster with RDDs of 100GB

2014-08-22 Thread Yadid Ayzenberg
Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work correctly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html Sent from the Apache Spark

Do we have to install the snappy when running the shuffle jobs

2014-08-22 Thread carlmartin
Hi everyone! Nowadays Spark has set the Snappy as the default compression codec in spark-1.1.0-Snapshot.‍ So if I want run a shuffle job, do I have to install snappy in linux?

Manipulating/Analyzing CSV files in Spark on local machine

2014-08-22 Thread Hingorani, Vineet
Hello all, I am new to Spark and I want to analyze csv file using Spark on my local machine. The csv files contains airline database and I want to get a few descriptive statistics (e.g. maximum of one column, mean, standard deviation in a column, etc.) for my file. I am reading the file using

why classTag not typeTag?

2014-08-22 Thread Mohit Jaggi
Folks, I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the more functional TypeTag option. I have some code that needs TypeTag functionality and I don't know if a typeTag can be converted to a classTag. Mohit.

Re: pyspark/yarn and inconsistent number of executors

2014-08-22 Thread Sandy Ryza
Hi Calvin, When you say until all the memory in the cluster is allocated and the job gets killed, do you know what's going on? Spark apps should never be killed for requesting / using too many resources? Any associated error message? Unfortunately there are no tools currently for tweaking the

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hi Sankar, You need to create an external table in order to specify the location of data (i.e. using CREATE EXTERNAL TABLE user1 LOCATION). You can take a look at this page https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable for

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin, I have tried  the create external table command as well. I get the same error. Please help me to find the root cause.   Thanks and Regards, Sankar S.   On Friday, 22 August 2014, 22:43, Yin Huai huaiyin@gmail.com wrote: Hi Sankar, You need to create an external table in

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin, Forgot to mention one thing, the same query works fine in Hive and Shark..   Thanks and Regards, Sankar S.   On , S Malligarjunan smalligarju...@yahoo.com wrote: Hello Yin, I have tried  the create external table command as well. I get the same error. Please help me to find the

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread cjwang
It would be nice if an RDD that was massaged by OrderedRDDFunction could know its neighbors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html Sent from the Apache Spark User List mailing

importing scala libraries from python?

2014-08-22 Thread Jonathan Haddad
This is probably a bit ridiculous, but I'm wondering if it's possible to use scala libraries in a python module? The Cassandra connector here https://github.com/datastax/spark-cassandra-connector is in Scala, would I need a Python version of that library to use Python Spark? Personally I have no

FetchFailed when collect at YARN cluster

2014-08-22 Thread Jiayu Zhou
Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M lines of short strings (about 10 characters each line) from a YARN cluster with 400 nodes: *14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage 0.0 (TID 1228, aaa.xxx.com):

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hello Sankar, Add JAR in SQL is not supported at the moment. We are working on it ( https://issues.apache.org/jira/browse/SPARK-2219). For now, can you try SparkContext.addJar or using --jars your-jar to launch spark shell? Thanks, Yin On Fri, Aug 22, 2014 at 2:01 PM, S Malligarjunan

[PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan

spark streaming - realtime reports - storing current state of resources

2014-08-22 Thread salemi
Hi All, I have set of 1000k Workers of a company with different attribute associated with them. I like at anytime to be able to report on their current state and update the reports every 5 second. Spark Streaming allows you to receive events about the Workers state changes and process them.

cache table with JDBC

2014-08-22 Thread ken
I am using Spark's Thrift server to connect to Hive and use JDBC to issue queries. Is there a way to cache table in Sparck by using JDBC call? Thanks, Ken -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-table-with-JDBC-tp12675.html Sent from the

Re: Hive From Spark

2014-08-22 Thread Du Li
I thought the fix had been pushed to the apache master ref. commit [SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my previous email was based on own build of the apache master, which turned out not working yet. Marcelo: Please correct me if I got that commit wrong. Thanks,

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context:

Re: AppMaster OOME on YARN

2014-08-22 Thread Vipul Pandey
This is all that I see related to spark.MapOutputTrackerMaster in the master logs after OOME 14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: Java heap space

ODBC and HiveThriftServer2

2014-08-22 Thread prnicolas
Is it possible to connect to the thrift server using an ODBC client (ODBC-JDBC)? My thrift server is built from branch-1.0-jdbc using Hive 0.13.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ODBC-and-HiveThriftServer2-tp12680.html Sent from the Apache

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee
On 08/22/2014 04:32 PM, Arpan Ghosh wrote: Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan there's no implicit ordering in place. the same holds for the order of keys,

Spark: Why Standalone mode can not set Executor Number.

2014-08-22 Thread Victor Sheng
As far as I know, only yarn mode can set --num-executors, someone proved to set more number-execuotrs for will perform better than set only 1 or 2 executor with large mem and core. sett http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-vs-num-executors-td9878.html Why

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-22 Thread guxiaobo1982
Hi Xiangrui, You can refer to An Introduction to Statistical Learning with Applications in R, there are many stander hypothesis test to do regarding to linear regression and logistic regression, they should be implement in the fist order, then we will list some other testes, which are also

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee
you can kv.mapValues(sorted), but that's definitely less efficient than sorting during the groupBy you could try using combineByKey directly w/ heapq... from heapq import heapify, heappush, merge def createCombiner(x): return [x] def mergeValues(xs, x): heappush(xs, x) return xs

Re: why classTag not typeTag?

2014-08-22 Thread Matei Zaharia
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still somewhat experimental at the time so we decided not to use them. If you want though, you can probably design other APIs that pass a TypeTag around (e.g. make a method that takes an RDD[T] but also requires an implicit

Re: Installation On Windows machine

2014-08-22 Thread Matei Zaharia
You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using

FetchFailedException from Block Manager

2014-08-22 Thread Victor Tso-Guillen
Anyone know why I would see this in a bunch of executor logs? Is it just classical overloading of the cluster network, OOM, or something else? If anyone's seen this before, what do I need to tune to make some headway here? Thanks, Victor Caused by: org.apache.spark.FetchFailedException: Fetch

Re: Configuration for big worker nodes

2014-08-22 Thread Zhan Zhang
I think it depends on your job. My personal experiences when I run TB data. spark got loss connection failure if I use big JVM with large memory, but with more executors with small memory, it can run very smoothly. I was running spark on yarn. Thanks. Zhan Zhang On Aug 21, 2014, at 3:42 PM,