Re: Advantage of using cache()

2014-08-22 Thread Nieyuan
Because map-reduce tasks like join will save shuffle data to disk . So the only diffrence with caching or no-caching version is : >> .map { case (x, (n, i)) => (x, n)} - Thanks, Nieyuan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using

Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476 LDA is on the roadmap for the 1.2 release, hopefully we will officially support it then! Best, Burak - Original Message - From: "Denny Lee" To: user@spark.apache.org Sent: Thursday, August 21, 2014 10:10:35 P

Re: LDA example?

2014-08-22 Thread Debasish Das
Hi Burak, This LDA implementation is friendly to the equality and positivity als code that I added in the following JIRA to formulate robust plsa https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-2426 Should I build upon the PR that you pointed ? I want to run some experiment

Re: DStream start a separate DStream

2014-08-22 Thread Mayur Rustagi
Why dont you directly use DStream created as output of windowing process? Any reason Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Aug 21, 2014 at 8:38 PM, Josh J wrote: > Hi, > > I would like to ha

iterator cause NotSerializableException

2014-08-22 Thread Kevin Jung
Hi The following code gives me 'Task not serializable: java.io.NotSerializableException: scala.collection.mutable.ArrayOps$ofInt' var x = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),3) var iter = Array(5).toIterator var value = 5 var value2 = iter.next x.map( q => q*value).collect //Line 1, it works.

countByWindow save the count ?

2014-08-22 Thread Josh J
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh

Installation On Windows machine

2014-08-22 Thread Mishra, Abhishek
Hello Team, I was just trying to install spark on my windows server 2012 machine and use it in my project; but unfortunately I do not find any documentation for the same. Please let me know if we have drafted anything for spark users on Windows. I am really in need of it as we are using Windows

On Spark Standalone mode, Where the driver program will run?

2014-08-22 Thread taoist...@gmail.com
Hi all, 1. On Spark Standalone mode, client sumbit application. Where the driver program will run? client or master? 2. Standalone is reliable? can use in production mode ? taoist...@gmail.com

[PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I am using PySpark with IPython notebook. data = sc.parallelize(range(1000), 10) #successful data.map(lambda x: x+1).collect() #Error data.count() Something similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html But it does not figure out

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I'm running pyspark with Python 2.7.8 under Virtualenv System Python Version: Python 2.6.x -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html Sent from the Apache

Re: Extracting an element from the feature vector in LabeledPoint

2014-08-22 Thread LPG
Hi all, Somehow related to this question and this data structure, what is the best way to extract features using names instead of positions? Of course, it is previously necessary to store the names in some way... Thanks in advance -- View this message in context: http://apache-spark-user-list

"Block input-* already exists on this machine; not re-adding it" warnings

2014-08-22 Thread Aniket Bhatnagar
Hi everyone I back ported kinesis-asl to spark 1.0.2 and ran a quick test on my local machine. It seems to be working fine but I keep getting the following warnings. I am not sure what it means and weather it is something to worry about or not. 2014-08-22 15:53:43,803 [pool-1-thread-7] WARN o.ap

Re: Finding Rank in Spark

2014-08-22 Thread athiradas
Does anyone knw a way to do this? I tried it by sorting it and writing an auto increment function. But since its parallel computing the result is wrong. Is there anyway? please reply -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-Rank-in-Spark-tp

Understanding how to create custom DStreams in Spark streaming

2014-08-22 Thread Aniket Bhatnagar
Hi everyone Sorry about the noob question, but I am struggling to understand ways to create DStreams in Spark. Here is my understanding based on what I could gather from documentation and studying Spark code (as well as some hunch). Please correct me if I am wrong. 1. In most cases, one would eit

Losing Executors on cluster with RDDs of 100GB

2014-08-22 Thread Yadid Ayzenberg
Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my application was working well ( several RDDs the largest being around 50G). When I started processing larger amounts of data (RDDs of 100G) my app is losing executors. Im currently ju

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work correctly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html Sent from the Apache Spark U

Do we have to install the snappy when running the shuffle jobs

2014-08-22 Thread carlmartin
Hi everyone! Nowadays Spark has set the Snappy as the default compression codec in spark-1.1.0-Snapshot.‍ So if I want run a shuffle job, do I have to install snappy in linux?

Manipulating/Analyzing CSV files in Spark on local machine

2014-08-22 Thread Hingorani, Vineet
Hello all, I am new to Spark and I want to analyze csv file using Spark on my local machine. The csv files contains airline database and I want to get a few descriptive statistics (e.g. maximum of one column, mean, standard deviation in a column, etc.) for my file. I am reading the file using s

why classTag not typeTag?

2014-08-22 Thread Mohit Jaggi
Folks, I am wondering why Spark uses ClassTag in RDD[T: ClassTag] instead of the more functional TypeTag option. I have some code that needs TypeTag functionality and I don't know if a typeTag can be converted to a classTag. Mohit.

Re: pyspark/yarn and inconsistent number of executors

2014-08-22 Thread Sandy Ryza
Hi Calvin, When you say "until all the memory in the cluster is allocated and the job gets killed", do you know what's going on? Spark apps should never be killed for requesting / using too many resources? Any associated error message? Unfortunately there are no tools currently for tweaking the

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hi Sankar, You need to create an external table in order to specify the location of data (i.e. using CREATE EXTERNAL TABLE user1 LOCATION). You can take a look at this page for r

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin, I have tried  the create external table command as well. I get the same error. Please help me to find the root cause.   Thanks and Regards, Sankar S.   On Friday, 22 August 2014, 22:43, Yin Huai wrote: Hi Sankar, You need to create an external table in order to specify the loca

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin, Forgot to mention one thing, the same query works fine in Hive and Shark..   Thanks and Regards, Sankar S.   On , S Malligarjunan wrote: Hello Yin, I have tried  the create external table command as well. I get the same error. Please help me to find the root cause.   Thanks and

Re: Spark SQL Parser error

2014-08-22 Thread S Malligarjunan
Hello Yin/All. @Yin - Thanks for helping. I solved the sql parser error. I am getting the following exception now scala> hiveContext.hql("ADD JAR s3n://hadoop.anonymous.com/lib/myudf.jar"); warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/08/22 17:58:55 INFO

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread cjwang
It would be nice if an RDD that was massaged by OrderedRDDFunction could know its "neighbors". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html Sent from the Apache Spark User List mailing

RE: Hive From Spark

2014-08-22 Thread Andrew Lee
Hopefully there could be some progress on SPARK-2420. It looks like shading may be the voted solution among downgrading. Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark 1.1.2? By the way, regarding bin/spark-sql? Is this more of a debugging tool for Spark job integrating

importing scala libraries from python?

2014-08-22 Thread Jonathan Haddad
This is probably a bit ridiculous, but I'm wondering if it's possible to use scala libraries in a python module? The Cassandra connector here https://github.com/datastax/spark-cassandra-connector is in Scala, would I need a Python version of that library to use Python Spark? Personally I have no

Re: Hive From Spark

2014-08-22 Thread Marcelo Vanzin
SPARK-2420 is fixed. I don't think it will be in 1.1, though - might be too risky at this point. I'm not familiar with spark-sql. On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee wrote: > Hopefully there could be some progress on SPARK-2420. It looks like shading > may be the voted solution among do

How to start master and workers on Windows

2014-08-22 Thread Steve Lewis
Thank you for your advice - I really need this help and promise to post a blog entry once it works I ran >bin\spark-class.cmd org.apache.spark.deploy.master.Master and this ran successfully and I got a web page at http://localhost:8080 this says Spark Master at spark://192.168.1.4:7077 *My machi

RE: Hive From Spark

2014-08-22 Thread Jeremy Chambers
Ø How does people use spark-sql See: http://spark.apache.org/docs/latest/sql-programming-guide.html From: Andrew Lee [mailto:alee...@hotmail.com] Sent: Friday, August 22, 2014 2:25 PM To: Marcelo Vanzin Cc: user@spark.apache.org; u...@spark.incubator.apache.org; Patrick Wendell Subject: RE: Hiv

FetchFailed when collect at YARN cluster

2014-08-22 Thread Jiayu Zhou
Hi, I am having this FetchFailed issue when the driver is about to collect about 2.5M lines of short strings (about 10 characters each line) from a YARN cluster with 400 nodes: *14/08/22 11:43:27 WARN scheduler.TaskSetManager: Lost task 205.0 in stage 0.0 (TID 1228, aaa.xxx.com): FetchFailed(Bloc

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hello Sankar, "Add JAR" in SQL is not supported at the moment. We are working on it ( https://issues.apache.org/jira/browse/SPARK-2219). For now, can you try SparkContext.addJar or using "--jars " to launch spark shell? Thanks, Yin On Fri, Aug 22, 2014 at 2:01 PM, S Malligarjunan wrote: > He

[PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan

spark streaming - realtime reports - storing current state of resources

2014-08-22 Thread salemi
Hi All, I have set of 1000k Workers of a company with different attribute associated with them. I like at anytime to be able to report on their current state and update the reports every 5 second. Spark Streaming allows you to receive events about the Workers state changes and process them. Where

cache table with JDBC

2014-08-22 Thread ken
I am using Spark's Thrift server to connect to Hive and use JDBC to issue queries. Is there a way to cache table in Sparck by using JDBC call? Thanks, Ken -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cache-table-with-JDBC-tp12675.html Sent from the Apach

Re: Hive From Spark

2014-08-22 Thread Du Li
I thought the fix had been pushed to the apache master ref. commit "[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my previous email was based on own build of the apache master, which turned out not working yet. Marcelo: Please correct me if I got that commit wrong. Thanks,

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I had the same issue with spark-1.0.2-bin-hadoop*1*, and indeed the issue seems related to Hadoop1. When switching to using spark-1.0.2-bin-hadoop*2*, the issue disappears. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDF

Re: wholeTextFiles not working with HDFS

2014-08-22 Thread pierred
I forgot to say, I am using bin/spark-shell, spark-1.0.2 That host has scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_11) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p12678.html Sent from t

Re: AppMaster OOME on YARN

2014-08-22 Thread Vipul Pandey
This is all that I see related to spark.MapOutputTrackerMaster in the master logs after OOME 14/08/21 13:24:45 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-27] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: Java heap space Exception

ODBC and HiveThriftServer2

2014-08-22 Thread prnicolas
Is it possible to connect to the thrift server using an ODBC client (ODBC-JDBC)? My thrift server is built from branch-1.0-jdbc using Hive 0.13.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ODBC-and-HiveThriftServer2-tp12680.html Sent from the Apache Sp

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee
On 08/22/2014 04:32 PM, Arpan Ghosh wrote: Is there any way to control the ordering of values for each key during a groupByKey() operation? Is there some sort of implicit ordering in place already? Thanks Arpan there's no implicit ordering in place. the same holds for the order of keys, unle

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Arpan Ghosh
I was grouping time series data by a key. I want the values to be sorted by timestamp after the grouping. On Fri, Aug 22, 2014 at 7:26 PM, Matthew Farrellee wrote: > On 08/22/2014 04:32 PM, Arpan Ghosh wrote: > >> Is there any way to control the ordering of values for each key during a >> group

Spark: Why Standalone mode can not set Executor Number.

2014-08-22 Thread Victor Sheng
As far as I know, only yarn mode can set --num-executors, someone proved to set more number-execuotrs for will perform better than set only 1 or 2 executor with large mem and core. sett http://apache-spark-user-list.1001560.n3.nabble.com/executor-cores-vs-num-executors-td9878.html Why Standalon

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-22 Thread guxiaobo1982
Hi Xiangrui, You can refer to <>, there are many stander hypothesis test to do regarding to linear regression and logistic regression, they should be implement in the fist order, then we will list some other testes, which are also important when using logistic regression to build score cards.

Re: [PySpark] order of values in GroupByKey()

2014-08-22 Thread Matthew Farrellee
you can kv.mapValues(sorted), but that's definitely less efficient than sorting during the groupBy you could try using combineByKey directly w/ heapq... from heapq import heapify, heappush, merge def createCombiner(x): return [x] def mergeValues(xs, x): heappush(xs, x) return xs def

Re: why classTag not typeTag?

2014-08-22 Thread Matei Zaharia
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still somewhat experimental at the time so we decided not to use them. If you want though, you can probably design other APIs that pass a TypeTag around (e.g. make a method that takes an RDD[T] but also requires an implicit Type

Re: Installation On Windows machine

2014-08-22 Thread Matei Zaharia
You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using b

FetchFailedException from Block Manager

2014-08-22 Thread Victor Tso-Guillen
Anyone know why I would see this in a bunch of executor logs? Is it just classical overloading of the cluster network, OOM, or something else? If anyone's seen this before, what do I need to tune to make some headway here? Thanks, Victor Caused by: org.apache.spark.FetchFailedException: Fetch fa

Re: Configuration for big worker nodes

2014-08-22 Thread Zhan Zhang
I think it depends on your job. My personal experiences when I run TB data. spark got loss connection failure if I use big JVM with large memory, but with more executors with small memory, it can run very smoothly. I was running spark on yarn. Thanks. Zhan Zhang On Aug 21, 2014, at 3:42 PM, s