Re: Restarting a Streaming Context

2014-07-10 Thread Tathagata Das
I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop() is called. Also it keeps things consistent with the spark context - once a spark context is stopped it cannot be used any more. You can create a new

Re: spark1.0 principal component analysis

2014-07-10 Thread Sean Owen
To clarify, you are looking for eigenvectors of what, the covariance matrix? So for example you are looking for the sqrt of the eigenvalues when you talk about stdev of components? Looking at

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Sean Owen
Have a look at the code maybe? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Yes there is a smoothing parameter, and yes from the looks of it it is simply additive / Laplace smoothing. It's been in there for a while. On

KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks,  Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Sean Owen
I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p = Vectors.dense(Array[Double](p._1, p._2 val

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Not sure that was what I want. I tried to run Spark Shell on a machine other than the master and got the same error. The 192 was suppose to be a simple shell script change that alters SPARK_HOME before submitting jobs. Too bad it wasn't there anymore. The build described in the pull request

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
so this is what I am running:  ./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001 And this is the input file: ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 This is the final output from spark: 14/07/10

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Tathagata Das
Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote: I'm running an App for hours in a standalone cluster. From the data injector and Streaming tab of web ui, it's running well. However, I see quite a lot of Active stages in

Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread richiesgr
Hi I get exactly the same problem here, do you've found the problem ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html Sent from the

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how

Getting Persistent Connection using socketStream?

2014-07-10 Thread kytay
Hi I am trying out a simple piece of code by writing my own JavaNetworkCount app to test out Spark Streaming So here is the 2 set of the codes. // #1 JavaReceiverInputDStreamString lines = sctx.socketTextStream(127.0.0.1, ); // #2 JavaReceiverInputDStreamString lines =

running scrapy (or any other scraper) on the cluster?

2014-07-10 Thread mrm
Hi all, Has anybody tried to run scrapy on a cluster? If yes, I would appreciate hearing about the general approach that was taken (multiple spiders? single spider? how to distribute urls across nodes?...etc). I would also be interested in hearing about any experience running a different scraper

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Surendranauth Hiraman
History Server is also very helpful. On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang hw...@qilinsoft.com wrote: I didn't keep the driver's log. It's a lesson. I will try to run it again to see if it happens again. -- *From:* Tathagata Das

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
Ya thanks. I can see that lambda is used as the parameter. On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen so...@cloudera.com wrote: Have a look at the code maybe? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Yes there

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first time I am reporting some bug. So I just wanted to ask, that do your name come somewhere or do you get something after correcting/reporting some bug. So that i can mention that in my

Difference between collect() and take(n)

2014-07-10 Thread MEETHU MATHEW
Hi all, I want to know how collect() works, and how it is different from take().I am just reading a file of 330MB which has 43lakh rows with 13 columns and calling take(430) to save to a variable.But the same is not working with collect().So is there any difference in the operation of

SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Hi, any suggestions on how to implement OR clause and IN clause in the SparkSQL language integrated queries. For example: 'SELECT name FROM people WHERE age = 10 AND month = 2' can be written as val teenagers = people.where('age = 10).where('month === 2).select('name) How do we write

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Takuya UESHIN
Hi Prem, You can write like: people.where('age = 10 'month === 2).select('name) people.where('age = 10 || 'month === 2).select('name) people.where(In('month, Seq(2, 6))).select('name) The last one needs to import catalyst.expressions package. Thanks. 2014-07-10 22:15 GMT+09:00

Re: Restarting a Streaming Context

2014-07-10 Thread Nicholas Chammas
Okie doke. Thanks for the confirmation, Burak and Tathagata. On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop()

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Daniel Siegmann
One thing to keep in mind is that the progress bar doesn't take into account tasks which are rerun. If you see 4/4 but the stage is still active, click the stage name and look at the task list. That will show you if any are actually running. When rerun tasks complete, it can result in the number

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Thanks Takuya . works like a Charm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Language-Integrated-query-OR-clause-and-IN-clause-tp9298p9303.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Terminal freeze during SVM

2014-07-10 Thread AlexanderRiggers
Tried the newest branch, but still get stuck on the same task: (kill) runJob at SlidingRDD.scala:74 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html Sent from the Apache Spark User List mailing list

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into modules/mod_authn_alias.so issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Spark developers, I have the following hqls that spark will throw exceptions of this kind: 14/07/10 15:07:55 INFO TaskSetManager: Loss was due to org.apache.spark.TaskKilledException [duplicate 17] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:736 failed 4 times,

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into modules/mod_authn_alias.so issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

GraphX: how to specify partition strategy?

2014-07-10 Thread Yifan LI
Hi, I am doing graph computation using GraphX, but it seems to be an error on graph partition strategy specification. as in GraphX programming guide: The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark

How to read saved model?

2014-07-10 Thread rohitspujari
Hello Folks: I attended the session Aron D did at hadoop summit. He mentioned about training the model and saving it on HDFS. When you start scoring you can read the saved model. So, I can save the model using sc.makeRDD(model.clusterCenters).saveAsObjectFile(model) But when I try to read the

SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread yadid
Hi All, I have a class with too many variables to be implemented as a case class, therefor I am using regular class that implements Scala's product interface. Like so: class Info () extends Product with Serializable { var param1 : String = var param2 : String = ... var param38:

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
I have created the issue: In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug Have a look at it. Thanks, On Thu, Jul 10, 2014 at 8:37 PM, Bertrand Dechoux decho...@gmail.com wrote: A patch proposal on the apache JIRA for Spark?

Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
Tobias, Your help on the problems I have met have been very helpful. Thanks a lot! Bill On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I ran the SparkKMeans example (not the

Re: Spark job tracker.

2014-07-10 Thread abhiguruvayya
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way i can log these details on the console rather than logging it. As of now once i start my application i could see this, 14/07/10 00:48:20 INFO yarn.Client: Application report from ASM: application identifier:

Re: Terminal freeze during SVM

2014-07-10 Thread Xiangrui Meng
news20.binary's feature dimension is 1.35M. So the serialized task size is above the default limit 10M. You need to set spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this parameter is not passed to executors automatically, which causes Spark freezes. This was fixed in the latest

Re: Spark job tracker.

2014-07-10 Thread Marcelo Vanzin
That output means you're running in yarn-cluster mode. So your code is running inside the ApplicationMaster and has no access to the local terminal. If you want to see the output: - try yarn-client mode, then your code will run inside the launcher process - check the RM web ui and look at the

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Aris Vlasakakis
Thank you very much Yana for replying! So right now the set up is a single-node machine which is my cluster, and YES you are right my submitting laptop has a different path to the spark-1.0.0 installation than the cluster machine. I tried to set SPARK_HOME on my submittor laptop using the actual

Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some

Running Spark on Yarn vs Mesos

2014-07-10 Thread k.tham
What do people usually do for this? It looks like Yarn might be the simplest since the Cloudera distribution already installs this for you when you install hadoop. Any advantages of using Mesos instead? Thanks. -- View this message in context:

Stateful RDDs?

2014-07-10 Thread Sargun Dhillon
So, one portion of our Spark streaming application requires some state. Our application takes a bunch of application events (i.e. user_session_started, user_session_ended, etc..), and calculates out metrics from these, and writes them to a serving layer (see: Lambda Architecture). Two related

Re: Some question about SQL and streaming

2014-07-10 Thread hsy...@gmail.com
Yes, this is what I tried, but thanks! On Wed, Jul 9, 2014 at 6:02 PM, Tobias Pfeiffer t...@preferred.jp wrote: Siyuan, I do it like this: // get data from Kafka val ssc = new StreamingContext(...) val kvPairs = KafkaUtils.createStream(...) // we need to wrap the data in a case class

Difference between SparkSQL and shark

2014-07-10 Thread hsy...@gmail.com
I have a newbie question. What is the difference between SparkSQL and Shark? Best, Siyuan

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us |

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with ./bin/run-example SparkKMeans file.txt 2 0.001 I get this response: Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) The start point is not random. It uses the first K points from the given set On

Re: Difference between SparkSQL and shark

2014-07-10 Thread Nicholas Chammas
In short, Spark SQL is the future, built from the ground up. Shark was built as a drop-in replacement for Hive, will be retired, and will perhaps be replaced by a future initiative to run Hive on Spark https://issues.apache.org/jira/browse/HIVE-7292. More info: -

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Yadid, I have the same problem with you so I implemented the product interface as well, even the codes are similar with your codes. But now I face another problem that is I don't know how to run the codes...My whole program is like this: object SimpleApp { class Record(val x1: String,

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) which spark-shell ~/bench/spark-1.0.0/bin/spark-shell which scala ~/bench/scala-2.10.4/bin/scala On Thursday, July

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Andrew Or
Setting SPARK_HOME is not super effective, because it is overridden very quickly by bin/spark-submit here https://github.com/apache/spark/blob/88006a62377d2b7c9886ba49ceef158737bc1b97/bin/spark-submit#L20. Instead you should set the config spark.home. Here's why: Each of your executors inherits

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Andrew Or
Yes, there are a few bugs in the UI in the event of a node failure. The duplicated stages in both the active and completed tables should be fixed by this PR: https://github.com/apache/spark/pull/1262 The fact that the progress bar on the stages page displays an overflow (e.g. 5/4) is still an

How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-10 Thread Yan Fang
Hi all, I am working to improve the parallelism of the Spark Streaming application. But I have problem in understanding how the executors are used and the application is distributed. 1. In YARN, is one executor equal one container? 2. I saw the statement that a streaming receiver runs on one

Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming, For your spark-submit question: can you try using an assembly jar (sbt/sbt assembly will build it for you)? Another thing to check is if there is any package structure that contains your SimpleApp; if so you should include the hierarchal name. Zongheng On Thu, Jul 10, 2014 at 11:33

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI iamyifa...@gmail.com wrote: - how to build the latest version of Spark from the master branch, which contains a fix? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under Development

Use of the SparkContext.hadoopRDD function in Scala code

2014-07-10 Thread Nick R. Katsipoulakis
Hello, I want to run an MLlib task in Scala API, that creates a hadoopRDD from a CustomInputFormat. According to Spark API def hadoopRDD[K, V](conf: JobConf, inputFormatClass: Class[_ : org.apache.hadoop.mapred.InputFormat[K,V]], keyClass: Class[K], valueClass: Class[V], minSplits: Int): RDD

How to RDD.take(middle 10 elements)

2014-07-10 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is there a way to take() elements of an RDD at an arbitrary index? Nick ​ -- View this message in context:

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data

Re: How to RDD.take(middle 10 elements)

2014-07-10 Thread Xiangrui Meng
This is expensive but doable: rdd.zipWithIndex().filter { case (_, idx) = idx = 10 idx 20 }.collect() -Xiangrui On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is

writing FLume data to HDFS

2014-07-10 Thread Sundaram, Muthu X.
I am new to spark. I am trying to do the following. Netcat--Flume--Spark streaming(process Flume Data)--HDFS. My flume config file has following set up. Source = netcat Sink=avrosink. Spark Streaming code: I am able to print data from flume to the monitor. But I am struggling to create a file.

Re: Potential bugs in SparkSQL

2014-07-10 Thread Stephen Boesch
Hi Jerry, To add to your question: Following does work (from master)- notice the registerAsTable is commented : (I took a liberty to add the order by clause) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(USE test) // hql(select id from

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread Andrew Or
Hi C.J., The PR Yana pointed out seems to fix this. However, it is not merged in master yet, so for now I would recommend that you try the following workaround: set spark.home to the executor's /path/to/spark. I provided more detail here:

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via

Re: RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Thomas Robert
Hi, I'm quite a Spark newbie so I might be wrong but I think that registerAsTable works either on case classes or on classes extending Product. You find this info in an example on the doc page of Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html // Define the schema

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Zongheng, Thanks a lot for your reply. I was edited my codes in my group project and I forgot to remove the package declaration...How silly! Regards, Haoming Date: Thu, 10 Jul 2014 12:00:40 -0700 Subject: Re: SPARKSQL problem with implementing Scala's Product interface From:

What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Nick Chammas
Looks like twitter4j http://twitter4j.org/archive/ 2.2.6 is what works, but I don’t believe it’s documented anywhere. Using 3.0.6 works for a while, but then causes the following error: 14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote: By the way, I also try hql(select * from m).count. It is terribly

Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a local SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster.

incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread SK
Hi, I have a csv data file, which I have organized in the following format to be read as a LabeledPoint(following the example in mllib/data/sample_tree_data.csv): 1,5.1,3.5,1.4,0.2 1,4.9,3,1.4,0.2 1,4.7,3.2,1.3,0.2 1,4.6,3.1,1.5,0.2 The first column is the binary label (1 or 0) and the

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Andrei
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can

SparkR failed to connect to the master

2014-07-10 Thread cjwang
I have a cluster running. I was able to run Spark Shell and submit programs. But when I tried to use SparkR, I got these errors: wifi-orcus:sparkR cwang$ MASTER=spark://wifi-orcus.dhcp.carrieriq.com:7077 sparkR R version 3.1.0 (2014-04-10) -- Spring Dance Copyright (C) 2014 The R Foundation

Submitting to a cluster behind a VPN, configuring different IP address

2014-07-10 Thread Aris Vlasakakis
Hi Spark folks, So on our production Spark cluster, it lives in the data center and I need to attach to a VPN from my laptop, so that I can then submit a Spark application job to the Spark Master (behind the VPN). However, the problem arises that I have a local IP address on the laptop which is

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote: For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? There is going to be some

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql(select s.id from m join s on (s.id=m_id)).queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote: Hi Spark developers, I

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Michael Armbrust
I'll add that the SQL parser is very limited right now, and that you'll get much wider coverage using hql inside of HiveContext. We are working on bringing sql() much closer to SQL-92 though in the future. On Thu, Jul 10, 2014 at 7:28 AM, premdass premdas...@yahoo.co.in wrote: Thanks Takuya .

Re: EC2 Cluster script. Shark install fails

2014-07-10 Thread Michael Armbrust
There is no version of Shark that is compatible with Spark 1.0, however, Spark SQL does come included automatically. More information here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a

Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote: Hi I get exactly the same problem here, do you've found the

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more executors in a job. However, it is still slow. And I further observed that most executors take less than 20 seconds but two of them take much longer such as 2 minutes. The data size is very small

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Michael, I got the log you asked for. Note that I manually edited the table name and the field names to hide some sensitive information. == Logical Plan == Project ['s.id] Join Inner, Some((id#106 = 'm.id)) Project [id#96 AS id#62] MetastoreRelation test, m, None MetastoreRelation

Re: incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread Yana Kadiyska
I do not believe the order of points in a distributed RDD is in any way guaranteed. For a simple test, you can always add a last column which is an id (make it double and throw it in the feature vector). Printing the rdd back will not give you the points in file order. If you don't want to go that

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam chiling...@gmail.com

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Can you try setting the number-of-partitions in all the shuffle-based DStream operations, explicitly. It may be the case that the default parallelism (that is, spark.default.parallelism) is probably not being respected. Regarding the unusual delay, I would look at the task details of that stage

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hmm, yeah looks like the table name is not getting applied to the attributes of m. You can work around this by rewriting your query as: hql(select s.id from (SELECT * FROM m) m join s on (s.id=m.id) order by s.id This explicitly gives the alias m to the attributes of that table. You can also

Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi, I think it would be great if we could do the string parsing only once and then just apply the transformation for each interval (reducing the processing overhead for short intervals). Also, one issue with the approach above is that transform() has the following signature: def

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Andrew, Thanks for replying. I did the following and the result was still the same. 1. Added spark.home /root/spark-1.0.0 to local conf/spark-defaults.conf, where /root was the place in the cluster where I put Spark. 2. Ran bin/spark-shell --master

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I think it

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread kytay
Hi TD Thanks. I have problem understanding the codes in github, Object SocketReceiver.byteToLines(...) https://github.com/apache/spark/blob/095b5182536a43e2ae738be93294ee5215d86581/streaming/src/main/scala/org/apache/spark/streaming/dstream/SocketInputDStream.scala private[streaming]

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
Actually we have a POC project which shows the power of combining Spark Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get SchemaDStream. You can take a look at it: https://github.com/thunderain-project/StreamSQL Thanks Jerry From: Tathagata Das

Streaming. Cannot get socketTextStream to receive anything.

2014-07-10 Thread kytay
Hi I am learning spark streaming, and is trying out the JavaNetworkCount example. #1 - This is the code I wrote JavaStreamingContext sctx = new JavaStreamingContext(local, appName, new Duration(5000)); JavaReceiverInputDStreamString lines = sctx.socketTextStream(127.0.0.1, );

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
No specific plans to do so, since there has some functional loss like time based windowing function which is important for streaming sql. Also keep compatible with fast growing SparkSQL is quite hard. So no clear plans to submit to upstream. -Jerry From: Tobias Pfeiffer

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
Right this uses NextIterator, which is elsewhere in the repo. It just makes it cleaner to implement a custom iterator. But i guess you got the high level point, so its okay. TD On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote: Hi TD Thanks. I have problem understanding

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD = { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Wanna know more about Pyspark Internals

2014-07-10 Thread Baofeng Zhang
In addition to wiki on Confluence https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals or reading source code, where/how can I get more information about Pyspark internals, for I am so familiar with python :( -- View this message in context:

Spark Streaming with Kafka NoClassDefFoundError

2014-07-10 Thread Dilip
Hi, I am trying to run a program with spark streaming using Kafka on a stand alone system. These are my details: Spark 1.0.0 hadoop2 Scala 2.10.3 I am trying a simple program using my custom sbt project but this is the error I am getting: Exception in thread main

Spark summit 2014 videos ?

2014-07-10 Thread Ajay Srivastava
Hi, I did not find any videos on apache spark channel in youtube yet. Any idea when these will be made available ? Regards, Ajay

Re: Wanna know more about Pyspark Internals

2014-07-10 Thread Davies Liu
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals Hope it's useful for you. Davies On Thu, Jul 10, 2014 at 8:49 PM, Baofeng Zhang pelickzh...@qq.com wrote: In addition to wiki on Confluence https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals or reading