Re: Number of executors change during job running

2014-07-11 Thread Praveen Seluka
If I understand correctly, you could not change the number of executors at runtime right(correct me if am wrong) - its defined when we start the application and fixed. Do you mean number of tasks? On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you try

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Akhil Das
Easiest fix would be adding the kafka jars to the SparkContext while creating it. Thanks Best Regards On Fri, Jul 11, 2014 at 4:39 AM, Dilip dilip_ram...@hotmail.com wrote: Hi, I am trying to run a program with spark streaming using Kafka on a stand alone system. These are my details:

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-11 Thread Akhil Das
You simply use the *nc* command to do this. like: nc -p 12345 will open the 12345 port and from the terminal you can provide whatever input you require for your StreamingCode. Thanks Best Regards On Fri, Jul 11, 2014 at 2:41 AM, kytay kaiyang@gmail.com wrote: Hi I am learning spark

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
Hi Praveen, I did not change the number of total executors. I specified 300 as the number of executors when I submitted the jobs. However, for some stages, the number of executors is very small, leading to long calculation time even for small data set. That means not all executors were used for

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-11 Thread Akhil Das
Sorry, the command is nc -lk 12345 Thanks Best Regards On Fri, Jul 11, 2014 at 6:46 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You simply use the *nc* command to do this. like: nc -p 12345 will open the 12345 port and from the terminal you can provide whatever input you require for

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Dilip
Hi Akhil, Can you please guide me through this? Because the code I am running already has this in it: [java] SparkContext sc = new SparkContext(); sc.addJar(/usr/local/spark/external/kafka/target/scala-2.10/spark-streaming-kafka_2.10-1.1.0-SNAPSHOT.jar); Is there something I am

Re: Join two Spark Streaming

2014-07-11 Thread Bill Jay
Hi Tathagata, Thanks for the solution. Actually, I will use the number of unique integers in the batch instead of accumulative number of unique integers. I do have two questions about your code: 1. Why do we need uniqueValuesRDD? Why do we need to call uniqueValuesRDD.checkpoint()? 2. Where

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
I have met similar issues. The reason is probably because in Spark assembly, spark-streaming-kafka is not included. Currently, I am using Maven to generate a shaded package with all the dependencies. You may try to use sbt assembly to include the dependencies in your jar file. Bill On Thu, Jul

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath
You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up. You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. 

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
Hi Tathagata, I also tried to use the number of partitions as parameters to the functions such as groupByKey. It seems the numbers of executors is around 50 instead of 300, which is the number of the executors I specified in submission script. Moreover, the running time of different executors is

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread 明风
We use Azkaban for a short time and suffer a lot. Finally we almost rewrite it totally. Don’t recommend it really. 发件人: Nick Pentreath nick.pentre...@gmail.com 答复: user@spark.apache.org 日期: 2014年7月11日 星期五 下午3:18 至: user@spark.apache.org 主题: Re: Recommended pipeline automation tool? Oozie?

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath
Did you use old azkaban or azkaban 2.5? It has been completely rewritten. Not saying it is the best but I found it way better than oozie for example. Sent from my iPhone On 11 Jul 2014, at 09:24, 明风 mingf...@taobao.com wrote: We use Azkaban for a short time and suffer a lot. Finally we

Re: KMeans code is rubbish

2014-07-11 Thread Wanda Hawk
I also took a look at  spark-1.0.0/examples/src/main/scala/org/apache/spark/examples/mllib/DenseKMeans.scala and ran the code in a shell. There is an issue here:     val initMode = params.initializationMode match {       case Random = KMeans.RANDOM       case Parallel = KMeans.K_MEANS_PARALLEL  

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Dilip
A simple sbt assembly is not working. Is there any other way to include particular jars with assembly command? Regards, Dilip On Friday 11 July 2014 12:45 PM, Bill Jay wrote: I have met similar issues. The reason is probably because in Spark assembly, spark-streaming-kafka is not

KMeans for large training data

2014-07-11 Thread durin
Hi, I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic clustering with Strings. My code works great when I use a five-figure amount of training elements. However, with for example 2 million elements, it gets extremely slow. A single stage may take up to 30 minutes. From

Re: KMeans for large training data

2014-07-11 Thread Sean Owen
How many partitions do you use for your data? if the default is 1, you probably need to manually ask for more partitions. Also, I'd check that your executors aren't thrashing close to the GC limit. This can make things start to get very slow. On Fri, Jul 11, 2014 at 9:53 AM, durin

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-11 Thread kytay
Hi Akhil Das I have tried the nc -lk command too. I was hoping the System.out.println(Print text: + arg0); is printed when a stream is processed when lines.flatMap(...) is called. But from my test with nc -lk , nothing is printed on the console at all. == To test out whether the nc

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-11 Thread kytay
I think I should be seeing any line of text that I have typed in the nc command. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-socketTextStream-to-receive-anything-tp9382p9410.html Sent from the Apache Spark User List mailing list

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-11 Thread Akhil Das
Can you try this piece of code? SparkConf sparkConf = new SparkConf().setAppName(JavaNetworkWordCount ); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(1000)); JavaReceiverInputDStreamString lines = ssc.socketTextStream( args[0],

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
Hi TD: The input file is on hdfs.   The file is approx 2.7 GB and when the process starts, there are 11 tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce by key.   After the file has been processed, I see new stages with 2 tasks that continue to be generated. I

RE: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-11 Thread Haopu Wang
I saw some exceptions like this in driver log. Can you shed some lights? Is it related with the behaviour? 14/07/11 20:40:09 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.util.NoSuchElementException: key not found: 64019 at

Iteration question

2014-07-11 Thread Nathan Kronenfeld
Hi, folks. We're having a problem with iteration that I don't understand. We have the following test code: org.apache.log4j.Logger.getLogger(org).setLevel(org.apache.log4j.Level.WARN) org.apache.log4j.Logger.getLogger(akka).setLevel(org.apache.log4j.Level.WARN) def test (caching: Boolean,

Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
Hi Folks, Does any one have experience or recommendations on incorporating categorical features (attributes) into k-means clustering in Spark? In other words, I want to cluster on a set of attributes that include categorical variables. I know I could probably implement some custom code to

Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Sean Owen
Since you can't define your own distance function, you will need to convert these to numeric dimensions. 1-of-n encoding can work OK, depending on your use case. So a dimension that takes on 3 categorical values, becomes 3 dimensions, of which all are 0 except one that has value 1. On Fri, Jul

Re: Categorical Features for K-Means Clustering

2014-07-11 Thread Wen Phan
I see. So, basically, kind of like dummy variables like with regressions. Thanks, Sean. On Jul 11, 2014, at 10:11 AM, Sean Owen so...@cloudera.com wrote: Since you can't define your own distance function, you will need to convert these to numeric dimensions. 1-of-n encoding can work OK,

Spark Streaming timing considerations

2014-07-11 Thread Laeeq Ahmed
Hi, In the spark streaming paper, slack time has been suggested for delaying the batch creation in case of external timestamps. I don't see any such option in streamingcontext. Is it available in the API? Also going through the previous posts, queueStream has been suggested for this. I

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-07-11 Thread Rohit Rai
Hi Gerard, This was on my todos since long... i just published a Calliope snapshot built against Hadoop 2.2.x, Take it for a spin if you get a chance - You can get the jars from here - -

Databricks demo

2014-07-11 Thread Debasish Das
Hi, Databricks demo at spark summit was amazing...what's the frontend stack used specifically for rendering multiple reactive charts on same dom? Looks like that's an emerging pattern for correlating different data api... Thanks Deb

Re: KMeans code is rubbish

2014-07-11 Thread Ameet Talwalkar
Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value

Re: RDD join, index key: composite keys

2014-07-11 Thread marspoc
I want to do Index similar to RDBMS on keyPnl on the pnl_type_code so that group by can be done efficitently. How do I achieve that? Currently below code blow out of memory in Spark on 60GB of data. keyPnl is very large file. We have been stuck for 1 week. trying kryo, mapvalue etc but without

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-11 Thread Bill Jay
You may try to use this one: https://github.com/sbt/sbt-assembly I had an issue of duplicate files in the uber jar file. But I think this library will assemble dependencies into a single jar file. Bill On Fri, Jul 11, 2014 at 1:34 AM, Dilip dilip_ram...@hotmail.com wrote: A simple sbt

RE: SPARK_CLASSPATH Warning

2014-07-11 Thread Andrew Lee
As mentioned, deprecated in Spark 1.0+. Try to use the --driver-class-path: ./bin/spark-shell --driver-class-path yourlib.jar:abc.jar:xyz.jar Don't use glob *, specify the JAR one by one with colon. Date: Wed, 9 Jul 2014 13:45:07 -0700 From: kat...@cs.pitt.edu Subject: SPARK_CLASSPATH Warning

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
1. Since the RDD of the previous batch is used to create the RDD of the next batch, the lineage of dependencies in the RDDs continues to grow infinitely. Thats not good because of it increases fault-recover times, task sizes, etc. Checkpointing saves the data of an RDD to HDFS and truncates the

RE: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-11 Thread Andrew Lee
Ok, I found it on JIRA SPARK-2390: https://issues.apache.org/jira/browse/SPARK-2390 So it looks like this is a known issue. From: alee...@hotmail.com To: user@spark.apache.org Subject: spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option? Date: Tue, 8 Jul 2014 15:17:00

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
Whenever you need to do a shuffle=based operation like reduceByKey, groupByKey, join, etc., the system is essentially redistributing the data across the cluster and it needs to know how many parts should it divide the data into. Thats where the default parallelism is used. TD On Fri, Jul 11,

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Yan Fang
Hi Praveen, Thank you for the answer. That's interesting because if I only bring up one executor for the Spark Streaming, it seems only the receiver is working, no other tasks are happening, by checking the log and UI. Maybe it's just because the receiving task eats all the resource?, not because

Re: writing FLume data to HDFS

2014-07-11 Thread Tathagata Das
What is the error you are getting when you say ??I was trying to write the data to hdfs..but it fails… TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. muthu.x.sundaram@sabre.com wrote: I am new to spark. I am trying to do the following. NetcatàFlumeàSpark streaming(process Flume

Re: Spark Streaming RDD to Shark table

2014-07-11 Thread patwhite
Hi, I'm running into an identical issue running Spark 1.0.0 on Mesos 0.19. Were you able to get it sorted? There's no real documentation for the spark.httpBroadcast.uri except what's in the code - is this config setting required for running on a Mesos cluster? I'm running this in a dev

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
So, is it expected for the process to generate stages/tasks even after processing a file ? Also, is there a way to figure out the file that is getting processed and when that process is complete ? Thanks On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1...@gmail.com wrote:

Top N predictions

2014-07-11 Thread Rich Kroll
Hello all, In our use case we would like to return top 10 predicted values. I've looked at NaiveBayes LogisticRegressionModel and cannot seem to find a way to get the predicted values for a vector - is this possible with mllib/spark? Thanks, Rich

Re: Top N predictions

2014-07-11 Thread Sean Owen
I don't believe it is. Recently when I needed to do this, I just copied out the underlying probability / margin function and calculated it from the model params. It's just a dot product. On Fri, Jul 11, 2014 at 7:48 PM, Rich Kroll rich.kr...@modernizingmedicine.com wrote: Hello all, In our use

Job getting killed

2014-07-11 Thread Srikrishna S
I am trying to run Logistic Regression on the url dataset (from libsvm) using the exact same code as the example on a 5 node Yarn-Cluster. I get a pretty cryptic error that says Killed Nothing more Settings: --master yarn-client --verbose --driver-memory 24G --executor-memory 24G

MLlib feature request

2014-07-11 Thread Joseph Feng
Hi all, My company is actively using spark machine learning library, and we would love to see Gradient Boosting Machine algorithm (and perhaps Adaboost algorithm as well) being implemented. I’d greatly appreciate it if anyone could help to move it forward or to elevate this request. Thanks,

Re: KMeans for large training data

2014-07-11 Thread Sean Owen
On Fri, Jul 11, 2014 at 7:32 PM, durin m...@simon-schaefer.net wrote: How would you get more partitions? You can specify this as the second arg to methods that read your data originally, like: sc.textFile(..., 20) I ran broadcastVector.value.repartition(5), but

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Wei Tan
Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research

Re: MLlib feature request

2014-07-11 Thread Ameet Talwalkar
Hi Joseph, Thanks for your email. Many users are requesting this functionality, while it would be a stretch for them to appear in Spark 1.1, various people (including Manish Amde and folks at the AMPLab, Databricks and Alpine Labs) are actively work on developing ensembles of decision trees

not getting output from socket connection

2014-07-11 Thread Walrus theCat
Hi, I have a java application that is outputting a string every second. I'm running the wordcount example that comes with Spark 1.0, and running nc -lk . When I type words into the terminal running netcat, I get counts. However, when I write the String onto a socket on port , I don't get

Re: not getting output from socket connection

2014-07-11 Thread Walrus theCat
I forgot to add that I get the same behavior if I tail -f | nc localhost on a log file. On Fri, Jul 11, 2014 at 1:25 PM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I have a java application that is outputting a string every second. I'm running the wordcount example that comes

Re: not getting output from socket connection

2014-07-11 Thread Sean Owen
netcat is listening for a connection on port . It is echoing what you type to its console to anything that connects to and reads. That is what Spark streaming does. If you yourself connect to and write, nothing happens except that netcat echoes it. This does not cause Spark to

Spark Questions

2014-07-11 Thread Gonzalo Zarza
Hi all, We've been evaluating Spark for a long-term project. Although we've been reading several topics in forum, any hints on the following topics we'll be extremely welcomed: 1. Which are the data partition strategies available in Spark? How configurable are these strategies? 2. How would be

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
The model for file stream is to pick up and process new files written atomically (by move) into a directory. So your file is being processed in a single batch, and then its waiting for any new files to be written into that directory. TD On Fri, Jul 11, 2014 at 11:46 AM, M Singh

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down to the number of cores / task slots that executor has. Each receiver is like a long running task, so each of them occupy a slot. If there are free slots

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
Can you show us the program that you are running. If you are setting number of partitions in the XYZ-ByKey operation as 300, then there should be 300 tasks for that stage, distributed on the 50 executors are allocated to your context. However the data distribution may be skewed in which case, you

Re: executor failed, cannot find compute-classpath.sh

2014-07-11 Thread Andrew Or
Hi CJ, Looks like I overlook a few lines in the spark shell case. It appears that spark shell explicitly overwrites https://github.com/apache/spark/blob/f4f46dec5ae1da48738b9b650d3de155b59c4674/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L955 spark.home to whatever SPARK_HOME is

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Yan Fang
Hi Tathagata, Thank you. Is task slot equivalent to the core number? Or actually one core can run multiple tasks at the same time? Best, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The same executor can

pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi
spark_data_array here has about 35k rows with 4k columns. I have 4 nodes in the cluster and gave 48g to executors. also tried kyro serialization. traceback (most recent call last): File /mohit/./m.py, line 58, in module spark_data = sc.parallelize(spark_data_array) File

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
Hi Tathagata, Below is my main function. I omit some filtering and data conversion functions. These functions are just a one-to-one mapping, which may not possible increase running time. The only reduce function I have here is groupByKey. There are 4 topics in my Kafka brokers and two of the

Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Hi, I am trying graphx on live journal data. I have a cluster of 17 computing nodes, 1 master and 16 workers. I had few questions about this. * I built spark from spark-master (to avoid partitionBy error of spark 1.0). * I am using edgeFileList() to load data and I figured I need to provide

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Solution 2 is to map the

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
ssimanta wrote Solution 2 is to map the objects into a pair RDD where the key is the number of the day in the interval, then group by key, collect, and parallelize the resulting grouped data. However, I worry collecting large data sets is going to be a serious performance bottleneck. Why do

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
On Fri, Jul 11, 2014 at 2:23 PM, ShreyanshB shreyanshpbh...@gmail.com wrote: -- Is it a correct way to load file to get best performance? Yes, edgeListFile should be efficient at loading the edges. -- What should be the partition size? =computing node or =cores? In general it should be a

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Thanks a lot Ankur, I'll follow that. A last quick Does that error affect performance? ~Shreyansh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9462.html Sent from the Apache Spark User List

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
I think my best option is to partition my data in directories by day before running my Spark application, and then direct my Spark application to load RDD's from each directory when I want to load a date range. How does this sound? If your upstream system can write data by day then it makes

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread bdamos
Sean Owen-2 wrote Can you not just filter the range you want, then groupBy timestamp/86400 ? That sounds like your solution 1 and is about as fast as it gets, I think. Are you thinking you would have to filter out each day individually from there, and that's why it would be slow? I don't

Streaming training@ Spark Summit 2014

2014-07-11 Thread SK
Hi, I tried out the streaming program on the Spark training web page. I created a Twitter app as per the instructions (pointing to http://www.twitter.com). When I run the program, my credentials get printed out correctly but thereafter, my program just keeps waiting. It does not print out the

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Sean Owen
On Fri, Jul 11, 2014 at 10:53 PM, bdamos a...@adobe.com wrote: I didn't make it clear in my first message that I want to obtain an RDD instead of an Iterable, and will be doing map-reduce like operations on the data by day. My problem is that groupBy returns an RDD[(K, Iterable[T])], but I

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
I don't think it should affect performance very much, because GraphX doesn't serialize ShippableVertexPartition in the fast path of mapReduceTriplets execution (instead it calls ShippableVertexPartition.shipVertexAttributes and serializes the result). I think it should only get serialized for

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu
I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier.

Re: confirm subscribe to user@spark.apache.org

2014-07-11 Thread Veeranagouda Mukkanagoudar
On Fri, Jul 11, 2014 at 3:11 PM, user-h...@spark.apache.org wrote: Hi! This is the ezmlm program. I'm managing the user@spark.apache.org mailing list. To confirm that you would like veera...@gmail.com added to the user mailing list, please send a short reply to this address:

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry, When you ran these queries using different methods, did you see any discrepancy in the returned results (i.e. the counts)? On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sorry. I think you are seeing some weirdness with partitioned tables that

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
You dont get any exception from twitter.com, saying credential error or something? I have seen this happen when once one was behind vpn to his office, and probably twitter was blocked in their office. You could be having a similar issue. TD On Fri, Jul 11, 2014 at 2:57 PM, SK

try JDBC server

2014-07-11 Thread Nan Zhu
Hi, all I would like to give a try on JDBC server (which is supposed to be released in 1.1) where can I find the document about that? Best, -- Nan Zhu

Re: Some question about SQL and streaming

2014-07-11 Thread Tathagata Das
Yes, even though we dont have immediate plans, I definitely would like to see it happen some time in not-so-distant future. TD On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote: No specific plans to do so, since there has some functional loss like time based

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
I totally agree that doing if we are able to do this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah

Re: try JDBC server

2014-07-11 Thread Nan Zhu
nvm for others with the same question: https://github.com/apache/spark/commit/8032fe2fae3ac40a02c6018c52e76584a14b3438 -- Nan Zhu On Friday, July 11, 2014 at 7:02 PM, Nan Zhu wrote: Hi, all I would like to give a try on JDBC server (which is supposed to be released in 1.1) where

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread SK
I dont get any exceptions or error messages. I tried it both with and without VPN and had the same outcome. But I can try again without VPN later today and report back. thanks. -- View this message in context:

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
Hi folks, I just ran another job that only received data from Kafka, did some filtering, and then save as text files in HDFS. There was no reducing work involved. Surprisingly, the number of executors for the saveAsTextFiles stage was also 2 although I specified 300 executors in the job

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Do you have a proxy server ? If yes you need to set the proxy for twitter4j On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote: I dont get any exceptions or error messages. I tried it both with and without VPN and had the same outcome. But I can try again without VPN later

Re: Generic Interface between RDD and DStream

2014-07-11 Thread andy petrella
A while ago, I wrote this: ``` package com.virdata.core.compute.common.api import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.streaming.dstream.DStream import org.apache.spark.streaming.StreamingContext sealed trait SparkEnvironment extends Serializable

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread SK
I dont have a proxy server. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-training-Spark-Summit-2014-tp9465p9481.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Linkage error - duplicate class definition

2014-07-11 Thread _soumya_
Facing a funny issue with the Spark class loader. Testing out a basic functionality on a vagrant VM with spark running - looks like it's attempting to ship the jar to a remote instance (in this case local) and somehow is encountering the jar twice? 14/07/11 23:27:59 INFO DAGScheduler: Got job 0

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Great! Thanks a lot. Hate to say this but I promise this is last quickie I looked at the configurations but I didn't find any parameter to tune for network bandwidth i.e. Is there anyway to tell graphx (spark) that I'm using 1G network or 10G network or infinite band? Does it figure out on its

spark ui on yarn

2014-07-11 Thread Koert Kuipers
I just tested a long lived application (that we normally run in standalone mode) on yarn in client mode. it looks to me like cached rdds are missing in the storage tap of the ui. accessing the rdd storage information via the spark context shows rdds as fully cached but they are missing on

ML classifier and data format for dataset with variable number of features

2014-07-11 Thread SK
Hi, I need to perform binary classification on an image dataset. Each image is a data point described by a Json object. The feature set for each image is a set of feature vectors, each feature vector corresponding to a distinct object in the image. For example, if an image has 5 objects, its

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread Ankur Dave
Spark just uses opens up inter-slave TCP connections for message passing during shuffles (I think the relevant code is in ConnectionManager). Since TCP automatically determines http://en.wikipedia.org/wiki/TCP_congestion-avoidance_algorithm the optimal sending rate, Spark doesn't need any

Re: Graphx : optimal partitions for a graph and error in logs

2014-07-11 Thread ShreyanshB
Perfect! Thanks Ankur. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-optimal-partitions-for-a-graph-and-error-in-logs-tp9455p9488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ML classifier and data format for dataset with variable number of features

2014-07-11 Thread Xiangrui Meng
You can load the dataset as an RDD of JSON object and use a flatMap to extract feature vectors at object level. Then you can filter the training examples you want for binary classification. If you want to try multiclass, checkout DB's PR at https://github.com/apache/spark/pull/1379 Best, Xiangrui

Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Try running a simple standalone program if you are using Scala and see if you are getting any data. I use this to debug any connection/twitter4j issues. import twitter4j._ //put your keys and creds here object Util { val config = new twitter4j.conf.ConfigurationBuilder()

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
Does nothing get printed on the screen? If you are not getting any tweets but spark streaming is running successfully you should get at least counts being printed every batch (which would be zero). But they are not being printed either, check the spark web ui to see stages are running or not. If

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
Task slot is equivalent to core number. So one core can only run one task at a time. TD On Fri, Jul 11, 2014 at 1:57 PM, Yan Fang yanfang...@gmail.com wrote: Hi Tathagata, Thank you. Is task slot equivalent to the core number? Or actually one core can run multiple tasks at the same time?

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
Aah, I get it now. That is because the input data streams is replicated on two machines, so by locality the data is processed on those two machines. So the map stage on the data uses 2 executors, but the reduce stage, (after groupByKey) the saveAsTextFiles would use 300 tasks. And the default

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-11 Thread Mohit Jaggi
I put the same dataset into scala (using spark-shell) and it acts weird. I cannot do a count on it, the executors seem to hang. The WebUI shows 0/96 in the status bar, shows details about the worker nodes but there is no progress. sc.parallelize does finish (takes too long for the data size) in

Re: Spark Streaming timing considerations

2014-07-11 Thread Tathagata Das
This is not in the current streaming API. Queue stream is useful for testing with generated RDDs, but not for actual data. For actual data stream, the slack time can be implemented by doing DStream.window on a larger window that take slack time in consideration, and then the required

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
Hey Andy, Thats pretty cool!! Is there a github repo where you can share this piece of code for us to play around? If we can come up with a simple enough general pattern, that can be very usefull! TD On Fri, Jul 11, 2014 at 4:12 PM, andy petrella andy.petre...@gmail.com wrote: A while ago, I

Re: Number of executors change during job running

2014-07-11 Thread Bill Jay
Hi Tathagata, Do you mean that the data is not shuffled until the reduce stage? That means groupBy still only uses 2 machines? I think I used repartition(300) after I read the data from Kafka into DStream. It seems that it did not guarantee that the map or reduce stages will be run on 300

Re: Announcing Spark 1.0.1

2014-07-11 Thread Henry Saputra
Congrats to the Spark community ! On Friday, July 11, 2014, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core

RE: EC2 Cluster script. Shark install fails

2014-07-11 Thread Jason H
Thanks Michael Missed that point as well as the integration of SQL within the scala shell (with setting the SQLContext)Looking forward to feature parity with feature releases. (Shark - Spark SQL) Cheers. From: mich...@databricks.com Date: Thu, 10 Jul 2014 16:20:20 -0700 Subject: Re: EC2

Re: One question about RDD.zip function when trying Naive Bayes

2014-07-11 Thread x
I tried my test case with Spark 1.0.1 and saw the same result(27 pairs becomes 25 pairs after zip). Could someone please check it? Regards, xj On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng men...@gmail.com wrote: This is due to a bug in sampling, which was fixed in 1.0.1 and latest master.