Re: amp lab spark streaming twitter example

2014-08-26 Thread Akhil Das
I think your *sparkUrl *points to an invalid cluster url. Just make sure you are giving the correct url (the one you see on top left in the master:8080 webUI). Thanks Best Regards On Tue, Aug 26, 2014 at 11:07 AM, Forest D dev24a...@gmail.com wrote: Hi Jonathan, Thanks for the reply. I ran

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is ExecutorLostFailure (executor lost). The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: Spark webUI - application details page

2014-08-26 Thread Akhil Das
Have a look at the history server, looks like you have enabled history server on your local and not on the remote server. http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html Thanks Best Regards On Tue, Aug 26, 2014 at 7:01 AM, SK skrishna...@gmail.com wrote: Hi, I am

Re: Block input-* already exists on this machine; not re-adding it warnings

2014-08-26 Thread Aniket Bhatnagar
Answering my own question, it seems that the warnings are expected as explained by TD @ http://apache-spark-user-list.1001560.n3.nabble.com/streaming-questions-td3281.html . Here is what he wrote: Spark Streaming is designed to replicate the received data within the machines in a Spark cluster

Re: How do you hit breakpoints using IntelliJ In functions used by an RDD

2014-08-26 Thread Akhil Das
You need to run your app in localmode ( aka master=local[2]) to get it debugged locally. If you are running it on a cluster, then you can use the remote debugging feature. http://stackoverflow.com/questions/19128264/how-to-remote-debug-in-intellij-12-1-4 For remote debugging, you need to pass the

Re: Request for Help

2014-08-26 Thread Akhil Das
Hi Not sure this is the right way of doing it, but if you can create a PairRDDFunction from that RDD then you can use the following piece of code to access the filenames from the RDD. PairRDDFunctionsK, V ds = .; //getting the

My Post Related Query

2014-08-26 Thread Sandeep Vaid
I wrote a post on this forum but it shows the message This post has NOT been accepted by the mailing list yet. above my post. How long will it take to get it posted? Regards, Sandeep Vaid +91 - 09881710301

Re: Running Wordcount on large file stucks and throws OOM exception

2014-08-26 Thread motte1988
Hello, it's me again. Now I've got an explanation for the behaviour. It seems that the driver memory is not large enough to hold the whole result set of saveAsTextFile In-Memory. And then OOM occures. I test it with a filter-step that removes KV-pairs with WordCount smaller 100,000. So now the job

Re: Pair RDD

2014-08-26 Thread Yanbo Liang
val node = textFile.map(line = { val fileds = line.split(\\s+) (fileds(1),fileds(2)) }) then you can manipulate node RDD with PairRDD function. 2014-08-26 12:55 GMT+08:00 Deep Pradhan pradhandeep1...@gmail.com: Hi, I have an input file of a graph in the format source_node

Re: Trying to run SparkSQL over Spark Streaming

2014-08-26 Thread praveshjain1991
Thanks for the reply. Ya it doesn't seem doable straight away. Someone suggested this /For each of your streams, first create an emty RDD that you register as a table, obtaining an empty table. For your example, let's say you call it allTeenagers. Then, for each of your queries, use SchemaRDD's

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread BertrandR
I actually tried without unpersisting, but given the performance I tryed to add these in order to free the memory. After your anwser I tried to remove them again, but without any change in the execution time... Looking at the web interface, I can see that the mapPartitions at GraphImpl.scala:184

Re: Losing Executors on cluster with RDDs of 100GB

2014-08-26 Thread MEETHU MATHEW
Hi, Plz give a try by changing the worker memory such that worker memoryexecutor memory   Thanks Regards, Meethu M On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg ya...@media.mit.edu wrote: Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone

Re: Printing the RDDs in SparkPageRank

2014-08-26 Thread Deep Pradhan
println(parts(0)) does not solve the problem. It does not work On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote: On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: When I add parts(0).collect().foreach(println)

Key-Value in PairRDD

2014-08-26 Thread Deep Pradhan
I have the following code *val nodes = lines.map(s ={val fields = s.split(\\s+) (fields(0),fields(1))}).distinct().groupByKey().cache()* and when I print out the nodes RDD I get the following *(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,

Re: Key-Value in PairRDD

2014-08-26 Thread Sean Owen
I'd suggest first reading the scaladoc for RDD and PairRDDFunctions to familiarize yourself with all the operations available: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Re: Storage Handlers in Spark SQL

2014-08-26 Thread chutium
it seems he means to query RDBMS or cassandra using Spark SQL, multi data sources for spark SQL. i looked through the link he posted https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources using their

spark.default.parallelism bug?

2014-08-26 Thread Grzegorz Białek
Hi, consider the following code: import org.apache.spark.{SparkContext, SparkConf} object ParallelismBug extends App { var sConf = new SparkConf() .setMaster(spark://hostName:7077) // .setMaster(local[4]) .set(spark.default.parallelism, 7) // or without it val sc = new

Re: Spark Streaming Output to DB

2014-08-26 Thread Ravi Sharma
Hello People, I'm using java spark streaming. I'm just wondering, Can I make simple jdbc connection in JavaDStream map() method? Or Do I need to create jdbc connection for each JavaPairDStream, after map task? Kindly give your thoughts. Cheers, Ravi Sharma

Prevent too many partitions

2014-08-26 Thread Grzegorz Białek
Hi, I have in my application many union operations. But union increases number of partitions of following RDDs. And performance on more partitions sometimes is very slow. Is there any cleaner way to prevent increasing number of partitions than adding coalesce(numPartitions) after each union?

Re: Spark Streaming Output to DB

2014-08-26 Thread Akhil Das
Yes, you can open a jdbc connection at the beginning of the map method then close this connection at the end of map() and in between you can use this connection. Thanks Best Regards On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com wrote: Hello People, I'm using java

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi, As I understand, your problem is similar to this JIRA. https://issues.apache.org/jira/browse/SPARK-1647 The issue in this case, Kafka can not replay the message as offsets are already committed. Also I think existing KafkaUtils ( The Default High Level Kafka Consumer) also have this issue.

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
For the record, I'm using Chrome 36.0.1985.143 on 10.9.4 as well. Maybe it's a Chrome add-on I'm running? Anyway, as Matei pointed out, if I change the https to http, it works fine. On Tue, Aug 26, 2014 at 1:46 AM, Michael Hausenblas michael.hausenb...@gmail.com wrote:

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi Bharat, Thanks for your email. If the Kafka Reader worker process dies, it will be replaced by different machine, and it will start consuming from the offset where it left over ( for each partition). Same case can happen even if I tried to have individual Receiver for every partition.

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 10:28 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Maybe it's a Chrome add-on I'm running? Hmm, scratch that. Trying in incognito mode (which disables add-ons, I believe) also yields the same behavior. Nick

Re: Spark SQL Parser error

2014-08-26 Thread Yin Huai
I have not tried it. But, I guess you need to add your credential in the s3 path. Or, can you copy the jar to your driver node and try again? On Sun, Aug 24, 2014 at 9:35 AM, S Malligarjunan smalligarju...@yahoo.com wrote: Hello Yin, Additional note: In ./bin/spark-shell --jars

What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
I'm curious not only about what they do, but what their relationship is to the rest of the system. I find that I get listener events for n block managers added where n is also the number of workers I have available to the application. Is this a stable constant? Also, are there ways to determine

Re: Spark webUI - application details page

2014-08-26 Thread SK
I have already tried setting the history server and accessing it on master-url:18080 as per the link. But the page does not list any completed applications. As I mentioned in my previous mail, I am running Spark in standalone mode on the cluster (as well as on my local machine). According to the

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Yin Huai
Hello Du, Can you check if there is a dir metastore in the place you launching your program. If so, can you delete it and try again? Also, can you try HiveContext? LocalHiveContext is deprecated. Thanks, Yin On Mon, Aug 25, 2014 at 6:33 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, I

Submit to the Powered By Spark Page!

2014-08-26 Thread Patrick Wendell
Hi All, I want to invite users to submit to the Spark Powered By page. This page is a great way for people to learn about Spark use cases. Since Spark activity has increased a lot in the higher level libraries and people often ask who uses each one, we'll include information about which

Spark Streaming - Small file in HDFS

2014-08-26 Thread Ravi Sharma
Hi People, I'm using java kafka spark streaming and saving the result file into hdfs. As per my understanding, spark streaming write every processed message or event to hdfs file. Reason to creating one file per message or event could be to ensure fault tolerance. Is there any way spark handle

Re: SPARK Hive Context UDF Class Not Found Exception,

2014-08-26 Thread S Malligarjunan
Hello Michel, I have executed git pull now, As per pom, version entry it is 1.1.0-SNAPSHOT.   Thanks and Regards, Sankar S.   On Tuesday, 26 August 2014, 1:00, Michael Armbrust mich...@databricks.com wrote: Which version of Spark SQL are you using?  Several issues with custom hive UDFs

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector =

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql(SELECT text FROM tweetTrainTable LIMIT 2000).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again

Kinesis receiver spark streaming partition

2014-08-26 Thread Wei Liu
We are exploring using Kinesis and spark streaming together. I took at a look at the kinesis receiver code in 1.1.0. I have a question regarding kinesis partition spark streaming partition. It seems to be pretty difficult to align these partitions. Kinesis partitions a stream of data into

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Matei Zaharia
It should be fixed now. Maybe you have a cached version of the page in your browser. Open DevTools (cmd-shift-I), press the gear icon, and check disable cache while devtools open, then refresh the page to refresh without cache. Matei On August 26, 2014 at 7:31:18 AM, Nicholas Chammas

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
Confirmed. Works now. Thanks Matei. (BTW, on OS X Command + Shift + R also refreshes the page without cache.) On Tue, Aug 26, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: It should be fixed now. Maybe you have a cached version of the page in your browser. Open DevTools

OutofMemoryError when generating output

2014-08-26 Thread SK
Hi, I have the following piece of code that I am running on a cluster with 10 nodes with 2GB memory per node. The tasks seem to complete, but at the point where it is generating output (saveAsTextFile), the program freezes after some time and reports an out of memory error (error transcript

Spark 1.1. doesn't work with hive context

2014-08-26 Thread S Malligarjunan
Hello all, I have just checked out branch-1.1  and executed below command ./bin/spark-shell --driver-memory 1G val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT,

Re: disable log4j for spark-shell

2014-08-26 Thread Aaron
If someone doesn't have the access to do that is there any easy to specify a different properties file to be used? Patrick Wendell wrote If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and

Re: OutofMemoryError when generating output

2014-08-26 Thread Burak Yavuz
Hi, The error doesn't occur during saveAsTextFile but rather during the groupByKey as far as I can tell. We strongly urge users to not use groupByKey if they don't have to. What I would suggest is the following work-around: sc.textFile(baseFile)).map { line = val fields = line.split(\t)

Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, Your job is probably hanging on the groupByKey process. Probably GC is kicking in and the process starts to hang or the data is unbalanced and you end up with stragglers (Once GC kicks in you'll start to get the connection errors you shared). If you don't care about the list of

Re: Out of memory on large RDDs

2014-08-26 Thread Andrew Ash
Hi Grega, Did you ever get this figured out? I'm observing the same issue in Spark 1.0.2. For me it was after 1.5hr of a large .distinct call, followed by a .saveAsTextFile() 14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18500 14/08/26 20:57:43 INFO

Re: Does HiveContext support Parquet?

2014-08-26 Thread lyc
Hi Silvio, I re-downloaded hive-0.12-bin and reset the related path in spark-env.sh. However, I still got some error. Do you happen to know any step I did wrong? Thank you! My detailed step is as follows: #enter spark-shell (successful) /bin/spark-shell --master spark://S4:7077 --jars

Re: countByWindow save the count ?

2014-08-26 Thread Josh J
Thanks. I''m just confused on the syntax, I'm not sure which variables or where the value of the count is stored so that I can save it. Any examples or tips? On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov daniil.osi...@shazam.com wrote: You could try to use foreachRDD on the result of

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Chris Fregly
i've seen this done using mapPartitions() where each partition represents a single, multi-line json file. you can rip through each partition (json file) and parse the json doc as a whole. this assumes you use sc.textFile(path/*.json) or equivalent to load in multiple files at once. each json

Re: Spark-Streaming collect/take functionality.

2014-08-26 Thread Chris Fregly
good suggestion, td. and i believe the optimization that jon.burns is referring to - from the big data mini course - is a step earlier: the sorting mechanism that produces sortedCounts. you can use mapPartitions() to get a top k locally on each partition, then shuffle only (k * # of partitions)

Specifying classpath

2014-08-26 Thread Ashish Jain
Hello, I'm using the following version of Spark - 1.0.0+cdh5.1.0+41 (1.cdh5.1.0.p0.27). I've tried to specify the libraries Spark uses using the following ways - 1) Adding it to spark context 2) Specifying the jar path in a) spark.executor.extraClassPath b) spark.executor.extraLibraryPath

Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
I wanted to make sure that there's full compatibility between minor releases. I have a project that has a dependency on spark-core so that it can be a driver program and that I can test locally. However, when connecting to a cluster you don't necessarily know what version you're connecting to. Is

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread Ankur Dave
At 2014-08-26 01:20:09 -0700, BertrandR bertrand.rondepierre...@gmail.com wrote: I actually tried without unpersisting, but given the performance I tryed to add these in order to free the memory. After your anwser I tried to remove them again, but without any change in the execution time...

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though is that the standalone mode grabs the executors' version of Spark code from what's installed on the cluster, while your driver might be built against

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Matei Zaharia
You can use sc.wholeTextFiles to read each file as a complete String, though it requires each file to be small enough for one task to process. On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote: i've seen this done using mapPartitions() where each partition represents a

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
Yes, we are standalone right now. Do you have literature why one would want to consider Mesos or YARN for Spark deployments? Sounds like I should try upgrading my project and seeing if everything compiles without modification. Then I can connect to an existing 1.0.0 cluster and see what what

CUDA in spark, especially in MLlib?

2014-08-26 Thread Wei Tan
Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how

Re: CUDA in spark, especially in MLlib?

2014-08-26 Thread Matei Zaharia
You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Things will definitely compile, and apps compiled on 1.0.0 should even be able to link against 1.0.2 without recompiling. The only problem is if you run your driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in executors. For Mesos and YARN vs standalone, the difference is that they

Ask for help, how to integrate Sparkstreaming and IBM MQ

2014-08-26 Thread 35597...@qq.com
hi, dear Now I am working on a project in below scenario. We will use Sparkingstreaming to receive data from IBM MQ, I checked the API document of streaming, it's only support ZeroMQ, Kafka, etc. I have some questions: 1. we can use MQTT protocol to get data in this scenario, right? any other

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Nan Zhu
Hi, Victor, the issue for you to have different version in driver and cluster is that you the master will shutdown your application due to the inconsistent SerialVersionID in ExecutorState Best, -- Nan Zhu On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote: Things will

Execute HiveFormSpark ERROR.

2014-08-26 Thread CharlieLin
hi, all : I tried to use Spark SQL on spark-shell, as the spark-example. When I execute : *val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING)) *then report error like below: scala hiveContext.hql(CREATE

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Chris Fregly
great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit: what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here:

Re: Spark Streaming Output to DB

2014-08-26 Thread Mayur Rustagi
I would suggest you to use JDBC connector in mappartition instead of maps as JDBC connections are costly can really impact your performance. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Aug 26, 2014 at 6:45 PM,

Re: What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
We're a single-app deployment so we want to launch as many executors as the system has workers. We accomplish this by not configuring the max for the application. However, is there really no way to inspect what machines/executor ids/number of workers/etc is available in context? I'd imagine that