Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
How can I configure Mesos allocation policy to share resources between all current Spark applications? I can't seem to find it in the architecture docs. *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Tue, Nov 4, 2014 at 9:11 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes.

RE: Key-Value decomposition

2014-11-04 Thread Suraj Satishkumar Sheth
Hi David, Use something like : Val outputRDD = rdd.flatMap(keyValue = keyValue._2.split(;).map(value = (keyvalue._1, value)).toArray) Thanks and Regards, Suraj Sheth -Original Message- From: david [mailto:david...@free.fr] Sent: Tuesday, November 04, 2014 1:28 PM To:

Re: Spark Streaming - Most popular Twitter Hashtags

2014-11-04 Thread Akhil Das
This might help https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala Thanks Best Regards On Tue, Nov 4, 2014 at 6:03 AM, Harold Nguyen har...@nexgate.com wrote: Hi all, I was just reading this nice documentation

R: Spark Kafka Performance

2014-11-04 Thread Eduardo Alfaia
Hi Gwen, I have changed the java code kafkawordcount to use reducebykeyandwindow in spark. - Messaggio originale - Da: Gwen Shapira gshap...@cloudera.com Inviato: ‎03/‎11/‎2014 21:08 A: us...@kafka.apache.org us...@kafka.apache.org Cc: u...@spark.incubator.apache.org

save as JSON objects

2014-11-04 Thread Andrejs Abele
Hi, Can some one pleas sugest me, what is the best way to output spark data as JSON file. (File where each line is a JSON object) Cheers, Andrejs

Re: Spark job resource allocation best practices

2014-11-04 Thread Akhil Das
You can look at different modes over here http://docs.sigmoidanalytics.com/index.php/Spark_On_Mesos#Mesos_Run_Modes These people has very good tutorial to get you started http://mesosphere.com/docs/tutorials/run-spark-on-mesos/#overview Thanks Best Regards On Tue, Nov 4, 2014 at 1:44 PM, Romi

Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
I have a single Spark cluster, not multiple frameworks and not multiple versions. Is it relevant for my use-case? Where can I find information about exactly how to make Mesos tell Spark how many resources of the cluster to use? (instead of the default take-all) *Romi Kuntsman*, *Big Data

Re: Spark job resource allocation best practices

2014-11-04 Thread Akhil Das
You need to install mesos on your cluster. Then you will run your spark applications by specifying mesos master (mesos://) instead of (spark://). Spark can run over Mesos in two modes: “*fine-grained*” (default) and “ *coarse-grained*”. In “*fine-grained*” mode (default), each Spark task runs as

Re: Spark job resource allocation best practices

2014-11-04 Thread Romi Kuntsman
Let's say that I run Spark on Mesos in fine-grained mode, and I have 12 cores and 64GB memory. I run application A on Spark, and some time after that (but before A finished) application B. How many CPUs will each of them get? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Tue,

Re: java.io.NotSerializableException: org.apache.spark.SparkEnv

2014-11-04 Thread sivarani
Same Issue .. How did you solve it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-spark-SparkEnv-tp10641p18047.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Got java.lang.SecurityException: class javax.servlet.FilterRegistration's when running job from intellij Idea

2014-11-04 Thread Sean Owen
Generally this means you included some javax.servlet dependency in your project deps. You should exclude any of these as they conflict in this bad way with other copies of the servlet API from Spark. On Tue, Nov 4, 2014 at 7:55 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I have a

Re: java.io.NotSerializableException: org.apache.spark.SparkEnv

2014-11-04 Thread Akhil Das
Can you paste the piece of code that you are running? Thanks Best Regards On Tue, Nov 4, 2014 at 3:24 PM, sivarani whitefeathers...@gmail.com wrote: Same Issue .. How did you solve it? -- View this message in context:

RE: Key-Value decomposition

2014-11-04 Thread david
Thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
Hi, There is a need in my application to query the loaded data into sparkcontext (I mean loaded SchemaRDD from JSON file(s)). For this purpose, I created the SchemaRDD and call registerTempTable method in a standalone program and submited the application using spark-submit command. Then I have

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-04 Thread Akhil Das
You can add your custom jar in the SPARK_CLASSPATH inside spark-env.sh file and restart the cluster to get it shipped on all the workers. Also you can use the .setJars option and add the jar while creating the sparkContext. Thanks Best Regards On Tue, Nov 4, 2014 at 8:12 AM, Peng Cheng

Re: netty on classpath when using spark-submit

2014-11-04 Thread M. Dale
Tobias, From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Whether to give user-added jars precedence over Spark's own jars when loading classes in Executors. This feature can be used to mitigate

stdout in spark applications

2014-11-04 Thread lokeshkumar
Hi Forum, I am running a simple spark application in 1 master and 1 worker. Submitting my application through spark submit as a java program. I have sysout in the program, but I am not finding these sysouts in stdout/stderr links in web ui of master as well in the SPARK_HOME/work directory.

Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet
Michael, I should probably look closer myself @ the design of 1.2 vs 1.1 but I've been curious why Spark's in-memory data uses the heap instead of putting it off heap? Was this the optimization that was done in 1.2 to alleviate GC? On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari

Spark Streaming getOrCreate

2014-11-04 Thread sivarani
Hi All I am using SparkStreaming.. public class SparkStreaming{ SparkConf sparkConf = new SparkConf().setAppName(Sales); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(5000)); String chkPntDir = ; //get checkpoint dir jssc.checkpoint(chkPntDir); JavaSpark jSpark =

MEMORY_ONLY_SER question

2014-11-04 Thread Mohit Jaggi
Folks, If I have an RDD persisted in MEMORY_ONLY_SER mode and then it is needed for a transformation/action later, is the whole partition of the RDD deserialized into Java objects first before my transform/action code works on it? Or is it deserialized in a streaming manner as the iterator moves

Re: SparkSQL - No support for subqueries in 1.2-snapshot?

2014-11-04 Thread Michael Armbrust
This is not supported yet. It would be great if you could open a JIRA (though I think apache JIRA is down ATM). On Tue, Nov 4, 2014 at 9:40 AM, Terry Siu terry@smartfocus.com wrote: I’m trying to execute a subquery inside an IN clause and am encountering an unsupported language feature

Fwd: Master example.MovielensALS

2014-11-04 Thread Debasish Das
Hi, I just built the master today and I was testing the IR metrics (MAP and prec@k) on Movielens data to establish a baseline... I am getting a weird error which I have not seen before: MASTER=spark://TUSCA09LMLVT00C.local:7077 ./bin/run-example mllib.MovieLensALS --kryo --lambda 0.065

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
I'm using the same code https://github.com/apache/spark/blob/83b7a1c6503adce1826fc537b4db47e534da5cae/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L687, though still receive not enough arguments for method sortBy: (f: String = K, ascending: Boolean, numPartitions: Int)(implicit ord:

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread Michael Armbrust
Temporary tables are local to the context that creates them (just like RDDs). I'd recommend saving the data out as Parquet to share it between contexts. On Tue, Nov 4, 2014 at 3:18 AM, vdiwakar.malladi vdiwakar.mall...@gmail.com wrote: Hi, There is a need in my application to query the

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread Sean Owen
On Tue, Nov 4, 2014 at 8:02 PM, spr s...@yarcdata.com wrote: To state this another way, it seems like there's no way to straddle the streaming world and the non-streaming world; to get input from both a (vanilla, Linux) file and a stream. Is that true? If so, it seems I need to turn my

[ANN] Spark resources searchable

2014-11-04 Thread Otis Gospodnetic
Hi everyone, We've recently added indexing of all Spark resources to http://search-hadoop.com/spark . Everything is nicely searchable: * user dev mailing lists * JIRA issues * web site * wiki * source code * javadoc. Maybe it's worth adding to http://spark.apache.org/community.html ? Enjoy!

spark sql create nested schema

2014-11-04 Thread tridib
I am trying to create a schema which will look like: root |-- ParentInfo: struct (nullable = true) ||-- ID: string (nullable = true) ||-- State: string (nullable = true) ||-- Zip: string (nullable = true) |-- ChildInfo: struct (nullable = true) ||-- ID: string (nullable =

StructField of StructType

2014-11-04 Thread tridib
How do I create a StructField of StructType? I need to create a nested schema. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StructField-of-StructType-tp18091.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: StructField of StructType

2014-11-04 Thread Michael Armbrust
Structs are Rows nested in other rows. This might also be helpful: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema On Tue, Nov 4, 2014 at 12:21 PM, tridib tridib.sama...@live.com wrote: How do I create a StructField of StructType? I need to

Best practice for join

2014-11-04 Thread Benyi Wang
I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, # build (K,V) from A and B to prepare the join val ja = A.map( r = (K1, Va)) val jb = B.map( r = (K1, Vb)) # join A, B val jab = ja.join(jb) # build (K,V) from the joined result of A and B to prepare joining with C val jc =

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Tathagata Das
It it deserialized in a streaming manner as the iterator moves over the partition. This is a functionality of core Spark, and Spark Streaming just uses it as is. What do you want to customize it to? On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, If I have an RDD

Re: Model characterization

2014-11-04 Thread vinay453
Go it from a friend - println(model.weights) and println(model.intercept). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Model-characterization-tp17985p18106.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Streaming window operations not producing output

2014-11-04 Thread Tathagata Das
Didnt oyu get any errors in the log4j logs, saying that you have to enable checkpointing? TD On Tue, Nov 4, 2014 at 7:20 AM, diogo di...@uken.com wrote: So, to answer my own n00b question, if case anyone ever needs it. You have to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai
Hello Tridib, For you case, you can use StructType(StructField(ParentInfo, parentInfo, true) :: StructField(ChildInfo, childInfo, true) :: Nil) to create the StructType representing the schema (parentInfo and childInfo are two existing StructTypes). You can take a look at our docs (

Re: How to ship cython library to workers?

2014-11-04 Thread freedafeng
Thanks for the solution! I did figure out how to create an .egg file to ship out to the workers. Using ipython seems to be another cool solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-ship-cython-library-to-workers-tp14467p18116.html Sent

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
BTW while I haven't actually used Redshift, I've seen many companies that use both, usually using Spark for ETL and advanced analytics and Redshift for SQL on the cleaned / summarized data. Xiangrui Meng also wrote https://github.com/mengxr/redshift-input-format to make it easy to read data

Re: Spark v Redshift

2014-11-04 Thread Jimmy McErlain
This is pretty spot on.. though I would also add that the Spark features that it touts around speed are all dependent on caching the data into memory... reading off the disk still takes time..ie pulling the data into an RDD. This is the reason that Spark is great for ML... the data is used over

Re: Spark v Redshift

2014-11-04 Thread Akshar Dave
There is no one size fits all solution available in the market today. If somebody tell you they do then they are simply lying :) Both solutions cater to different set of problems. My recommendation is to put real focus on getting better understanding of your problems that you are trying to solve

RE: Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab
Hi Nan,Cool. Thanks. Regards,Ashic. Date: Tue, 4 Nov 2014 18:26:48 -0500 From: zhunanmcg...@gmail.com To: as...@live.com CC: user@spark.apache.org Subject: Re: Workers not registering after master restart Hi, Ashic, this is expected for the latest released

Re: deploying a model built in mllib

2014-11-04 Thread Simon Chan
The latest version of PredictionIO, which is now under Apache 2 license, supports the deployment of MLlib models on production. The engine you build will including a few components, such as: - Data - includes Data Source and Data Preparator - Algorithm(s) - Serving I believe that you can do the

spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Hi, Amazon aws started to provide service for China mainland, the region name is cn-north-1. But the script spark provides: spark_ec2.py will query ami id from https://github.com/mesos/spark-ec2/tree/v4/ami-list and there's no ami information for cn-north-1 region . Can anybody update the

Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread SK
SchemaRDD supports some of the SQL-like functionality like groupBy(), distinct(), select(). However, SparkSQL also supports SQL statements which provide this functionality. In terms of future support and performance, is it better to use SQL statements or the SchemaRDD methods that provide

Re: netty on classpath when using spark-submit

2014-11-04 Thread Tobias Pfeiffer
Markus, thanks for your help! On Tue, Nov 4, 2014 at 8:33 PM, M. Dale medal...@yahoo.com.invalid wrote: Tobias, From http://spark.apache.org/docs/latest/configuration.html it seems that there is an experimental property: spark.files.userClassPathFirst Thank you very much, I didn't

Why mapred for the HadoopRDD?

2014-11-04 Thread Corey Nolet
I'm fairly new to spark and I'm trying to kick the tires with a few InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf instead of a MapReduce Job object. Is there future planned support for the mapreduce packaging?

Re: Spark v Redshift

2014-11-04 Thread agfung
Sounds like context would help, I just didn't want to subject people to a wall of text if it wasn't necessary :) Currently we use neither Spark SQL (or anything in the Hadoop stack) or Redshift. We service templated queries from the appserver, i.e. user fills out some forms, dropdowns: we

Re: Using SQL statements vs. SchemaRDD methods

2014-11-04 Thread Michael Armbrust
They both compile down to the same logical plans so the performance of running the query should be the same. The Scala DSL uses a lot of Scala magic and thus is experimental where as HiveQL is pretty set in stone. On Tue, Nov 4, 2014 at 5:22 PM, SK skrishna...@gmail.com wrote: SchemaRDD

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html cn-north-1 is not a supported region for EC2, as far as I can tell. There may be other AWS services that can use that region, but spark-ec2 relies on EC2. Nick On Tue, Nov 4, 2014 at 8:09 PM, haitao .yao

Re: Why mapred for the HadoopRDD?

2014-11-04 Thread raymond
You could take a look at sc.newAPIHadoopRDD() 在 2014年11月5日,上午9:29,Corey Nolet cjno...@gmail.com 写道: I'm fairly new to spark and I'm trying to kick the tires with a few InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf instead of a MapReduce Job object. Is there

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
I'm afraid not. We have been using EC2 instances in cn-north-1 region for a while. And the latest version of boto has added the region: cn-north-1 Here's the screenshot: from boto import ec2 ec2.regions() [RegionInfo:us-east-1, RegionInfo:cn-north-1, RegionInfo:ap-northeast-1,

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread Nicholas Chammas
Oh, I can see that region via boto as well. Perhaps the doc is indeed out of date. Do you mind opening a JIRA issue https://issues.apache.org/jira/secure/Dashboard.jspa to track this request? I can do it if you've never opened a JIRA issue before. Nick On Tue, Nov 4, 2014 at 9:03 PM, haitao

Re: stdout in spark applications

2014-11-04 Thread lokeshkumar
Got my answer from this thread, http://apache-spark-user-list.1001560.n3.nabble.com/no-stdout-output-from-worker-td2437.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-spark-applications-tp18056p18134.html Sent from the Apache Spark User List

Re: pass unique ID to mllib algorithms pyspark

2014-11-04 Thread Xiangrui Meng
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We carry over extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues`

GraphX and Spark

2014-11-04 Thread Deep Pradhan
Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any graph algorithm and I do not want to use GraphX, can I get the work done with Spark? Than You

Re: loading, querying schemaRDD using SparkSQL

2014-11-04 Thread vdiwakar.malladi
Thanks Michael for your response. Just now, i saw saveAsTable method on JavaSchemaRDD object (in Spark 1.1.0 API). But I couldn't find the corresponding documentation. Will that help? Please let me know. Thanks in advance. -- View this message in context:

MLlib and PredictionIO sample code

2014-11-04 Thread Simon Chan
Hey guys, I have written a tutorial on deploying MLlib's models on production with open source PredictionIO: http://docs.prediction.io/0.8.1/templates/ The goal is to add the following features to MLlib, with production application in mind: - JSON query to retrieve prediction online -

Re: Spark Streaming getOrCreate

2014-11-04 Thread sivarani
Anybody any luck? I am also trying to set NONE to delete key from state, will null help? how to use scala none in java My code goes this way public static class ScalaLang { public static T OptionT none() { return (OptionT) None$.MODULE$; }

Issue in Spark Streaming

2014-11-04 Thread Suman S Patil
I am trying to run the Spark streaming program as given in the Spark streaming Programming guidehttps://spark.apache.org/docs/latest/streaming-programming-guide.html, in the interactive shell. I am getting an error as shown herefile:///C:\Users\10609685\Desktop\stream-spark.png as an

Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
I've following code in my program. I don't get any error, but it's not consuming the messages either. Shouldn't the following code print the line in the 'call' method? What am I missing? Please help. Thanks. JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
Hi, would you mind describing your problem a little more specific. 1. Is the Kafka broker currently has no data feed in? 2. This code will print the lines, but not in the driver side, the code is running in the executor side, so you can check the log in worker dir to see if there’s

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
The Kafka broker definitely has messages coming in. But your #2 point is valid. Needless to say I am a newbie to Spark. I can't figure out where the 'executor' logs would be. How would I find them? All I see printed on my screen is this: 14/11/04 22:21:23 INFO Slf4jLogger: Slf4jLogger

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Sean Owen
this code only expresses a transformation and so does not actually cause any action. I think you intend to use foreachRDD. On Wed, Nov 5, 2014 at 5:57 AM, Something Something mailinglist...@gmail.com wrote: I've following code in my program. I don't get any error, but it's not consuming the

RE: MEMORY_ONLY_SER question

2014-11-04 Thread Shao, Saisai
From my understanding, the Spark code use Kryo as a streaming manner for RDD partitions, the deserialization comes with iteration to move forward. But the internal thing of Kryo to deserialize all the object once or incrementally is actually a behavior of Kryo, I guess Kyro will not deserialize

Re: ERROR UserGroupInformation: PriviledgedActionException

2014-11-04 Thread Akhil Das
Its more like you are having different versions of spark Thanks Best Regards On Wed, Nov 5, 2014 at 3:05 AM, Saiph Kappa saiph.ka...@gmail.com wrote: I set the host and port of the driver and now the error slightly changed Using Spark's default log4j profile:

RE: Kafka Consumer in Spark Streaming

2014-11-04 Thread Shao, Saisai
If you’re running on a standalone mode, the log is under SPAR_HOME/work/ directory. I’m not sure for yarn or mesos, you can check the document of Spark to see the details. Thanks Jerry From: Something Something [mailto:mailinglist...@gmail.com] Sent: Wednesday, November 05, 2014 2:28 PM To:

Re: Issue in Spark Streaming

2014-11-04 Thread Akhil Das
Which error are you referring here? Can you paste the error logs? Thanks Best Regards On Wed, Nov 5, 2014 at 11:04 AM, Suman S Patil suman.pa...@lntinfotech.com wrote: I am trying to run the Spark streaming program as given in the Spark streaming Programming guide

Re: stackoverflow error

2014-11-04 Thread Sean Owen
With so many iterations, your RDD lineage is too deep. You should not need nearly so many iterations. 10 or 20 is usually plenty. On Tue, Nov 4, 2014 at 11:13 PM, Hongbin Liu hongbin@theice.com wrote: Hi, can you help with the following? We are new to spark. Error stack: 14/11/04

Re: save as JSON objects

2014-11-04 Thread Akhil Das
Something like this? val json = myRDD.map(*map_obj* = new JSONObject(*map_obj*)) ​Here map_obj will be a map containing values (eg: *Map(name - Akhil, mail - xyz@xyz)*)​ Performance wasn't so good with this one though. Thanks Best Regards On Wed, Nov 5, 2014 at 3:02 AM, Yin Huai

How to increase hdfs read parallelism

2014-11-04 Thread Rajat Verma
Hi I have simple use case where I have to join two feeds. I have two worker nodes each having 96 GB memory and 24 cores. I am running spark(1.1.0) with yarn(2.4.0). I have allocated 80% resources to spark queue and my spark config looks like spark.executor.cores=18 spark.executor.memory=66g

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
Added foreach as follows. Still don't see any output on my console. Would this go to the worker logs as Jerry indicated? JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, mymachine:2181, 1, map); JavaDStreamString statuses = tweets.map( new

Re: GraphX and Spark

2014-11-04 Thread Kamal Banga
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can. On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any

Re: Best practice for join

2014-11-04 Thread Akhil Das
How about Using SparkSQL https://spark.apache.org/sql/? Thanks Best Regards On Wed, Nov 5, 2014 at 1:53 AM, Benyi Wang bewang.t...@gmail.com wrote: I need to join RDD[A], RDD[B], and RDD[C]. Here is what I did, # build (K,V) from A and B to prepare the join val ja = A.map( r = (K1, Va))

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Something Something
It's not local. My spark url is something like this: String sparkUrl = spark://host name:7077; On Tue, Nov 4, 2014 at 11:03 PM, Jain Rahul ja...@ivycomptech.com wrote: I think you are running it locally. Do you have local[1] here for master url? If yes change it to local[2] or

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Jain Rahul
I think you are running it locally. Do you have local[1] here for master url? If yes change it to local[2] or more number of threads. It may be due to topic name mismatch also. sparkConf.setMaster(“local[1]); Regards, Rahul From: Something Something

Re: Kafka Consumer in Spark Streaming

2014-11-04 Thread Akhil Das
Your code doesn't trigger any action. How about the following? JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(60 * 1 * 1000)); JavaPairReceiverInputDStream tweets = KafkaUtils.createStream(ssc, machine:2181, 1, map); JavaDStreamString statuses =

Re: Best practice for join

2014-11-04 Thread Benyi Wang
I'm using spark-1.0.0 in CDH 5.1.0. The big problem is SparkSQL doesn't support Hash join in this version. On Tue, Nov 4, 2014 at 10:54 PM, Akhil Das ak...@sigmoidanalytics.com wrote: How about Using SparkSQL https://spark.apache.org/sql/? Thanks Best Regards On Wed, Nov 5, 2014 at 1:53

Re: spark_ec2.py for AWS region: cn-north-1, China

2014-11-04 Thread haitao .yao
Done, JIRA link: https://issues.apache.org/jira/browse/SPARK-4241 Thanks. 2014-11-05 10:58 GMT+08:00 Nicholas Chammas nicholas.cham...@gmail.com: Oh, I can see that region via boto as well. Perhaps the doc is indeed out of date. Do you mind opening a JIRA issue

Re: Best practice for join

2014-11-04 Thread Akhil Das
Oh, in that case, if you want to reduce the GC time, you can specify the level of parallelism along with your join, reduceByKey operations. Thanks Best Regards On Wed, Nov 5, 2014 at 1:11 PM, Benyi Wang bewang.t...@gmail.com wrote: I'm using spark-1.0.0 in CDH 5.1.0. The big problem is

sparse x sparse matrix multiplication

2014-11-04 Thread ll
what is the best way to implement a sparse x sparse matrix multiplication with spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html Sent from the Apache Spark User List mailing list archive at Nabble.com.