Spark Streaming
Hello Folks: I'm running into following error while executing relatively straight forward spark-streaming code. Am I missing anything? *Exception in thread main java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute* Code: val conf = new SparkConf().setMaster(local[2]).setAppName(Streams) val ssc = new StreamingContext(conf, Seconds(1)) val kafkaStream = { val sparkStreamingConsumerGroup = spark-streaming-consumer-group val kafkaParams = Map( zookeeper.connect - node1.c.emerald-skill-783.internal:2181, group.id - twitter, zookeeper.connection.timeout.ms - 1000) val inputTopic = twitter val numPartitionsOfInputTopic = 2 val streams = (1 to numPartitionsOfInputTopic) map { _ = KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1), StorageLevel.MEMORY_ONLY_SER) } val unifiedStream = ssc.union(streams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) } //print(kafkaStream) ssc.start() ssc.awaitTermination() -- Rohit Pujari -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Spark Streaming
Hi Francois: I tried using print(kafkaStream) as output operator but no luck. It throws the same error. Any other thoughts? Thanks, Rohit From: francois.garil...@typesafe.commailto:francois.garil...@typesafe.com francois.garil...@typesafe.commailto:francois.garil...@typesafe.com Date: Saturday, January 17, 2015 at 4:10 AM To: Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com Subject: Re: Spark Streaming Streams are lazy. Their computation is triggered by an output operator, which is apparently missing from your code. See the programming guide: https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams - FG On Sat, Jan 17, 2015 at 11:06 AM, Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com wrote: Hello Folks: I'm running into following error while executing relatively straight forward spark-streaming code. Am I missing anything? Exception in thread main java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute Code: val conf = new SparkConf().setMaster(local[2]).setAppName(Streams) val ssc = new StreamingContext(conf, Seconds(1)) val kafkaStream = { val sparkStreamingConsumerGroup = spark-streaming-consumer-group val kafkaParams = Map( zookeeper.connect - node1.c.emerald-skill-783.internal:2181, group.idhttp://group.id - twitter, zookeeper.connection.timeout.mshttp://zookeeper.connection.timeout.ms - 1000) val inputTopic = twitter val numPartitionsOfInputTopic = 2 val streams = (1 to numPartitionsOfInputTopic) map { _ = KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1), StorageLevel.MEMORY_ONLY_SER) } val unifiedStream = ssc.union(streams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) } //print(kafkaStream) ssc.start() ssc.awaitTermination() -- Rohit Pujari CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Spark Streaming
That was it. Thanks Akhil and Owen for your quick response. On Sat, Jan 17, 2015 at 4:27 AM, Sean Owen so...@cloudera.com wrote: Not print(kafkaStream), which would just print some String description of the stream to the console, but kafkaStream.print(), which actually invokes the print operation on the stream. On Sat, Jan 17, 2015 at 10:17 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hi Francois: I tried using print(kafkaStream)” as output operator but no luck. It throws the same error. Any other thoughts? Thanks, Rohit From: francois.garil...@typesafe.com francois.garil...@typesafe.com Date: Saturday, January 17, 2015 at 4:10 AM To: Rohit Pujari rpuj...@hortonworks.com Subject: Re: Spark Streaming Streams are lazy. Their computation is triggered by an output operator, which is apparently missing from your code. See the programming guide: https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams — FG On Sat, Jan 17, 2015 at 11:06 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: I'm running into following error while executing relatively straight forward spark-streaming code. Am I missing anything? Exception in thread main java.lang.AssertionError: assertion failed: No output streams registered, so nothing to execute Code: val conf = new SparkConf().setMaster(local[2]).setAppName(Streams) val ssc = new StreamingContext(conf, Seconds(1)) val kafkaStream = { val sparkStreamingConsumerGroup = spark-streaming-consumer-group val kafkaParams = Map( zookeeper.connect - node1.c.emerald-skill-783.internal:2181, group.id - twitter, zookeeper.connection.timeout.ms - 1000) val inputTopic = twitter val numPartitionsOfInputTopic = 2 val streams = (1 to numPartitionsOfInputTopic) map { _ = KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1), StorageLevel.MEMORY_ONLY_SER) } val unifiedStream = ssc.union(streams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) } //print(kafkaStream) ssc.start() ssc.awaitTermination() -- Rohit Pujari CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Market Basket Analysis
This is a typical use case people who buy electric razors, also tend to buy batteries and shaving gel along with it. The goal is to build a model which will look through POS records and find which product categories have higher likelihood of appearing together in given a transaction. What would you recommend? On Fri, Dec 5, 2014 at 7:21 AM, Sean Owen so...@cloudera.com wrote: Generally I don't think frequent-item-set algorithms are that useful. They're simple and not probabilistic; they don't tell you what sets occurred unusually frequently. Usually people ask for frequent item set algos when they really mean they want to compute item similarity or make recommendations. What's your use case? On Thu, Dec 4, 2014 at 8:23 PM, Rohit Pujari rpuj...@hortonworks.com wrote: Sure, I’m looking to perform frequent item set analysis on POS data set. Apriori is a classic algorithm used for such tasks. Since Apriori implementation is not part of MLLib yet, (see https://issues.apache.org/jira/browse/SPARK-4001) What are some other options/algorithms I could use to perform a similar task? If there’s no spoon to spoon substitute, spoon to fork will suffice too. Hopefully this provides some clarification. Thanks, Rohit From: Tobias Pfeiffer t...@preferred.jp Date: Thursday, December 4, 2014 at 7:20 PM To: Rohit Pujari rpuj...@hortonworks.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: Market Basket Analysis Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com wrote: I'd like to do market basket analysis using spark, what're my options? To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Market Basket Analysis
Hello Folks: I'd like to do market basket analysis using spark, what're my options? Thanks, Rohit Pujari Solutions Architect, Hortonworks -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Market Basket Analysis
Sure, I'm looking to perform frequent item set analysis on POS data set. Apriori is a classic algorithm used for such tasks. Since Apriori implementation is not part of MLLib yet, (see https://issues.apache.org/jira/browse/SPARK-4001) What are some other options/algorithms I could use to perform a similar task? If there's no spoon to spoon substitute, spoon to fork will suffice too. Hopefully this provides some clarification. Thanks, Rohit From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp Date: Thursday, December 4, 2014 at 7:20 PM To: Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Market Basket Analysis Hi, On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com wrote: I'd like to do market basket analysis using spark, what're my options? To do it or not to do it ;-) Seriously, could you elaborate a bit on what you want to know? Tobias CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Python Scientific Libraries in Spark
Hello Folks: Since spark exposes python bindings and allows you to express your logic in Python, Is there a way to leverage some of the sophisticated libraries like NumPy, SciPy, Scikit in spark job and run at scale? What's been your experience, any insights you can share in terms of what's possible today and some of the active development in the community that's on the horizon. Thanks, Rohit Pujari Solutions Architect, Hortonworks -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Spark job doesn't clean after itself
Reviving this .. any thoughts experts? On Thu, Oct 9, 2014 at 3:47 PM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: I'm running spark job on YARN. After the execution, I would expect the spark job to clean staging the area, but it seems every run creates a new staging directory. Is there a way to force spark job to clean after itself? Thanks, Rohit -- Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Spark job doesn't clean after itself
Hello Folks: I'm running spark job on YARN. After the execution, I would expect the spark job to clean staging the area, but it seems every run creates a new staging directory. Is there a way to force spark job to clean after itself? Thanks, Rohit -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Debug Spark in Cluster Mode
Hello Folks: What're some best practices to debug Spark in cluster mode? Thanks, Rohit -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Can Spark stack scale to petabyte scale without performance degradation?
Thanks Matei. On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, as mentioned in the FAQ, we are aware of multiple deployments running jobs on over 1000 nodes. Some of our proof of concepts involved people running a 2000-node job on EC2. I wouldn't confuse buzz with FUD :). Matei On Jul 15, 2014, at 9:17 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi Rohit, I think the 3rd question on the FAQ may help you. https://spark.apache.org/faq.html Some other links that talk about building bigger clusters and processing more data: http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf http://apache-spark-user-list.1001560.n3.nabble.com/Largest-Spark-Cluster-td3782.html Best Regards, Sonal Nube Technologies http://www.nubetech.co/ http://in.linkedin.com/in/sonalgoyal On Wed, Jul 16, 2014 at 9:17 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: There is lot of buzz in the hadoop community around Spark's inability to scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as great tech for cpu intensive workloads on smaller data( less that TB) but fails to scale and perform effectively on larger datasets. How true it is? Are there any customers in who are running petabyte scale workloads on spark in production? Are there any benchmarks performed by databricks or other companies to clear this perception? I'm a big fan of spark. Knowing spark is in its early stages, I'd like to better understand boundaries of the tech and recommend right solution for right problem. Thanks, Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Can Spark stack scale to petabyte scale without performance degradation?
Hello Folks: There is lot of buzz in the hadoop community around Spark's inability to scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as great tech for cpu intensive workloads on smaller data( less that TB) but fails to scale and perform effectively on larger datasets. How true it is? Are there any customers in who are running petabyte scale workloads on spark in production? Are there any benchmarks performed by databricks or other companies to clear this perception? I'm a big fan of spark. Knowing spark is in its early stages, I'd like to better understand boundaries of the tech and recommend right solution for right problem. Thanks, Rohit Pujari Solutions Engineer, Hortonworks rpuj...@hortonworks.com 716-430-6899 -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
KMeansModel Construtor error
Hello Folks: I have written a simple program to read the already saved model from HDFS and score it. But when I'm trying to read the saved model, I get the following error. Any clues what might be going wrong here .. val x = sc.objectFile[Vector](/data/model).collect() val y = new KMeansModel(x); *constructor KMeansModel in class KMeansModel cannot be accessed in object KMeansScore* *val y = new KMeansModel(x);* *^* Thanks, Rohit -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.