Spark Streaming

2015-01-17 Thread Rohit Pujari
Hello Folks:

I'm running into following error while executing relatively straight
forward spark-streaming code. Am I missing anything?

*Exception in thread main java.lang.AssertionError: assertion failed: No
output streams registered, so nothing to execute*


Code:

val conf = new SparkConf().setMaster(local[2]).setAppName(Streams)
val ssc = new StreamingContext(conf, Seconds(1))

val kafkaStream = {
  val sparkStreamingConsumerGroup = spark-streaming-consumer-group
  val kafkaParams = Map(
zookeeper.connect - node1.c.emerald-skill-783.internal:2181,
group.id - twitter,
zookeeper.connection.timeout.ms - 1000)
  val inputTopic = twitter
  val numPartitionsOfInputTopic = 2
  val streams = (1 to numPartitionsOfInputTopic) map { _ =
KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1),
StorageLevel.MEMORY_ONLY_SER)
  }
  val unifiedStream = ssc.union(streams)
  val sparkProcessingParallelism = 1
  unifiedStream.repartition(sparkProcessingParallelism)
}

//print(kafkaStream)
ssc.start()
ssc.awaitTermination()

-- 
Rohit Pujari

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Spark Streaming

2015-01-17 Thread Rohit Pujari
Hi Francois:

I tried using print(kafkaStream) as output operator but no luck. It throws 
the same error. Any other thoughts?

Thanks,
Rohit


From: francois.garil...@typesafe.commailto:francois.garil...@typesafe.com 
francois.garil...@typesafe.commailto:francois.garil...@typesafe.com
Date: Saturday, January 17, 2015 at 4:10 AM
To: Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com
Subject: Re: Spark Streaming

Streams are lazy. Their computation is triggered by an output operator, which 
is apparently missing from your code. See the programming guide:

https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

-
FG



On Sat, Jan 17, 2015 at 11:06 AM, Rohit Pujari 
rpuj...@hortonworks.commailto:rpuj...@hortonworks.com wrote:

Hello Folks:

I'm running into following error while executing relatively straight forward 
spark-streaming code. Am I missing anything?

Exception in thread main java.lang.AssertionError: assertion failed: No 
output streams registered, so nothing to execute


Code:

val conf = new SparkConf().setMaster(local[2]).setAppName(Streams)
val ssc = new StreamingContext(conf, Seconds(1))

val kafkaStream = {
  val sparkStreamingConsumerGroup = spark-streaming-consumer-group
  val kafkaParams = Map(
zookeeper.connect - node1.c.emerald-skill-783.internal:2181,
group.idhttp://group.id - twitter,

zookeeper.connection.timeout.mshttp://zookeeper.connection.timeout.ms - 
1000)
  val inputTopic = twitter
  val numPartitionsOfInputTopic = 2
  val streams = (1 to numPartitionsOfInputTopic) map { _ =
KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1), 
StorageLevel.MEMORY_ONLY_SER)
  }
  val unifiedStream = ssc.union(streams)
  val sparkProcessingParallelism = 1
  unifiedStream.repartition(sparkProcessingParallelism)
}

//print(kafkaStream)
ssc.start()
ssc.awaitTermination()

--
Rohit Pujari


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Re: Spark Streaming

2015-01-17 Thread Rohit Pujari
That was it. Thanks Akhil and Owen for your quick response.

On Sat, Jan 17, 2015 at 4:27 AM, Sean Owen so...@cloudera.com wrote:

 Not print(kafkaStream), which would just print some String description
 of the stream to the console, but kafkaStream.print(), which actually
 invokes the print operation on the stream.

 On Sat, Jan 17, 2015 at 10:17 AM, Rohit Pujari rpuj...@hortonworks.com
 wrote:
  Hi Francois:
 
  I tried using print(kafkaStream)” as output operator but no luck. It
 throws
  the same error. Any other thoughts?
 
  Thanks,
  Rohit
 
 
  From: francois.garil...@typesafe.com francois.garil...@typesafe.com
  Date: Saturday, January 17, 2015 at 4:10 AM
  To: Rohit Pujari rpuj...@hortonworks.com
  Subject: Re: Spark Streaming
 
  Streams are lazy. Their computation is triggered by an output operator,
  which is apparently missing from your code. See the programming guide:
 
 
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
 
  —
  FG
 
 
  On Sat, Jan 17, 2015 at 11:06 AM, Rohit Pujari rpuj...@hortonworks.com
  wrote:
 
  Hello Folks:
 
  I'm running into following error while executing relatively straight
  forward spark-streaming code. Am I missing anything?
 
  Exception in thread main java.lang.AssertionError: assertion failed:
 No
  output streams registered, so nothing to execute
 
 
  Code:
 
  val conf = new SparkConf().setMaster(local[2]).setAppName(Streams)
  val ssc = new StreamingContext(conf, Seconds(1))
 
  val kafkaStream = {
val sparkStreamingConsumerGroup = spark-streaming-consumer-group
val kafkaParams = Map(
  zookeeper.connect -
 node1.c.emerald-skill-783.internal:2181,
  group.id - twitter,
  zookeeper.connection.timeout.ms - 1000)
val inputTopic = twitter
val numPartitionsOfInputTopic = 2
val streams = (1 to numPartitionsOfInputTopic) map { _ =
  KafkaUtils.createStream(ssc, kafkaParams, Map(inputTopic - 1),
  StorageLevel.MEMORY_ONLY_SER)
}
val unifiedStream = ssc.union(streams)
val sparkProcessingParallelism = 1
unifiedStream.repartition(sparkProcessingParallelism)
  }
 
  //print(kafkaStream)
  ssc.start()
  ssc.awaitTermination()
 
  --
  Rohit Pujari
 
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
  to which it is addressed and may contain information that is
 confidential,
  privileged and exempt from disclosure under applicable law. If the
 reader of
  this message is not the intended recipient, you are hereby notified
 that any
  printing, copying, dissemination, distribution, disclosure or
 forwarding of
  this communication is strictly prohibited. If you have received this
  communication in error, please contact the sender immediately and
 delete it
  from your system. Thank You.
 
 
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the
 reader of
  this message is not the intended recipient, you are hereby notified that
 any
  printing, copying, dissemination, distribution, disclosure or forwarding
 of
  this communication is strictly prohibited. If you have received this
  communication in error, please contact the sender immediately and delete
 it
  from your system. Thank You.




-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Market Basket Analysis

2014-12-05 Thread Rohit Pujari
This is a typical use case people who buy electric razors, also tend to
buy batteries and shaving gel along with it. The goal is to build a model
which will look through POS records and find which product categories have
higher likelihood of appearing together in given a transaction.

What would you recommend?

On Fri, Dec 5, 2014 at 7:21 AM, Sean Owen so...@cloudera.com wrote:

 Generally I don't think frequent-item-set algorithms are that useful.
 They're simple and not probabilistic; they don't tell you what sets
 occurred unusually frequently. Usually people ask for frequent item
 set algos when they really mean they want to compute item similarity
 or make recommendations. What's your use case?

 On Thu, Dec 4, 2014 at 8:23 PM, Rohit Pujari rpuj...@hortonworks.com
 wrote:
  Sure, I’m looking to perform frequent item set analysis on POS data set.
  Apriori is a classic algorithm used for such tasks. Since Apriori
  implementation is not part of MLLib yet, (see
  https://issues.apache.org/jira/browse/SPARK-4001) What are some other
  options/algorithms I could use to perform a similar task? If there’s no
  spoon to spoon substitute,  spoon to fork will suffice too.
 
  Hopefully this provides some clarification.
 
  Thanks,
  Rohit
 
 
 
  From: Tobias Pfeiffer t...@preferred.jp
  Date: Thursday, December 4, 2014 at 7:20 PM
  To: Rohit Pujari rpuj...@hortonworks.com
  Cc: user@spark.apache.org user@spark.apache.org
  Subject: Re: Market Basket Analysis
 
  Hi,
 
  On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari rpuj...@hortonworks.com
  wrote:
 
  I'd like to do market basket analysis using spark, what're my options?
 
 
  To do it or not to do it ;-)
 
  Seriously, could you elaborate a bit on what you want to know?
 
  Tobias
 
 
 
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the
 reader of
  this message is not the intended recipient, you are hereby notified that
 any
  printing, copying, dissemination, distribution, disclosure or forwarding
 of
  this communication is strictly prohibited. If you have received this
  communication in error, please contact the sender immediately and delete
 it
  from your system. Thank You.




-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Hello Folks:

I'd like to do market basket analysis using spark, what're my options?

Thanks,
Rohit Pujari
Solutions Architect, Hortonworks

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Market Basket Analysis

2014-12-04 Thread Rohit Pujari
Sure, I'm looking to perform frequent item set analysis on POS data set. 
Apriori is a classic algorithm used for such tasks. Since Apriori 
implementation is not part of MLLib yet, (see 
https://issues.apache.org/jira/browse/SPARK-4001) What are some other 
options/algorithms I could use to perform a similar task? If there's no spoon 
to spoon substitute,  spoon to fork will suffice too.

Hopefully this provides some clarification.

Thanks,
Rohit



From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Date: Thursday, December 4, 2014 at 7:20 PM
To: Rohit Pujari rpuj...@hortonworks.commailto:rpuj...@hortonworks.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Market Basket Analysis

Hi,

On Thu, Dec 4, 2014 at 11:58 PM, Rohit Pujari 
rpuj...@hortonworks.commailto:rpuj...@hortonworks.com wrote:
I'd like to do market basket analysis using spark, what're my options?

To do it or not to do it ;-)

Seriously, could you elaborate a bit on what you want to know?

Tobias



CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Python Scientific Libraries in Spark

2014-11-24 Thread Rohit Pujari
Hello Folks:

Since spark exposes python bindings and allows you to express your logic in
Python, Is there a way to leverage some of the sophisticated libraries like
NumPy, SciPy, Scikit in spark job and run at scale?

What's been your experience, any insights you can share in terms of what's
possible today and some of the active development in the community that's
on the horizon.

Thanks,
Rohit Pujari
Solutions Architect, Hortonworks

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Spark job doesn't clean after itself

2014-10-12 Thread Rohit Pujari
Reviving this .. any thoughts experts?

On Thu, Oct 9, 2014 at 3:47 PM, Rohit Pujari rpuj...@hortonworks.com
wrote:

 Hello Folks:

 I'm running spark job on YARN. After the execution, I would expect the
 spark job to clean staging the area, but it seems every run creates a new
 staging directory. Is there a way to force spark job to clean after itself?

 Thanks,
 Rohit




-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Spark job doesn't clean after itself

2014-10-09 Thread Rohit Pujari
Hello Folks:

I'm running spark job on YARN. After the execution, I would expect the
spark job to clean staging the area, but it seems every run creates a new
staging directory. Is there a way to force spark job to clean after itself?

Thanks,
Rohit

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Debug Spark in Cluster Mode

2014-10-09 Thread Rohit Pujari
Hello Folks:

What're some best practices to debug Spark in cluster mode?


Thanks,
Rohit

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Can Spark stack scale to petabyte scale without performance degradation?

2014-07-16 Thread Rohit Pujari
Thanks Matei.


On Tue, Jul 15, 2014 at 11:47 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yup, as mentioned in the FAQ, we are aware of multiple deployments running
 jobs on over 1000 nodes. Some of our proof of concepts involved people
 running a 2000-node job on EC2.

 I wouldn't confuse buzz with FUD :).

 Matei

 On Jul 15, 2014, at 9:17 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 Hi Rohit,

 I think the 3rd question on the FAQ may help you.

 https://spark.apache.org/faq.html

 Some other links that talk about building bigger clusters and processing
 more data:


 http://spark-summit.org/wp-content/uploads/2014/07/Building-1000-node-Spark-Cluster-on-EMR.pdf

 http://apache-spark-user-list.1001560.n3.nabble.com/Largest-Spark-Cluster-td3782.html



 Best Regards,
 Sonal
 Nube Technologies http://www.nubetech.co/

  http://in.linkedin.com/in/sonalgoyal




 On Wed, Jul 16, 2014 at 9:17 AM, Rohit Pujari rpuj...@hortonworks.com
 wrote:

 Hello Folks:

 There is lot of buzz in the hadoop community around Spark's inability to
 scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as
 great tech for cpu intensive workloads on smaller data( less that TB) but
 fails to scale and perform effectively on larger datasets. How true it is?

 Are there any customers in who are running petabyte scale workloads on
 spark in production? Are there any benchmarks performed by databricks or
 other companies to clear this perception?

  I'm a big fan of spark. Knowing spark is in its early stages, I'd like
 to better understand boundaries of the tech and recommend right solution
 for right problem.

 Thanks,
 Rohit Pujari
 Solutions Engineer, Hortonworks
 rpuj...@hortonworks.com
 716-430-6899

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.






-- 
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Can Spark stack scale to petabyte scale without performance degradation?

2014-07-15 Thread Rohit Pujari
Hello Folks:

There is lot of buzz in the hadoop community around Spark's inability to
scale beyond the 1 TB datasets ( or 10-20 nodes). It is being regarded as
great tech for cpu intensive workloads on smaller data( less that TB) but
fails to scale and perform effectively on larger datasets. How true it is?

Are there any customers in who are running petabyte scale workloads on
spark in production? Are there any benchmarks performed by databricks or
other companies to clear this perception?

I'm a big fan of spark. Knowing spark is in its early stages, I'd like to
better understand boundaries of the tech and recommend right solution for
right problem.

Thanks,
Rohit Pujari
Solutions Engineer, Hortonworks
rpuj...@hortonworks.com
716-430-6899

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


KMeansModel Construtor error

2014-07-14 Thread Rohit Pujari
Hello Folks:

I have written a simple program to read the already saved model from HDFS
and score it. But when I'm trying to read the saved model, I get the
following error. Any clues what might be going wrong here ..


val x = sc.objectFile[Vector](/data/model).collect()
val y = new KMeansModel(x);

*constructor KMeansModel in class KMeansModel cannot be accessed in object
KMeansScore*
*val y = new KMeansModel(x);*
*^*

Thanks,
Rohit

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.