Python:Streaming Question

2014-12-21 Thread Samarth Mailinglist
Iā€™m trying to run the stateful network word count at
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py
using the command:

./bin/spark-submit
examples/src/main/python/streaming/stateful_network_wordcount.py
localhost 

I am also running netcat at the same time (prior to running the above
command):

nc -lk 

However, no wordcount is printed (even though pprint() is being called).

   1. How do I print the results?
   2. How do I otherwise access the data at real time? Suppose I want to
   have a dashboard showing the data in running_counts?

Note that
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py
works perfectly fine.

Running Spark 1.2.0, hadoop 2.4.x prebuilt

Thanks,
Samarth
ā€‹


Re: spark-submit question

2014-11-17 Thread Samarth Mailinglist
I figured it out. I had to use pyspark.files.SparkFiles to get the
locations of files loaded into Spark.


On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen so...@cloudera.com wrote:

 You are changing these paths and filenames to match your own actual
 scripts and file locations right?
 On Nov 17, 2014 4:59 AM, Samarth Mailinglist 
 mailinglistsama...@gmail.com wrote:

 I am trying to run a job written in python with the following command:

 bin/spark-submit --master spark://localhost:7077 
 /path/spark_solution_basic.py --py-files /path/*.py --files 
 /path/config.properties

 I always get an exception that config.properties is not found:

 INFO - IOError: [Errno 2] No such file or directory: 'config.properties'

 Why isn't this working?
 ā€‹




Probability in Naive Bayes

2014-11-17 Thread Samarth Mailinglist
I am trying to use Naive Bayes for a project of mine in Python and I want
to obtain the probability value after having built the model.

Suppose I have two classes - A and B. Currently there is an API to to find
which class a sample belongs to (predict). Now, I want to find the
probability of it belonging to Class A or Class B.

Can you please provide any details about how I can obtain the probability
values for it belonging to the either class A or class B?


Re: Functions in Spark

2014-11-16 Thread Samarth Mailinglist
Check this video out:
https://www.youtube.com/watch?v=dmL0N3qfSc8list=UURzsq7k4-kT-h3TDUBQ82-w

On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 Is there any way to know which of my functions perform better in Spark? In
 other words, say I have achieved same thing using two different
 implementations. How do I judge as to which implementation is better than
 the other. Is processing time the only metric that we can use to claim the
 goodness of one implementation to the other?
 Can anyone please share some thoughts on this?

 Thank You



Re: Scala vs Python performance differences

2014-11-12 Thread Samarth Mailinglist
I was about to ask this question.

On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash and...@andrewash.com wrote:

 Jeremy,

 Did you complete this benchmark in a way that's shareable with those
 interested here?

 Andrew

 On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'd also be interested in seeing such a benchmark.


 On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira ianferre...@hotmail.com
 wrote:

 This would be super useful. Thanks.

 On 4/15/14, 1:30 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:

 Hi Andrew,
 
 I'm putting together some benchmarks for PySpark vs Scala. I'm focusing
 on
 ML algorithms, as I'm particularly curious about the relative
 performance
 of
 MLlib in Scala vs the Python MLlib API vs pure Python implementations.
 
 Will share real results as soon as I have them, but roughly, in our
 hands,
 that 40% number is ballpark correct, at least for some basic operations
 (e.g
 textFile, count, reduce).
 
 -- Jeremy
 
 -
 Jeremy Freeman, PhD
 Neuroscientist
 @thefreemanlab
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-perfor
 mance-differences-tp4247p4261.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.







Re: Read a HDFS file from Spark source code

2014-11-11 Thread Samarth Mailinglist
Instead of a file path, use a HDFS URI.
For example: (In Python)



data = sc.textFile(hdfs://localhost/user/someuser/data)

ā€‹

On Wed, Nov 12, 2014 at 10:12 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi

 I am trying to access a file in HDFS from spark source code. Basically,
 I am tweaking the spark source code. I need to access a file in HDFS from
 the source code of the spark. I am really not understanding how to go about
 doing this.

 Can someone please help me out in this regard.
 Thank you!!
 Karthik



Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
Thanks a lot, sorry for the really late reply! (Didn't have my laptop)

This is working, but it's dreadfully slow and seems to not run in parallel?


On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 You need to use mapPartitions (or foreachPartition) to instantiate your
 client in each partition as it is not serializable by the pickle library.
 Something like

 def mapper(iter):
 db = MongoClient()['spark_test_db']
 *collec = db['programs']*
 *for val in iter:*
 asc = val.encode('ascii','ignore')
 json = convertToJSON(asc, indexMap)
 yield collec.insert(json)



 def convertToJSON(string, indexMap):
 values = string.strip().split(,)
 json = {}
 for i in range(len(values)):
 json[indexMap[i]] = values[i]
 return json

 *doc_ids = data.mapPartitions(mapper)*




 On Mon, May 19, 2014 at 8:00 AM, Samarth Mailinglist 
 mailinglistsama...@gmail.com wrote:

 db = MongoClient()['spark_test_db']
 *collec = db['programs']*

 def mapper(val):
 asc = val.encode('ascii','ignore')
 json = convertToJSON(asc, indexMap)
 collec.insert(json) # *this is not working*

 def convertToJSON(string, indexMap):
 values = string.strip().split(,)
  json = {}
 for i in range(len(values)):
 json[indexMap[i]] = values[i]
 return json

 *jsons = data.map(mapper)*



 *The last line does the mapping. I am very new to Spark, can you explain
 what explicit serialization, etc is in the context of spark?  The error I
 am getting:*
 *Traceback (most recent call last):  File stdin, line 1, in module
 File /usr/local/spark-0.9.1/python/pyspark/rdd.py, line 712, in
 saveAsTextFile
 keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)  File
 /usr/local/spark-0.9.1/python/pyspark/rdd.py, line 1178, in _jrdd
 pickled_command = CloudPickleSerializer().dumps(command)   File
 /usr/local/spark-0.9.1/python/pyspark/serializers.py, line 275, in dumps
   def dumps(self, obj): return cloudpickle.dumps(obj, 2)  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 801, in dumps
 cp.dump(obj)  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 140, in dump
   return pickle.Pickler.dump(self, obj)  File
 /usr/lib/python2.7/pickle.py, line 224, in dump self.save(obj)  File
 /usr/lib/python2.7/pickle.py, line 286, in savef(self, obj) # Call
 unbound method with explicit self  File /usr/lib/python2.7/pickle.py,
 line 548, in save_tuple save(element)  File
 /usr/lib/python2.7/pickle.py, line 286, in savef(self, obj) # Call
 unbound method with explicit self  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 259, in
 save_function self.save_function_tuple(obj, [themodule])  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 316, in
 save_function_tuplesave(closure)  File /usr/lib/python2.7/pickle.py,
 line 286, in save f(self, obj) # Call unbound method with explicit
 self  File /usr/lib/python2.7/pickle.py, line 600, in save_list
 self._batch_appends(iter(obj))  File /usr/lib/python2.7/pickle.py, line
 633, in _batch_appends save(x)  File /usr/lib/python2.7/pickle.py,
 line 286, in savef(self, obj) # Call unbound method with explicit self
 File /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 259, in
 save_function self.save_function_tuple(obj, [themodule])  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 316, in
 save_function_tuplesave(closure)  File /usr/lib/python2.7/pickle.py,
 line 286, in save f(self, obj) # Call unbound method with explicit
 self  File /usr/lib/python2.7/pickle.py, line 600, in save_list
 self._batch_appends(iter(obj))  File /usr/lib/python2.7/pickle.py, line
 636, in _batch_appends save(tmp[0])  File
 /usr/lib/python2.7/pickle.py, line 286, in savef(self, obj) # Call
 unbound method with explicit self  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 254, in
 save_function self.save_function_tuple(obj, modList)  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 314, in
 save_function_tuplesave(f_globals)  File
 /usr/lib/python2.7/pickle.py, line 286, in save f(self, obj) # Call
 unbound method with explicit self  File
 /usr/local/spark-0.9.1/python/pyspark/cloudpickle.py, line 181, in
 save_dictpickle.Pickler.save_dict(self, obj)   File
 /usr/lib/python2.7/pickle.py, line 649, in save_dict
 self._batch_setitems(obj.iteritems())  File /usr/lib/python2.7/pickle.py,
 line 681, in _batch_setitems save(v)  File
 /usr/lib/python2.7/pickle.py, line 306, in saverv =
 reduce(self.proto)  File
 /usr/local/lib/python2.7/dist-packages/pymongo/collection.py, line 1489,
 in __call__ self.__name.split(.)[-1])TypeError: 'Collection' object
 is not callable. If you meant to call the '__getnewargs__' method on a
 'Collection' object it is failing because no such method exists. *


 On Sat, May 17, 2014 at 9:30 PM