Re: Using mongo with PySpark

2014-06-06 Thread Mayur Rustagi
Yes initialization each turn is hard.. you seem to using python. Another risky thing you can try is to serialize the mongoclient object using any serializer (like kryo wrappers in python) pass it on to mappers.. then in each mapper you'll just have to unserialize it use it directly... This may

Re: Using mongo with PySpark

2014-06-04 Thread Samarth Mailinglist
Thanks a lot, sorry for the really late reply! (Didn't have my laptop) This is working, but it's dreadfully slow and seems to not run in parallel? On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath nick.pentre...@gmail.com wrote: You need to use mapPartitions (or foreachPartition) to instantiate

Re: Using mongo with PySpark

2014-05-17 Thread Nicholas Chammas
Where's your driver code (the code interacting with the RDDs)? Are you getting serialization errors? 2014년 5월 17일 토요일, Samarth Mailinglistmailinglistsama...@gmail.com님이 작성한 메시지: Hi all, I am trying to store the results of a reduce into mongo. I want to share the variable collection in the

Re: Using mongo with PySpark

2014-05-17 Thread Mayur Rustagi
You have to ideally pass the mongoclient object along with your data in the mapper(python should be try to serialize your mongoclient, but explicit is better) if client is serializable then all should end well.. if not then you are better off using map partition initilizing the driver in each