akka error : play framework (2.3.3) and spark (1.0.2)

2014-08-17 Thread Sujee Maniyam
Hi I am trying to connect to Spark from Play framework. Getting the following Akka error... [ERROR] [08/16/2014 17:12:05.249] [spark-akka.actor.default-dispatcher-3] [ActorSystem(spark)] Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-3] shutting down ActorSystem [spark]

application as a service

2014-08-17 Thread Zhanfeng Huo
Hi, All: I have a demand that using spark load business data daily and cache it as rdd or spark sql rdd. And other users can query base on it (in memery). As a summary, it requires that the app must runing as a deamon service that can last for one day at least and user's app can

Re: application as a service

2014-08-17 Thread Eugen Cepoi
Hi, You can achieve it by running a spray service for example that has access to the RDD in question. When starting the app you first build your RDD and cache it. In your spray endpoints you will translate the HTTP requests to operations on that RDD. 2014-08-17 17:27 GMT+02:00 Zhanfeng Huo

Vast network traffic during RDD creation

2014-08-17 Thread Acco
Hi, I am building a distributed machine learning algorithm on top of Spark. Datasets reside on HDFS in .*sv format and I build the RDDs using the textFile method of SparkContext. The number of partitions is usually a lot larger than the number blocks on HDFS (usually I aim to split the default

Spark: why need a masterLock when sending heartbeat to master

2014-08-17 Thread Victor Sheng
I don't understand why worker need a master lock when sending heartbeat. Caused by master HA ? Who can explain this in detail? Thanks~ Please refer: http://stackoverflow.com/questions/25173219/why-does-the-spark-worker-actor-use-a-masterlock case SendHeartbeat = masterLock.synchronized {

Re: Question on mappartitionwithsplit

2014-08-17 Thread Chengi Liu
Hi, Thanks for the response.. In the second case f2?? foo will have to be declared globablly??right?? My function is somthing like: def indexing(splitIndex, iterator): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator):

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Wei Tan
Hi Xiangrui, yes I was not clear in my previous posting. You did optimization in block-level (which is brilliant!) so that blocks are joined first to avoid redundant data transfer. I have two follow-up questions: when you do rdd_a.join(rdd_b), which site this join will be done? Say, if

Re: Question on mappartitionwithsplit

2014-08-17 Thread Mohit Singh
Building on what Davies Liu said, How about something like: def indexing(splitIndex, iterator,*offset_lists* ): count = 0 offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0 indexed = [] for i, e in enumerate(iterator): index = count + offset + i for j, ele in

Re: SparkStreaming 0.9.0 / Java / Twitter issue

2014-08-17 Thread Jörn Franke
Could it be that the Twitter source is not supported in Java? For my use case (teaching) I created now a simple shell-server using netcat, which streams tweets I collected beforehand (several GB from the worldcup) and I use the Network source to process them using Spark-Streaming. On Sun, Aug

Re: s3:// sequence file startup time

2014-08-17 Thread Aaron Davidson
The driver must initially compute the partitions and their preferred locations for each part of the file, which results in a serial getFileBlockLocations() on each part. However, I would expect this to take several seconds, not minutes, to perform on 1000 parts. Is your driver inside or outside of

Re: Spark: why need a masterLock when sending heartbeat to master

2014-08-17 Thread Aaron Davidson
Yes, good point, I believe the masterLock is now unnecessary altogether. The reason for its initial existence was that changeMaster() originally could be called out-of-band of the actor, and so we needed to make sure the master reference did not change out from under us. Now it appears that all

Re: application as a service

2014-08-17 Thread ryaminal
You can also look into using ooyala's job server at https://github.com/ooyala/spark-jobserver This already has a spary server built in that allows you to do what has already been explained above. Sounds like it should solve your problem. Enjoy! -- View this message in context:

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Debasish Das
Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some

Re: [Spar Streaming] How can we use consecutive data points as the features ?

2014-08-17 Thread Tobias Pfeiffer
Hi, On Sat, Aug 16, 2014 at 3:29 AM, Yan Fang yanfang...@gmail.com wrote: If all consecutive data points are in one batch, it's not complicated except that the order of data points is not guaranteed in the batch and so I have to use the timestamp in the data point to reach my goal. However,

Re: Re: application as a service

2014-08-17 Thread Zhanfeng Huo
Thank you Eugen Cepoi, I will try it now. Zhanfeng Huo From: Eugen Cepoi Date: 2014-08-17 23:34 To: Zhanfeng Huo CC: user Subject: Re: application as a service Hi, You can achieve it by running a spray service for example that has access to the RDD in question. When starting the app you

Re: Question on mappartitionwithsplit

2014-08-17 Thread Davies Liu
On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, Thanks for the response.. In the second case f2?? foo will have to be declared globablly??right?? My function is somthing like: def indexing(splitIndex, iterator): count = 0 offset =

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Wei Tan
Hi Deb, thanks for sharing your result. Please find my comments inline in blue. Best regards, Wei From: Debasish Das debasish.da...@gmail.com To: Wei Tan/Watson/IBM@IBMUS, Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org user@spark.apache.org Date: 08/17/2014 08:15 PM

Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial ( https://docs.python.org/2/library/functools.html#functools.partial) with PySpark? If it works, it might be a nice way to address this use-case. On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote: On Sun, Aug 17, 2014 at 11:21 AM,

OutOfMemory Error

2014-08-17 Thread Ghousia Taj
Hi, I am trying to implement machine learning algorithms on Spark. I am working on a 3 node cluster, with each node having 5GB of memory. Whenever I am working with slightly more number of records, I end up with OutOfMemory Error. Problem is, even if number of records is slightly high, the

Segmented fold count

2014-08-17 Thread fil
Can anyone assist with a scan of the following kind (Python preferred, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)] or preferably two output columns: id: [1,2,3,4,5,1] count: [3,2,1,2,1,1]