Hi
I am trying to connect to Spark from Play framework. Getting the following
Akka error...
[ERROR] [08/16/2014 17:12:05.249]
[spark-akka.actor.default-dispatcher-3] [ActorSystem(spark)] Uncaught
fatal error from thread [spark-akka.actor.default-dispatcher-3]
shutting down ActorSystem [spark]
Hi, All:
I have a demand that using spark load business data daily and cache it as
rdd or spark sql rdd. And other users can query base on it (in memery). As a
summary, it requires that the app must runing as a deamon service that can last
for one day at least and user's app can
Hi,
You can achieve it by running a spray service for example that has access
to the RDD in question. When starting the app you first build your RDD and
cache it. In your spray endpoints you will translate the HTTP requests to
operations on that RDD.
2014-08-17 17:27 GMT+02:00 Zhanfeng Huo
Hi,
I am building a distributed machine learning algorithm on top of Spark.
Datasets reside on HDFS in .*sv format and I build the RDDs using the
textFile method of SparkContext.
The number of partitions is usually a lot larger than the number blocks on
HDFS (usually I aim to split the default
I don't understand why worker need a master lock when sending heartbeat.
Caused by master HA ? Who can explain this in detail? Thanks~
Please refer:
http://stackoverflow.com/questions/25173219/why-does-the-spark-worker-actor-use-a-masterlock
case SendHeartbeat =
masterLock.synchronized {
Hi,
Thanks for the response..
In the second case f2??
foo will have to be declared globablly??right??
My function is somthing like:
def indexing(splitIndex, iterator):
count = 0
offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
indexed = []
for i, e in enumerate(iterator):
Hi Xiangrui, yes I was not clear in my previous posting. You did
optimization in block-level (which is brilliant!) so that blocks are
joined first to avoid redundant data transfer.
I have two follow-up questions:
when you do rdd_a.join(rdd_b), which site this join will be done? Say, if
Building on what Davies Liu said,
How about something like:
def indexing(splitIndex, iterator,*offset_lists* ):
count = 0
offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
indexed = []
for i, e in enumerate(iterator):
index = count + offset + i
for j, ele in
Could it be that the Twitter source is not supported in Java?
For my use case (teaching) I created now a simple shell-server using
netcat, which streams tweets I collected beforehand (several GB from the
worldcup) and I use the Network source to process them using
Spark-Streaming.
On Sun, Aug
The driver must initially compute the partitions and their preferred
locations for each part of the file, which results in a serial
getFileBlockLocations() on each part. However, I would expect this to take
several seconds, not minutes, to perform on 1000 parts. Is your driver
inside or outside of
Yes, good point, I believe the masterLock is now unnecessary altogether.
The reason for its initial existence was that changeMaster() originally
could be called out-of-band of the actor, and so we needed to make sure the
master reference did not change out from under us. Now it appears that all
You can also look into using ooyala's job server at
https://github.com/ooyala/spark-jobserver
This already has a spary server built in that allows you to do what has
already been explained above. Sounds like it should solve your problem.
Enjoy!
--
View this message in context:
Hi Wei,
Sparkler code was not available for benchmarking and so I picked up
Jellyfish which uses SGD and if you look at the paper, the ideas are very
similar to sparkler paper but Jellyfish is on shared memory and uses C code
while sparkler was built on top of spark...Jellyfish used some
Hi,
On Sat, Aug 16, 2014 at 3:29 AM, Yan Fang yanfang...@gmail.com wrote:
If all consecutive data points are in one batch, it's not complicated
except that the order of data points is not guaranteed in the batch and so
I have to use the timestamp in the data point to reach my goal. However,
Thank you Eugen Cepoi, I will try it now.
Zhanfeng Huo
From: Eugen Cepoi
Date: 2014-08-17 23:34
To: Zhanfeng Huo
CC: user
Subject: Re: application as a service
Hi,
You can achieve it by running a spray service for example that has access to
the RDD in question. When starting the app you
On Sun, Aug 17, 2014 at 11:21 AM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
Thanks for the response..
In the second case f2??
foo will have to be declared globablly??right??
My function is somthing like:
def indexing(splitIndex, iterator):
count = 0
offset =
Hi Deb, thanks for sharing your result. Please find my comments inline in
blue.
Best regards,
Wei
From: Debasish Das debasish.da...@gmail.com
To: Wei Tan/Watson/IBM@IBMUS,
Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org
user@spark.apache.org
Date: 08/17/2014 08:15 PM
Has anyone tried using functools.partial (
https://docs.python.org/2/library/functools.html#functools.partial) with
PySpark? If it works, it might be a nice way to address this use-case.
On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote:
On Sun, Aug 17, 2014 at 11:21 AM,
Hi,
I am trying to implement machine learning algorithms on Spark. I am working
on a 3 node cluster, with each node having 5GB of memory. Whenever I am
working with slightly more number of records, I end up with OutOfMemory
Error. Problem is, even if number of records is slightly high, the
Can anyone assist with a scan of the following kind (Python preferred, but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
or preferably two output columns:
id: [1,2,3,4,5,1]
count: [3,2,1,2,1,1]
20 matches
Mail list logo