Hi,
Has someone tried using Spark Streaming with MySQL (or any other
database/data store)? I can write to MySQL at the beginning of the driver
application. However, when I am trying to write the result of every
streaming processing window to MySQL, it fails with the following error:
Hi all,
I am Implementing a Crawler, Scraper. The It should be able to process the
request for crawling scraping, within few seconds of submitting the
job(around 1mil/sec), for rest I can take some time(scheduled evenly all
over the day). What is the best way to implement this?
Thanks.
--
After looking at the source code of SparkConf.scala, I found the following
solution.
Just set the following Java system property :
-Dspark.master=local
Shing
On Monday, 1 September 2014, 22:09, Shing Hing Man mat...@yahoo.com.INVALID
wrote:
Hi,
I have noticed that the GroupByTest
Hi,
I would like to make sure I'm not exceeding the quota on the local
cluster's hdfs. I have a couple of questions:
1. How do I know the quota? Here's the output of hadoop fs -count -q
which essentially does not tell me a lot
root@ip-172-31-7-49 ~]$ hadoop fs -count -q /
2147483647
On 9/7/2014 7:27 AM, Tomer Benyamini wrote:
2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.
Take a
I keep getting below reply every time I send a message to the Spark user
list? Can this person be taken off the list by powers that be?
Thanks!
Ognen
Forwarded Message
Subject: DELIVERY FAILURE: Error transferring to
QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded.
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/.
It shows 1 node hdfs though, although I have 4 slaves on my cluster.
Any idea why?
On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski
ognen.duzlev...@gmail.com wrote:
On 9/7/2014 7:27 AM, Tomer Benyamini wrote:
2. What should
Hi,
I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
running on the cluster - I'm getting the exception below.
Is there a way to activate it, or is there a spark alternative to distcp?
Thanks,
Tomer
mapreduce.Cluster
Standard pattern is to initialize the mysql jdbc driver in your
mappartition call , update database then close off the driver.
Couple of gotchas
1. New driver initiated for all your partitions
2. If the effect(inserts updates) is not idempotent, so if your server
crashes, Spark will replay
Statements are executed only when you try to cause some effect on the
server (produce data, collect data on driver). At time of execution Spark
does all the depedency resolution truncates paths that dont go anywhere
as well as optimize execution pipelines. So you really dont have to worry
about
Your question is a bit confusing..
I assume you have a RDD containing nodes some meta data (child nodes
maybe) you are trying to attach another metadata to it (bye array). if
its just same byte array for all nodes you can generate rdd with the count
of nodes zip the two rdd together, you can
Spark will simply have a backlog of tasks, it'll manage to process them
nonetheless, though if it keeps falling behind, you may run out of memory
or have unreasonable latency. For momentary spikes, Spark streaming will
manage.
Mostly if you are looking to do 100% processing, you'll have to go with
I've installed a spark standalone cluster on ec2 as defined here -
https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
mr1/2 is part of this installation.
On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote:
Distcp requires a mr1(or mr2) cluster to start. Do
Have you actually tested this?
I have two instances, one is standalone master and the other one just
has spark installed, same versions of spark (1.0.0).
The security group on the master allows all (0-65535) TCP and UDP
traffic from the other machine and the other machine allows all TCP/UDP
Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run spark-shell over the networ. I'm using that
way in my project (https://github.com/data-tsunami/smoke) with good
results.
I can't make a try now, but something like this should work:
ssh -tt
Horacio,
Thanks, I have not tried that, however, I am not after security right
now - I am just wondering why something so obvious won't work ;)
Ognen
On 9/7/2014 12:38 PM, Horacio G. de Oro wrote:
Have you tryied with ssh? It will be much secure (only 1 port open),
and you'll be able to run
I think you need to run start-all.sh or something similar on the EC2
cluster. MR is installed but is not running by default on EC2 clusters spun
up by spark-ec2.
On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini tomer@gmail.com
wrote:
I've installed a spark standalone cluster on ec2 as
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh.
On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:
Hi,
I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess
Hi Tathagata,
I have managed to implement the logic into the Kafka-Spark consumer to
recover from Driver failure. This is just a interim fix till actual fix is
done from Spark side.
The logic is something like this.
1. When the Individual Receivers starts for every Topic partition, it
writes
Also keep in mind there is a non-trivial amount of traffic between the
driver and cluster. It's not something I would do by default, running
the driver so remotely. With enough ports open it should work though.
On Sun, Sep 7, 2014 at 7:05 PM, Ognen Duzlevski
ognen.duzlev...@gmail.com wrote:
... I'd call out that last bit as actually tricky: close off the driver
See this message for the right-est way to do that, along with the
right way to open DB connections remotely instead of trying to
serialize them:
Sometimes the underlying Hive code will also print exceptions during
successful execution (for example CREATE TABLE IF NOT EXISTS). If there is
actually a problem Spark SQL should throw an exception.
What is the command you are running and what is the error you are seeing?
On Sat, Sep 6, 2014
I have the following code:
stream foreachRDD { rdd =
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =
initDbConnection ()
iterator foreach {
write to db
When a MappedRDD is handled by groupByKey transformation, tuples distributed
in different worker nodes with the same key will be collected into one
worker nodes, say,
(K, V1), (K, V2), ..., (K, Vn) - (K, Seq(V1, V2, ..., Vn)).
I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple
Hi,
On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Does Spark support recursive calls?
Can you give an example of which kind of recursion you would like to use?
Tobias
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Err... there's no such feature?
The problem is that the SQLContext's `catalog` member is protected, so you
can't access it from outside. If you subclass SQLContext, and make sure
that `catalog` is always a
Hi,
I'm trying to determine which Spark deployment models are the most popular
- Standalone, YARN, Mesos, or SIMR. Anyone knows?
I thought I'm use search-hadoop.com to help me figure this out and this is
what I found:
1) Standalone
I would say that the first three are all used pretty heavily. Mesos
was the first one supported (long ago), the standalone is the
simplest and most popular today, and YARN is newer but growing a lot
in activity.
SIMR is not used as much... it was designed mostly for environments
where users had
Doing a quick Google search, it appears to me that there is a number people
who have implemented algorithms for solving systems of (sparse) linear
equations on Hadoop MapReduce.
However, I can find no such thing for Spark.
Has anyone information on whether there are attempts of creating such
29 matches
Mail list logo