Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread jchen
Hi, Has someone tried using Spark Streaming with MySQL (or any other database/data store)? I can write to MySQL at the beginning of the driver application. However, when I am trying to write the result of every streaming processing window to MySQL, it fails with the following error:

Crawler and Scraper with different priorities

2014-09-07 Thread Sandeep Singh
Hi all, I am Implementing a Crawler, Scraper. The It should be able to process the request for crawling scraping, within few seconds of submitting the job(around 1mil/sec), for rest I can take some time(scheduled evenly all over the day). What is the best way to implement this? Thanks. --

Re: Spark 1.0.2 Can GroupByTest example be run in Eclipse without change

2014-09-07 Thread Shing Hing Man
After looking at the source code of SparkConf.scala, I found the following solution. Just set the following Java system property : -Dspark.master=local Shing On Monday, 1 September 2014, 22:09, Shing Hing Man mat...@yahoo.com.INVALID wrote: Hi, I have noticed that the GroupByTest

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi, I would like to make sure I'm not exceeding the quota on the local cluster's hdfs. I have a couple of questions: 1. How do I know the quota? Here's the output of hadoop fs -count -q which essentially does not tell me a lot root@ip-172-31-7-49 ~]$ hadoop fs -count -q / 2147483647

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Ognen Duzlevski
On 9/7/2014 7:27 AM, Tomer Benyamini wrote: 2. What should I do to increase the quota? Should I bring down the existing slaves and upgrade to ones with more storage? Is there a way to add disks to existing slaves? I'm using the default m1.large slaves set up using the spark-ec2 script. Take a

Fwd: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded. Message probably in a routing loop.

2014-09-07 Thread Ognen Duzlevski
I keep getting below reply every time I send a message to the Spark user list? Can this person be taken off the list by powers that be? Thanks! Ognen Forwarded Message Subject: DELIVERY FAILURE: Error transferring to QCMBSJ601.HERMES.SI.SOCGEN; Maximum hop count exceeded.

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/. It shows 1 node hdfs though, although I have 4 slaves on my cluster. Any idea why? On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On 9/7/2014 7:27 AM, Tomer Benyamini wrote: 2. What should

distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Mayur Rustagi
Standard pattern is to initialize the mysql jdbc driver in your mappartition call , update database then close off the driver. Couple of gotchas 1. New driver initiated for all your partitions 2. If the effect(inserts updates) is not idempotent, so if your server crashes, Spark will replay

Re: Q: About scenarios where driver execution flow may block...

2014-09-07 Thread Mayur Rustagi
Statements are executed only when you try to cause some effect on the server (produce data, collect data on driver). At time of execution Spark does all the depedency resolution truncates paths that dont go anywhere as well as optimize execution pipelines. So you really dont have to worry about

Re: Array and RDDs

2014-09-07 Thread Mayur Rustagi
Your question is a bit confusing.. I assume you have a RDD containing nodes some meta data (child nodes maybe) you are trying to attach another metadata to it (bye array). if its just same byte array for all nodes you can generate rdd with the count of nodes zip the two rdd together, you can

Re: how to choose right DStream batch interval

2014-09-07 Thread Mayur Rustagi
Spark will simply have a backlog of tasks, it'll manage to process them nonetheless, though if it keeps falling behind, you may run out of memory or have unreasonable latency. For momentary spikes, Spark streaming will manage. Mostly if you are looking to do 100% processing, you'll have to go with

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
I've installed a spark standalone cluster on ec2 as defined here - https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if mr1/2 is part of this installation. On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote: Distcp requires a mr1(or mr2) cluster to start. Do

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski
Have you actually tested this? I have two instances, one is standalone master and the other one just has spark installed, same versions of spark (1.0.0). The security group on the master allows all (0-65535) TCP and UDP traffic from the other machine and the other machine allows all TCP/UDP

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Horacio G. de Oro
Have you tryied with ssh? It will be much secure (only 1 port open), and you'll be able to run spark-shell over the networ. I'm using that way in my project (https://github.com/data-tsunami/smoke) with good results. I can't make a try now, but something like this should work: ssh -tt

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Ognen Duzlevski
Horacio, Thanks, I have not tried that, however, I am not after security right now - I am just wondering why something so obvious won't work ;) Ognen On 9/7/2014 12:38 PM, Horacio G. de Oro wrote: Have you tryied with ssh? It will be much secure (only 1 port open), and you'll be able to run

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Nicholas Chammas
I think you need to run start-all.sh or something similar on the EC2 cluster. MR is installed but is not running by default on EC2 clusters spun up by spark-ec2. ​ On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini tomer@gmail.com wrote: I've installed a spark standalone cluster on ec2 as

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess

Re: Low Level Kafka Consumer for Spark

2014-09-07 Thread Dibyendu Bhattacharya
Hi Tathagata, I have managed to implement the logic into the Kafka-Spark consumer to recover from Driver failure. This is just a interim fix till actual fix is done from Spark side. The logic is something like this. 1. When the Individual Receivers starts for every Topic partition, it writes

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-07 Thread Sean Owen
Also keep in mind there is a non-trivial amount of traffic between the driver and cluster. It's not something I would do by default, running the driver so remotely. With enough ports open it should work though. On Sun, Sep 7, 2014 at 7:05 PM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote:

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Sean Owen
... I'd call out that last bit as actually tricky: close off the driver See this message for the right-est way to do that, along with the right way to open DB connections remotely instead of trying to serialize them:

Re: Spark SQL check if query is completed (pyspark)

2014-09-07 Thread Michael Armbrust
Sometimes the underlying Hive code will also print exceptions during successful execution (for example CREATE TABLE IF NOT EXISTS). If there is actually a problem Spark SQL should throw an exception. What is the command you are running and what is the error you are seeing? On Sat, Sep 6, 2014

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-07 Thread Soumitra Kumar
I have the following code: stream foreachRDD { rdd = if (rdd.take (1).size == 1) { rdd foreachPartition { iterator = initDbConnection () iterator foreach { write to db

Spark groupByKey partition out of memory

2014-09-07 Thread julyfire
When a MappedRDD is handled by groupByKey transformation, tuples distributed in different worker nodes with the same key will be collected into one worker nodes, say, (K, V1), (K, V2), ..., (K, Vn) - (K, Seq(V1, V2, ..., Vn)). I want to know whether the value /Seq(V1, V2, ..., Vn)/ of a tuple

Re: Recursion

2014-09-07 Thread Tobias Pfeiffer
Hi, On Fri, Sep 5, 2014 at 6:16 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Does Spark support recursive calls? Can you give an example of which kind of recursion you would like to use? Tobias

Re: How to list all registered tables in a sql context?

2014-09-07 Thread Tobias Pfeiffer
Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Err... there's no such feature? The problem is that the SQLContext's `catalog` member is protected, so you can't access it from outside. If you subclass SQLContext, and make sure that `catalog` is always a

Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Otis Gospodnetic
Hi, I'm trying to determine which Spark deployment models are the most popular - Standalone, YARN, Mesos, or SIMR. Anyone knows? I thought I'm use search-hadoop.com to help me figure this out and this is what I found: 1) Standalone

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had

Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread durin
Doing a quick Google search, it appears to me that there is a number people who have implemented algorithms for solving systems of (sparse) linear equations on Hadoop MapReduce. However, I can find no such thing for Spark. Has anyone information on whether there are attempts of creating such