You can try out a few tricks employed by folks at Lynx Analytics...
Daniel Darabos gave some details at Spark Summit:
https://www.youtube.com/watch?v=zt1LdVj76LU&index=13&list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs
On 22.7.2015. 17:00, Louis Hust wrote:
My code like below:
Map<String, String> t11opt = new HashMap<String, String>();
t11opt.put("url", DB_URL);
t11opt.put("dbtable", "t11");
DataFrame t11 = sqlContext.load("jdbc", t11opt);
t11.registerTempTable("t11");
.......the same for t12, t21, t22
DataFrame t1 = t11.unionAll(t12);
t1.registerTempTable("t1");
DataFrame t2 = t21.unionAll(t22);
t2.registerTempTable("t2");
for (int i = 0; i < 10; i ++) {
System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql("select txt from
t1 join t2 on t1.id <http://t1.id> = t2.id <http://t2.id>");
crossjoin.show();
System.out.println(new Date(System.currentTimeMillis()));
}
Where t11,t12, t21,t22 are all table dataframe load from jdbc of
mysql database which is at local with the spark job.
But each loop execute about 3 seconds. i do not know why cost so many
time?
2015-07-22 19:52 GMT+08:00 Robin East <robin.e...@xense.co.uk
<mailto:robin.e...@xense.co.uk>>:
Here’s an example using spark-shell on my laptop:
sc.textFile("LICENSE").filter(_ contains "Spark").count
This takes less than a second the first time I run it and is
instantaneous on every subsequent run.
What code are you running?
On 22 Jul 2015, at 12:34, Louis Hust <louis.h...@gmail.com
<mailto:louis.h...@gmail.com>> wrote:
I do a simple test using spark in standalone mode(not cluster),
and found a simple action take a few seconds, the data size is
small, just few rows.
So each spark job will cost some time for init or prepare work no
matter what the job is?
I mean if the basic framework of spark job will cost seconds?
2015-07-22 19:17 GMT+08:00 Robin East <robin.e...@xense.co.uk
<mailto:robin.e...@xense.co.uk>>:
Real-time is, of course, relative but you’ve mentioned
microsecond level. Spark is designed to process large amounts
of data in a distributed fashion. No distributed system I
know of could give any kind of guarantees at the microsecond
level.
Robin
> On 22 Jul 2015, at 11:14, Louis Hust <louis.h...@gmail.com
<mailto:louis.h...@gmail.com>> wrote:
>
> Hi, all
>
> I am using spark jar in standalone mode, fetch data from
different mysql instance and do some action, but i found the
time is at second level.
>
> So i want to know if spark job is suitable for real time
query which at microseconds?