You can try out a few tricks employed by folks at Lynx Analytics... Daniel Darabos gave some details at Spark Summit:
https://www.youtube.com/watch?v=zt1LdVj76LU&index=13&list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs


On 22.7.2015. 17:00, Louis Hust wrote:
My code like below:
            Map<String, String> t11opt = new HashMap<String, String>();
            t11opt.put("url", DB_URL);
            t11opt.put("dbtable", "t11");
            DataFrame t11 = sqlContext.load("jdbc", t11opt);
            t11.registerTempTable("t11");

            .......the same for t12, t21, t22


            DataFrame t1 = t11.unionAll(t12);
            t1.registerTempTable("t1");
            DataFrame t2 = t21.unionAll(t22);
            t2.registerTempTable("t2");
            for (int i = 0; i < 10; i ++) {
                System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql("select txt from t1 join t2 on t1.id <http://t1.id> = t2.id <http://t2.id>");
                crossjoin.show();
                System.out.println(new Date(System.currentTimeMillis()));
            }

Where t11,t12, t21,t22 are all table dataframe load from jdbc of mysql database which is at local with the spark job.

But each loop execute about 3 seconds. i do not know why cost so many time?




2015-07-22 19:52 GMT+08:00 Robin East <robin.e...@xense.co.uk <mailto:robin.e...@xense.co.uk>>:

    Here’s an example using spark-shell on my laptop:

    sc.textFile("LICENSE").filter(_ contains "Spark").count

    This takes less than a second the first time I run it and is
    instantaneous on every subsequent run.

    What code are you running?


    On 22 Jul 2015, at 12:34, Louis Hust <louis.h...@gmail.com
    <mailto:louis.h...@gmail.com>> wrote:

    I do a simple test using spark in standalone mode(not cluster),
     and found a simple action take a few seconds, the data size is
    small, just few rows.
    So each spark job will cost some time for init or prepare work no
    matter what the job is?
    I mean if the basic framework of spark job will cost seconds?

    2015-07-22 19:17 GMT+08:00 Robin East <robin.e...@xense.co.uk
    <mailto:robin.e...@xense.co.uk>>:

        Real-time is, of course, relative but you’ve mentioned
        microsecond level. Spark is designed to process large amounts
        of data in a distributed fashion. No distributed system I
        know of could give any kind of guarantees at the microsecond
        level.

        Robin

        > On 22 Jul 2015, at 11:14, Louis Hust <louis.h...@gmail.com
        <mailto:louis.h...@gmail.com>> wrote:
        >
        > Hi, all
        >
        > I am using spark jar in standalone mode, fetch data from
        different mysql instance and do some action, but i found the
        time is at second level.
        >
        > So i want to know if spark job is suitable for real time
        query which at microseconds?





Reply via email to