So thr are static cost associated with parsing the queries, structuring the
operators but should not be that much.
Another thing is all the data is passed through a parser in Shark,
serialized & passed through filter & sent to driver.
In Spark data is simply read as text, run through contains & returns data
back to driver.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Mar 6, 2014 at 7:39 PM, qingyang li <liqingyang1...@gmail.com>wrote:

> *Hi, community, I have setup 3 nodes spark cluster using standalone mode,
> each machine's memery is 16G, the core is 4. *
>
>
>
> *when i run " val file =
> sc.textFile("/user/hive/warehouse/b/test.txt")
> file.filter(line => line.contains("2013-")).count()               "*
>
> *it cost  2.7s , *
>
>
>
> *but , when i run "select count(*) from b;" using shark, it cost 15.81s, *
>
>
>
> *So,Why shark using more time than spark?  *
>
> *other info:*
>
> *1. i have set export SPARK_MEM=10g in shark-env.sh2. *
> *test.txt is 4.21G which exists on each machine's directory
> /user/hive/warehouse/b/ and *
> *test.txt has been loaded into memery.*
>
> *3. there are 38532979 lines in test.txt*
>

Reply via email to