So thr are static cost associated with parsing the queries, structuring the operators but should not be that much. Another thing is all the data is passed through a parser in Shark, serialized & passed through filter & sent to driver. In Spark data is simply read as text, run through contains & returns data back to driver.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Thu, Mar 6, 2014 at 7:39 PM, qingyang li <liqingyang1...@gmail.com>wrote: > *Hi, community, I have setup 3 nodes spark cluster using standalone mode, > each machine's memery is 16G, the core is 4. * > > > > *when i run " val file = > sc.textFile("/user/hive/warehouse/b/test.txt") > file.filter(line => line.contains("2013-")).count() "* > > *it cost 2.7s , * > > > > *but , when i run "select count(*) from b;" using shark, it cost 15.81s, * > > > > *So,Why shark using more time than spark? * > > *other info:* > > *1. i have set export SPARK_MEM=10g in shark-env.sh2. * > *test.txt is 4.21G which exists on each machine's directory > /user/hive/warehouse/b/ and * > *test.txt has been loaded into memery.* > > *3. there are 38532979 lines in test.txt* >