What is the format of your input data, prior to insertion into Tachyon?
On Sun, May 25, 2014 at 7:52 PM, qingyang li <liqingyang1...@gmail.com>wrote: > i tried "set mapred.map.tasks=30" , it does not work, it seems shark does > not support this setting. > i also tried "SET mapred.max.split.size=64000000", it does not work,too. > is there other way to control task number in shark CLI ? > > > > 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: > > You can try setting "mapred.map.tasks" to get Hive to do the right thing. >> >> >> On Sun, May 25, 2014 at 7:27 PM, qingyang li <liqingyang1...@gmail.com>wrote: >> >>> Hi, Aaron, thanks for sharing. >>> >>> I am using shark to execute query , and table is created on tachyon. I >>> think i can not using RDD#repartition() in shark CLI; >>> if shark support "SET mapred.max.split.size" to control file size ? >>> if yes, after i create table, i can control file num, then I can >>> control task number. >>> if not , do anyone know other way to control task number in shark CLI? >>> >>> >>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>: >>> >>> How many partitions are in your input data set? A possibility is that >>>> your input data has 10 unsplittable files, so you end up with 10 >>>> partitions. You could improve this by using RDD#repartition(). >>>> >>>> Note that mapPartitionsWithIndex is sort of the "main processing loop" >>>> for many Spark functions. It is iterating through all the elements of the >>>> partition and doing some computation (probably running your user code) on >>>> it. >>>> >>>> You can see the number of partitions in your RDD by visiting the Spark >>>> driver web interface. To access this, visit port 8080 on host running your >>>> Standalone Master (assuming you're running standalone mode), which will >>>> have a link to the application web interface. The Tachyon master also has a >>>> useful web interface, available at port 19999. >>>> >>>> >>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li >>>> <liqingyang1...@gmail.com>wrote: >>>> >>>>> hi, Mayur, thanks for replying. >>>>> I know spark application should take all cores by default. My question >>>>> is how to set task number on each core ? >>>>> If one silce, one task, how can i set silce file size ? >>>>> >>>>> >>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>: >>>>> >>>>> How many cores do you see on your spark master (8080 port). >>>>>> By default spark application should take all cores when you launch >>>>>> it. Unless you have set max core configuration. >>>>>> >>>>>> >>>>>> Mayur Rustagi >>>>>> Ph: +1 (760) 203 3257 >>>>>> http://www.sigmoidanalytics.com >>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li < >>>>>> liqingyang1...@gmail.com> wrote: >>>>>> >>>>>>> my aim of setting task number is to increase the query speed, and >>>>>>> I have also found " mapPartitionsWithIndex at >>>>>>> Operator.scala:333<http://192.168.1.101:4040/stages/stage?id=17>" >>>>>>> is costing much time. so, my another question is : >>>>>>> how to tunning >>>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17> >>>>>>> to make the costing time down? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>>>>> >>>>>>> i have added SPARK_JAVA_OPTS+="-Dspark. >>>>>>>> default.parallelism=40 " in shark-env.sh, >>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks each >>>>>>>> machine. >>>>>>>> >>>>>>>> >>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>>>>>> >>>>>>>> i have added SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 " >>>>>>>>> in shark-env.sh >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: >>>>>>>>> >>>>>>>>> i am using tachyon as storage system and using to shark to query a >>>>>>>>>> table which is a bigtable, i have 5 machines as a spark cluster, >>>>>>>>>> there are >>>>>>>>>> 4 cores on each machine . >>>>>>>>>> My question is: >>>>>>>>>> 1. how to set task number on each core? >>>>>>>>>> 2. where to see how many partitions of one RDD? >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >