when i using "create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 limit 400000;" ,  there will be 4
files created on tachyon.
but when i using "create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 ;" ,  there will be 35 files created
on tachyon.
so, I think spark/shark know how to split files when creating table,  could
i control it's spliting by setting some configuration ,such as setting
"map.split.size=64M" ?


2014-05-26 12:14 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:

> I using " create table bigtable002 tblproperties('shark.cache'='tachyon')
> as select * from bigtable001"  to create table bigtable002; while
> bigtable001 is load from hdfs, it's format is text file ,  so i think
> bigtable002's is text.
>
>
> 2014-05-26 11:14 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>
> What is the format of your input data, prior to insertion into Tachyon?
>>
>>
>> On Sun, May 25, 2014 at 7:52 PM, qingyang li <liqingyang1...@gmail.com>wrote:
>>
>>> i tried "set mapred.map.tasks=30" , it does not work, it seems shark
>>> does not support this setting.
>>> i also tried "SET mapred.max.split.size=64000000", it does not work,too.
>>> is there other way to control task number in shark CLI ?
>>>
>>>
>>>
>>> 2014-05-26 10:38 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>>
>>> You can try setting "mapred.map.tasks" to get Hive to do the right
>>>> thing.
>>>>
>>>>
>>>> On Sun, May 25, 2014 at 7:27 PM, qingyang li 
>>>> <liqingyang1...@gmail.com>wrote:
>>>>
>>>>> Hi, Aaron, thanks for sharing.
>>>>>
>>>>> I am using shark to execute query , and table is created on tachyon. I
>>>>> think i can not using RDD#repartition() in shark CLI;
>>>>> if shark support "SET mapred.max.split.size" to control file size ?
>>>>> if yes,  after i create table, i can control file num,  then   I can
>>>>> control task number.
>>>>> if not , do anyone know other way to control task number in shark CLI?
>>>>>
>>>>>
>>>>> 2014-05-26 9:36 GMT+08:00 Aaron Davidson <ilike...@gmail.com>:
>>>>>
>>>>> How many partitions are in your input data set? A possibility is that
>>>>>> your input data has 10 unsplittable files, so you end up with 10
>>>>>> partitions. You could improve this by using RDD#repartition().
>>>>>>
>>>>>> Note that mapPartitionsWithIndex is sort of the "main processing
>>>>>> loop" for many Spark functions. It is iterating through all the elements 
>>>>>> of
>>>>>> the partition and doing some computation (probably running your user 
>>>>>> code)
>>>>>> on it.
>>>>>>
>>>>>> You can see the number of partitions in your RDD by visiting the
>>>>>> Spark driver web interface. To access this, visit port 8080 on host 
>>>>>> running
>>>>>> your Standalone Master (assuming you're running standalone mode), which
>>>>>> will have a link to the application web interface. The Tachyon master 
>>>>>> also
>>>>>> has a useful web interface, available at port 19999.
>>>>>>
>>>>>>
>>>>>> On Sun, May 25, 2014 at 5:43 PM, qingyang li <
>>>>>> liqingyang1...@gmail.com> wrote:
>>>>>>
>>>>>>> hi, Mayur, thanks for replying.
>>>>>>> I know spark application should take all cores by default. My
>>>>>>> question is  how to set task number on each core ?
>>>>>>> If one silce, one task,  how can i set silce file size ?
>>>>>>>
>>>>>>>
>>>>>>> 2014-05-23 16:37 GMT+08:00 Mayur Rustagi <mayur.rust...@gmail.com>:
>>>>>>>
>>>>>>> How many cores do you see on your spark master (8080 port).
>>>>>>>> By default spark application should take all cores when you launch
>>>>>>>> it. Unless you have set max core configuration.
>>>>>>>>
>>>>>>>>
>>>>>>>> Mayur Rustagi
>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 22, 2014 at 4:07 PM, qingyang li <
>>>>>>>> liqingyang1...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> my aim of setting task number is to increase the query speed,
>>>>>>>>> and I have also found " mapPartitionsWithIndex at
>>>>>>>>> Operator.scala:333 <http://192.168.1.101:4040/stages/stage?id=17>"
>>>>>>>>> is costing much time.  so, my another question is :
>>>>>>>>> how to tunning 
>>>>>>>>> mapPartitionsWithIndex<http://192.168.1.101:4040/stages/stage?id=17>
>>>>>>>>> to make the costing time down?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2014-05-22 18:09 GMT+08:00 qingyang li <liqingyang1...@gmail.com>:
>>>>>>>>>
>>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.
>>>>>>>>>> default.parallelism=40 "  in shark-env.sh,
>>>>>>>>>> but i find there are only10 tasks on the cluster and 2 tasks each
>>>>>>>>>> machine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-05-22 18:07 GMT+08:00 qingyang li <liqingyang1...@gmail.com>
>>>>>>>>>> :
>>>>>>>>>>
>>>>>>>>>> i have added  SPARK_JAVA_OPTS+="-Dspark.default.parallelism=40 "
>>>>>>>>>>> in shark-env.sh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-05-22 17:50 GMT+08:00 qingyang li <liqingyang1...@gmail.com
>>>>>>>>>>> >:
>>>>>>>>>>>
>>>>>>>>>>> i am using tachyon as storage system and using to shark to query
>>>>>>>>>>>> a table which is a bigtable, i have 5 machines as a spark cluster, 
>>>>>>>>>>>> there
>>>>>>>>>>>> are 4 cores on each machine .
>>>>>>>>>>>> My question is:
>>>>>>>>>>>> 1. how to set task number on each core?
>>>>>>>>>>>> 2. where to see how many partitions of one RDD?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to