Side note: I thought bzip2 was splittable. Perhaps you meant gzip? 2014년 10월 18일 토요일, Aaron Davidson<ilike...@gmail.com>님이 작성한 메시지:
> The "minPartitions" argument of textFile/hadoopFile cannot decrease the > number of splits past the physical number of blocks/files. So if you have 3 > HDFS blocks, asking for 2 minPartitions will still give you 3 partitions > (hence the "min"). It can, however, convert a file with fewer HDFS blocks > into more (so you could ask for and get 4 partitions), assuming the blocks > are "splittable". HDFS blocks are usually splittable, but if it's > compressed with something like bzip2, it would not be. > > If you wish to combine splits from a larger file, you can use > RDD#coalesce. With shuffle=false, this will simply concatenate partitions, > but it does not provide any ordering guarantees (it uses an algorithm which > attempts to coalesce co-located partitions, to maintain locality > information). > > coalesce() with shuffle=true causes all of the elements will be shuffled > around randomly into new partitions, which is an expensive operation but > guarantees uniformity of data distribution. > > On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <mayur.rust...@gmail.com > <javascript:_e(%7B%7D,'cvml','mayur.rust...@gmail.com');>> wrote: > >> Does it retain the order if its pulling from the hdfs blocks, meaning >> if file1 => a, b, c partition in order >> if I convert to 2 partition read will it map to ab, c or a, bc or it can >> also be a, cb ? >> >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <ilgan...@gmail.com >> <javascript:_e(%7B%7D,'cvml','ilgan...@gmail.com');>> wrote: >> >>> Also - if you're doing a text file read you can pass the number of >>> resulting partitions as the second argument. >>> On Oct 17, 2014 9:05 PM, "Larry Liu" <larryli...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','larryli...@gmail.com');>> wrote: >>> >>>> Thanks, Andrew. What about reading out of local? >>>> >>>> On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <and...@andrewash.com >>>> <javascript:_e(%7B%7D,'cvml','and...@andrewash.com');>> wrote: >>>> >>>>> When reading out of HDFS it's the HDFS block size. >>>>> >>>>> On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <larryli...@gmail.com >>>>> <javascript:_e(%7B%7D,'cvml','larryli...@gmail.com');>> wrote: >>>>> >>>>>> What is the default input split size? How to change it? >>>>>> >>>>> >>>>> >>>> >> >