As per spark doc :
The textFile method also takes an optional second argument for controlling
the number of partitions of the file.* By default, Spark creates one
partition for each block of the file (blocks being 128MB by default in
HDFS)*, but you can also ask for a higher number of partitions by passing a
larger value. Note that you cannot have fewer partitions than blocks.


sc.textFile doesn't commence any reading. It simply defines a
driver-resident data structure which can be used for further processing.

It is not until an action is called on an RDD that Spark will build up a
strategy to perform all the required transforms (including the read) and
then return the result.

If there is an action called to run the sequence, and your next
transformation after the read is to map, then Spark will need to read a
small section of lines of the file (according to the partitioning strategy
based on the number of cores) and then immediately start to map it until it
needs to return a result to the driver, or shuffle before the next sequence
of transformations.

If your partitioning strategy (defaultMinPartitions) seems to be swamping
the workers because the java representation of your partition (an InputSplit in
HDFS terms) is bigger than available executor memory, then you need to
specify the number of partitions to read as the second parameter to textFile.
You can calculate the ideal number of partitions by dividing your file size
by your target partition size (allowing for memory growth). A simple check
that the file can be read would be:

sc.textFile(file, numPartitions).count()

You can get good explanation here :
https://stackoverflow.com/questions/29011574/how-does-
partitioning-work-for-data-from-files-on-hdfs



Regards,
Vaquar khan


On Jun 11, 2017 5:28 AM, "ayan guha" <[email protected]> wrote:

> Hi
>
> My question is what happens if I have 1 file of say 100gb. Then how many
> partitions will be there?
>
> Best
> Ayan
> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <[email protected]> wrote:
>
>> Hi Ayan,
>>
>> If you have multiple files (example 12 files )and you are using following
>> code then you will get 12 partition.
>>
>> r = sc.textFile("file://my/file/*")
>>
>> Not sure what you want to know about file system ,please check API doc.
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Jun 8, 2017 10:44 AM, "ayan guha" <[email protected]> wrote:
>>
>> Any one?
>>
>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <[email protected]> wrote:
>>
>>> Hi Guys
>>>
>>> Quick one: How spark deals (ie create partitions) with large files
>>> sitting on NFS, assuming the all executors can see the file exactly same
>>> way.
>>>
>>> ie, when I run
>>>
>>> r = sc.textFile("file://my/file")
>>>
>>> what happens if the file is on NFS?
>>>
>>> is there any difference from
>>>
>>> r = sc.textFile("hdfs://my/file")
>>>
>>> Are the input formats used same in both cases?
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
> Best Regards,
> Ayan Guha
>

Reply via email to