Re: Read Data From NFS

Riccardo Ferrari Tue, 13 Jun 2017 01:05:29 -0700

Hi Ayan,
You might be interested in the official Spark docs:
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and
its spark.default.parallelism setting


Best,

On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <guha.a...@gmail.com> wrote:

> I understand how it works with hdfs. My question is when hdfs is not the
> file sustem, how number of partitions are calculated. Hope that makes it
> clearer.
>
> On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <vaquar.k...@gmail.com> wrote:
>
>>
>>
>> As per spark doc :
>> The textFile method also takes an optional second argument for
>> controlling the number of partitions of the file.* By default, Spark
>> creates one partition for each block of the file (blocks being 128MB by
>> default in HDFS)*, but you can also ask for a higher number of
>> partitions by passing a larger value. Note that you cannot have fewer
>> partitions than blocks.
>>
>>
>> sc.textFile doesn't commence any reading. It simply defines a
>> driver-resident data structure which can be used for further processing.
>>
>> It is not until an action is called on an RDD that Spark will build up a
>> strategy to perform all the required transforms (including the read) and
>> then return the result.
>>
>> If there is an action called to run the sequence, and your next
>> transformation after the read is to map, then Spark will need to read a
>> small section of lines of the file (according to the partitioning strategy
>> based on the number of cores) and then immediately start to map it until it
>> needs to return a result to the driver, or shuffle before the next sequence
>> of transformations.
>>
>> If your partitioning strategy (defaultMinPartitions) seems to be
>> swamping the workers because the java representation of your partition (an
>> InputSplit in HDFS terms) is bigger than available executor memory, then
>> you need to specify the number of partitions to read as the second
>> parameter to textFile. You can calculate the ideal number of partitions
>> by dividing your file size by your target partition size (allowing for
>> memory growth). A simple check that the file can be read would be:
>>
>> sc.textFile(file, numPartitions).count()
>>
>> You can get good explanation here :
>> https://stackoverflow.com/questions/29011574/how-does-
>> partitioning-work-for-data-from-files-on-hdfs
>>
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> My question is what happens if I have 1 file of say 100gb. Then how many
>>> partitions will be there?
>>>
>>> Best
>>> Ayan
>>> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <vaquar.k...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ayan,
>>>>
>>>> If you have multiple files (example 12 files )and you are using
>>>> following code then you will get 12 partition.
>>>>
>>>> r = sc.textFile("file://my/file/*")
>>>>
>>>> Not sure what you want to know about file system ,please check API doc.
>>>>
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>>
>>>> On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@gmail.com> wrote:
>>>>
>>>> Any one?
>>>>
>>>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi Guys
>>>>>
>>>>> Quick one: How spark deals (ie create partitions) with large files
>>>>> sitting on NFS, assuming the all executors can see the file exactly same
>>>>> way.
>>>>>
>>>>> ie, when I run
>>>>>
>>>>> r = sc.textFile("file://my/file")
>>>>>
>>>>> what happens if the file is on NFS?
>>>>>
>>>>> is there any difference from
>>>>>
>>>>> r = sc.textFile("hdfs://my/file")
>>>>>
>>>>> Are the input formats used same in both cases?
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Read Data From NFS

Reply via email to