Hi Ayan, You might be interested in the official Spark docs: https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and its spark.default.parallelism setting
Best, On Mon, Jun 12, 2017 at 6:18 AM, ayan guha <guha.a...@gmail.com> wrote: > I understand how it works with hdfs. My question is when hdfs is not the > file sustem, how number of partitions are calculated. Hope that makes it > clearer. > > On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <vaquar.k...@gmail.com> wrote: > >> >> >> As per spark doc : >> The textFile method also takes an optional second argument for >> controlling the number of partitions of the file.* By default, Spark >> creates one partition for each block of the file (blocks being 128MB by >> default in HDFS)*, but you can also ask for a higher number of >> partitions by passing a larger value. Note that you cannot have fewer >> partitions than blocks. >> >> >> sc.textFile doesn't commence any reading. It simply defines a >> driver-resident data structure which can be used for further processing. >> >> It is not until an action is called on an RDD that Spark will build up a >> strategy to perform all the required transforms (including the read) and >> then return the result. >> >> If there is an action called to run the sequence, and your next >> transformation after the read is to map, then Spark will need to read a >> small section of lines of the file (according to the partitioning strategy >> based on the number of cores) and then immediately start to map it until it >> needs to return a result to the driver, or shuffle before the next sequence >> of transformations. >> >> If your partitioning strategy (defaultMinPartitions) seems to be >> swamping the workers because the java representation of your partition (an >> InputSplit in HDFS terms) is bigger than available executor memory, then >> you need to specify the number of partitions to read as the second >> parameter to textFile. You can calculate the ideal number of partitions >> by dividing your file size by your target partition size (allowing for >> memory growth). A simple check that the file can be read would be: >> >> sc.textFile(file, numPartitions).count() >> >> You can get good explanation here : >> https://stackoverflow.com/questions/29011574/how-does- >> partitioning-work-for-data-from-files-on-hdfs >> >> >> >> Regards, >> Vaquar khan >> >> >> On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote: >> >>> Hi >>> >>> My question is what happens if I have 1 file of say 100gb. Then how many >>> partitions will be there? >>> >>> Best >>> Ayan >>> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <vaquar.k...@gmail.com> >>> wrote: >>> >>>> Hi Ayan, >>>> >>>> If you have multiple files (example 12 files )and you are using >>>> following code then you will get 12 partition. >>>> >>>> r = sc.textFile("file://my/file/*") >>>> >>>> Not sure what you want to know about file system ,please check API doc. >>>> >>>> >>>> Regards, >>>> Vaquar khan >>>> >>>> >>>> On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@gmail.com> wrote: >>>> >>>> Any one? >>>> >>>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote: >>>> >>>>> Hi Guys >>>>> >>>>> Quick one: How spark deals (ie create partitions) with large files >>>>> sitting on NFS, assuming the all executors can see the file exactly same >>>>> way. >>>>> >>>>> ie, when I run >>>>> >>>>> r = sc.textFile("file://my/file") >>>>> >>>>> what happens if the file is on NFS? >>>>> >>>>> is there any difference from >>>>> >>>>> r = sc.textFile("hdfs://my/file") >>>>> >>>>> Are the input formats used same in both cases? >>>>> >>>>> >>>>> -- >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>>> >>>> -- >>> Best Regards, >>> Ayan Guha >>> >> -- > Best Regards, > Ayan Guha >