Re: RDD Partition number

2015-02-20 Thread Alessandro Lulli
Hi All,

Thanks for your answers.
I have one more details to point out.

It is clear now how partition number is defined for HDFS file,

However, if i have my dataset replicated on all the machines in the same
absolute path.
In this case each machine has for instance ext3 filesystem.

If i load the file in a RDD how many partitions are defined in this case
and why?

I found that Spark define a number, say K, of partitions. If i force the
partition to be =K my parameter is ignored.
If a set a value K*=K then Spark set K* partitions.

Thanks for your help
Alessandro


On Thu, Feb 19, 2015 at 6:27 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. *blocks being 64MB by default in HDFS*


 *In hadoop 2.1+, default block size has been increased.*
 See https://issues.apache.org/jira/browse/HDFS-4053

 Cheers

 On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu yuzhih...@gmail.com wrote:

 What file system are you using ?

 If you use hdfs, the documentation you cited is pretty clear on how
 partitions are determined.

 bq. file X replicated on 4 machines

 I don't think replication factor plays a role w.r.t. partitions.

 On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli lu...@di.unipi.it
 wrote:

 Hi All,

 Could you please help me understanding how Spark defines the number of
 partitions of the RDDs if not specified?

 I found the following in the documentation for file loaded from HDFS:
 *The textFile method also takes an optional second argument for
 controlling the number of partitions of the file. By default, Spark creates
 one partition for each block of the file (blocks being 64MB by default in
 HDFS), but you can also ask for a higher number of partitions by passing a
 larger value. Note that you cannot have fewer partitions than blocks*

 What is the rule for file loaded from the file systems?
 For instance, i have a file X replicated on 4 machines. If i load the
 file X in a RDD how many partitions are defined and why?

 Thanks for your help on this
 Alessandro






Re: RDD Partition number

2015-02-19 Thread Ted Yu
What file system are you using ?

If you use hdfs, the documentation you cited is pretty clear on how
partitions are determined.

bq. file X replicated on 4 machines

I don't think replication factor plays a role w.r.t. partitions.

On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli lu...@di.unipi.it wrote:

 Hi All,

 Could you please help me understanding how Spark defines the number of
 partitions of the RDDs if not specified?

 I found the following in the documentation for file loaded from HDFS:
 *The textFile method also takes an optional second argument for
 controlling the number of partitions of the file. By default, Spark creates
 one partition for each block of the file (blocks being 64MB by default in
 HDFS), but you can also ask for a higher number of partitions by passing a
 larger value. Note that you cannot have fewer partitions than blocks*

 What is the rule for file loaded from the file systems?
 For instance, i have a file X replicated on 4 machines. If i load the file
 X in a RDD how many partitions are defined and why?

 Thanks for your help on this
 Alessandro



Re: RDD Partition number

2015-02-19 Thread Ilya Ganelin
By default you will have (fileSize in Mb / 64) partitions. You can also set
the number of partitions when you read in a file with sc.textFile as an
optional second parameter.
On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli lu...@di.unipi.it wrote:

 Hi All,

 Could you please help me understanding how Spark defines the number of
 partitions of the RDDs if not specified?

 I found the following in the documentation for file loaded from HDFS:
 *The textFile method also takes an optional second argument for
 controlling the number of partitions of the file. By default, Spark creates
 one partition for each block of the file (blocks being 64MB by default in
 HDFS), but you can also ask for a higher number of partitions by passing a
 larger value. Note that you cannot have fewer partitions than blocks*

 What is the rule for file loaded from the file systems?
 For instance, i have a file X replicated on 4 machines. If i load the file
 X in a RDD how many partitions are defined and why?

 Thanks for your help on this
 Alessandro



Re: RDD Partition number

2015-02-19 Thread Ted Yu
bq. *blocks being 64MB by default in HDFS*


*In hadoop 2.1+, default block size has been increased.*
See https://issues.apache.org/jira/browse/HDFS-4053

Cheers

On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu yuzhih...@gmail.com wrote:

 What file system are you using ?

 If you use hdfs, the documentation you cited is pretty clear on how
 partitions are determined.

 bq. file X replicated on 4 machines

 I don't think replication factor plays a role w.r.t. partitions.

 On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli lu...@di.unipi.it
 wrote:

 Hi All,

 Could you please help me understanding how Spark defines the number of
 partitions of the RDDs if not specified?

 I found the following in the documentation for file loaded from HDFS:
 *The textFile method also takes an optional second argument for
 controlling the number of partitions of the file. By default, Spark creates
 one partition for each block of the file (blocks being 64MB by default in
 HDFS), but you can also ask for a higher number of partitions by passing a
 larger value. Note that you cannot have fewer partitions than blocks*

 What is the rule for file loaded from the file systems?
 For instance, i have a file X replicated on 4 machines. If i load the
 file X in a RDD how many partitions are defined and why?

 Thanks for your help on this
 Alessandro





RE: RDD Partition number

2015-02-19 Thread Ganelin, Ilya
As Ted Yu points out, default block size is 128MB as of Hadoop 2.1.



Sent with Good (www.good.com)


-Original Message-
From: Ilya Ganelin [ilgan...@gmail.commailto:ilgan...@gmail.com]
Sent: Thursday, February 19, 2015 12:13 PM Eastern Standard Time
To: Alessandro Lulli; user@spark.apache.org
Cc: Massimiliano Bertolucci
Subject: Re: RDD Partition number

By default you will have (fileSize in Mb / 64) partitions. You can also set the 
number of partitions when you read in a file with sc.textFile as an optional 
second parameter.
On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli 
lu...@di.unipi.itmailto:lu...@di.unipi.it wrote:
Hi All,

Could you please help me understanding how Spark defines the number of 
partitions of the RDDs if not specified?

I found the following in the documentation for file loaded from HDFS:
The textFile method also takes an optional second argument for controlling the 
number of partitions of the file. By default, Spark creates one partition for 
each block of the file (blocks being 64MB by default in HDFS), but you can also 
ask for a higher number of partitions by passing a larger value. Note that you 
cannot have fewer partitions than blocks

What is the rule for file loaded from the file systems?
For instance, i have a file X replicated on 4 machines. If i load the file X in 
a RDD how many partitions are defined and why?

Thanks for your help on this
Alessandro


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.