Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-28 Thread Xiangrui Meng
It implements CombineInputFormat from Hadoop. isSplittable=false means each
individual file cannot be split. If you only see one partition even with a
large minPartitions, perhaps the total size of files is not big enough.
Those are configurable in Hadoop conf. -Xiangrui

On Tue, Apr 26, 2016, 8:32 AM Hyukjin Kwon  wrote:

> EDIT: not mapper but a task for HadoopRDD maybe as far as I know.
>
> I think the most clear way is just to run a job on multiple files with the
> API and check the number of tasks in the job.
> On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon"  wrote:
>
> wholeTextFile() API uses WholeTextFileInputFormat,
> https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
> which returns false for isSplittable. In this case, only single mapper
> appears for the entire file as far as I know.
>
> And also https://spark.apache.org/docs/1.6.0/programming-guide.html
>
> If the file is single file, then this would not be distributed.
> On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:
>
>> Please take a look at:
>> core/src/main/scala/org/apache/spark/SparkContext.scala
>>
>>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>>*
>>*  then `rdd` contains
>>* {{{
>>*   (a-hdfs-path/part-0, its content)
>>*   (a-hdfs-path/part-1, its content)
>>*   ...
>>*   (a-hdfs-path/part-n, its content)
>>* }}}
>> ...
>>   * @param minPartitions A suggestion value of the minimal splitting
>> number for input data.
>>
>>   def wholeTextFiles(
>>   path: String,
>>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
>> withScope {
>>
>> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
>> wrote:
>>
>>> Hi guys,
>>>
>>> I'm trying to read many filed from s3 using
>>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>>> manner? Please give me a link to the place in documentation where it's
>>> specified.
>>>
>>> Thanks, Vadim.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know.

I think the most clear way is just to run a job on multiple files with the
API and check the number of tasks in the job.
On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon"  wrote:

wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu
Spark can create distributed datasets from any storage source supported 
by Hadoop, including your local file system, HDFS, Cassandra, HBase, 
Amazon S3 , etc. Spark supports 
text files, SequenceFiles 
, 
and any other Hadoop InputFormat 
.


Text file RDDs can be created using |SparkContext|’s |textFile| method. 
This method takes an URI for the file (either a local path on the 
machine, or a |hdfs://|, |s3n://|, etc URI) and reads it as a collection 
of lines. Here is an example invocation



I could not find an concrete statement where it says either the read 
(more than one file) is distributed or not.


On 26.04.2016 18:00, Hyukjin Kwon wrote:

then this would not be distributed




Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at:
core/src/main/scala/org/apache/spark/SparkContext.scala

   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   *  then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-0, its content)
   *   (a-hdfs-path/part-1, its content)
   *   ...
   *   (a-hdfs-path/part-n, its content)
   * }}}
...
  * @param minPartitions A suggestion value of the minimal splitting number
for input data.

  def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
withScope {

On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
wrote:

> Hi guys,
>
> I'm trying to read many filed from s3 using
> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
> manner? Please give me a link to the place in documentation where it's
> specified.
>
> Thanks, Vadim.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Vadim Vararu

Hi guys,

I'm trying to read many filed from s3 using 
JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed 
manner? Please give me a link to the place in documentation where it's 
specified.


Thanks, Vadim.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org