Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.



On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang  wrote:

> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I guess you mean put comma separated path into one string and pass it
> to existing API (SparkContext#textFile). It should not work. I suggest to
> create new api SparkContext#textFiles to accept an array of string. I have
> already implemented a simple patch and it works.
>
>
>
>
> On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota 
> wrote:
>
>> Looks like what I was suggesting doesn't work. :/
>>
>> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:
>>
>>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So
>>> in spark side, we just need to provide API to for that.
>>>
>>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> > wrote:
>>>
 IIRC, TextInputFormat supports an input path that is a comma separated
 list. I haven't tried this, but I think you should just be able to do
 sc.textFile("file1,file2,...")

 On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra  > wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple
 inputs. But sometimes, it is not convenient to do that. Not sure why
 there's no api of SparkContext#textFiles. It should be easy to 
 implement
 that. I'd love to create a ticket and contribute for that if there's no
 other consideration that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Hi Pradeep

≥≥≥ Looks like what I was suggesting doesn't work. :/
I guess you mean put comma separated path into one string and pass it
to existing API (SparkContext#textFile). It should not work. I suggest to
create new api SparkContext#textFiles to accept an array of string. I have
already implemented a simple patch and it works.




On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota 
wrote:

> Looks like what I was suggesting doesn't work. :/
>
> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:
>
>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So
>> in spark side, we just need to provide API to for that.
>>
>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
>> wrote:
>>
>>> IIRC, TextInputFormat supports an input path that is a comma separated
>>> list. I haven't tried this, but I think you should just be able to do
>>> sc.textFile("file1,file2,...")
>>>
>>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>>>
 I know these workaround, but wouldn't it be more convenient and
 straightforward to use SparkContext#textFiles ?

 On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
 wrote:

> For more than a small number of files, you'd be better off using
> SparkContext#union instead of RDD#union.  That will avoid building up a
> lengthy lineage.
>
> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
> wrote:
>
>> Hey Jeff,
>> Do you mean reading from multiple text files? In that case, as a
>> workaround, you can use the RDD#union() (or ++) method to concatenate
>> multiple rdds. For example:
>>
>> val lines1 = sc.textFile("file1")
>> val lines2 = sc.textFile("file2")
>>
>> val rdd = lines1 union lines2
>>
>> regards,
>> --Jakob
>>
>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>
>>> Although user can use the hdfs glob syntax to support multiple
>>> inputs. But sometimes, it is not convenient to do that. Not sure why
>>> there's no api of SparkContext#textFiles. It should be easy to implement
>>> that. I'd love to create a ticket and contribute for that if there's no
>>> other consideration that I don't know.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
spark side, we just need to provide API to for that.

On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
wrote:

> IIRC, TextInputFormat supports an input path that is a comma separated
> list. I haven't tried this, but I think you should just be able to do
> sc.textFile("file1,file2,...")
>
> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>
>> I know these workaround, but wouldn't it be more convenient and
>> straightforward to use SparkContext#textFiles ?
>>
>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>> wrote:
>>
>>> For more than a small number of files, you'd be better off using
>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>> lengthy lineage.
>>>
>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>>> wrote:
>>>
 Hey Jeff,
 Do you mean reading from multiple text files? In that case, as a
 workaround, you can use the RDD#union() (or ++) method to concatenate
 multiple rdds. For example:

 val lines1 = sc.textFile("file1")
 val lines2 = sc.textFile("file2")

 val rdd = lines1 union lines2

 regards,
 --Jakob

 On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs.
> But sometimes, it is not convenient to do that. Not sure why there's no 
> api
> of SparkContext#textFiles. It should be easy to implement that. I'd love 
> to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and
straightforward to use SparkContext#textFiles ?

On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
wrote:

> For more than a small number of files, you'd be better off using
> SparkContext#union instead of RDD#union.  That will avoid building up a
> lengthy lineage.
>
> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
> wrote:
>
>> Hey Jeff,
>> Do you mean reading from multiple text files? In that case, as a
>> workaround, you can use the RDD#union() (or ++) method to concatenate
>> multiple rdds. For example:
>>
>> val lines1 = sc.textFile("file1")
>> val lines2 = sc.textFile("file2")
>>
>> val rdd = lines1 union lines2
>>
>> regards,
>> --Jakob
>>
>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>
>>> Although user can use the hdfs glob syntax to support multiple inputs.
>>> But sometimes, it is not convenient to do that. Not sure why there's no api
>>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>>> create a ticket and contribute for that if there's no other consideration
>>> that I don't know.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using
SparkContext#union instead of RDD#union.  That will avoid building up a
lengthy lineage.

On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky  wrote:

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them
into a Seq and use "reduce(_ ++ _)".

Best Regards,
Shixiong Zhu

2015-11-11 10:21 GMT-08:00 Jakob Odersky :

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>