Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.



On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang  wrote:

> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I guess you mean put comma separated path into one string and pass it
> to existing API (SparkContext#textFile). It should not work. I suggest to
> create new api SparkContext#textFiles to accept an array of string. I have
> already implemented a simple patch and it works.
>
>
>
>
> On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota 
> wrote:
>
>> Looks like what I was suggesting doesn't work. :/
>>
>> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:
>>
>>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So
>>> in spark side, we just need to provide API to for that.
>>>
>>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> > wrote:
>>>
 IIRC, TextInputFormat supports an input path that is a comma separated
 list. I haven't tried this, but I think you should just be able to do
 sc.textFile("file1,file2,...")

 On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra  > wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple
 inputs. But sometimes, it is not convenient to do that. Not sure why
 there's no api of SparkContext#textFiles. It should be easy to 
 implement
 that. I'd love to create a ticket and contribute for that if there's no
 other consideration that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Although user can use the hdfs glob syntax to support multiple inputs. But
sometimes, it is not convenient to do that. Not sure why there's no api
of SparkContext#textFiles. It should be easy to implement that. I'd love to
create a ticket and contribute for that if there's no other consideration
that I don't know.

-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them
into a Seq and use "reduce(_ ++ _)".

Best Regards,
Shixiong Zhu

2015-11-11 10:21 GMT-08:00 Jakob Odersky :

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff,
Do you mean reading from multiple text files? In that case, as a
workaround, you can use the RDD#union() (or ++) method to concatenate
multiple rdds. For example:

val lines1 = sc.textFile("file1")
val lines2 = sc.textFile("file2")

val rdd = lines1 union lines2

regards,
--Jakob

On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs. But
> sometimes, it is not convenient to do that. Not sure why there's no api
> of SparkContext#textFiles. It should be easy to implement that. I'd love to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
IIRC, TextInputFormat supports an input path that is a comma separated
list. I haven't tried this, but I think you should just be able to do
sc.textFile("file1,file2,...")

On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:

> I know these workaround, but wouldn't it be more convenient and
> straightforward to use SparkContext#textFiles ?
>
> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
> wrote:
>
>> For more than a small number of files, you'd be better off using
>> SparkContext#union instead of RDD#union.  That will avoid building up a
>> lengthy lineage.
>>
>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>> wrote:
>>
>>> Hey Jeff,
>>> Do you mean reading from multiple text files? In that case, as a
>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>> multiple rdds. For example:
>>>
>>> val lines1 = sc.textFile("file1")
>>> val lines2 = sc.textFile("file2")
>>>
>>> val rdd = lines1 union lines2
>>>
>>> regards,
>>> --Jakob
>>>
>>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>>
 Although user can use the hdfs glob syntax to support multiple inputs.
 But sometimes, it is not convenient to do that. Not sure why there's no api
 of SparkContext#textFiles. It should be easy to implement that. I'd love to
 create a ticket and contribute for that if there's no other consideration
 that I don't know.

 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Pradeep Gollakota
Looks like what I was suggesting doesn't work. :/

On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang  wrote:

> Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
> spark side, we just need to provide API to for that.
>
> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
> wrote:
>
>> IIRC, TextInputFormat supports an input path that is a comma separated
>> list. I haven't tried this, but I think you should just be able to do
>> sc.textFile("file1,file2,...")
>>
>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>>
>>> I know these workaround, but wouldn't it be more convenient and
>>> straightforward to use SparkContext#textFiles ?
>>>
>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>>> wrote:
>>>
 For more than a small number of files, you'd be better off using
 SparkContext#union instead of RDD#union.  That will avoid building up a
 lengthy lineage.

 On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
 wrote:

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple
>> inputs. But sometimes, it is not convenient to do that. Not sure why
>> there's no api of SparkContext#textFiles. It should be easy to implement
>> that. I'd love to create a ticket and contribute for that if there's no
>> other consideration that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>

>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and
straightforward to use SparkContext#textFiles ?

On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
wrote:

> For more than a small number of files, you'd be better off using
> SparkContext#union instead of RDD#union.  That will avoid building up a
> lengthy lineage.
>
> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
> wrote:
>
>> Hey Jeff,
>> Do you mean reading from multiple text files? In that case, as a
>> workaround, you can use the RDD#union() (or ++) method to concatenate
>> multiple rdds. For example:
>>
>> val lines1 = sc.textFile("file1")
>> val lines2 = sc.textFile("file2")
>>
>> val rdd = lines1 union lines2
>>
>> regards,
>> --Jakob
>>
>> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>>
>>> Although user can use the hdfs glob syntax to support multiple inputs.
>>> But sometimes, it is not convenient to do that. Not sure why there's no api
>>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>>> create a ticket and contribute for that if there's no other consideration
>>> that I don't know.
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>


-- 
Best Regards

Jeff Zhang


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using
SparkContext#union instead of RDD#union.  That will avoid building up a
lengthy lineage.

On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky  wrote:

> Hey Jeff,
> Do you mean reading from multiple text files? In that case, as a
> workaround, you can use the RDD#union() (or ++) method to concatenate
> multiple rdds. For example:
>
> val lines1 = sc.textFile("file1")
> val lines2 = sc.textFile("file2")
>
> val rdd = lines1 union lines2
>
> regards,
> --Jakob
>
> On 11 November 2015 at 01:20, Jeff Zhang  wrote:
>
>> Although user can use the hdfs glob syntax to support multiple inputs.
>> But sometimes, it is not convenient to do that. Not sure why there's no api
>> of SparkContext#textFiles. It should be easy to implement that. I'd love to
>> create a ticket and contribute for that if there's no other consideration
>> that I don't know.
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in
spark side, we just need to provide API to for that.

On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota 
wrote:

> IIRC, TextInputFormat supports an input path that is a comma separated
> list. I haven't tried this, but I think you should just be able to do
> sc.textFile("file1,file2,...")
>
> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang  wrote:
>
>> I know these workaround, but wouldn't it be more convenient and
>> straightforward to use SparkContext#textFiles ?
>>
>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra 
>> wrote:
>>
>>> For more than a small number of files, you'd be better off using
>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>> lengthy lineage.
>>>
>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky 
>>> wrote:
>>>
 Hey Jeff,
 Do you mean reading from multiple text files? In that case, as a
 workaround, you can use the RDD#union() (or ++) method to concatenate
 multiple rdds. For example:

 val lines1 = sc.textFile("file1")
 val lines2 = sc.textFile("file2")

 val rdd = lines1 union lines2

 regards,
 --Jakob

 On 11 November 2015 at 01:20, Jeff Zhang  wrote:

> Although user can use the hdfs glob syntax to support multiple inputs.
> But sometimes, it is not convenient to do that. Not sure why there's no 
> api
> of SparkContext#textFiles. It should be easy to implement that. I'd love 
> to
> create a ticket and contribute for that if there's no other consideration
> that I don't know.
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang