Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

Jeff Zhang Thu, 12 Nov 2015 17:28:51 -0800

Didn't notice that I can pass comma separated path in the existing API
(SparkContext#textFile). So no necessary for new api. Thanks all.




On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> Hi Pradeep
>
> ≥≥≥ Looks like what I was suggesting doesn't work. :/
> I guess you mean put comma separated path into one string and pass it
> to existing API (SparkContext#textFile). It should not work. I suggest to
> create new api SparkContext#textFiles to accept an array of string. I have
> already implemented a simple patch and it works.
>
>
>
>
> On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota <pradeep...@gmail.com>
> wrote:
>
>> Looks like what I was suggesting doesn't work. :/
>>
>> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So
>>> in spark side, we just need to provide API to for that.
>>>
>>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota <pradeep...@gmail.com
>>> > wrote:
>>>
>>>> IIRC, TextInputFormat supports an input path that is a comma separated
>>>> list. I haven't tried this, but I think you should just be able to do
>>>> sc.textFile("file1,file2,...")
>>>>
>>>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> I know these workaround, but wouldn't it be more convenient and
>>>>> straightforward to use SparkContext#textFiles ?
>>>>>
>>>>> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra <m...@clearstorydata.com
>>>>> > wrote:
>>>>>
>>>>>> For more than a small number of files, you'd be better off using
>>>>>> SparkContext#union instead of RDD#union.  That will avoid building up a
>>>>>> lengthy lineage.
>>>>>>
>>>>>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky <joder...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Jeff,
>>>>>>> Do you mean reading from multiple text files? In that case, as a
>>>>>>> workaround, you can use the RDD#union() (or ++) method to concatenate
>>>>>>> multiple rdds. For example:
>>>>>>>
>>>>>>> val lines1 = sc.textFile("file1")
>>>>>>> val lines2 = sc.textFile("file2")
>>>>>>>
>>>>>>> val rdd = lines1 union lines2
>>>>>>>
>>>>>>> regards,
>>>>>>> --Jakob
>>>>>>>
>>>>>>> On 11 November 2015 at 01:20, Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Although user can use the hdfs glob syntax to support multiple
>>>>>>>> inputs. But sometimes, it is not convenient to do that. Not sure why
>>>>>>>> there's no api of SparkContext#textFiles. It should be easy to 
>>>>>>>> implement
>>>>>>>> that. I'd love to create a ticket and contribute for that if there's no
>>>>>>>> other consideration that I don't know.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> Jeff Zhang
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

Reply via email to