Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I guess you

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Shixiong Zhu
In addition, if you have more than two text files, you can just put them into a Seq and use "reduce(_ ++ _)". Best Regards, Shixiong Zhu 2015-11-11 10:21 GMT-08:00 Jakob Odersky : > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workaround,

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jakob Odersky
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11

Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Although user can use the hdfs glob syntax to support multiple inputs. But sometimes, it is not convenient to do that. Not sure why there's no api of SparkContext#textFiles. It should be easy to implement that. I'd love to create a ticket and contribute for that if there's no other consideration

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Hi Pradeep ≥≥≥ Looks like what I was suggesting doesn't work. :/ I guess you mean put comma separated path into one string and pass it to existing API (SparkContext#textFile). It should not work. I suggest to create new api SparkContext#textFiles to accept an array of string. I have already

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Jeff Zhang
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in spark side, we just need to provide API to for that. On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota wrote: > IIRC, TextInputFormat supports an input path that is a comma separated > list. I