Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I guess you mean put comma separated path into one string and pass it > to existing API (SparkContext#textFile). It should not work. I suggest to > create new api SparkContext#textFiles to accept an array of string. I have > already implemented a simple patch and it works. > > > > > On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota > wrote: > >> Looks like what I was suggesting doesn't work. :/ >> >> On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: >> >>> Yes, that's what I suggest. TextInputFormat support multiple inputs. So >>> in spark side, we just need to provide API to for that. >>> >>> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> > wrote: >>> IIRC, TextInputFormat supports an input path that is a comma separated list. I haven't tried this, but I think you should just be able to do sc.textFile("file1,file2,...") On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > I know these workaround, but wouldn't it be more convenient and > straightforward to use SparkContext#textFiles ? > > On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra > wrote: > >> For more than a small number of files, you'd be better off using >> SparkContext#union instead of RDD#union. That will avoid building up a >> lengthy lineage. >> >> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky >> wrote: >> >>> Hey Jeff, >>> Do you mean reading from multiple text files? In that case, as a >>> workaround, you can use the RDD#union() (or ++) method to concatenate >>> multiple rdds. For example: >>> >>> val lines1 = sc.textFile("file1") >>> val lines2 = sc.textFile("file2") >>> >>> val rdd = lines1 union lines2 >>> >>> regards, >>> --Jakob >>> >>> On 11 November 2015 at 01:20, Jeff Zhang wrote: >>> Although user can use the hdfs glob syntax to support multiple inputs. But sometimes, it is not convenient to do that. Not sure why there's no api of SparkContext#textFiles. It should be easy to implement that. I'd love to create a ticket and contribute for that if there's no other consideration that I don't know. -- Best Regards Jeff Zhang >>> >>> >> > > > -- > Best Regards > > Jeff Zhang > >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > > > -- > Best Regards > > Jeff Zhang > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Hi Pradeep ≥≥≥ Looks like what I was suggesting doesn't work. :/ I guess you mean put comma separated path into one string and pass it to existing API (SparkContext#textFile). It should not work. I suggest to create new api SparkContext#textFiles to accept an array of string. I have already implemented a simple patch and it works. On Thu, Nov 12, 2015 at 10:17 AM, Pradeep Gollakota wrote: > Looks like what I was suggesting doesn't work. :/ > > On Wed, Nov 11, 2015 at 4:49 PM, Jeff Zhang wrote: > >> Yes, that's what I suggest. TextInputFormat support multiple inputs. So >> in spark side, we just need to provide API to for that. >> >> On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota >> wrote: >> >>> IIRC, TextInputFormat supports an input path that is a comma separated >>> list. I haven't tried this, but I think you should just be able to do >>> sc.textFile("file1,file2,...") >>> >>> On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: >>> I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of RDD#union. That will avoid building up a > lengthy lineage. > > On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky > wrote: > >> Hey Jeff, >> Do you mean reading from multiple text files? In that case, as a >> workaround, you can use the RDD#union() (or ++) method to concatenate >> multiple rdds. For example: >> >> val lines1 = sc.textFile("file1") >> val lines2 = sc.textFile("file2") >> >> val rdd = lines1 union lines2 >> >> regards, >> --Jakob >> >> On 11 November 2015 at 01:20, Jeff Zhang wrote: >> >>> Although user can use the hdfs glob syntax to support multiple >>> inputs. But sometimes, it is not convenient to do that. Not sure why >>> there's no api of SparkContext#textFiles. It should be easy to implement >>> that. I'd love to create a ticket and contribute for that if there's no >>> other consideration that I don't know. >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > -- Best Regards Jeff Zhang >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Yes, that's what I suggest. TextInputFormat support multiple inputs. So in spark side, we just need to provide API to for that. On Thu, Nov 12, 2015 at 8:45 AM, Pradeep Gollakota wrote: > IIRC, TextInputFormat supports an input path that is a comma separated > list. I haven't tried this, but I think you should just be able to do > sc.textFile("file1,file2,...") > > On Wed, Nov 11, 2015 at 4:30 PM, Jeff Zhang wrote: > >> I know these workaround, but wouldn't it be more convenient and >> straightforward to use SparkContext#textFiles ? >> >> On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra >> wrote: >> >>> For more than a small number of files, you'd be better off using >>> SparkContext#union instead of RDD#union. That will avoid building up a >>> lengthy lineage. >>> >>> On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky >>> wrote: >>> Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11 November 2015 at 01:20, Jeff Zhang wrote: > Although user can use the hdfs glob syntax to support multiple inputs. > But sometimes, it is not convenient to do that. Not sure why there's no > api > of SparkContext#textFiles. It should be easy to implement that. I'd love > to > create a ticket and contribute for that if there's no other consideration > that I don't know. > > -- > Best Regards > > Jeff Zhang > >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
I know these workaround, but wouldn't it be more convenient and straightforward to use SparkContext#textFiles ? On Thu, Nov 12, 2015 at 2:27 AM, Mark Hamstra wrote: > For more than a small number of files, you'd be better off using > SparkContext#union instead of RDD#union. That will avoid building up a > lengthy lineage. > > On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky > wrote: > >> Hey Jeff, >> Do you mean reading from multiple text files? In that case, as a >> workaround, you can use the RDD#union() (or ++) method to concatenate >> multiple rdds. For example: >> >> val lines1 = sc.textFile("file1") >> val lines2 = sc.textFile("file2") >> >> val rdd = lines1 union lines2 >> >> regards, >> --Jakob >> >> On 11 November 2015 at 01:20, Jeff Zhang wrote: >> >>> Although user can use the hdfs glob syntax to support multiple inputs. >>> But sometimes, it is not convenient to do that. Not sure why there's no api >>> of SparkContext#textFiles. It should be easy to implement that. I'd love to >>> create a ticket and contribute for that if there's no other consideration >>> that I don't know. >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > -- Best Regards Jeff Zhang
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workaround, you can use the RDD#union() (or ++) method to concatenate > multiple rdds. For example: > > val lines1 = sc.textFile("file1") > val lines2 = sc.textFile("file2") > > val rdd = lines1 union lines2 > > regards, > --Jakob > > On 11 November 2015 at 01:20, Jeff Zhang wrote: > >> Although user can use the hdfs glob syntax to support multiple inputs. >> But sometimes, it is not convenient to do that. Not sure why there's no api >> of SparkContext#textFiles. It should be easy to implement that. I'd love to >> create a ticket and contribute for that if there's no other consideration >> that I don't know. >> >> -- >> Best Regards >> >> Jeff Zhang >> > >
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
In addition, if you have more than two text files, you can just put them into a Seq and use "reduce(_ ++ _)". Best Regards, Shixiong Zhu 2015-11-11 10:21 GMT-08:00 Jakob Odersky : > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workaround, you can use the RDD#union() (or ++) method to concatenate > multiple rdds. For example: > > val lines1 = sc.textFile("file1") > val lines2 = sc.textFile("file2") > > val rdd = lines1 union lines2 > > regards, > --Jakob > > On 11 November 2015 at 01:20, Jeff Zhang wrote: > >> Although user can use the hdfs glob syntax to support multiple inputs. >> But sometimes, it is not convenient to do that. Not sure why there's no api >> of SparkContext#textFiles. It should be easy to implement that. I'd love to >> create a ticket and contribute for that if there's no other consideration >> that I don't know. >> >> -- >> Best Regards >> >> Jeff Zhang >> > >
Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?
Hey Jeff, Do you mean reading from multiple text files? In that case, as a workaround, you can use the RDD#union() (or ++) method to concatenate multiple rdds. For example: val lines1 = sc.textFile("file1") val lines2 = sc.textFile("file2") val rdd = lines1 union lines2 regards, --Jakob On 11 November 2015 at 01:20, Jeff Zhang wrote: > Although user can use the hdfs glob syntax to support multiple inputs. But > sometimes, it is not convenient to do that. Not sure why there's no api > of SparkContext#textFiles. It should be easy to implement that. I'd love to > create a ticket and contribute for that if there's no other consideration > that I don't know. > > -- > Best Regards > > Jeff Zhang >