Re: retrieving all the rows with collect()

2016-02-11 Thread Mich Talebzadeh
Thanks Jacob much appreciated Mich On 11/02/2016 00:01, Jakob Odersky wrote: > Exactly! > As a final note, `foreach` is also defined on RDDs. This means that > you don't need to `collect()` the results into an array (which could > give you an OutOfMemoryError in case the RDD is really

retrieving all the rows with collect()

2016-02-10 Thread mich . talebzadeh
Hi, I have a bunch of files stored in hdfs /unit_files directory in total 319 files scala> val errlog = sc.textFile("/unix_files/*.ksh") scala> errlog.filter(line => line.contains("sed"))count() res104: Long = 1113 So it returns 1113 instances the word "sed" If I want to see the collection

Re: retrieving all the rows with collect()

2016-02-10 Thread Chandeep Singh
Hi Mich, If you would like to print everything to the console you could - errlog. filter(line => line.contains("sed"))collect()foreach(println) or you could always save to a file using any of the saveAs methods. Thanks, Chandeep On Wed, Feb 10, 2016 at 8:14 PM, <

Re: retrieving all the rows with collect()

2016-02-10 Thread Mich Talebzadeh
Hi Chandeep Many thanks for your help In the line below errlog.filter(line => line.contains("sed"))collect()foreach(println) Can you please clarify the components with the correct naming as I am new to Scala * errlog --> is the RDD? * filter(line =>

Re: retrieving all the rows with collect()

2016-02-10 Thread Ted Yu
Mich: When you execute the statements in Spark shell, you would see the types of the intermediate results. scala> val errlog = sc.textFile("/home/john/s.out") errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out MapPartitionsRDD[1] at textFile at :24 scala> val sed = errlog.filter(line =>

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Hi Mich, your assumptions 1 to 3 are all correct (nitpick: they're method *calls*, the methods being the part before the parentheses, but I assume that's what you meant). The last one is also a method call but uses syntactic sugar on top: `foreach(println)` boils down to `foreach(line =>

Re: retrieving all the rows with collect()

2016-02-10 Thread Mich Talebzadeh
Many thanks Jakob. So it basically boils down to this demarcation as suggested which looks clearer val errlog = sc.textFile("/unix_files/*.ksh") errlog.filter(line => line.contains("sed")).collect().foreach(line => println(line)) Regards, Mich On 10/02/2016 23:21, Jakob Odersky wrote:

Re: retrieving all the rows with collect()

2016-02-10 Thread Jakob Odersky
Exactly! As a final note, `foreach` is also defined on RDDs. This means that you don't need to `collect()` the results into an array (which could give you an OutOfMemoryError in case the RDD is really really large) before printing them. Personally, when I learn using a new library, I like to look