Thanks Jacob much appreciated
Mich
On 11/02/2016 00:01, Jakob Odersky wrote:
> Exactly!
> As a final note, `foreach` is also defined on RDDs. This means that
> you don't need to `collect()` the results into an array (which could
> give you an OutOfMemoryError in case the RDD is really
Hi,
I have a bunch of files stored in hdfs /unit_files directory in total
319 files
scala> val errlog = sc.textFile("/unix_files/*.ksh")
scala> errlog.filter(line => line.contains("sed"))count()
res104: Long = 1113
So it returns 1113 instances the word "sed"
If I want to see the collection
Hi Mich,
If you would like to print everything to the console you could - errlog.
filter(line => line.contains("sed"))collect()foreach(println)
or you could always save to a file using any of the saveAs methods.
Thanks,
Chandeep
On Wed, Feb 10, 2016 at 8:14 PM, <
Hi Chandeep
Many thanks for your help
In the line below
errlog.filter(line => line.contains("sed"))collect()foreach(println)
Can you please clarify the components with the correct naming as I am
new to Scala
* errlog --> is the RDD?
* filter(line =>
Mich:
When you execute the statements in Spark shell, you would see the types of
the intermediate results.
scala> val errlog = sc.textFile("/home/john/s.out")
errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out
MapPartitionsRDD[1] at textFile at :24
scala> val sed = errlog.filter(line =>
Hi Mich,
your assumptions 1 to 3 are all correct (nitpick: they're method
*calls*, the methods being the part before the parentheses, but I
assume that's what you meant). The last one is also a method call but
uses syntactic sugar on top: `foreach(println)` boils down to
`foreach(line =>
Many thanks Jakob.
So it basically boils down to this demarcation as suggested which looks
clearer
val errlog = sc.textFile("/unix_files/*.ksh")
errlog.filter(line => line.contains("sed")).collect().foreach(line =>
println(line))
Regards,
Mich
On 10/02/2016 23:21, Jakob Odersky wrote:
Exactly!
As a final note, `foreach` is also defined on RDDs. This means that
you don't need to `collect()` the results into an array (which could
give you an OutOfMemoryError in case the RDD is really really large)
before printing them.
Personally, when I learn using a new library, I like to look