Mich: When you execute the statements in Spark shell, you would see the types of the intermediate results.
scala> val errlog = sc.textFile("/home/john/s.out") errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out MapPartitionsRDD[1] at textFile at <console>:24 scala> val sed = errlog.filter(line => line.contains("sed")) sed: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:26 scala> sed.collect() res0: Array[String] = Array([WARNING] Unrecognised ... Cheers On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > > Hi Chandeep > > > > Many thanks for your help > > > > In the line below > > > > errlog.filter(line => line.contains("sed"))collect()foreach(println) > > > > Can you please clarify the components with the correct naming as I am new > to Scala > > 1. errlog --> is the RDD? > 2. filter(line => line.contains("sed")) is a method > 3. collect() is another method ? > 4. foreach (println) ? > > > > Thanks > > > > On 10/02/2016 21:28, Chandeep Singh wrote: > > Hi Mich, > > If you would like to print everything to the console you could - errlog. > filter(line => line.contains("sed"))collect()foreach(println) > > or you could always save to a file using any of the saveAs methods. > > Thanks, > Chandeep > > On Wed, Feb 10, 2016 at 8:14 PM, < > mich.talebza...@cloudtechnologypartners.co.uk> wrote: > >> >> >> Hi, >> >> I have a bunch of files stored in hdfs /unit_files directory in total 319 >> files >> scala> val errlog = sc.textFile("/unix_files/*.ksh") >> >> scala> errlog.filter(line => line.contains("sed"))count() >> res104: Long = 1113 >> So it returns 1113 instances the word "sed" >> >> If I want to see the collection I can do >> >> >> *scala> errlog.filter(line => line.contains("sed"))collect()* >> >> res105: Array[String] = Array(" DSQUERY=${1} ; >> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", # . >> in environment based on argument for script., " exec sp_spaceused", " >> exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), " >> BACKUPSERVER=$5 # Server that is used to load the transaction dump", >> " BACKUPSERVER=$5 # Server that is used to load the >> transaction dump", " BACKUPSERVER=$5 # Server that is used to >> load the transaction dump", " cat $TMPDIR/${DBNAME}_trandump.sql | sed >> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat >> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > >> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e >> s/.ksh//), " B... >> scala> >> >> >> Now is there anyway I can retrieve all these instances or perhaps they are >> all wrapped up and I only see few lines? >> >> Thanks, >> >> Mich >> >> > > > -- > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. > > >