Mich:
When you execute the statements in Spark shell, you would see the types of
the intermediate results.

scala> val errlog = sc.textFile("/home/john/s.out")
errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out
MapPartitionsRDD[1] at textFile at <console>:24

scala> val sed = errlog.filter(line => line.contains("sed"))
sed: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at
<console>:26

scala> sed.collect()
res0: Array[String] = Array([WARNING] Unrecognised ...

Cheers

On Wed, Feb 10, 2016 at 2:35 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

>
>
> Hi Chandeep
>
>
>
> Many thanks for your help
>
>
>
> In the line below
>
>
>
> errlog.filter(line => line.contains("sed"))collect()foreach(println)
>
>
>
> Can you please clarify the components with the correct naming as I am new
> to Scala
>
>    1. errlog   --> is the RDD?
>    2. filter(line => line.contains("sed")) is a method
>    3. collect()  is another method ?
>    4. foreach (println) ?
>
>
>
> Thanks
>
>
>
> On 10/02/2016 21:28, Chandeep Singh wrote:
>
> Hi Mich,
>
> If you would like to print everything to the console you could - errlog.
> filter(line => line.contains("sed"))collect()foreach(println)
>
> or you could always save to a file using any of the saveAs methods.
>
> Thanks,
> Chandeep
>
> On Wed, Feb 10, 2016 at 8:14 PM, <
> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
>>
>>
>> Hi,
>>
>> I have a bunch of files stored in hdfs /unit_files directory in total 319 
>> files
>> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>>
>> scala>  errlog.filter(line => line.contains("sed"))count()
>> res104: Long = 1113
>> So it returns 1113 instances the word "sed"
>>
>> If I want to see the collection I can do
>>
>>
>> *scala>  errlog.filter(line => line.contains("sed"))collect()*
>>
>> res105: Array[String] = Array("                         DSQUERY=${1} ; 
>> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #    . 
>> in environment based on argument for script., "       exec sp_spaceused", "  
>>       exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "       
>>  BACKUPSERVER=$5        # Server that is used to load the transaction dump", 
>> "        BACKUPSERVER=$5         # Server that is used to load the 
>> transaction dump", "        BACKUPSERVER=$5         # Server that is used to 
>> load the transaction dump", "    cat $TMPDIR/${DBNAME}_trandump.sql | sed 
>> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat 
>> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > 
>> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e 
>> s/.ksh//), "        B...
>> scala>
>>
>>
>> Now is there anyway I can retrieve all these instances or perhaps they are 
>> all wrapped up and I only see few lines?
>>
>> Thanks,
>>
>> Mich
>>
>>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>

Reply via email to