Re: multiple splits fails

Mich Talebzadeh Sun, 03 Apr 2016 05:07:12 -0700

Hi Eliran,

Many thanks for your input on this.


I thought about what I was trying to achieve so I rewrote the logic as
follows:


   1. Read the text file in
   2. Filter out empty lines (well not really needed here)
   3. Search for lines that contain "ASE 15" and further have sentence
   "UPDATE INDEX STATISTICS" in the said line
   4. Split the text by "\t" and ","
   5. Print the outcome


This was what I did with your suggestions included

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
f.cache()
 f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
contains("UPDATE INDEX STATISTICS")).flatMap(line =>
line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
_).collect.foreach(println)


Couple of questions if I may


   1. I take that "_" refers to content of the file read in by default?
   2. _.length > 0 basically filters out blank lines (not really needed
   here)
   3. Multiple filters are needed for each *contains* logic
   4. split"\t," splits the filter by carriage return AND ,?


Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 12:35, Eliran Bivas <elir...@iguaz.io> wrote:

> Hi Mich,
>
> Few comments:
>
> When doing .filter(_ > “”) you’re actually doing a lexicographic
> comparison and not filtering for empty lines (which could be achieved with
> _.notEmpty or _.length > 0).
> I think that filtering with _.contains should be sufficient and the first
> filter can be omitted.
>
> As for line => line.split(“\t”).split(“,”):
> You have to do a second map or (since split() requires a regex as input)
> .split(“\t,”).
> The problem is that your first split() call will generate an Array and
> then your second call will result in an error.
> e.g.
>
> val lines: Array[String] = line.split(“\t”)
> lines.split(“,”) // Compilation error - no method split() exists for Array
>
> So either go with map(_.split(“\t”)).map(_.split(“,”)) or
> map(_.split(“\t,”))
>
> Hope that helps.
>
> *Eliran Bivas*
> Data Team | iguaz.io
>
>
> On 3 Apr 2016, at 13:31, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Hi,
>
> I am not sure this is the correct approach
>
> Read a text file in
>
> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>
>
> Now I want to get rid of empty lines and filter only the lines that
> contain "ASE15"
>
>  f.filter(_ > "").filter(_ contains("ASE15")).
>
> The above works but I am not sure whether I need two filter transformation
> above? Can it be done in one?
>
> Now I want to map the above filter to lines with carriage return ans split
> them by ","
>
> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t")))
> res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
> map at <console>:30
>
> Now I want to split the output by ","
>
> scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t").split(",")))
> <console>:30: error: value split is not a member of Array[String]
>               f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t").split(",")))
>
> ^
> Any advice will be appreciated
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: multiple splits fails

Reply via email to