Hi Mich,
1. The first underscore in your filter call is refering to a line in the file
(as textFile() results in a collection of strings)
2. You're correct. No need for it.
3. Filter is expecting a Boolean result. So you can merge your contains filters
to one with AND (&&) statement.
4. Correct. Each character in split() is used as a divider.
Eliran Bivas
From: Mich Talebzadeh <mich.talebza...@gmail.com>
Sent: Apr 3, 2016 15:06
To: Eliran Bivas
Cc: user @spark
Subject: Re: multiple splits fails
Hi Eliran,
Many thanks for your input on this.
I thought about what I was trying to achieve so I rewrote the logic as follows:
1. Read the text file in
2. Filter out empty lines (well not really needed here)
3. Search for lines that contain "ASE 15" and further have sentence "UPDATE
INDEX STATISTICS" in the said line
4. Split the text by "\t" and ","
5. Print the outcome
This was what I did with your suggestions included
val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
f.cache()
f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_ contains("UPDATE
INDEX STATISTICS")).flatMap(line => line.split("\t,")).map(word => (word,
1)).reduceByKey(_ + _).collect.foreach(println)
Couple of questions if I may
1. I take that "_" refers to content of the file read in by default?
2. _.length > 0 basically filters out blank lines (not really needed here)
3. Multiple filters are needed for each *contains* logic
4. split"\t," splits the filter by carriage return AND ,?
Regards
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 3 April 2016 at 12:35, Eliran Bivas
<elir...@iguaz.io<mailto:elir...@iguaz.io>> wrote:
Hi Mich,
Few comments:
When doing .filter(_ > "") you're actually doing a lexicographic comparison and
not filtering for empty lines (which could be achieved with _.notEmpty or
_.length > 0).
I think that filtering with _.contains should be sufficient and the first
filter can be omitted.
As for line => line.split("\t").split(","):
You have to do a second map or (since split() requires a regex as input)
.split("\t,").
The problem is that your first split() call will generate an Array and then
your second call will result in an error.
e.g.
val lines: Array[String] = line.split("\t")
lines.split(",") // Compilation error - no method split() exists for Array
So either go with map(_.split("\t")).map(_.split(",")) or map(_.split("\t,"))
Hope that helps.
Eliran Bivas
Data Team | iguaz.io<http://iguaz.io>
On 3 Apr 2016, at 13:31, Mich Talebzadeh
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi,
I am not sure this is the correct approach
Read a text file in
val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
Now I want to get rid of empty lines and filter only the lines that contain
"ASE15"
f.filter(_ > "").filter(_ contains("ASE15")).
The above works but I am not sure whether I need two filter transformation
above? Can it be done in one?
Now I want to map the above filter to lines with carriage return ans split them
by ","
f.filter(_ > "").filter(_ contains("ASE15")).map(line => (line.split("\t")))
res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at map
at :30
Now I want to split the output by ","
scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
(line.split("\t").split(",")))
:30: error: value split is not a member of Array[String]
f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
(line.split("\t").split(",")))
^
Any advice will be appreciated
Thanks
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>