Re: multiple splits fails

Ted Yu Sun, 03 Apr 2016 07:32:29 -0700

bq. split"\t," splits the filter by carriage return

Minor correction: "\t" denotes tab character.


On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas <elir...@iguaz.io> wrote:

> Hi Mich,
>
> 1. The first underscore in your filter call is refering to a line in the
> file (as textFile() results in a collection of strings)
> 2. You're correct. No need for it.
> 3. Filter is expecting a Boolean result. So you can merge your contains
> filters to one with AND (&&) statement.
> 4. Correct. Each character in split() is used as a divider.
>
> Eliran Bivas
>
> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
> *Sent:* Apr 3, 2016 15:06
> *To:* Eliran Bivas
> *Cc:* user @spark
> *Subject:* Re: multiple splits fails
>
> Hi Eliran,
>
> Many thanks for your input on this.
>
> I thought about what I was trying to achieve so I rewrote the logic as
> follows:
>
>
>    1. Read the text file in
>    2. Filter out empty lines (well not really needed here)
>    3. Search for lines that contain "ASE 15" and further have sentence
>    "UPDATE INDEX STATISTICS" in the said line
>    4. Split the text by "\t" and ","
>    5. Print the outcome
>
>
> This was what I did with your suggestions included
>
> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
> f.cache()
>  f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
>
>
> Couple of questions if I may
>
>
>    1. I take that "_" refers to content of the file read in by default?
>    2. _.length > 0 basically filters out blank lines (not really needed
>    here)
>    3. Multiple filters are needed for each *contains* logic
>    4. split"\t," splits the filter by carriage return AND ,?
>
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 12:35, Eliran Bivas <elir...@iguaz.io> wrote:
>
>> Hi Mich,
>>
>> Few comments:
>>
>> When doing .filter(_ > “”) you’re actually doing a lexicographic
>> comparison and not filtering for empty lines (which could be achieved with
>> _.notEmpty or _.length > 0).
>> I think that filtering with _.contains should be sufficient and the
>> first filter can be omitted.
>>
>> As for line => line.split(“\t”).split(“,”):
>> You have to do a second map or (since split() requires a regex as input)
>> .split(“\t,”).
>> The problem is that your first split() call will generate an Array and
>> then your second call will result in an error.
>> e.g.
>>
>> val lines: Array[String] = line.split(“\t”)
>> lines.split(“,”) // Compilation error - no method split() exists for Array
>>
>> So either go with map(_.split(“\t”)).map(_.split(“,”)) or
>> map(_.split(“\t,”))
>>
>> Hope that helps.
>>
>> *Eliran Bivas*
>> Data Team | iguaz.io
>>
>>
>> On 3 Apr 2016, at 13:31, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I am not sure this is the correct approach
>>
>> Read a text file in
>>
>> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>>
>>
>> Now I want to get rid of empty lines and filter only the lines that
>> contain "ASE15"
>>
>>  f.filter(_ > "").filter(_ contains("ASE15")).
>>
>> The above works but I am not sure whether I need two filter
>> transformation above? Can it be done in one?
>>
>> Now I want to map the above filter to lines with carriage return ans
>> split them by ","
>>
>> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t")))
>> res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
>> map at <console>:30
>>
>> Now I want to split the output by ","
>>
>> scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>> <console>:30: error: value split is not a member of Array[String]
>>               f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>>
>> ^
>> Any advice will be appreciated
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>

Re: multiple splits fails

Reply via email to