Re: Breaking down text String into Array elements

Mich Talebzadeh Tue, 23 Aug 2016 14:22:27 -0700

Thanks Nick, Sean and everyone. That did it

BTW I registered UDF for later use in a program


Anyway this is the much simplified code

import scala.util.Random
//
// UDF to create a random string of length characters
//
def randomString(chars: String, length: Int): String =
   (0 until length).map(_ => chars(Random.nextInt(chars.length))).mkString
spark.udf.register("randomString", randomString(_:String, _:Int))
case class columns (col1: Int, col2: String)
//val chars = ('a' to 'z') ++ ('A' to 'Z') ++ ('0' to '9') ++ ("-!£$")
val chars = ('a' to 'z') ++ ('A' to 'Z')
val text = (1 to 10).map(i => (i.toString, randomString(chars.mkString(""),
10))).toArray
val df = sc.parallelize(text).map(p => columns(p._1.toString.toInt,
p._2.toString)).toDF()
df.show
sys.exit


And this is the result

Loading dynamic_ARRAY_generator.scala...
import scala.util.Random
randomString: (chars: String, length: Int)String
res0: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function2>,StringType,Some(List(StringType,
IntegerType)))
defined class columns
chars: scala.collection.immutable.IndexedSeq[Char] = Vector(a, b, c, d, e,
f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, A, B, C, D,
E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z)
text: Array[(String, String)] = Array((1,KUQyYfnnlu), (2,uYRHdRvSOc),
(3,BmrUBiMOgY), (4,LbvcqCUcQt), (5,GJlJmWFHwc), (6,zLuhPtoHJH),
(7,oCQaoCkFHG), (8,wUghlvXvQF), (9,zCHhwMsvaw), (10,pQCYUJuFyt))
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string]
+----+----------+
|col1|      col2|
+----+----------+
|   1|KUQyYfnnlu|
|   2|uYRHdRvSOc|
|   3|BmrUBiMOgY|
|   4|LbvcqCUcQt|
|   5|GJlJmWFHwc|
|   6|zLuhPtoHJH|
|   7|oCQaoCkFHG|
|   8|wUghlvXvQF|
|   9|zCHhwMsvaw|
|  10|pQCYUJuFyt|
+----+----------+


Cheers





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 23 August 2016 at 21:20, RK Aduri <rkad...@collectivei.com> wrote:

> That’s because of this:
>
> scala> val text = Array((1,"hNjLJEgjxn"),(2,"
> lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"
> LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"
> LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr"))
> text: Array[(Int, String)] = Array((1,hNjLJEgjxn), (2,lgryHkVlCN),
> (3,ukswqcanVC), (4,ZFULVxzAsv), (5,LNzOozHZPF), (6,KZPYXTqMkY),
> (7,DVjpOvVJTw), (8,LKRYrrLrLh), (9,acheneIPDM), (10,iGZTrKfXNr))
>
> scala> Array(text).getClass()
> res1: Class[_ <: Array[Array[(Int, String)]]] = class [[Lscala.Tuple2;
>
> scala> Array(text).length
> res2: Int = 1
>
> You see that Array(text) is basically a single element.
>
>
> On Aug 23, 2016, at 12:26 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>
> How about something like
>>
>> scala> val text = (1 to 10).map(i => (i.toString,
>> random_string(chars.mkString(""), 10))).toArray
>>
>> text: Array[(String, String)] = Array((1,FBECDoOoAC), (2,wvAyZsMZnt),
>> (3,KgnwObOFEG), (4,tAZPRodrgP), (5,uSgrqyZGuc), (6,ztrTmbkOhO),
>> (7,qUbQsKtZWq), (8,JDokbiFzWy), (9,vNHgiHSuUM), (10,CmnFjlHnHx))
>>
>> scala> sc.parallelize(text).count
>> res0: Long = 10
>>
>> By the way not sure exactly why you need the udf registration here?
>>
>>
>> On Tue, 23 Aug 2016 at 20:12 Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi gents,
>>>
>>> Well I was trying to see whether I can create an array of elements. From
>>> RDD to DF, register as TempTable and store it  as a Hive table
>>>
>>> import scala.util.Random
>>> //
>>> // UDF to create a random string of charlength characters
>>> //
>>> def random_string(chars: String, charlength: Int) : String = {
>>>   val newKey = (1 to charlength).map(
>>>     x =>
>>>     {
>>>       val index = Random.nextInt(chars.length)
>>>       chars(index)
>>>     }
>>>    ).mkString("")
>>>    return newKey
>>> }
>>> spark.udf.register("random_string", random_string(_:String, _:Int))
>>> case class columns (col1: Int, col2: String)
>>> val chars = ('a' to 'z') ++ ('A' to 'Z')
>>> var text = ""
>>> val comma = ","
>>> val terminator = "))"
>>> var random_char = ""
>>> for (i  <- 1 to 10) {
>>>     random_char = random_string(chars.mkString(""), 10)
>>> if (i < 10) {text = text + """(""" + i.toString +
>>> """,""""+random_char+"""")"""+comma}
>>>    else {text = text + """(""" + i.toString +
>>> """,""""+random_char+"""")"""}
>>> }
>>> println(text)
>>> val df = sc.parallelize((Array(text)))
>>>
>>>
>>> Unfortunately that only sees it as the text and interprets it as text.
>>>
>>> I can write is easily as a shell script with ${text} passed to Array and
>>> it will work. I was wondering if I could do this in Spark/Scala with my
>>> limited knowledge
>>>
>>> Cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 23 August 2016 at 19:00, Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>
>>>> what is "text"? i.e. what is the "val text = ..." definition?
>>>>
>>>> If text is a String itself then indeed sc.parallelize(Array(text)) is
>>>> doing the correct thing in this case.
>>>>
>>>>
>>>> On Tue, 23 Aug 2016 at 19:42 Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am sure someone know this :)
>>>>>
>>>>> Created a dynamic text string which has format
>>>>>
>>>>> scala> println(text)
>>>>> (1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),
>>>>> (4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),
>>>>> (7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr")
>>>>>
>>>>> now if I do
>>>>>
>>>>> scala> val df = sc.parallelize((Array((1,"
>>>>> hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"
>>>>> ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"
>>>>> DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr"))))
>>>>> df: org.apache.spark.rdd.RDD[(Int, String)] =
>>>>> ParallelCollectionRDD[230] at parallelize at <console>:39
>>>>> scala> df.count
>>>>> res157: Long = 10
>>>>> It shows ten Array elements, which is correct.
>>>>>
>>>>> Now if I pass that text into Array it only sees one row
>>>>>
>>>>> scala> val df = sc.parallelize((Array(text)))
>>>>> df: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[228] at
>>>>> parallelize at <console>:41
>>>>> scala> df.count
>>>>> res158: Long = 1
>>>>>
>>>>> Basically it sees it as one element of array
>>>>>
>>>>> scala> df.first
>>>>> res165: String = (1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),
>>>>> (4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),
>>>>> (7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr")
>>>>> Which is not what I want.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This works fine
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
> Collective[i] dramatically improves sales and marketing performance using
> technology, applications and a revolutionary network designed to provide
> next generation analytics and decision-support directly to business users.
> Our goal is to maximize human potential and minimize mistakes. In most
> cases, the results are astounding. We cannot, however, stop emails from
> sometimes being sent to the wrong person. If you are not the intended
> recipient, please notify us by replying to this email's sender and deleting
> it (and any attachments) permanently from your system. If you are, please
> respect the confidentiality of this communication's contents.

Re: Breaking down text String into Array elements

Reply via email to