Re: Stream group by

Vinti Maheshwari Sun, 21 Feb 2016 15:58:41 -0800
>
> Yeah, i tried with reduceByKey, was able to do it. Thanks Ayan and Jatin.
> For reference, final solution:
>
> def main(args: Array[String]): Unit = {
>     val conf = new SparkConf().setAppName("HBaseStream")
>     val sc = new SparkContext(conf)
>     // create a StreamingContext, the main entry point for all streaming 
> functionality
>     val ssc = new StreamingContext(sc, Seconds(2))
>     val inputStream = ssc.socketTextStream("hostname", 9999)
>     val parsedDstream = inputStream
>       .map(line => {
>         val splitLines = line.split(",")
>         (splitLines(1), splitLines.slice(2, 
> splitLines.length).map(_.trim.toInt))
>       })
>       .reduceByKey((first, second) => {
>         val listOfArrays = ArrayBuffer(first, second)
>         listOfArrays.toList.transpose.map(_.sum).toArray
>       })
>       .foreachRDD(rdd => rdd.foreach(Blaher.blah))
>
> }
>
>
> Regards,
> Vinti
>
> On Sun, Feb 21, 2016 at 2:22 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> I believe the best way would be to use reduceByKey operation.
>>
>> On Mon, Feb 22, 2016 at 5:10 AM, Jatin Kumar <
>> jku...@rocketfuelinc.com.invalid> wrote:
>>
>>> You will need to do a collect and update a global map if you want to.
>>>
>>> myDStream.map(record => (record._2, (record._3, record_4, record._5))
>>>          .groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3 +
>>> r2._3))
>>>          .foreachRDD(rdd => {
>>>            rdd.collect().foreach((fileName, valueTuple) => <update
>>> global map here>)
>>>          })
>>>
>>> --
>>> Thanks
>>> Jatin Kumar | Rocket Scientist
>>> +91-7696741743 m
>>>
>>> On Sun, Feb 21, 2016 at 11:30 PM, Vinti Maheshwari <vinti.u...@gmail.com
>>> > wrote:
>>>
>>>> Nevermind, seems like an executor level mutable map is not recommended
>>>> as stated in
>>>> http://blog.guillaume-pitel.fr/2015/06/spark-trick-shot-node-centered-aggregator/
>>>>
>>>> On Sun, Feb 21, 2016 at 9:54 AM, Vinti Maheshwari <vinti.u...@gmail.com
>>>> > wrote:
>>>>
>>>>> Thanks for your reply Jatin. I changed my parsing logic to what you
>>>>> suggested:
>>>>>
>>>>>     def parseCoverageLine(str: String) = {
>>>>>       val arr = str.split(",")
>>>>>       ...
>>>>>       ...
>>>>>       (arr(0), arr(1) :: count.toList)  // (test, [file, 1, 1, 2])
>>>>>     }
>>>>>
>>>>> Then in the grouping, can i use a global hash map per executor /
>>>>> partition to aggregate the results?
>>>>>
>>>>> val globalMap:[String: List[Int]] = Map()
>>>>> val coverageDStream = inputStream.map(parseCoverageLine)
>>>>>     coverageDStream.reduceByKey((fileAndCountsA, fileAndCountsB) => {
>>>>>     // if exists in global map, append result else add new key
>>>>>
>>>>>     // globalMap
>>>>>     // { "file1": [1+1+5, 1+2+5, 1+3+5], "file2": [ 2, 2, 2, 2 ] }
>>>>> })
>>>>>
>>>>> Thanks,
>>>>> Vinti
>>>>>
>>>>> On Sun, Feb 21, 2016 at 9:31 AM, Jatin Kumar <jku...@rocketfuelinc.com
>>>>> > wrote:
>>>>>
>>>>>> Hello Vinti,
>>>>>>
>>>>>> One way to get this done is you split your input line into key and
>>>>>> value tuple and then you can simply use groupByKey and handle the values
>>>>>> the way you want. For example:
>>>>>>
>>>>>> Assuming you have already split the values into a 5 tuple:
>>>>>> myDStream.map(record => (record._2, (record._3, record_4, record._5))
>>>>>>          .groupByKey((r1, r2) => (r1._1 + r2._1, r1._2 + r2._2, r1._3
>>>>>> + r2._3))
>>>>>>
>>>>>> I hope that helps.
>>>>>>
>>>>>> --
>>>>>> Thanks
>>>>>> Jatin Kumar | Rocket Scientist
>>>>>> +91-7696741743 m
>>>>>>
>>>>>> On Sun, Feb 21, 2016 at 10:35 PM, Vinti Maheshwari <
>>>>>> vinti.u...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have input lines like below
>>>>>>>
>>>>>>> *Input*
>>>>>>> t1, file1, 1, 1, 1
>>>>>>> t1, file1, 1, 2, 3
>>>>>>> t1, file2, 2, 2, 2, 2
>>>>>>> t2, file1, 5, 5, 5
>>>>>>> t2, file2, 1, 1, 2, 2
>>>>>>>
>>>>>>> and i want to achieve the output like below rows which is a vertical
>>>>>>> addition of the corresponding numbers.
>>>>>>>
>>>>>>> *Output*
>>>>>>> “file1” : [ 1+1+5, 1+2+5, 1+3+5 ]
>>>>>>> “file2” : [ 2+1, 2+1, 2+2, 2+2 ]
>>>>>>>
>>>>>>> I am in a spark streaming context and i am having a hard time trying
>>>>>>> to figure out the way to group by file name.
>>>>>>>
>>>>>>> It seems like i will need to use something like below, i am not sure
>>>>>>> how to get to the correct syntax. Any inputs will be helpful.
>>>>>>>
>>>>>>> myDStream.foreachRDD(rdd => rdd.groupBy())
>>>>>>>
>>>>>>> I know how to do the vertical sum of array of given numbers, but i
>>>>>>> am not sure how to feed that function to the group by.
>>>>>>>
>>>>>>>   def compute_counters(counts : ArrayBuffer[List[Int]]) = {
>>>>>>>       counts.toList.transpose.map(_.sum)
>>>>>>>   }
>>>>>>>
>>>>>>> ~Thanks,
>>>>>>> Vinti
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
Re: Stream group by

Reply via email to