[no subject]

2015-11-26 Thread Dmitry Tolpeko



Re: Migrate Relational to Distributed

2015-05-23 Thread Dmitry Tolpeko
Hi Brant,

Let me partially answer to your concerns: please follow a new open source
project PL/HQL (www.plhql.org) aimed at allowing you to reuse existing
logic and leverage existing skills at some extent, so you do not need to
rewrite everything to Scala/Java and can do this gradually. I hope it can
help.

Thanks,

Dmitry

On Sat, May 23, 2015 at 1:22 AM, Brant Seibert brantseib...@hotmail.com
wrote:

 Hi,  The healthcare industry can do wonderful things with Apache Spark.
 But,
 there is already a very large base of data and applications firmly rooted
 in
 the relational paradigm and they are resistent to change - stuck on Oracle.

 **
 QUESTION 1 - Migrate legacy relational data (plus new transactions) to
 distributed storage?

 DISCUSSION 1 - The primary advantage I see is not having to engage in the
 lengthy (1+ years) process of creating a relational data warehouse and
 cubes.  Just store the data in a distributed system and analyze first in
 memory with Spark.

 **
 QUESTION 2 - Will we have to re-write the enormous amount of logic that is
 already built for the old relational system?

 DISCUSSION 2 - If we move the data to distributed, can we simply run that
 existing relational logic as SparkSQL queries?  [existing SQL -- Spark
 Context -- Cassandra -- process in SparkSQL -- display in existing UI].
 Can we create an RDD that uses existing SQL?  Or do we need to rewrite all
 our SQL?

 **
 DATA SIZE - We are adding many new data sources to a system that already
 manages health care data for over a million people.  The number of rows may
 not be enormous right now compared to the advertising industry, for
 example,
 but the number of dimensions runs well into the thousands.  If we add to
 this, IoT data for each health care patient, that creates billions of
 events
 per day, and the number of rows then grows exponentially.  We would like to
 be prepared to handle that huge data scenario.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-17 Thread Dmitry Tolpeko
I ended up with the following:

def firstValue(items: Iterable[String]) = for { i - items
} yield (i, items.head)

data.groupByKey().map{case(a, b)=(a, firstValue(b))}.collect

More details:
http://dmtolpeko.com/2015/02/17/first_value-last_value-lead-and-lag-in-spark/

I would appreciate any feedback.

Dmitry

On Fri, Feb 13, 2015 at 11:54 AM, Dmitry Tolpeko dmtolp...@gmail.com
wrote:

 Hello,

 To convert existing Map Reduce jobs to Spark, I need to implement window
 functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
 FIRST_VALUE function:

 Source (1st column is key):

 A, A1
 A, A2
 A, A3
 B, B1
 B, B2
 C, C1

 and the result should be

 A, A1, A1
 A, A2, A1
 A, A3, A1
 B, B1, B1
 B, B2, B1
 C, C1, C1

 You can see that the first value in a group is repeated in each row.

 My current Spark/Scala code:

 def firstValue(b: Iterable[String]) : List[(String, String)] = {
   val c = scala.collection.mutable.MutableList[(String, String)]()
   var f = 
   b.foreach(d = { if(f.isEmpty()) f = d; c += d - f})
   c.toList
 }

 val data=sc.parallelize(List(
(A, A1),
(A, A2),
(A, A3),
(B, B1),
(B, B2),
(C, C1)))

 data.groupByKey().map{case(a, b)=(a, firstValue(b))}.collect

 So I create a new list after groupByKey. Is it right approach to do this
 in Spark? Are there any other options? Please point me to any drawbacks.

 Thanks,

 Dmitry




Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-13 Thread Dmitry Tolpeko
Hello,

To convert existing Map Reduce jobs to Spark, I need to implement window
functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
FIRST_VALUE function:

Source (1st column is key):

A, A1
A, A2
A, A3
B, B1
B, B2
C, C1

and the result should be

A, A1, A1
A, A2, A1
A, A3, A1
B, B1, B1
B, B2, B1
C, C1, C1

You can see that the first value in a group is repeated in each row.

My current Spark/Scala code:

def firstValue(b: Iterable[String]) : List[(String, String)] = {
  val c = scala.collection.mutable.MutableList[(String, String)]()
  var f = 
  b.foreach(d = { if(f.isEmpty()) f = d; c += d - f})
  c.toList
}

val data=sc.parallelize(List(
   (A, A1),
   (A, A2),
   (A, A3),
   (B, B1),
   (B, B2),
   (C, C1)))

data.groupByKey().map{case(a, b)=(a, firstValue(b))}.collect

So I create a new list after groupByKey. Is it right approach to do this in
Spark? Are there any other options? Please point me to any drawbacks.

Thanks,

Dmitry