Near Real time analytics with Spark and tokenization

2017-10-15 Thread Mich Talebzadeh
Hi,

When doing micro-batch streaming of trade data we need to tokenization
certain columns before data lands in Hbase with Lambda architecture.

There are two ways of tokenizing data, vault based and vault less using
something like Protegrity tokenization.

The vault-based tokenization requires clear text and token values to be
stored in a vault say Hbase and crucially the vault cannot be on the same
Hadoop cluster that we are processing real time. It could be in another
Hadoop cluster for tokenization.

This causes latency for real time analytics when token values have to be
calculated and then stored in remote Hbase vault.

What is the general approach to this type of issue. It seems to be based to
use vault-less tokenization?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Near Real time analytics with Spark and tokenization

2017-10-15 Thread Jörn Franke
Can’t you cache the token vault in a caching solution , such as Ignite? The 
lookup of single tokens would be really fast.
About what volumes one talks about? 

I assume you refer to PCI DSS, so security might be an important aspect which 
might be not that easy to achieve with vault-less tokenization. Then, with 
vault-less tokenization you need to recalculate all tokens  in case the secret 
is compromised.
There might be other compliance requirements , which may need to be weighted by 
the users.

> On 15. Oct 2017, at 09:15, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> When doing micro-batch streaming of trade data we need to tokenization 
> certain columns before data lands in Hbase with Lambda architecture.
> 
> There are two ways of tokenizing data, vault based and vault less using 
> something like Protegrity tokenization.
> 
> The vault-based tokenization requires clear text and token values to be 
> stored in a vault say Hbase and crucially the vault cannot be on the same 
> Hadoop cluster that we are processing real time. It could be in another 
> Hadoop cluster for tokenization.
> 
> This causes latency for real time analytics when token values have to be 
> calculated and then stored in remote Hbase vault.
> 
> What is the general approach to this type of issue. It seems to be based to 
> use vault-less tokenization?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext

Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)


For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_people")
people.createOrReplaceTempView("people")
sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()


This method looks much faster than both:
- with jdbc access
- with sparkContext

Any experience on that ?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas,

what if the table has partitions and sub-partitions? And you do not want to
access the entire data?


Regards,
Gourav

On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris  wrote:

> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences accessing HIVE tables in two different ways:
> > - with jdbc access
> > - with sparkContext
>
> Well there is also a third way to access the hive data from spark:
> - with direct file access (here ORC format)
>
>
> For example:
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
> people")
> people.createOrReplaceTempView("people")
> sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
>
>
> This method looks much faster than both:
> - with jdbc access
> - with sparkContext
>
> Any experience on that ?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Hi Gourav

> what if the table has partitions and sub-partitions? 

well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?

> And you do not want to access the entire data?

This works for static datasets, or when new data is comming by batch
processes, the spark application should be reloaded to get the new files
in the folder


>> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris  wrote:
> 
> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences accessing HIVE tables in two different ways:
> > - with jdbc access
> > - with sparkContext
> 
> Well there is also a third way to access the hive data from spark:
> - with direct file access (here ORC format)
> 
> 
> For example:
> 
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
> people")
> people.createOrReplaceTempView("people")
> sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
> 
> 
> This method looks much faster than both:
> - with jdbc access
> - with sparkContext
> 
> Any experience on that ?
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
> I do not think that SPARK will automatically determine the partitions. 
> Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.

Hi Gourav

Actualy spark jdbc driver is able to deal direclty with partitions.
Sparks creates a jdbc connection for each partition.

All details explained in this post : 
http://www.gatorsmile.io/numpartitionsinjdbc/

Also an example with greenplum database:
http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas,

without the hive thrift server, if you try to run a select * on a table
which has around 10,000 partitions, SPARK will give you some surprises.
PRESTO works fine in these scenarios, and I am sure SPARK community will
soon learn from their algorithms.


Regards,
Gourav

On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:

> > I do not think that SPARK will automatically determine the partitions.
> Actually
> > it does not automatically determine the partitions. In case a table has
> a few
> > million records, it all goes through the driver.
>
> Hi Gourav
>
> Actualy spark jdbc driver is able to deal direclty with partitions.
> Sparks creates a jdbc connection for each partition.
>
> All details explained in this post :
> http://www.gatorsmile.io/numpartitionsinjdbc/
>
> Also an example with greenplum database:
> http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
>


Is Spark suited for this use case?

2017-10-15 Thread Saravanan Thirumalai
We are an Investment firm and have a MDM platform in oracle at a vendor
location and use Oracle Golden Gate to replicat data to our data center for
reporting needs.
Our data is not big data (total size 6 TB including 2 TB of archive data).
Moreover our data doesn't get updated often, nightly once (around 50 MB)
and some correction transactions during the day (<10 MB). We don't have
external users and hence data doesn't grow real-time like e-commerce.

When we replicate data from source to target, we transfer data through
files. So, if there are DML operations (corrections) during day time on a
source table, the corresponding file would have probably 100 lines of table
data that needs to be loaded into the target database. Due to low volume of
data we designed this through Informatica and this works in less than 2-5
minutes. Can Spark be used in this case or would it be an overkill of
technology use?


RE: Is Spark suited for this use case?

2017-10-15 Thread van den Heever, Christian CC
Hi,

We basically have the same scenario but worldwide as we have bigger Datasets we 
use OGG --> local --> Sqoop Into Hadoop.
By all means you can have spark reading the oracle tables and then do some 
changes to data in need which will not be done on scoop qry. Ie fraudulent 
detection on transaction records.

But some time the simplest way is the best. Unless you need a change or need 
more then I would advise not using another hop.
I would rather move away from files as OGG can do files and direct table 
loading then sqoop for the rest.

Simpler is better.

Hope this helps.
C.

From: Saravanan Thirumalai [mailto:saravanan.thiruma...@gmail.com]
Sent: Monday, 16 October 2017 4:29 AM
To: user@spark.apache.org
Subject: Is Spark suited for this use case?

We are an Investment firm and have a MDM platform in oracle at a vendor 
location and use Oracle Golden Gate to replicat data to our data center for 
reporting needs.
Our data is not big data (total size 6 TB including 2 TB of archive data). 
Moreover our data doesn't get updated often, nightly once (around 50 MB) and 
some correction transactions during the day (<10 MB). We don't have external 
users and hence data doesn't grow real-time like e-commerce.

When we replicate data from source to target, we transfer data through files. 
So, if there are DML operations (corrections) during day time on a source 
table, the corresponding file would have probably 100 lines of table data that 
needs to be loaded into the target database. Due to low volume of data we 
designed this through Informatica and this works in less than 2-5 minutes. Can 
Spark be used in this case or would it be an overkill of technology use?



Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read 
our email disclaimer and confidentiality note. Kindly email 
disclai...@standardbank.co.za (no content or subject line necessary) if you 
cannot view that page and we will email our email disclaimer and 
confidentiality note to you.


Dependency error due to scala version mismatch in SBT and Spark 2.1

2017-10-15 Thread patel kumar
Hi,

I am using CDH cluster with Spark 2.1 with Scala Version 2.11.8.
sbt version is 1.0.2.

While doing assembly , I am getting error as
*[error] java.lang.RuntimeException: Conflicting cross-version suffixes in:
org.scala-lang.modules:scala-xml, org.scala-lang.*
*modules:scala-parser-combinators*

I tried to override the version mismatch using dependencyOverrides and
force(), but none of the solution worked.

Please help me to resolve this version Conflict.

Details of the configuration are mentioned below :-

*build.sbt *

***
name := "newtest"
version := "0.0.2"

scalaVersion := "2.11.8"

sbtPlugin := true

val sparkVersion = "2.1.0"

mainClass in (Compile, run) := Some("com.testpackage.sq.newsparktest")

assemblyJarName in assembly := "newtest.jar"


libraryDependencies ++= Seq(
   "org.apache.spark" % "spark-core_2.11" % "2.1.1" % "provided",
  "org.apache.spark" % "spark-sql_2.11" % "2.1.1" % "provided",
  "com.databricks" % "spark-avro_2.11" % "3.2.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.1" % "provided"
   )


libraryDependencies +=
 "log4j" % "log4j" % "1.2.15" excludeAll(
   ExclusionRule(organization = "com.sun.jdmk"),
   ExclusionRule(organization = "com.sun.jmx"),
   ExclusionRule(organization = "javax.jms")
 )

resolvers += "SparkPackages" at "https://dl.bintray.com/spark-
packages/maven/"
resolvers += Resolver.url("bintray-sbt-plugins", url("http://dl.bintray.com/
sbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)

assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}

***
*plugins.sbt*



dependencyOverrides += ("org.scala-lang.modules" % "scala-xml_2.11" %
"1.0.4")
dependencyOverrides += ("org.scala-lang.modules" %
"scala-parser-combinators_2.11" % "1.0.4")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
resolvers += Resolver.url("bintray-sbt-plugins", url("
https://dl.bintray.com/eed3si9n/sbt-plugins/";))(Resolver.ivyStylePatterns)

*
*Error Message after assembly*

**
[error] Modules were resolved with conflicting cross-version suffixes in
{file:/D:/Tools/scala_ide/test_workspace/test/NewSp
arkTest/}newsparktest:
[error]org.scala-lang.modules:scala-xml _2.11, _2.12
[error]org.scala-lang.modules:scala-parser-combinators _2.11, _2.12
[error] java.lang.RuntimeException: Conflicting cross-version suffixes in:
org.scala-lang.modules:scala-xml, org.scala-lang.
modules:scala-parser-combinators
[error] at scala.sys.package$.error(package.scala:27)
[error] at sbt.librarymanagement.ConflictWarning$.
processCrossVersioned(ConflictWarning.scala:39)
[error] at sbt.librarymanagement.ConflictWarning$.apply(
ConflictWarning.scala:19)
[error] at sbt.Classpaths$.$anonfun$ivyBaseSettings$64(Defaults.
scala:1971)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(
TypeFunctions.scala:42)
[error] at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(
ErrorHandling.scala:16)
[error] at sbt.Execute.work (Execute.
scala:266)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] at sbt.ConcurrentRestrictions$$
anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] at sbt.CompletionService$$anon$2.
call(CompletionService.scala:32)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.
call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:748)
[error] (*:update) Conflicting cross-version suffixes in:
org.scala-lang.modules:scala-xml, org.scala-lang.modules:scala-par
ser-combinators
[error] Total time: 413 s, completed Oct 12, 2017 3:28:02 AM

**


Re: Is Spark suited for this use case?

2017-10-15 Thread Jörn Franke
Hi,

What is the motivation behind your question? Save costs?

You seem to be happy with the functional/non-functional requirements. So the 
only thing that it could be is cost or need for innovation in the future.

Best regards

> On 16. Oct 2017, at 06:32, van den Heever, Christian CC 
>  wrote:
> 
> Hi,
>  
> We basically have the same scenario but worldwide as we have bigger Datasets 
> we use OGG à local à Sqoop Into Hadoop.
> By all means you can have spark reading the oracle tables and then do some 
> changes to data in need which will not be done on scoop qry. Ie fraudulent 
> detection on transaction records.
>  
> But some time the simplest way is the best. Unless you need a change or need 
> more then I would advise not using another hop.
> I would rather move away from files as OGG can do files and direct table 
> loading then sqoop for the rest.
>  
> Simpler is better.
>  
> Hope this helps.
> C.
>  
> From: Saravanan Thirumalai [mailto:saravanan.thiruma...@gmail.com] 
> Sent: Monday, 16 October 2017 4:29 AM
> To: user@spark.apache.org
> Subject: Is Spark suited for this use case?
>  
> We are an Investment firm and have a MDM platform in oracle at a vendor 
> location and use Oracle Golden Gate to replicat data to our data center for 
> reporting needs. 
> Our data is not big data (total size 6 TB including 2 TB of archive data). 
> Moreover our data doesn't get updated often, nightly once (around 50 MB) and 
> some correction transactions during the day (<10 MB). We don't have external 
> users and hence data doesn't grow real-time like e-commerce.
>  
> When we replicate data from source to target, we transfer data through files. 
> So, if there are DML operations (corrections) during day time on a source 
> table, the corresponding file would have probably 100 lines of table data 
> that needs to be loaded into the target database. Due to low volume of data 
> we designed this through Informatica and this works in less than 2-5 minutes. 
> Can Spark be used in this case or would it be an overkill of technology use?
>  
>  
>  
> 
> 
> Standard Bank email disclaimer and confidentiality note
> Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to 
> read our email disclaimer and confidentiality note. Kindly email 
> disclai...@standardbank.co.za (no content or subject line necessary) if you 
> cannot view that page and we will email our email disclaimer and 
> confidentiality note to you.
> 
>