Dependency error due to scala version mismatch in SBT and Spark 2.1

2017-10-15 Thread patel kumar
Hi, I am using CDH cluster with Spark 2.1 with Scala Version 2.11.8. sbt version is 1.0.2. While doing assembly , I am getting error as *[error] java.lang.RuntimeException: Conflicting cross-version suffixes in: org.scala-lang.modules:scala-xml, org.scala-lang.*

RE: Is Spark suited for this use case?

2017-10-15 Thread van den Heever, Christian CC
Hi, We basically have the same scenario but worldwide as we have bigger Datasets we use OGG --> local --> Sqoop Into Hadoop. By all means you can have spark reading the oracle tables and then do some changes to data in need which will not be done on scoop qry. Ie fraudulent detection on

Is Spark suited for this use case?

2017-10-15 Thread Saravanan Thirumalai
We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs. Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas, without the hive thrift server, if you try to run a select * on a table which has around 10,000 partitions, SPARK will give you some surprises. PRESTO works fine in these scenarios, and I am sure SPARK community will soon learn from their algorithms. Regards, Gourav On Sun, Oct 15,

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
> I do not think that SPARK will automatically determine the partitions. > Actually > it does not automatically determine the partitions. In case a table has a few > million records, it all goes through the driver. Hi Gourav Actualy spark jdbc driver is able to deal direclty with partitions.

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Hi Gourav > what if the table has partitions and sub-partitions? well this also work with multiple orc files having same schema: val people = sqlContext.read.format("orc").load("hdfs://cluster/people*") Am I missing something? > And you do not want to access the entire data? This works for

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas, what if the table has partitions and sub-partitions? And you do not want to access the entire data? Regards, Gourav On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris wrote: > Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > > I wonder the differences

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext Well there is also a third way to access the hive data from spark: - with direct file access (here ORC format) For example: val

Re: Near Real time analytics with Spark and tokenization

2017-10-15 Thread Jörn Franke
Can’t you cache the token vault in a caching solution , such as Ignite? The lookup of single tokens would be really fast. About what volumes one talks about? I assume you refer to PCI DSS, so security might be an important aspect which might be not that easy to achieve with vault-less

Near Real time analytics with Spark and tokenization

2017-10-15 Thread Mich Talebzadeh
Hi, When doing micro-batch streaming of trade data we need to tokenization certain columns before data lands in Hbase with Lambda architecture. There are two ways of tokenizing data, vault based and vault less using something like Protegrity tokenization. The vault-based tokenization requires