Hi,
I am using CDH cluster with Spark 2.1 with Scala Version 2.11.8.
sbt version is 1.0.2.
While doing assembly , I am getting error as
*[error] java.lang.RuntimeException: Conflicting cross-version suffixes in:
org.scala-lang.modules:scala-xml, org.scala-lang.*
Hi,
We basically have the same scenario but worldwide as we have bigger Datasets we
use OGG --> local --> Sqoop Into Hadoop.
By all means you can have spark reading the oracle tables and then do some
changes to data in need which will not be done on scoop qry. Ie fraudulent
detection on
We are an Investment firm and have a MDM platform in oracle at a vendor
location and use Oracle Golden Gate to replicat data to our data center for
reporting needs.
Our data is not big data (total size 6 TB including 2 TB of archive data).
Moreover our data doesn't get updated often, nightly once
Hi Nicolas,
without the hive thrift server, if you try to run a select * on a table
which has around 10,000 partitions, SPARK will give you some surprises.
PRESTO works fine in these scenarios, and I am sure SPARK community will
soon learn from their algorithms.
Regards,
Gourav
On Sun, Oct 15,
> I do not think that SPARK will automatically determine the partitions.
> Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.
Hi Gourav
Actualy spark jdbc driver is able to deal direclty with partitions.
Hi Gourav
> what if the table has partitions and sub-partitions?
well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?
> And you do not want to access the entire data?
This works for
Hi Nicolas,
what if the table has partitions and sub-partitions? And you do not want to
access the entire data?
Regards,
Gourav
On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris wrote:
> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)
For example:
val
Can’t you cache the token vault in a caching solution , such as Ignite? The
lookup of single tokens would be really fast.
About what volumes one talks about?
I assume you refer to PCI DSS, so security might be an important aspect which
might be not that easy to achieve with vault-less
Hi,
When doing micro-batch streaming of trade data we need to tokenization
certain columns before data lands in Hbase with Lambda architecture.
There are two ways of tokenizing data, vault based and vault less using
something like Protegrity tokenization.
The vault-based tokenization requires
10 matches
Mail list logo