Thanks Starting the thriftserver with igniterdd tables doesn't seem very hard. Implementing a security layer over ignite cache may be harder as I need to: - get username from thriftserver - intercept each request and check permissions Maybe spark will also be able to handle permissions...
I will keep you informed Le 6 oct. 2016 00:12, "Denis Magda" <dma...@gridgain.com> a écrit : > Vincent, > > Please see below > > On Oct 5, 2016, at 4:31 AM, vincent gromakowski < > vincent.gromakow...@gmail.com> wrote: > > Hi > thanks for your explanations. Please find inline more questions > > Vincent > > 2016-10-05 3:33 GMT+02:00 Denis Magda <dma...@gridgain.com>: > >> Hi Vincent, >> >> See my answers inline >> >> On Oct 4, 2016, at 12:54 AM, vincent gromakowski < >> vincent.gromakow...@gmail.com> wrote: >> >> Hi, >> I know that Ignite has SQL support but: >> - ODBC driver doesn't seem to provide HTTP(S) support, which is easier to >> integrate on corporate networks with rules, firewalls, proxies >> >> >> *Igor Sapego*, what URIs are supported presently? >> >> - The SQL engine doesn't seem to scale like Spark SQL would. For >> instance, Spark won't generate OOM is dataset (source or result) doesn't >> fit in memory. From Ignite side, it's not clear… >> >> >> OOM is not related to scalability topic at all. This is about >> application’s logic. >> >> Ignite SQL engine perfectly scales out along with your cluster. Moreover, >> Ignite supports indexes which allows you to get O(logN) running time >> complexity for your SQL queries while in case of Spark you will face with >> full-scans (O(N)) all the time. >> >> However, to benefit from Ignite SQL queries you have to put all the data >> in-memory. Ignite doesn’t go to a CacheStore (Cassandra, relational >> database, MongoDB, etc) while a SQL query is executed and won’t preload >> anything from an underlying CacheStore. Automatic preloading works for >> key-value queries like cache.get(key). >> > > > This is an issue because I will potentially have to query TB of data. If I > use Spark thriftserver backed by IgniteRDD, does it solve this point and > can I get automatic preloading from C* ? > > > IgniteRDD will load missing tuples (key-value) pair from Cassandra because > essentially IgniteRDD is an IgniteCache and Cassandra is a CacheStore. The > only thing that is left to check is whether Spark triftserver can work with > IgniteRDDs. Hope you will be able figure out this and share your feedback > with us. > > > >> - Spark thrift can manage multi tenancy: different users can connect to >> the same SQL engine and share cache. In Ignite it's one cache per user, so >> a big waste of RAM. >> >> >> Everyone can connect to an Ignite cluster and work with the same set of >> distributed caches. I’m not sure why you need to create caches with the >> same content for every user. >> > > It's a security issue, Ignite cache doesn't provide multiple user account > per cache. I am thinking of using Spark to authenticate multiple users and > then Spark use a shared account on Ignite cache > > > Basically, Ignite provides basic security interfaces and some > implementations which you can rely on by building your secure solution. > This article can be useful for your case > http://smartkey.co.uk/development/securing-an-apache-ignite-cluster/ > > — > Denis > > >> If you need a real multi-tenancy support where cacheA is allowed to be >> accessed by a group of users A only and cacheB by users from group B then >> you can take a look at GridGain which is built on top of Ignite >> https://gridgain.readme.io/docs/multi-tenancy >> >> >> > OK but I am evaluating open source only solutions (kylin, druid, > alluxio...), it's a constraint from my hierarchy > >> >> What I want to achieve is : >> - use Cassandra for data store as it provides idempotence (HDFS/hive >> doesn't), resulting in exactly once semantic without any duplicates. >> - use Spark SQL thriftserver in multi tenancy for large scale adhoc >> analytics queries (> TB) from an ODBC driver through HTTP(S) >> - accelerate Cassandra reads when the data modeling of the Cassandra >> table doesn't fit the queries. Queries would be OLAP style: target multiple >> C* partitions, groupby or filters on lots of dimensions that aren't >> necessarely in the C* table key. >> >> >> As it was mentioned Ignite uses Cassandra as a CacheStore. You should >> keep this in mind. Before trying to assemble all the chain I would >> recommend you trying to connect Spark SQL thrift server directly to Ignite >> and work with its shared RDDs [1]. A shared RDD (basically Ignite cache) >> can be backed by Cassandra. Probably this chain will work for you but I >> can’t give more precise guidance on this. >> >> > I will try to make it works and give you feedback > > > >> [1] https://apacheignite-fs.readme.io/docs/ignite-for-spark >> >> — >> Denis >> >> Thanks for your advises >> >> >> 2016-10-04 6:51 GMT+02:00 Jörn Franke <jornfra...@gmail.com>: >> >>> I am not sure that this will be performant. What do you want to achieve >>> here? Fast lookups? Then the Cassandra Ignite store might be the right >>> solution. If you want to do more analytic style of queries then you can put >>> the data on HDFS/Hive and use the Ignite HDFS cache to cache certain >>> partitions/tables in Hive in-memory. If you want to go to iterative machine >>> learning algorithms you can go for Spark on top of this. You can use then >>> also Ignite cache for Spark RDDs. >>> >>> On 4 Oct 2016, at 02:24, Alexey Kuznetsov <akuznet...@gridgain.com> >>> wrote: >>> >>> Hi, Vincent! >>> >>> Ignite also has SQL support (also scalable), I think it will be much >>> faster to query directly from Ignite than query from Spark. >>> Also please mind, that before executing queries you should load all >>> needed data to cache. >>> To load data from Cassandra to Ignite you may use Cassandra store [1]. >>> >>> [1] https://apacheignite.readme.io/docs/ignite-with-apache-cassandra >>> >>> On Tue, Oct 4, 2016 at 4:19 AM, vincent gromakowski <vincent. >>> gromakow...@gmail.com> wrote: >>> >>>> Hi, >>>> I am evaluating the possibility to use Spark SQL (and its scalability) >>>> over an Ignite cache with Cassandra persistent store to increase read >>>> workloads like OLAP style analytics. >>>> Is there any way to configure Spark thriftserver to load an external >>>> table in Ignite like we can do in Cassandra ? >>>> Here is an example of config for spark backed by cassandra >>>> >>>> CREATE EXTERNAL TABLE MyHiveTable >>>> ( id int, data string ) >>>> STORED BY 'org.apache.hadoop.hive.cassan >>>> dra.cql.CqlStorageHandler' >>>> TBLPROPERTIES ("cassandra.host" = "x.x.x.x", "cassandra.ks.name" >>>> = "test" , >>>> "cassandra.cf.name" = "mytable" , >>>> "cassandra.ks.repfactor" = "1" , >>>> "cassandra.ks.strategy" = >>>> "org.apache.cassandra.locator.SimpleStrategy" ); >>>> >>>> >>> >>> >>> -- >>> Alexey Kuznetsov >>> >>> >