Hello, I am evaluating Ignite to be able to use it as a hdfs cache to speedup my hive queries. I am using hive with tez. Below are my cluster and Ignite configurations,
*Cluster: * 4 data nodes with 32gb RAM each, 1 edge node 4 ignite servers, one for each data node. Ignite servers were started with Xmx10g *Setup done using:* https://apacheignite-fs.readme.io/docs/installing-on-hortonworks-hdp https://apacheignite-fs.readme.io/docs/running-apache-hive-over-ignited-hadoop *Ignite configuration file (provided to each ignite server): * <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="memoryConfiguration"> <bean class="org.apache.ignite.configuration.MemoryConfiguration"> <property name="defaultMemoryPolicySize" value="#{8L * 1024 * 1024 * 1024}"/> </bean> </property> <property name="connectorConfiguration"> <bean class="org.apache.ignite.configuration.ConnectorConfiguration"> <property name="port" value="11211"/> </bean> </property> <property name="fileSystemConfiguration"> <list> <bean class="org.apache.ignite.configuration.FileSystemConfiguration"> <!-- IGFS name you will use to access IGFS through Hadoop API. --> <property name="name" value="igfs"/> <!-- Configure TCP endpoint for communication with the file system instance. --> <property name="ipcEndpointConfiguration"> <bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration"> <property name="type" value="TCP" /> <property name="host" value="0.0.0.0" /> <property name="port" value="10500" /> </bean> </property> <!-- Configure secondary file system if needed. --> <property name="secondaryFileSystem"> <bean class="org.apache.ignite.hadoop.fs.IgniteHadoopIgfsSecondaryFileSystem"> <property name="fileSystemFactory"> <bean class="org.apache.ignite.hadoop.fs.CachingHadoopFileSystemFactory"> <property name="uri" value="hdfs://<hostip>:8020/"/> </bean> </property> </bean> </property> </bean> </list> </property> <property name="discoverySpi"> <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> <property name="ipFinder"> <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder"> <property name="addresses"> <list> <value>node1:47500..47509</value> <value>node2:47500..47509</value> <value>node3:47500..47509</value> <value>node4:47500..47509</value> </list> </property> </bean> </property> </bean> </property> </bean> *Dataset used for the experiment: * TPCH customer 1500000 rows lineitem 59986052 rows nation 25 rows orders 15000000 rows part 2000000 rows partsupp 8000000 rows region 5 rows supplier 100000 rows and using standard TPCH queries *Querying from hive shell with below properties:* set fs.default.name=igfs://igfs@node1:10500/; I have now following questions: 1) My queries are running fine with the above configurations. I want to see whether the data is caching and coming from cache or not. How should i check this? I used Ignite visor to see if the data is available in cache, but i did not find any cache there. Although, in the Ignite server logs, i can see messages for local node metrics like shown below. The Heap usage is continuously increases while running query. what does this means? Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=e38943b2, name=null, uptime=03:02:18:866] ^-- H/N/C [hosts=4, nodes=4, CPUs=32] ^-- CPU [cur=0.23%, avg=0.13%, GC=0%] ^-- PageMemory [pages=7381] ^-- Heap [used=1050MB, free=88.46%, comm=3343MB] ^-- Non heap [used=83MB, free=98.45%, comm=84MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=6, qSize=0] ^-- Outbound messages queue [size=0] 2) I ran queries on both hive+tez+hdfs and hive+tez+ignite+hdfs. I found that the queries are slower when using ignite as a cache layer. For example consider below TPCH standard query, select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and o_orderdate >= '1993-01-01' and o_orderdate < '1994-01-01' group by n_name order by revenue desc; Hive+tez avg time: 35.542s Hive+tez+ignite avg time: 38.221s Am i using wrong configurations? 3) I tried running queries with ignite MR with below configs set in hive. set hive.rpc.query.plan = true; set hive.execution.engine = mr; set mapreduce.framework.name = ignite; set mapreduce.jobtracker.address = node1:11211; The queries were even slower than hive+tez+ignite. Is there any other configuration for Ignite MR that i need to do? 4) Are my configurations optimal? if not can you please suggest me one. 5) What serialization algo (kryo, native java ...) Ignite uses? Thanks