Hello,

I am evaluating Ignite to be able to use it as a hdfs cache to speedup my
hive queries. I am using hive with tez. Below are my cluster and Ignite
configurations,

*Cluster: *
4 data nodes with 32gb RAM each, 1 edge node
4 ignite servers, one for each data node. Ignite servers were started with
Xmx10g

*Setup done using:*
https://apacheignite-fs.readme.io/docs/installing-on-hortonworks-hdp
https://apacheignite-fs.readme.io/docs/running-apache-hive-over-ignited-hadoop

*Ignite configuration file (provided to each ignite server): *
<bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
<property name="memoryConfiguration">
<bean class="org.apache.ignite.configuration.MemoryConfiguration">
    <property name="defaultMemoryPolicySize" value="#{8L * 1024 * 1024 *
1024}"/>
</bean>
</property>
<property name="connectorConfiguration">
    <bean class="org.apache.ignite.configuration.ConnectorConfiguration">
        <property name="port" value="11211"/>
    </bean>
</property>
<property name="fileSystemConfiguration">
    <list>
        <bean
class="org.apache.ignite.configuration.FileSystemConfiguration">
            <!-- IGFS name you will use to access IGFS through Hadoop API.
-->
            <property name="name" value="igfs"/>

            <!-- Configure TCP endpoint for communication with the file
system instance. -->
            <property name="ipcEndpointConfiguration">
                <bean
class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
                    <property name="type" value="TCP" />
                    <property name="host" value="0.0.0.0" />
                    <property name="port" value="10500" />
                </bean>
            </property>

            <!--
                Configure secondary file system if needed.
            -->

            <property name="secondaryFileSystem">
                <bean
class="org.apache.ignite.hadoop.fs.IgniteHadoopIgfsSecondaryFileSystem">
                    <property name="fileSystemFactory">
                        <bean
class="org.apache.ignite.hadoop.fs.CachingHadoopFileSystemFactory">
                            <property name="uri"
value="hdfs://<hostip>:8020/"/>
                        </bean>
                    </property>
                </bean>
            </property>

        </bean>
    </list>
</property>
<property name="discoverySpi">
    <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
        <property name="ipFinder">
            <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
                <property name="addresses">
                    <list>
                        <value>node1:47500..47509</value>
        <value>node2:47500..47509</value>
         <value>node3:47500..47509</value>
         <value>node4:47500..47509</value>
                    </list>
                </property>
            </bean>
        </property>
    </bean>
</property>
</bean>

*Dataset used for the experiment: *
TPCH
customer 1500000 rows
lineitem 59986052 rows
nation 25 rows
orders 15000000 rows
part 2000000 rows
partsupp 8000000 rows
region 5 rows
supplier 100000 rows

and using standard TPCH queries

*Querying from hive shell with below properties:*
set fs.default.name=igfs://igfs@node1:10500/;



I have now following questions:

1) My queries are running fine with the above configurations. I want to see
whether the data is caching and coming from cache or not. How should i
check this? I used Ignite visor to see if the data is available in cache,
but i did not find any cache there.

Although, in the Ignite server logs, i can see messages for local node
metrics like shown below. The Heap usage is continuously increases while
running query. what does this means?

Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=e38943b2, name=null, uptime=03:02:18:866]
    ^-- H/N/C [hosts=4, nodes=4, CPUs=32]
    ^-- CPU [cur=0.23%, avg=0.13%, GC=0%]
    ^-- PageMemory [pages=7381]
    ^-- Heap [used=1050MB, free=88.46%, comm=3343MB]
    ^-- Non heap [used=83MB, free=98.45%, comm=84MB]
    ^-- Public thread pool [active=0, idle=0, qSize=0]
    ^-- System thread pool [active=0, idle=6, qSize=0]
    ^-- Outbound messages queue [size=0]


2) I ran queries on both hive+tez+hdfs and hive+tez+ignite+hdfs. I found
that the queries are slower when using ignite as a cache layer. For example
consider below TPCH standard query,

select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
and o_orderdate >= '1993-01-01'
and o_orderdate < '1994-01-01'
group by
n_name
order by
revenue desc;

Hive+tez avg time: 35.542s
Hive+tez+ignite avg time: 38.221s

Am i using wrong configurations?

3) I tried running queries with ignite MR with below configs set in hive.
set hive.rpc.query.plan = true;
set hive.execution.engine = mr;
set mapreduce.framework.name = ignite;
set mapreduce.jobtracker.address = node1:11211;

The queries were even slower than hive+tez+ignite. Is there any other
configuration for Ignite MR that i need to do?

4) Are my configurations optimal? if not can you please suggest me one.

5) What serialization algo (kryo, native java ...) Ignite uses?

Thanks

Reply via email to