[jira] [Updated] (KYLIN-1506) Refactor resource interface for timeseries-based data like jobs to much better performance

Hao Chen (JIRA) Sat, 19 Mar 2016 04:04:32 -0700

     [ 
https://issues.apache.org/jira/browse/KYLIN-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hao Chen updated KYLIN-1506:
----------------------------
    Description: 
h1. Problem

Currently all operations like getJobOutputs/getJobs and so on are use two-times 
scan to get the response, for example, currently the scan always:
1. Get keys, sort, get first and last key (in fact which is just get by prefix 
filter) with "store.listResources(resourcePath)"
2. Re-scan the keys with timestamp filter: 
"store.getAllResources(startKey,endKey,startTime, endTime, Class, Serializer)"

{code}
public List<ExecutableOutputPO> getJobOutputs(long timeStartInMillis, long 
timeEndInMillis) throws PersistentException {
        try {
            NavigableSet<String> resources = 
store.listResources(ResourceStore.EXECUTE_OUTPUT_RESOURCE_ROOT);
            if (resources == null || resources.isEmpty()) {
                return Collections.emptyList();
            }
            // Collections.sort(resources);
            String rangeStart = resources.first();
            String rangeEnd = resources.last();
            return store.getAllResources(rangeStart, rangeEnd, 
timeStartInMillis, timeEndInMillis, ExecutableOutputPO.class, 
JOB_OUTPUT_SERIALIZER);
        } catch (IOException e) {
            logger.error("error get all Jobs:", e);
            throw new PersistentException(e);
        }
    }
{code}

h2. Solution
In fact we could simply combine the two-times scan into one directly:
{code}
store.getAllResources(resourcePath,startTime, endTime, Class, Serializer)
store.getAllResources(resourcePath, Class, Serializer)
{code}

For example, refactored "List<ExecutableOutputPO> getJobOutputs(long 
timeStartInMillis, long timeEndInMillis)" as following:

{code}
public List<ExecutableOutputPO> getJobOutputs(long timeStartInMillis, long 
timeEndInMillis) throws PersistentException {
        try {
            return 
store.getAllResources(ResourceStore.EXECUTE_OUTPUT_RESOURCE_ROOT, 
timeStartInMillis, timeEndInMillis, ExecutableOutputPO.class, 
JOB_OUTPUT_SERIALIZER);
        } catch (IOException e) {
            logger.error("error get all Jobs:", e);
            throw new PersistentException(e);
        }
    }
{code}

  was:
h1. Problem

Currently all operations like getJobOutputs/getJobs and so on are use two-times 
scan to get the response, for example, currently the scan always:
1. Get keys, sort, get first and last key (in fact which is just get by prefix 
filter) with "store.listResources(resourcePath)"
2. Re-scan the keys with timestamp filter: 
"store.getAllResources(startKey,endKey,startTime, endTime, Class, Serializer)"

{code}
public List<ExecutableOutputPO> getJobOutputs(long timeStartInMillis, long 
timeEndInMillis) throws PersistentException {
        try {
            NavigableSet<String> resources = 
store.listResources(ResourceStore.EXECUTE_OUTPUT_RESOURCE_ROOT);
            if (resources == null || resources.isEmpty()) {
                return Collections.emptyList();
            }
            // Collections.sort(resources);
            String rangeStart = resources.first();
            String rangeEnd = resources.last();
            return store.getAllResources(rangeStart, rangeEnd, 
timeStartInMillis, timeEndInMillis, ExecutableOutputPO.class, 
JOB_OUTPUT_SERIALIZER);
        } catch (IOException e) {
            logger.error("error get all Jobs:", e);
            throw new PersistentException(e);
        }
    }
{code}

h2. Solution
In fact we could simply combine the two-times scan into one directly:
{code}
store.getAllResources(resourcePath,startTime, endTime, Class, Serializer)
{code}


> Refactor resource interface for timeseries-based data like jobs to much 
> better performance
> ------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-1506
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1506
>             Project: Kylin
>          Issue Type: Sub-task
>            Reporter: Hao Chen
>
> h1. Problem
> Currently all operations like getJobOutputs/getJobs and so on are use 
> two-times scan to get the response, for example, currently the scan always:
> 1. Get keys, sort, get first and last key (in fact which is just get by 
> prefix filter) with "store.listResources(resourcePath)"
> 2. Re-scan the keys with timestamp filter: 
> "store.getAllResources(startKey,endKey,startTime, endTime, Class, Serializer)"
> {code}
> public List<ExecutableOutputPO> getJobOutputs(long timeStartInMillis, long 
> timeEndInMillis) throws PersistentException {
>         try {
>             NavigableSet<String> resources = 
> store.listResources(ResourceStore.EXECUTE_OUTPUT_RESOURCE_ROOT);
>             if (resources == null || resources.isEmpty()) {
>                 return Collections.emptyList();
>             }
>             // Collections.sort(resources);
>             String rangeStart = resources.first();
>             String rangeEnd = resources.last();
>             return store.getAllResources(rangeStart, rangeEnd, 
> timeStartInMillis, timeEndInMillis, ExecutableOutputPO.class, 
> JOB_OUTPUT_SERIALIZER);
>         } catch (IOException e) {
>             logger.error("error get all Jobs:", e);
>             throw new PersistentException(e);
>         }
>     }
> {code}
> h2. Solution
> In fact we could simply combine the two-times scan into one directly:
> {code}
> store.getAllResources(resourcePath,startTime, endTime, Class, Serializer)
> store.getAllResources(resourcePath, Class, Serializer)
> {code}
> For example, refactored "List<ExecutableOutputPO> getJobOutputs(long 
> timeStartInMillis, long timeEndInMillis)" as following:
> {code}
> public List<ExecutableOutputPO> getJobOutputs(long timeStartInMillis, long 
> timeEndInMillis) throws PersistentException {
>         try {
>             return 
> store.getAllResources(ResourceStore.EXECUTE_OUTPUT_RESOURCE_ROOT, 
> timeStartInMillis, timeEndInMillis, ExecutableOutputPO.class, 
> JOB_OUTPUT_SERIALIZER);
>         } catch (IOException e) {
>             logger.error("error get all Jobs:", e);
>             throw new PersistentException(e);
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KYLIN-1506) Refactor resource interface for timeseries-based data like jobs to much better performance

Reply via email to