[ 
https://issues.apache.org/jira/browse/USERGRID-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Johnson updated USERGRID-788:
-----------------------------------
    Description: 
The idea is to use multiple files to make the Migration tool export run faster 
and to support entities with a huge number of connections. Here are some 
questions to consider and a proposal.

h3. Should application be saved as multiple files?

One advantage of saving to multiple files is that we can use multiple threads 
to write the files and that will make the export faster.  For example, we could 
start a thread to write out each collection of an app as it's own file, or set 
of files.

h3. Should each collection be saved as multiple files?

Each collection must be written out serially if we want to preserve order. If 
that is the case, then saving each collection to multiple files won't help much 
there.

h3. Should connections be separated out from entities in collections?

Currently, we write an entities connections right into the entity itself 
inside. This will be a problem if we have entities with a huge number of 
connections, it will cause entity size to bloat and could cause an import 
program to fail.  Connections should be stored in a separate file.

h3. Multiple files proposal

1. Each collection will be written out to a set of files named like this:

   {{<orgname>_<appname>_<collname>_collection_N.json}}


2. For each collection, outgoing connections will be written to a set of files 
named like this:

   {{<orgname>_<appname>_<collname>_connections.N.json}}


Each connection will be a JSON object with fields: 

   {{source, sourceType, target, targetType, targetType}}


3. A command-line parameter specifies max size of each output file.

4. Implementation should use a thread for each collection of an application. 
Currently, we have only one write thread which limits our throughput.



  was:
The idea is to use multiple files to make the Migration tool export run faster 
and to support entities with a huge number of connections. 

Here are some questions to consider:

h3. Should application be saved as multiple files?

One advantage of saving to multiple files is that we can use multiple threads 
to write the files and that will make the export faster.  For example, we could 
start a thread to write out each collection of an app as it's own file, or set 
of files.

h3. Should each collection be saved as multiple files?

Each collection must be written out serially if we want to preserve order. If 
that is the case, then saving each collection to multiple files won't help much 
there.

h3. Should connections be separated out from entities in collections?

Currently, we write an entities connections right into the entity itself 
inside. This will be a problem if we have entities with a huge number of 
connections, it will cause entity size to bloat and could cause an import 
program to fail.  Connections should be stored in a separate file.

h3. Multiple files proposal

1. Each collection will be written out to a set of files named like this:
{{monospaced}}
   <orgname>_<appname>_<collname>_collection_N.json
{{monospaced}}

2. For each collection, outgoing connections will be written to a set of files 
named like this:
{{monospaced}}
   <orgname>_<appname>_<collname>_connections.N.json
{{monospaced}}

Each connection will be a JSON object with fields: 
{{monospaced}}
   source, sourceType, target, targetType, targetType
{{monospaced}}

3. A command-line parameter specifies max size of each output file.

4. Implementation should use a thread for each collection of an application. 
Currently, we have only one write thread which limits our throughput.




> Use multiple output files in Migration/export tool
> --------------------------------------------------
>
>                 Key: USERGRID-788
>                 URL: https://issues.apache.org/jira/browse/USERGRID-788
>             Project: Usergrid
>          Issue Type: Story
>            Reporter: David Johnson
>
> The idea is to use multiple files to make the Migration tool export run 
> faster and to support entities with a huge number of connections. Here are 
> some questions to consider and a proposal.
> h3. Should application be saved as multiple files?
> One advantage of saving to multiple files is that we can use multiple threads 
> to write the files and that will make the export faster.  For example, we 
> could start a thread to write out each collection of an app as it's own file, 
> or set of files.
> h3. Should each collection be saved as multiple files?
> Each collection must be written out serially if we want to preserve order. If 
> that is the case, then saving each collection to multiple files won't help 
> much there.
> h3. Should connections be separated out from entities in collections?
> Currently, we write an entities connections right into the entity itself 
> inside. This will be a problem if we have entities with a huge number of 
> connections, it will cause entity size to bloat and could cause an import 
> program to fail.  Connections should be stored in a separate file.
> h3. Multiple files proposal
> 1. Each collection will be written out to a set of files named like this:
>    {{<orgname>_<appname>_<collname>_collection_N.json}}
> 2. For each collection, outgoing connections will be written to a set of 
> files named like this:
>    {{<orgname>_<appname>_<collname>_connections.N.json}}
> Each connection will be a JSON object with fields: 
>    {{source, sourceType, target, targetType, targetType}}
> 3. A command-line parameter specifies max size of each output file.
> 4. Implementation should use a thread for each collection of an application. 
> Currently, we have only one write thread which limits our throughput.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to