[ 
https://issues.apache.org/jira/browse/JENA-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118211#comment-13118211
 ] 

Stephen Allen commented on JENA-126:
------------------------------------

I believe what you describe is what Pig does [1].  It is complicated for a 
couple of reasons:

1) Hard to reason about which bags to spill.  If you want to do a good job, I 
think you need:
   1a) Some way to estimate the size of each bag (Pig does this)
   1b) Maybe some way to know which bag is "hot" so as not to spill it.  Some 
kind of LRU scheme?  Memory-mapped files would work great here, but now you're 
talking about off-heap memory and serialization/deserialization costs. (Pig 
does not do this)
2) The DataBags classes need to be thread-safe and handle spill requests from 
the memory management thread.  Perhaps this can be avoided by calculating it 
in-line, but testing the free memory size may be expensive, and then you can 
only spill bags in your own thread.
3) Need to make sure we spill before the system starts using virtual memory 
(swap) on its own

I think that doing a memory limit per operator may be simpler, since it 
essentially only requires you to 1a.  Willing to be proven wrong though, since 
your idea eliminates the need for a user configuration option.


[1] 
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/impl/util/SpillableMemoryManager.java


                
> Change temporary table threshold policy from count to memory size
> -----------------------------------------------------------------
>
>                 Key: JENA-126
>                 URL: https://issues.apache.org/jira/browse/JENA-126
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Stephen Allen
>
> The "workCount" setting for temporary table sizes is not a good configuration 
> option.  Binding sizes could potentially vary from as little as 32 bytes (8 
> byte ref to the binding + 8 byte ref to a variable + 8 byte nodeID + 8 byte 
> object overhead), to some bindings with multi-megabyte strings.  Asking the 
> user to know which one it is likely to be, and then how that count translates 
> into memory usage (the real resource we are attempting to control) is already 
> way too much IMO.
> OK, so what the user wants is a way to specify the amount of memory that can 
> be used by each query operator for temporary tables [1][2][3].  Hmm, wait, no 
> what he maybe wants is a way to specify a the total memory used for temporary 
> tables per query?  No, maybe he wants to specify it for the whole query 
> engine.
> But that last paragraph is not accurate.  What he *really* wants is a system 
> that answers all of his queries for whatever data he has as fast as possible. 
>  He doesn't want to have to configure any parameters.  Unfortunately, this is 
> a really hard dynamic optimization problem so we foist it off on the user, 
> hoping he'll be able to come up with some value.
> We need to decide on what we want to use as a config parameter.  I believe it 
> should be a "workMem" or "tmpTableSize" setting that specifies the max memory 
> usage of a temporary table before it is converted into an on-disk table.
> [1] This is what most DB systems provide, specifically PostgreSQL and MySQL 
> both have per operator temporary table sizes.  PostgreSQL calls the setting 
> "work_mem" and MySQL calls it "tmp_table_size"
> [2] http://www.postgresql.org/docs/8.3/static/runtime-config-resource.html
> [3] http://dev.mysql.com/doc/refman/5.0/en/internal-temporary-tables.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to