neumarcx commented on issue #568: Add Aggregate Median to SPARQL ARQ syntax
URL: https://github.com/apache/jena/pull/568#issuecomment-493487384
 
 
   > One observation is that the Commons Math library used notes that for 
percentile based stats to be evaluated correctly the data should be at least 
partially ordered 
(http://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/descriptive/rank/Percentile.html).
 Since aggregation is computed prior to sorting in SPARQL there’s no guarantee 
that the accumulator will see the data in a sensible order and thus calculate a 
meaningful result.
   > 
   > Could do with some test cases to see what happens in the case of generated 
random data inputs and may need some refactoring to do internal sorting of the 
accumulated values prior to passing them to the Math library
   
   @rvesse worthwhile observation on Commons Math library, but by definition 
median is the middle in a sorted order. Same is the case for the Commons Math 
implementation here. The sort in the Commons Math library is certainly much 
more efficient than a standard sort on e.g. java.util.Arrays. In preliminary 
tests I tend run into heap space issues with sets +200m aggregate values in an 
array and for +1 billion values in settings with large Xmx allocation.  Do you 
see this being an issue for a general release in ARQ?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to