ibobak opened a new issue, #599:
URL: https://github.com/apache/datasketches-java/issues/599
The code below implements a method "IsMemberOf" which is very much needed in
big data. I can explain later why it is so important.
But there is a serious issue in it:
```
package org.example;
//TIP To <b>Run</b> code, press <shortcut actionId="Run"/> or
// click the <icon src="AllIcons.Actions.Execute"/> icon in the gutter.
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.datasketches.memory.Memory;
import org.apache.datasketches.theta.*;
// simplified file operations and no error handling for clarity
public class Main {
final static long SIZE = 7680;
final static long SIZE_LEFT = (long)(SIZE / 2);
final static long SIZE_RIGHT = SIZE_LEFT + SIZE;
/**
* this function generates two sketches with some overlap and serializes
them into files
*/
public static void generateFiles() throws IOException {
// SIZE unique keys
UpdateSketch sketch1 = UpdateSketch.builder().setSeed(1).build();
for (int key = 0; key < SIZE; key++)
sketch1.update("item" + key);
FileOutputStream out1 = new FileOutputStream("ThetaSketch1.bin");
out1.write(sketch1.compact().toByteArray());
out1.close();
// SIZE unique keys
// the first SIZE unique keys overlap with sketch1
UpdateSketch sketch2 = UpdateSketch.builder().setSeed(1).build();
for (long key = SIZE_LEFT; key < SIZE_RIGHT; key++)
sketch2.update("item" + key);
FileOutputStream out2 = new FileOutputStream("ThetaSketch2.bin");
out2.write(sketch2.compact().toByteArray());
out2.close();
}
public static boolean IsMemberOf(String aMember, Sketch aSketch) {
UpdateSketch singleItemSketch =
UpdateSketch.builder().setSeed(1).setNominalEntries(8192).build();
singleItemSketch.update(aMember);
Intersection intersection =
SetOperation.builder().setSeed(1).setNominalEntries(8192).buildIntersection();
intersection.intersect(aSketch); // this will just create a sketch
(nothing is there yet)
intersection.intersect(singleItemSketch); // intersect with the
single item
// get the result of the intersection
Sketch result = intersection.getResult();
return (result.getEstimate() > 0);
}
public static void main(String[] args) throws Exception {
generateFiles();
FileInputStream in1 = new FileInputStream("ThetaSketch1.bin");
byte[] bytes1 = new byte[in1.available()];
in1.read(bytes1);
in1.close();
Sketch sketch1 = Sketches.wrapSketch(Memory.wrap(bytes1));
System.out.println(String.format("ZIZE = %d, sketch1 size = %f",
SIZE, sketch1.getEstimate()));
int found1 = 0;
for (long key = 0; key < SIZE; key++)
if (IsMemberOf("item" + key, sketch1))
found1++;
System.out.println(String.format("\nIndividual checks for existing:
found %d, must be found - %d", found1, SIZE));
}
}
```
It works OK with SIZE = 7680:
```
Individual checks for existing: found 7680, must be found - 7680
```
However, try to change 7680 to 7681 and re-run it again, and you will get
this:
```
Individual checks for existing: found 4096, must be found - 7681
```
So, what is the problem? Why the intersection doesn't work any more?
Note: I tried to set different value in setNominalEntries, but this doesn't
help.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]