That's great! Thanks for checking. Aaron
On Wed, Aug 17, 2016 at 6:13 PM, Prasanth J <[email protected]> wrote: > I can confirm that ORC-54 fixes the issue. > > I ran the test case initially provided by Aaron, and I am getting the > expected test results. > Total Batches Added [977] > Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 = (EQUALS > a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 = (EQUALS > a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 = > (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 = > (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = leaf-0]. > Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 = (EQUALS a2 > 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-1)]. > Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 = (EQUALS a2 > 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-1)]. > Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 = (EQUALS > a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-0)]. > Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 > leaf-0)]. > Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 = > (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 > leaf-0)]. > Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 = (EQUALS > a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 leaf-0)]. > Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 = > (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 > leaf-0)]. > Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 = > (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr = (and leaf-0 > leaf-0)]. > > Thanks > Prasanth > > > On Aug 17, 2016, at 3:09 PM, Owen O'Malley <[email protected]> wrote: > > > > This issue might have been fixed as part of ORC-54, which got committed > > this morning. Do you have a testcase already? > > > > .. Owen > > > > On Mon, Aug 15, 2016 at 1:08 PM, Aaron McCurry <[email protected]> > wrote: > > > >> I have been writing some test code that creates a simple orc writer and > >> reader with bloom filters enabled. The issue I have is when the > >> SearchArgument matches the first column name provided in the Options > >> searchArgument method ( > >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > >> core/src/java/org/apache/orc/Reader.java#L197) > >> the bloom filter doesn't seem to get applied. > >> > >> The test program creates an orc file file with 2 string columns. Then > it > >> populates the orc file with 1 million records with same UUID in both > >> columns, but different values for each row. Then it performs a series > of > >> reads on the file and counts the number of batches read and displays the > >> output. > >> > >> Test program: > >> https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118 > >> > >> NOTE: I'm assuming the searchArgument ( > >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > >> core/src/java/org/apache/orc/Reader.java#L197) > >> method that contains the columns names are to inform the orc reader what > >> indexes it should read to perform the search operations. > >> > >> High Level Output: > >> > >> where a1 == literal > >> colNames : ["a1"] reads 977 batches > >> colNames : ["a1", "a2"] reads 977 batches > >> colNames : ["a2", "a1"] reads 90 batches > >> > >> where a2 == literal > >> colNames : ["a2"] reads 977 batches > >> colNames : ["a1", "a2"] reads 90 batches > >> colNames : ["a2", "a1"] reads 977 batches > >> > >> where a1 == literal AND where a2 == literal > >> colNames : ["a1", "a2"] reads 90 batches > >> colNames : ["a2", "a1"] reads 90 batches > >> > >> where a1 == literal AND where a1 == literal > >> colNames : ["a1"] reads 977 batches > >> colNames : ["a1", "a2"] reads 977 batches > >> colNames : ["a2", "a1"] reads 90 batches > >> > >> where a2 == literal AND where a2 == literal > >> colNames : ["a2"] reads 977 batches > >> colNames : ["a1", "a2"] reads 90 batches > >> colNames : ["a2", "a1"] reads 977 batches > >> > >> Given that every row has the same value in both columns a1 and a2 I > would > >> assume that every one of these test runs would yield the same number of > >> batches read, which should be 90. > >> > >> Raw Output: > >> https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4 > >> > >> I think the issue is from mapSargColumnsToOrcInternalColIdx method > where > >> the rootColumn value is hard coded to '0': > >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 > >> > >> The mapSargColumnsToOrcInternalColIdx method checks each provided > column > >> against the columns in the orc schema. During this it calls > findColumns ( > >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104) > >> where if the column name matches one of the values in the columnNames > >> array, the index and rootColumn are added and returned. > >> > >> Then when the mapSargColumnsToOrcInternalColIdx returns it checks each > >> value in the filterColumns array to make sure it's value is greater than > >> '0'. If the column index is the first column and the rootColumn is '0' > >> then it's return value is '0' and the logical column filter gets > omitted. > >> > >> I think the rootColumn literal should be '1' instead of '0' ( > >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ > >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 > >> ). > >> > >> Thoughts? > >> > >> Thanks, > >> > >> Aaron > >> > >
