> On Aug. 5, 2014, 9:14 p.m., Matthew Hayes wrote: > > How do you expect this to be used in practice? Would one large dictionary > > be applied to a large collection of strings to identify the matches within > > each string? Or, do you expect a different dictionary to be applied to > > each string? If you expect the same dictionary to be used, then it seems > > we miss out on the potential with this implementation to build the trie > > once and reuse it over and over. Should the dictionary instead be loaded > > from HDFS via the distributed cache and lazy loaded on the first call to > > exec()? This way you only build the trie once.
You make a good point. The way I plan to use this is to group a relation of match words ALL, then to CROSS it with my text to be matched against, and the same words will be matched against a large number of strings. Compared to your suggestion, my plan is dumb. I think I will do what you suggest. - Russell ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24309/#review49639 ----------------------------------------------------------- On Aug. 5, 2014, 4:14 p.m., Russell Jurney wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/24309/ > ----------------------------------------------------------- > > (Updated Aug. 5, 2014, 4:14 p.m.) > > > Review request for DataFu, Jakob Homan, Matthew Hayes, and Sam Shah. > > > Repository: datafu > > > Description > ------- > > See DATAFU-65 > > > Diffs > ----- > > datafu-pig/build.gradle e21a5b1 > datafu-pig/src/main/java/datafu/pig/text/AhoCorasickMatch.java PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/text/AhoCorasickMatchTest.java > PRE-CREATION > gradle/dependency-versions.gradle eb24e4a > > Diff: https://reviews.apache.org/r/24309/diff/ > > > Testing > ------- > > > Thanks, > > Russell Jurney > >
