When there are one thousand OR clause, the hive appear below exception: Total MapReduce jobs = 1 Number of reduce tasks is set to 0 since there's no reduce operator java.lang.StackOverflowError at java.beans.Statement.<init>(Statement.java:60) at java.beans.Expression.<init>(Expression.java:47) at java.beans.Expression.<init>(Expression.java:65) at java.beans.PrimitivePersistenceDelegate.instantiate(MetaData.java:79) at java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97) at java.beans.Encoder.writeObject(Encoder.java:54) at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257) at java.beans.Encoder.writeObject1(Encoder.java:206) at java.beans.Encoder.cloneStatement(Encoder.java:219) at java.beans.Encoder.writeExpression(Encoder.java:278) at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372) at java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97) at java.beans.Encoder.writeObject(Encoder.java:54) at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257) at java.beans.Encoder.writeObject1(Encoder.java:206) at java.beans.Encoder.cloneStatement(Encoder.java:219) at java.beans.Encoder.writeExpression(Encoder.java:278) at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372) at java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97) at java.beans.Encoder.writeObject(Encoder.java:54) at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257) at java.beans.Encoder.writeExpression(Encoder.java:279) at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372) at java.beans.DefaultPersistenceDelegate.doProperty(DefaultPersistenceDelegate.java:212) at java.beans.DefaultPersistenceDelegate.initBean(DefaultPersistenceDelegate.java:247) at java.beans.DefaultPersistenceDelegate.initialize(DefaultPersistenceDelegate.java:395) at java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:100).
When there are two hundred OR clause, it is very very slow. Now I use 0.4.1 version, if I upgrade to 0.6 version, which things I need to do? In addition, when is the 0.6 version is released? Thanks, LiuLei 2010/8/5 Ning Zhang <nzh...@facebook.com> > I tested (1000 disjunctions) and it was extremely slow but no OOM. The > issue seems to be the fact that we serialize the plan by writing to HDFS > file directly. We probably should cache it locally and then write it to > HDFS. > > On Aug 4, 2010, at 10:23 AM, Edward Capriolo wrote: > > > On Wed, Aug 4, 2010 at 1:15 PM, Ning Zhang <nzh...@facebook.com> wrote: > >> Currently an expression tree (series of ORs in this case) is not > collapsed to one operator or any other optimizations. It would be great to > have this optimization rule to convert an OR operator tree to one IN > operator. Would you be able to file a JIRA and contribute a patch? > >> > >> On Aug 4, 2010, at 7:46 AM, Mark Tozzi wrote: > >> > >>> I haven't looked at the code, but I assume the query parser would sort > >>> the 'in' terms and then do a binary search lookup into them for each > >>> row, while the 'or' terms don't have that kind of obvious relationship > >>> and are probably tested in sequence. This would give the in O(log N) > >>> performance compared to a chain of or's having O(N) performance, per > >>> row queried. For large N, that could add up. That being said, I'm > >>> just speculating here. The query parser may be smart enough to > >>> optimize the related or's in the same way, or it may not optimize that > >>> at all. If I get a chance, I'll try to dig around and see what it's > >>> doing, as I have also had a lot of large 'in' queries and could use > >>> every drop of performance I can get. > >>> > >>> --Mark > >>> > >>> On Wed, Aug 4, 2010 at 9:47 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >>>> On Wed, Aug 4, 2010 at 6:10 AM, lei liu <liulei...@gmail.com> wrote: > >>>>> Because my company reuire we use 0.4.1 version, the version don't > support IN > >>>>> clause. I want to use the OR clause(example:where id=1 or id=2 or > id=3) to > >>>>> implement the IN clause(example: id in(1,2,3) ). I know it will be > slower > >>>>> especially when the list after "in" is very long. Could anybody can > tell me > >>>>> why is slow when use OR clause to implement In clause? > >>>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> > >>>>> LiuLei > >>>>> > >>>> > >>>> I can not imagine the performance difference between 'or' or 'in' > >>>> would be that great but I never benchmarked it. The big looming > >>>> problems is that if you string enough 'or' together (say 8000) the > >>>> query parser which uses java beans serialization will OOM. > >>>> > >>>> Edward > >>>> > >> > >> > > > > For reference I did this as a test case.... > > SELECT * FROM src where > > key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 > > OR key=0 OR key=0 OR key=0 OR > > key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 OR key=0 > > OR key=0 OR key=0 OR key=0 OR > > ...(100 more of these) > > > > No OOM but I gave up after the test case did not go anywhere for about > > 2 minutes. > > > > Edward > >