Re: why is slow when use OR clause instead of IN clause

lei liu Thu, 05 Aug 2010 00:33:09 -0700

When there are one thousand OR clause, the hive appear below exception:
Total MapReduce jobs = 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.StackOverflowError
        at java.beans.Statement.<init>(Statement.java:60)
        at java.beans.Expression.<init>(Expression.java:47)
        at java.beans.Expression.<init>(Expression.java:65)
        at
java.beans.PrimitivePersistenceDelegate.instantiate(MetaData.java:79)
        at
java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97)
        at java.beans.Encoder.writeObject(Encoder.java:54)
        at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257)
        at java.beans.Encoder.writeObject1(Encoder.java:206)
        at java.beans.Encoder.cloneStatement(Encoder.java:219)
        at java.beans.Encoder.writeExpression(Encoder.java:278)
        at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372)
        at
java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97)
        at java.beans.Encoder.writeObject(Encoder.java:54)
        at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257)
        at java.beans.Encoder.writeObject1(Encoder.java:206)
        at java.beans.Encoder.cloneStatement(Encoder.java:219)
        at java.beans.Encoder.writeExpression(Encoder.java:278)
        at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372)
        at
java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:97)
        at java.beans.Encoder.writeObject(Encoder.java:54)
        at java.beans.XMLEncoder.writeObject(XMLEncoder.java:257)
        at java.beans.Encoder.writeExpression(Encoder.java:279)
        at java.beans.XMLEncoder.writeExpression(XMLEncoder.java:372)
        at
java.beans.DefaultPersistenceDelegate.doProperty(DefaultPersistenceDelegate.java:212)
        at
java.beans.DefaultPersistenceDelegate.initBean(DefaultPersistenceDelegate.java:247)
        at
java.beans.DefaultPersistenceDelegate.initialize(DefaultPersistenceDelegate.java:395)
        at
java.beans.PersistenceDelegate.writeObject(PersistenceDelegate.java:100).




When there are two hundred OR clause, it is very very slow.

Now I use 0.4.1 version, if I upgrade to 0.6 version, which things I need to
do?

In addition, when is the 0.6 version is released?

Thanks,


LiuLei

2010/8/5 Ning Zhang <nzh...@facebook.com>

> I tested (1000 disjunctions) and it was extremely slow but no OOM. The
> issue seems to be the fact that we serialize the plan by writing to HDFS
> file directly. We probably should cache it locally and then write it to
> HDFS.
>
> On Aug 4, 2010, at 10:23 AM, Edward Capriolo wrote:
>
> > On Wed, Aug 4, 2010 at 1:15 PM, Ning Zhang <nzh...@facebook.com> wrote:
> >> Currently an expression tree (series of ORs in this case) is not
> collapsed to one operator or any other optimizations. It would be great to
> have this optimization rule to convert an OR operator tree to one IN
> operator. Would you be able to file a JIRA and contribute a patch?
> >>
> >> On Aug 4, 2010, at 7:46 AM, Mark Tozzi wrote:
> >>
> >>> I haven't looked at the code, but I assume the query parser would sort
> >>> the 'in' terms and then do a binary search lookup into them for each
> >>> row, while the 'or' terms don't have that kind of obvious relationship
> >>> and are probably tested in sequence.  This would give the in O(log N)
> >>> performance compared to a chain of or's having O(N) performance, per
> >>> row queried.  For large N, that could add up.  That being said, I'm
> >>> just speculating here.  The query parser may be smart enough to
> >>> optimize the related or's in the same way, or it may not optimize that
> >>> at all.  If I get a chance, I'll try to dig around and see what it's
> >>> doing, as I have also had a lot of large 'in' queries and could use
> >>> every drop of performance I can get.
> >>>
> >>> --Mark
> >>>
> >>> On Wed, Aug 4, 2010 at 9:47 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
> >>>> On Wed, Aug 4, 2010 at 6:10 AM, lei liu <liulei...@gmail.com> wrote:
> >>>>> Because my company reuire we use 0.4.1 version, the version don't
> support IN
> >>>>> clause. I want to  use the OR clause(example:where id=1 or id=2 or
> id=3) to
> >>>>> implement the IN clause(example: id in(1,2,3) ).  I know it will be
> slower
> >>>>> especially when the list after "in" is very long.  Could anybody can
> tell me
> >>>>> why is slow when use OR clause to implement In clause?
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>>
> >>>>> LiuLei
> >>>>>
> >>>>
> >>>> I can not imagine the performance difference between 'or' or 'in'
> >>>> would be that great but I never benchmarked it. The big looming
> >>>> problems is that if you string enough 'or' together (say 8000) the
> >>>> query parser which uses java beans serialization will OOM.
> >>>>
> >>>> Edward
> >>>>
> >>
> >>
> >
> > For reference I did this as a test case....
> > SELECT * FROM src where
> > key=0 OR key=0 OR key=0 OR  key=0 OR key=0 OR key=0 OR key=0 OR key=0
> > OR key=0 OR key=0 OR key=0 OR
> > key=0 OR key=0 OR key=0 OR  key=0 OR key=0 OR key=0 OR key=0 OR key=0
> > OR key=0 OR key=0 OR key=0 OR
> > ...(100 more of these)
> >
> > No OOM but I gave up after the test case did not go anywhere for about
> > 2 minutes.
> >
> > Edward
>
>

Re: why is slow when use OR clause instead of IN clause

Reply via email to