InvalidArgsError - passing TopDocs object

2010-08-24 Thread Roman Chyla
Hi,

I am trying to understand PyLucene more and to see if it is faster to
retrieve result ids with java instead of with Python. The use case is
to retrieve millions of recids -- with python, 700K ids takes about
1.5s. (even if query takes just fraction of that).

I wrote a simple java code (works in java) which returns array of
ints. I have wrapped it with jcc, it is visible from inside python,
but callind the static method throws InvalidArgsError (below is an
example python session)

JCC is version 2.4, built with shared mode -- the DistUtils is in a
different package than lucene (ie. not inside lucene jars). Can this
problem be similar to passing jcc-wrapped objects between different
jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1

The java class is very simple:

import org.apache.lucene.search.TopDocs;

public class DumpUtils {
public static int[] GetDocIds(TopDocs topdocs) {
int[] out;
out = new int[topdocs.totalHits];
ScoreDoc[] hits = topdocs.scoreDocs;
for (int i=0; i  topdocs.totalHits; i++) {
out[i] = hits[i].doc;
}
return out;
}
}

Thanks for any help/pointers,

   roman


Here is an example python session:

In [1]: import pyjama

In [2]: pyjama.initVM(pyjama.CLASSPATH)
Out[2]: jcc.JCCEnv object at 0x00C0E1F0

In [3]: import lucene as lu

In [4]: pyjama.DumpUtils
Out[4]: type 'DumpUtils'

In [5]: pyjama.DumpUtils.GetDocIds
Out[5]: built-in method GetDocIds of type object at 0x0189E780

In [6]:

In [7]: import newseman.pyjamic.slucene.searcher as se

In [8]: s = se.Searcher();s.open('/tmp/whisper/')

In [9]: hits = s._search(s._query('key:bo*',None), 50)

In [10]: hits
Out[10]: TopDocs: org.apache.lucene.search.topd...@480457

In [11]:

In [12]: pyjama.DumpUtils.GetDocIds(hits)
---
InvalidArgsError  Traceback (most recent call last)

InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene.
search.topd...@480457)


Re: InvalidArgsError - passing TopDocs object

2010-08-24 Thread Andi Vajda


On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote:


I am trying to understand PyLucene more and to see if it is faster to
retrieve result ids with java instead of with Python. The use case is
to retrieve millions of recids -- with python, 700K ids takes about
1.5s. (even if query takes just fraction of that).

I wrote a simple java code (works in java) which returns array of
ints. I have wrapped it with jcc, it is visible from inside python,
but callind the static method throws InvalidArgsError (below is an
example python session)

JCC is version 2.4, built with shared mode -- the DistUtils is in a
different package than lucene (ie. not inside lucene jars). Can this
problem be similar to passing jcc-wrapped objects between different
jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1

The java class is very simple:

import org.apache.lucene.search.TopDocs;

public class DumpUtils {
   public static int[] GetDocIds(TopDocs topdocs) {
   int[] out;
   out = new int[topdocs.totalHits];
   ScoreDoc[] hits = topdocs.scoreDocs;
   for (int i=0; i  topdocs.totalHits; i++) {
   out[i] = hits[i].doc;
   }
   return out;
   }
}

Thanks for any help/pointers,


Ah yes, importing separately built extensions that share classes (or  
dependencies) didn't work until support for the --import parameter was  
added in jcc 2.6 to solve the problem of incompatible shared classes.  
To make this work:

  - first, build PyLucene as usual, with --shared
  - then, build your DistUtils package with --import lucene and with  
--shared


That way, instead of generating code and wrapper classes again for the  
lucene classes, jcc will import them at build time thus making a much  
smaller library and faster build. The resulting shared library is  
linked against the lucene one.


See docs and list archives about --import for more examples. Then,  
when running all this, you should also import lucene first, then your  
other package.


Andi..



  roman


Here is an example python session:

In [1]: import pyjama

In [2]: pyjama.initVM(pyjama.CLASSPATH)
Out[2]: jcc.JCCEnv object at 0x00C0E1F0

In [3]: import lucene as lu

In [4]: pyjama.DumpUtils
Out[4]: type 'DumpUtils'

In [5]: pyjama.DumpUtils.GetDocIds
Out[5]: built-in method GetDocIds of type object at 0x0189E780

In [6]:

In [7]: import newseman.pyjamic.slucene.searcher as se

In [8]: s = se.Searcher();s.open('/tmp/whisper/')

In [9]: hits = s._search(s._query('key:bo*',None), 50)

In [10]: hits
Out[10]: TopDocs: org.apache.lucene.search.topd...@480457

In [11]:

In [12]: pyjama.DumpUtils.GetDocIds(hits)
--- 
--- 
-
InvalidArgsError  Traceback (most recent  
call last)


InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs:  
org.apache.lucene.

search.topd...@480457)


Re: InvalidArgsError - passing TopDocs object

2010-08-24 Thread Roman Chyla
Thank you very much, Andi.
Best,

  roman

On Tue, Aug 24, 2010 at 5:36 PM, Andi Vajda va...@apache.org wrote:

 On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote:

 I am trying to understand PyLucene more and to see if it is faster to
 retrieve result ids with java instead of with Python. The use case is
 to retrieve millions of recids -- with python, 700K ids takes about
 1.5s. (even if query takes just fraction of that).

 I wrote a simple java code (works in java) which returns array of
 ints. I have wrapped it with jcc, it is visible from inside python,
 but callind the static method throws InvalidArgsError (below is an
 example python session)

 JCC is version 2.4, built with shared mode -- the DistUtils is in a
 different package than lucene (ie. not inside lucene jars). Can this
 problem be similar to passing jcc-wrapped objects between different
 jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1

 The java class is very simple:

 import org.apache.lucene.search.TopDocs;

 public class DumpUtils {
   public static int[] GetDocIds(TopDocs topdocs) {
       int[] out;
       out = new int[topdocs.totalHits];
       ScoreDoc[] hits = topdocs.scoreDocs;
       for (int i=0; i  topdocs.totalHits; i++) {
           out[i] = hits[i].doc;
       }
       return out;
   }
 }

 Thanks for any help/pointers,

 Ah yes, importing separately built extensions that share classes (or
 dependencies) didn't work until support for the --import parameter was added
 in jcc 2.6 to solve the problem of incompatible shared classes. To make this
 work:
  - first, build PyLucene as usual, with --shared
  - then, build your DistUtils package with --import lucene and with --shared

 That way, instead of generating code and wrapper classes again for the
 lucene classes, jcc will import them at build time thus making a much
 smaller library and faster build. The resulting shared library is linked
 against the lucene one.

 See docs and list archives about --import for more examples. Then, when
 running all this, you should also import lucene first, then your other
 package.

 Andi..


  roman


 Here is an example python session:

 In [1]: import pyjama

 In [2]: pyjama.initVM(pyjama.CLASSPATH)
 Out[2]: jcc.JCCEnv object at 0x00C0E1F0

 In [3]: import lucene as lu

 In [4]: pyjama.DumpUtils
 Out[4]: type 'DumpUtils'

 In [5]: pyjama.DumpUtils.GetDocIds
 Out[5]: built-in method GetDocIds of type object at 0x0189E780

 In [6]:

 In [7]: import newseman.pyjamic.slucene.searcher as se

 In [8]: s = se.Searcher();s.open('/tmp/whisper/')

 In [9]: hits = s._search(s._query('key:bo*',None), 50)

 In [10]: hits
 Out[10]: TopDocs: org.apache.lucene.search.topd...@480457

 In [11]:

 In [12]: pyjama.DumpUtils.GetDocIds(hits)

 ---
 InvalidArgsError                          Traceback (most recent call
 last)

 InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs:
 org.apache.lucene.
 search.topd...@480457)



do the Java and Python garbage collectors talk to each other, with JCC?

2010-08-24 Thread Bill Janssen
I'm starting to see traces like the following in my UpLib (OS X 10.5.8,
32-bit Python 2.5, Java 6, JCC-2.6, PyLucene-2.9.3) that indicate an
out-of-memory issue.  I spawn a lot of short-lived threads in Python,
and each of them is attached to Java, and detached after the run
method returns.  I've run test programs that do nothing but repeatedly
start new threads that then invoke pylucene to index a document, and see
no problems.

I'm trying to come up with a hypothesis for this.  One of the things I'm
wondering is if my Python memory space is approaching the limit, does
PyLucene arrange for the Java garbage collector to invoke the Python
garbage collector if it can't allocate memory?  I keep a lot of objects
via weak references in my Python memory space, and I may just be filling
up VM so that Java can't allocate enough heap/stack space for a new
thread.  Note that the thread being unsuccessfully started isn't mine;
it's being started by Java.

Bill

thr1730: Running document rippers raised the following exception:
thr1730: Traceback (most recent call last):
thr1730:File /local/share/UpLib-1.7.9/code/uplib/newFolder.py, line 282, 
in _run_rippers
thr1730: ripper.rip(folderpath, id)
thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, 
line 187, in rip
thr1730: index_folder(location, self.repository().index_path())
thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, 
line 82, in index_folder
thr1730: c.index(folder, doc_id)
thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 813, 
in index
thr1730: self.reopen()
thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 635, 
in reopen
thr1730: self.current_writer.flush()
thr1730:  JavaError: java.lang.OutOfMemoryError: unable to create new native 
thread
thr1730: Java stacktrace:
thr1730: java.lang.OutOfMemoryError: unable to create new native thread
thr1730:at java.lang.Thread.start0(Native Method)
thr1730:at java.lang.Thread.start(Thread.java:592)
thr1730:at 
org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:221)
thr1730:at 
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3070)
thr1730:at 
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3065)
thr1730:at 
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3061)
thr1730:at 
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4256)
thr1730:at 
org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4060)