InvalidArgsError - passing TopDocs object
Hi, I am trying to understand PyLucene more and to see if it is faster to retrieve result ids with java instead of with Python. The use case is to retrieve millions of recids -- with python, 700K ids takes about 1.5s. (even if query takes just fraction of that). I wrote a simple java code (works in java) which returns array of ints. I have wrapped it with jcc, it is visible from inside python, but callind the static method throws InvalidArgsError (below is an example python session) JCC is version 2.4, built with shared mode -- the DistUtils is in a different package than lucene (ie. not inside lucene jars). Can this problem be similar to passing jcc-wrapped objects between different jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1 The java class is very simple: import org.apache.lucene.search.TopDocs; public class DumpUtils { public static int[] GetDocIds(TopDocs topdocs) { int[] out; out = new int[topdocs.totalHits]; ScoreDoc[] hits = topdocs.scoreDocs; for (int i=0; i topdocs.totalHits; i++) { out[i] = hits[i].doc; } return out; } } Thanks for any help/pointers, roman Here is an example python session: In [1]: import pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH) Out[2]: jcc.JCCEnv object at 0x00C0E1F0 In [3]: import lucene as lu In [4]: pyjama.DumpUtils Out[4]: type 'DumpUtils' In [5]: pyjama.DumpUtils.GetDocIds Out[5]: built-in method GetDocIds of type object at 0x0189E780 In [6]: In [7]: import newseman.pyjamic.slucene.searcher as se In [8]: s = se.Searcher();s.open('/tmp/whisper/') In [9]: hits = s._search(s._query('key:bo*',None), 50) In [10]: hits Out[10]: TopDocs: org.apache.lucene.search.topd...@480457 In [11]: In [12]: pyjama.DumpUtils.GetDocIds(hits) --- InvalidArgsError Traceback (most recent call last) InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene. search.topd...@480457)
Re: InvalidArgsError - passing TopDocs object
On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote: I am trying to understand PyLucene more and to see if it is faster to retrieve result ids with java instead of with Python. The use case is to retrieve millions of recids -- with python, 700K ids takes about 1.5s. (even if query takes just fraction of that). I wrote a simple java code (works in java) which returns array of ints. I have wrapped it with jcc, it is visible from inside python, but callind the static method throws InvalidArgsError (below is an example python session) JCC is version 2.4, built with shared mode -- the DistUtils is in a different package than lucene (ie. not inside lucene jars). Can this problem be similar to passing jcc-wrapped objects between different jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1 The java class is very simple: import org.apache.lucene.search.TopDocs; public class DumpUtils { public static int[] GetDocIds(TopDocs topdocs) { int[] out; out = new int[topdocs.totalHits]; ScoreDoc[] hits = topdocs.scoreDocs; for (int i=0; i topdocs.totalHits; i++) { out[i] = hits[i].doc; } return out; } } Thanks for any help/pointers, Ah yes, importing separately built extensions that share classes (or dependencies) didn't work until support for the --import parameter was added in jcc 2.6 to solve the problem of incompatible shared classes. To make this work: - first, build PyLucene as usual, with --shared - then, build your DistUtils package with --import lucene and with --shared That way, instead of generating code and wrapper classes again for the lucene classes, jcc will import them at build time thus making a much smaller library and faster build. The resulting shared library is linked against the lucene one. See docs and list archives about --import for more examples. Then, when running all this, you should also import lucene first, then your other package. Andi.. roman Here is an example python session: In [1]: import pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH) Out[2]: jcc.JCCEnv object at 0x00C0E1F0 In [3]: import lucene as lu In [4]: pyjama.DumpUtils Out[4]: type 'DumpUtils' In [5]: pyjama.DumpUtils.GetDocIds Out[5]: built-in method GetDocIds of type object at 0x0189E780 In [6]: In [7]: import newseman.pyjamic.slucene.searcher as se In [8]: s = se.Searcher();s.open('/tmp/whisper/') In [9]: hits = s._search(s._query('key:bo*',None), 50) In [10]: hits Out[10]: TopDocs: org.apache.lucene.search.topd...@480457 In [11]: In [12]: pyjama.DumpUtils.GetDocIds(hits) --- --- - InvalidArgsError Traceback (most recent call last) InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene. search.topd...@480457)
Re: InvalidArgsError - passing TopDocs object
Thank you very much, Andi. Best, roman On Tue, Aug 24, 2010 at 5:36 PM, Andi Vajda va...@apache.org wrote: On Aug 24, 2010, at 8:03, Roman Chyla roman.ch...@gmail.com wrote: I am trying to understand PyLucene more and to see if it is faster to retrieve result ids with java instead of with Python. The use case is to retrieve millions of recids -- with python, 700K ids takes about 1.5s. (even if query takes just fraction of that). I wrote a simple java code (works in java) which returns array of ints. I have wrapped it with jcc, it is visible from inside python, but callind the static method throws InvalidArgsError (below is an example python session) JCC is version 2.4, built with shared mode -- the DistUtils is in a different package than lucene (ie. not inside lucene jars). Can this problem be similar to passing jcc-wrapped objects between different jcc-packages? http://search-lucene.com/m/SPgeW1hDtAw1 The java class is very simple: import org.apache.lucene.search.TopDocs; public class DumpUtils { public static int[] GetDocIds(TopDocs topdocs) { int[] out; out = new int[topdocs.totalHits]; ScoreDoc[] hits = topdocs.scoreDocs; for (int i=0; i topdocs.totalHits; i++) { out[i] = hits[i].doc; } return out; } } Thanks for any help/pointers, Ah yes, importing separately built extensions that share classes (or dependencies) didn't work until support for the --import parameter was added in jcc 2.6 to solve the problem of incompatible shared classes. To make this work: - first, build PyLucene as usual, with --shared - then, build your DistUtils package with --import lucene and with --shared That way, instead of generating code and wrapper classes again for the lucene classes, jcc will import them at build time thus making a much smaller library and faster build. The resulting shared library is linked against the lucene one. See docs and list archives about --import for more examples. Then, when running all this, you should also import lucene first, then your other package. Andi.. roman Here is an example python session: In [1]: import pyjama In [2]: pyjama.initVM(pyjama.CLASSPATH) Out[2]: jcc.JCCEnv object at 0x00C0E1F0 In [3]: import lucene as lu In [4]: pyjama.DumpUtils Out[4]: type 'DumpUtils' In [5]: pyjama.DumpUtils.GetDocIds Out[5]: built-in method GetDocIds of type object at 0x0189E780 In [6]: In [7]: import newseman.pyjamic.slucene.searcher as se In [8]: s = se.Searcher();s.open('/tmp/whisper/') In [9]: hits = s._search(s._query('key:bo*',None), 50) In [10]: hits Out[10]: TopDocs: org.apache.lucene.search.topd...@480457 In [11]: In [12]: pyjama.DumpUtils.GetDocIds(hits) --- InvalidArgsError Traceback (most recent call last) InvalidArgsError: (type 'DumpUtils', 'GetDocIds', TopDocs: org.apache.lucene. search.topd...@480457)
do the Java and Python garbage collectors talk to each other, with JCC?
I'm starting to see traces like the following in my UpLib (OS X 10.5.8, 32-bit Python 2.5, Java 6, JCC-2.6, PyLucene-2.9.3) that indicate an out-of-memory issue. I spawn a lot of short-lived threads in Python, and each of them is attached to Java, and detached after the run method returns. I've run test programs that do nothing but repeatedly start new threads that then invoke pylucene to index a document, and see no problems. I'm trying to come up with a hypothesis for this. One of the things I'm wondering is if my Python memory space is approaching the limit, does PyLucene arrange for the Java garbage collector to invoke the Python garbage collector if it can't allocate memory? I keep a lot of objects via weak references in my Python memory space, and I may just be filling up VM so that Java can't allocate enough heap/stack space for a new thread. Note that the thread being unsuccessfully started isn't mine; it's being started by Java. Bill thr1730: Running document rippers raised the following exception: thr1730: Traceback (most recent call last): thr1730:File /local/share/UpLib-1.7.9/code/uplib/newFolder.py, line 282, in _run_rippers thr1730: ripper.rip(folderpath, id) thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, line 187, in rip thr1730: index_folder(location, self.repository().index_path()) thr1730:File /local/share/UpLib-1.7.9/code/uplib/createIndexEntry.py, line 82, in index_folder thr1730: c.index(folder, doc_id) thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 813, in index thr1730: self.reopen() thr1730:File /local/share/UpLib-1.7.9/code/uplib/indexing.py, line 635, in reopen thr1730: self.current_writer.flush() thr1730: JavaError: java.lang.OutOfMemoryError: unable to create new native thread thr1730: Java stacktrace: thr1730: java.lang.OutOfMemoryError: unable to create new native thread thr1730:at java.lang.Thread.start0(Native Method) thr1730:at java.lang.Thread.start(Thread.java:592) thr1730:at org.apache.lucene.index.ConcurrentMergeScheduler.merge(ConcurrentMergeScheduler.java:221) thr1730:at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3070) thr1730:at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3065) thr1730:at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:3061) thr1730:at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4256) thr1730:at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4060)