Itamar,

First of all, thanks for fixing StandardAnalyzer! The zero norm issues are gone. I have to agree with celtix44, you are indeed a legend. :)

I'm attaching a patch against HEAD that will add large file support for 32-bit Unix systems using #define hacks. Windows is going to be more difficult but at least CLucene can support large files on Unix easily.

I also wrote a test for IndexWriter but I couldn't get cl_test to compile (to link to be exact-- lots of linker errors when I do "make cl_test"). The test will attempt to create a huge index in the data/bigIndex directory. If the resulting index is 2.1gb in size that is bad-- we hit the 2^31 ceiling. If it creates an index file bigger than that then things are good.

Apologies if the changes in the patch are stylistically out of place with CLucene's order of things but I couldn't think up a better way to do it. I don't know CMake but perhaps the defines should be emitted by CMake on 32-bit systems only?

Itamar Syn-Hershko wrote:
Michael,

Please update your code, I just committed a fix for the bug you reported
(commit c89f8a39fa1faa34374d8a6e92ae9c2467deeda7). Please test this with
your code as well.

With regards to the 64bit FS issue, it would be nice if you could provide a
test and a fix for this (using some #define hacks or our cmake scripts). I'm
just so swamped at the moment that I'm afraid I won't be able to do this
myself anytime soon. I can provide pointers if necessary.

Itamar.
-----Original Message-----
From: Michael Levin [mailto:mele...@stanford.edu] Sent: Thursday, November 05, 2009 10:07 PM
To: clucene-developers@lists.sourceforge.net
Subject: Re: [CLucene-dev] Cannot write >2gb index file

Itamar Syn-Hershko wrote:
Michael,

Thanks. However, the O_LARGEFILE flag isn't supported on Windows (for all versions as far as I can tell), and might not be supported on other Linux distributions, and on Mac. That being said, this is something we need to test and find a solution for (probably another cmake check). I'm no cross-platform wiz, so anyone willing to take this up
please be my guest.

Thanks for looking into this.

It's true that Windows doesn't support this flag. I believe on Windows you
don't have the open64 style functions either so you must use the Windows API
equivalents (e.g. CreateFile()). I can see how inconvenient this can get
though as you probably won't be able to get away with a platform-agnostic
_cl_open() function...

On that note, I haven't tested your code to see if it crashes on Windows as well. Might be interesting to see tho.

I believe it should though definitely something worth testing.

I see no reason why this will break StandardAnalyzer. Can you provide more details please?

Honestly I don't know why it would either. It may not have been my changes
actually, cel tix44 just sent out an email saying the last two commits broke
the StandardAnalyzer ("[CLucene-dev] StandardAnalyzer broken - GIT 364c21b6c3f54fbb90df223621b660197366fb93"). I was using git to switch from my branch to head and I thought that the StandardAnalyzer was
working in HEAD though I may have made a mistake...

The exact problem is missing norms. When I generate a new file and open it
in Luke or query with CLucene only the first term processed with
StandardAnalyzer has a norm (of 1.0) and every other term has zero norm and
won't appear in search results.

I am currently using StopAnalyzer and it works fine so I wonder if the
problem is somewhere in StandardFilter?

Itamar.
-----Original Message-----
From: Michael Levin [mailto:mele...@stanford.edu]
Sent: Thursday, November 05, 2009 11:55 AM
To: clucene-developers@lists.sourceforge.net
Subject: Re: [CLucene-dev] Cannot write >2gb index file

(Sorry for the email spam...)

This change seems to break StandardAnalyzer though. I can't figure out why... all of the other analyzers work fine. :-\

Michael Levin wrote:
Really easy fix, please add "O_LARGEFILE" flag everywhere _cl_open() is used. E.g.:

   _cl_open(buffer, O_RDWR, _S_IWRITE) -->
   _cl_open(buffer, O_RDWR | O_LARGEFILE, _S_IWRITE)

The required header define is already defined in config files and adding this flag shouldn't affect 64-bit machines in any way. Thanks!

--
Michael Levin <mele...@stanford.edu>
diff --git a/src/core/CLucene/store/FSDirectory.cpp b/src/core/CLucene/store/FSDirectory.cpp
index 470b3d8..d1fd597 100644
--- a/src/core/CLucene/store/FSDirectory.cpp
+++ b/src/core/CLucene/store/FSDirectory.cpp
@@ -32,6 +32,11 @@
     #include "_MMap.h"
 #endif
 
+// Flag needed to support large files on 32-bit systems, not always available
+#ifndef O_LARGEFILE
+#define O_LARGEFILE 0
+#endif
+
 CL_NS_DEF(store)
 CL_NS_USE(util)
 
@@ -122,7 +127,7 @@ CL_NS_USE(util)
 	  SharedHandle* handle = _CLNEW SharedHandle(path);
 
 	  //Open the file
-	  handle->fhandle  = ::_cl_open(path, _O_BINARY | O_RDONLY | _O_RANDOM, _S_IREAD );
+	  handle->fhandle  = ::_cl_open(path, _O_BINARY | O_RDONLY | _O_RANDOM | O_LARGEFILE, _S_IREAD );
 
 	  //Check if a valid handle was retrieved
 	  if (handle->fhandle >= 0){
@@ -267,9 +272,9 @@ void FSDirectory::FSIndexInput::readInternal(uint8_t* b, const int32_t len) {
 	//O_RANDOM - Specifies that caching is optimized for, but not restricted to, random access from disk.
 	//O_WRONLY - Opens file for writing only;
 	  if ( Misc::dir_Exists(path) )
-	    fhandle = _cl_open( path, _O_BINARY | O_RDWR | _O_RANDOM | O_TRUNC, _S_IREAD | _S_IWRITE);
+	    fhandle = _cl_open( path, _O_BINARY | O_RDWR | _O_RANDOM | O_TRUNC | O_LARGEFILE, _S_IREAD | _S_IWRITE);
 	  else // added by JBP
-	    fhandle = _cl_open( path, _O_BINARY | O_RDWR | _O_RANDOM | O_CREAT, _S_IREAD | _S_IWRITE);
+	    fhandle = _cl_open( path, _O_BINARY | O_RDWR | _O_RANDOM | O_CREAT | O_LARGEFILE, _S_IREAD | _S_IWRITE);
 
 	  if ( fhandle < 0 ){
       int err = errno;
@@ -526,7 +531,7 @@ void FSDirectory::FSIndexInput::readInternal(uint8_t* b, const int32_t len) {
     char buffer[CL_MAX_DIR];
     _snprintf(buffer,CL_MAX_DIR,"%s%s%s",directory.c_str(),PATH_DELIMITERA,name);
 
-    int32_t r = _cl_open(buffer, O_RDWR, _S_IWRITE);
+    int32_t r = _cl_open(buffer, O_RDWR | O_LARGEFILE, _S_IWRITE);
 	if ( r < 0 )
 		_CLTHROWA(CL_ERR_IO,"IO Error while touching file");
 	::_close(r);
diff --git a/src/core/CLucene/store/Lock.cpp b/src/core/CLucene/store/Lock.cpp
index 76c8bce..0aeb9e8 100644
--- a/src/core/CLucene/store/Lock.cpp
+++ b/src/core/CLucene/store/Lock.cpp
@@ -23,6 +23,11 @@
 #endif
 #include <fcntl.h>
 
+// Flag needed to support large files on 32-bit systems, not always available
+#ifndef O_LARGEFILE
+#define O_LARGEFILE 0
+#endif
+
 
 CL_NS_USE(util)
 CL_NS_DEF(store)
@@ -142,7 +147,7 @@ CL_NS_DEF(store)
 	   		  _CLTHROWA_DEL(CL_ERR_IO, err );
 	         }
 	       }
-	       int32_t r = _cl_open(lockFile,  O_RDWR | O_CREAT | _O_RANDOM | O_EXCL,
+	       int32_t r = _cl_open(lockFile,  O_RDWR | O_CREAT | _O_RANDOM | O_EXCL | O_LARGEFILE,
 	       	_S_IREAD | _S_IWRITE); //must do this or file will be created Read only
 	   	if ( r < 0 ) {
 	   	  return false;
diff --git a/src/core/CLucene/util/Reader.cpp b/src/core/CLucene/util/Reader.cpp
index 91a0dc7..e634ed7 100644
--- a/src/core/CLucene/util/Reader.cpp
+++ b/src/core/CLucene/util/Reader.cpp
@@ -25,6 +25,11 @@
 
 #include "_bufferedstream.h"
 
+// Flag needed to support large files on 32-bit systems, not always available
+#ifndef O_LARGEFILE
+#define O_LARGEFILE 0
+#endif
+
 CL_NS_DEF(util)
 
 StringReader::StringReader ( const TCHAR* _value, const int32_t _length, bool copyData )
@@ -213,7 +218,7 @@ public:
 	JStreamsBuffer* jsbuffer;
 
 	Internal(const char* path, int32_t buffersize){
-		int32_t fhandle = _cl_open(path, _O_BINARY | O_RDONLY | _O_RANDOM, _S_IREAD );
+		int32_t fhandle = _cl_open(path, _O_BINARY | O_RDONLY | _O_RANDOM | O_LARGEFILE, _S_IREAD );
 		
 		//Check if a valid handle was retrieved
 	   if (fhandle < 0){
diff --git a/src/core/CMakeLists.txt b/src/core/CMakeLists.txt
index 6084f2f..5a08303 100644
diff --git a/src/shared/CLucene/SharedHeader.h b/src/shared/CLucene/SharedHeader.h
index 84816ec..7ce2d90 100644
--- a/src/shared/CLucene/SharedHeader.h
+++ b/src/shared/CLucene/SharedHeader.h
@@ -49,7 +49,12 @@
 #endif
 ////////////////////////////////////////////////////////
 
-
+////////////////////////////////////////////////////////
+//Support for large files on 32-bit systems, include before standard library headers
+////////////////////////////////////////////////////////
+#define D_FILE_OFFSET_BITS 64
+#define D_LARGEFILE_SOURCE
+#define D_LARGEFILE64_SOURCE
 
 ////////////////////////////////////////////////////////
 //platform includes that MUST be included for the public headers to work...
diff --git a/src/test/analysis/TestAnalyzers.cpp b/src/test/analysis/TestAnalyzers.cpp
index 3675317..7aad2b3 100644
diff --git a/src/test/index/TestIndexWriter.cpp b/src/test/index/TestIndexWriter.cpp
index 2ffd73c..e36a31d 100644
--- a/src/test/index/TestIndexWriter.cpp
+++ b/src/test/index/TestIndexWriter.cpp
@@ -249,6 +249,43 @@ void testHashingBug(CuTest *tc){
 }
 
 
+static const TCHAR *genValue()
+{
+  static TCHAR buf[4096];
+  for (unsigned int i = 0; i < sizeof (buf) / sizeof (*buf); i++)
+    buf[i] = ' ' + (rand() % 16);
+  return buf;
+}
+
+
+void testIWbigIndex(CuTest *tc){
+  srand(time(NULL));
+
+  // Create a big index
+  char dirPath[CL_MAX_PATH];
+  snprintf(dirPath, sizeof (dirPath), "%s/bigIndex", clucene_data_location);
+  StandardAnalyzer analyzer;
+  IndexWriter writer(FSDirectory::getDirectory(dirPath, true), &analyzer, true);
+  writer.setRAMBufferSizeMB(512);
+
+  // Add documents of random terms to the index until we get a >2gb index
+  CL_NS(document)::Document doc;
+  int flags = Field::STORE_YES | Field::INDEX_TOKENIZED;
+  for (int i = 0; i < 550000; i++) {
+    doc.clear();
+    doc.add(*(_CLNEW Field(_T("First"), genValue(), flags)));
+    doc.add(*(_CLNEW Field(_T("Second"), genValue(), flags)));
+    doc.add(*(_CLNEW Field(_T("Fifth"), genValue(), flags)));
+    doc.add(*(_CLNEW Field(_T("Eigth"), genValue(), flags)));
+    doc.add(*(_CLNEW Field(_T("Ninth"), genValue(), flags)));
+    writer.addDocument(&doc);
+  }
+
+  writer.optimize();
+  writer.close();
+}
+
+
 CuSuite *testindexwriter(void)
 {
 	CuSuite *suite = CuSuiteNew(_T("CLucene IndexWriter Test"));
@@ -257,6 +294,7 @@ CuSuite *testindexwriter(void)
 	SUITE_ADD_TEST(suite, testIWmergeSegments1);
   SUITE_ADD_TEST(suite, testIWmergeSegments2);
 	SUITE_ADD_TEST(suite, testIWmergePhraseSegments);
+	SUITE_ADD_TEST(suite, testIWbigIndex);
 
   return suite;
 }
diff --git a/src/test/test.h b/src/test/test.h
index c8cb0e2..aa8444f 100644
--- a/src/test/test.h
+++ b/src/test/test.h
@@ -19,6 +19,7 @@
 #include "CLucene/index/TermVector.h"
 #include "CLucene/queryParser/MultiFieldQueryParser.h"
 #include <string.h>
+#include <cstdio>
 
 #define LUCENE_INT64_MAX_SHOULDBE _ILONGLONG(0x7FFFFFFFFFFFFFFF)
 #define LUCENE_INT64_MIN_SHOULDBE (-LUCENE_INT64_MAX_SHOULDBE - _ILONGLONG(1) )
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to