Hi Karl, Above changes provide great throughput for crawling of local system.
Now i tried running job for shared drive. I know things wont be as fast as local drive. But i wanted to look how things change with shared drive. When i ran job with shared drive it errored out with message "Error: IO Error: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network name is no longer available.". This is not Solr error because i could find in Solr log. I found error in MCF log. please find below stack trace: ERROR 2014-08-01 14:41:17,531 (Worker thread '88') - Exception tossed: IO Error: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network name is no longer available. org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO Error: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network name is no longer available. at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:417) at org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:433) at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:565) Caused by: java.nio.file.FileSystemException: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network name is no longer available. at sun.nio.fs.WindowsException.translateToIOException(Unknown Source) at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source) at sun.nio.fs.WindowsAclFileAttributeView.getFileSecurity(Unknown Source) at sun.nio.fs.WindowsAclFileAttributeView.getOwner(Unknown Source) at sun.nio.fs.FileOwnerAttributeViewImpl.getOwner(Unknown Source) at java.nio.file.Files.getOwner(Unknown Source) at org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:391) ... 2 more Can you give me some suggestions? Thanks, Ameya On Thu, Jul 31, 2014 at 4:24 PM, Karl Wright <daddy...@gmail.com> wrote: > Hi Ameya, > > You cannot just comment out that line; instead you must supply an input > stream. But you can create a null input stream, for example: > > data.setBinary(new ByteArrayInputStream(new byte[0]),0); > > Karl > > > On Thu, Jul 31, 2014 at 4:22 PM, Ameya Aware <ameya.aw...@gmail.com> > wrote: > >> >>>>>>>>>>>>>>>>>>>>>>>>>> >> long fileBytes = file.length(); >> RepositoryDocument data = new RepositoryDocument(); >> data.setBinary(is,fileBytes); >> String fileName = file.getName(); >> data.setFileName(fileName); >> data.setMimeType(mapExtensionToMimeType(fileName)); >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<< >> >> >> do i just need to comment out 3rd line i.e. data.setBinary(is,fileBytes); >> ?? >> >> >> Thanks, >> Ameya >> >> >> On Thu, Jul 31, 2014 at 4:17 PM, Ameya Aware <ameya.aw...@gmail.com> >> wrote: >> >>> I could not exactly locate the position where this is happening. >>> >>> Can you please help me out with the changes? >>> >>> Thanks, >>> Ameya >>> >>> >>> >>> On Thu, Jul 31, 2014 at 4:10 PM, Karl Wright <daddy...@gmail.com> wrote: >>> >>>> Hi Ameya, >>>> >>>> Since you are already modifying the connector for your purposes, >>>> nothing is stopping you from modifying it further to not fetch the document >>>> and instead substitute an empty input stream. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Thu, Jul 31, 2014 at 3:03 PM, Ameya Aware <ameya.aw...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> i have modified code a little to add different metadata fields such as >>>>> below (FileConnector.java): >>>>> >>>>> data.addField("created", new >>>>> Date((attr.creationTime().toMillis()))); >>>>> data.addField("last_accessed", new >>>>> Date(attr.lastAccessTime().toMillis())); >>>>> data.addField("last_modified", new >>>>> Date(file.lastModified())); >>>>> data.addField("size", file.length()); >>>>> >>>>> >>>>> which are being passed to Solr. >>>>> >>>>> Now can i stop MCF from reading a file and sending that content and >>>>> just passed above information to Solr? >>>>> >>>>> >>>>> Thanks, >>>>> Ameya >>>>> >>>>> >>>>> On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <daddy...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Ameya, >>>>>> >>>>>> The file system connector does not retrieve any metadata for a >>>>>> document at all. So I'm not sure what metadata you are talking about. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <ameya.aw...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> So the thing here is i am not looking for any data or content of any >>>>>>> of files. I am just interested in metadata of file. >>>>>>> >>>>>>> So i thought it should be possible to not read any file and just get >>>>>>> metadata of file and give to Solr. >>>>>>> >>>>>>> This should save lots of time. >>>>>>> >>>>>>> Is it possible to do this? >>>>>>> >>>>>>> Thanks, >>>>>>> Ameya >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddy...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ameya, >>>>>>>> >>>>>>>> (1) Please look at the Simple History report. Note what kinds of >>>>>>>> documents are being fetched, what kinds are being indexed, and how >>>>>>>> long it >>>>>>>> is taking. I have noted from your previous posts that you seem to be >>>>>>>> indexing a lot of very large EXE files. This is useless and you >>>>>>>> should be >>>>>>>> excluding them. >>>>>>>> >>>>>>>> (2) Please look in the manifoldcf.log file for evidence that >>>>>>>> fetches and/or Solr indexing requests are being retried due to errors. >>>>>>>> It >>>>>>>> doesn't take many documents being chronically retried before forward >>>>>>>> progress drops to near zero. >>>>>>>> >>>>>>>> (3) If you look into (1) & (2) and everything seems fine, it may be >>>>>>>> a misalignment between availability of several kinds of resources that >>>>>>>> is >>>>>>>> the problem. Please get a thread dump of the agents process while it >>>>>>>> is >>>>>>>> crawling, using jstack. Post that thread dump and we can tell you >>>>>>>> what to >>>>>>>> look at next. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <ameya.aw...@gmail.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> I am using filesystem connector to index my entire C drive using >>>>>>>>> Solr as output connector. >>>>>>>>> >>>>>>>>> Initial 100000 documents were crawled and indexed successfully in >>>>>>>>> couple of hours but after that indexing slowed down badly (around >>>>>>>>> 15-20 >>>>>>>>> documents per min). >>>>>>>>> >>>>>>>>> >>>>>>>>> I am not able to figure out whether there is issue with MCF or >>>>>>>>> Solr. >>>>>>>>> >>>>>>>>> >>>>>>>>> Can you advice me how to proceed with this? >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ameya >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >