Re: skip index directory in search results

2010-05-01 Thread b k
://wiki.apache.org/nutch/WritingPluginExample-0.9 <http://wiki.apache.org/nutch/WritingPluginExample-0.9>After adding this plugin, I was able to index the files by skipping this index page...hope this helps... On Wed, Apr 28, 2010 at 1:54 PM, BK wrote: > Hello all, > > I have indexed

skip index directory in search results

2010-04-28 Thread BK
Hello all, I have indexed few directories which contain html files and the *index to each directory* is showing up as one of the search results. Is there any way to skip this directory from search results. e.g. *Index of C:\temp\html*, *Index of C:\temp\html\dir2 *are showing up in the results

Separate Nutch(crawl) and Lucene (index/search)

2010-04-25 Thread sb101h
I have a requirement where I want to index and search file system contents (my local server contents), and at the same time crawl a select set of web-sites on the same search query. I have search for my local file system implemented through Lucene. I would like to have Nutch just crawl the web

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-21 Thread joshua paul
YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Harry Nutch
esoftware.com] >>> Sent: Wednesday, 21 April 2010 9:44 AM >>> To: nutch-user@lucene.apache.org >>> Subject: nutch says No URLs to fetch - check your seed list and URL >>> filters when trying to index fmforums.com >>> >>> nutch says No URLs to fetch - chec

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
.com] Sent: Wednesday, 21 April 2010 9:44 AM To: nutch-user@lucene.apache.org Subject: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using th

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin
ers when trying to index fmforums.com > > nutch says No URLs to fetch - check your seed list and URL filters when > trying to index fmforums.com. > > I am using this command: > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > - urls directory contains urls.txt which

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread joshua paul
nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com. I am using this command: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 - urls directory contains urls.txt which contains http://www.fmforums.com/ - crawl-urlfilter.txt contains +^http

nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-16 Thread joshuasottpaul
trying to index fmforums.com??? also fmforums.com/robots.txt looks ok: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/ Disallow

Re: problem: crawl pdfs from a website and index these to solr

2010-04-05 Thread toocrazymail
ilter, seeds, paths, > nutch-site.xml and so on by overwrite the solr- and nutch-configurations). > the solr-server shall indexing the configured paths of intranet (file, > smb-shares, svn, ...) and hold the index and nutch shall crawl the configured > websites (html, pdf, doc, ...) and i

Re: Can't open a nutch 1.0 index with luke

2010-04-01 Thread Magnús Skúlason
Hi, I found the problem, I could open the index on my server (Linux) but not on my desktop (Windows) so something must be messed up in transfering the files (FTP), same thing used to work just fine with nutch-0.9. I tried to zip it on the server and then unzip it on the windows and then I can

Re: Can't open a nutch 1.0 index with luke

2010-04-01 Thread Andrzej Bialecki
On 2010-04-01 21:09, Magnús Skúlason wrote: > Hi, > > I am getting the following exception when I try to open a nutch 1.0 (I am > using the official release) index with Luke (0.9.9.1) > > java.io.IOException: read past EOF > at > org.apache.lucene.store.B

Can't open a nutch 1.0 index with luke

2010-04-01 Thread Magnús Skúlason
Hi, I am getting the following exception when I try to open a nutch 1.0 (I am using the official release) index with Luke (0.9.9.1) java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput. java:151) at

problem: crawl pdfs from a website and index these to solr

2010-04-01 Thread toocrazymail
the index and nutch shall crawl the configured websites (html, pdf, doc, ...) and indexing these to solr-server. currently i use the whole-web-crawl-script shown below. the indexing of the plain/text websites into solr is not a problem, but when i would like to crawl a website (included pdf

Problem with writing index

2010-03-30 Thread hareesh
I was trying a crawl with 200 seeds. In previous cases it used to create the index with out any problem , now when i started the crawl its show the following exception at depth 2 attempt_201003301923_0007_m_00_0: Aborting with 100 hung threads. Task attempt_201003301923_0007_m_04_0

Re: Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-24 Thread Ahmad Al-Amri
Hello Arnaud; why not start by making sure you are parsing correctly; by debugging your code and check if author is added to metadata .. if it parsed and added to meta data then move to the index part. in indexer do something like this; and make sure it added to doc: String[] tags

Re: Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-23 Thread Arnaud Garcia
Hello Ahmad and all nutch-users I would to thank you for your response. Exactly as you say there isn't the same interface with the version 1.0 of NUTCH So actually i don't have error in building the plugin but the problem is that no new field appears on the index when i display the i

Re: Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-22 Thread Ahmad Al-Amri
__ From: Arnaud Garcia To: nutch-user@lucene.apache.org Sent: Wed, March 17, 2010 8:25:26 AM Subject: Re: Plugin installed , deployed and works correctly but no new field in the index 2010/3/17 Arnaud Garcia > > > 2010/3/17 Arnaud Garcia > > Hello everybody >&g

reading solr index

2010-03-18 Thread Fadzi Ushewokunze
hi there, I need to open a solr 1.4 index from the file system using SolrIndexReader.open(); this seems to do the job ok and i can read my documents. the problem arises when i try to get a date field which was indexed as text; here is what toString gives me; stored/uncompressed,binary

Re: Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-17 Thread Arnaud Garcia
y >> in the file /nutch/src/plugin/ and >> >> the "author" directory was been created on the directory /nutch/build/ . >> >> >> THE PROBLEM IS : >> >> No new field named "author" exists in the index . >> >> I m using Luke

Re: Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-17 Thread Arnaud Garcia
, all things are built successfully , (plugin (separately)+ Nutch ) , > the name of the plugin ("author) was added in nutch-site.xml file , > > and the balisewas added correctly in > the file /nutch/src/plugin/ and > > the "author" directory was been created on the di

Plugin installed , deployed and works correctly but no new field in the index ????????????

2010-03-17 Thread Arnaud Garcia
uthor) was added in nutch-site.xml file , and the balisewas added correctly in the file /nutch/src/plugin/ and the "author" directory was been created on the directory /nutch/build/ . THE PROBLEM IS : No new field named "author" exists in the index . I m using Luke to read

Can nutch index file-exchanger such as depositfiles.com

2010-03-12 Thread michaelnazaruk
Is there possible to do this with nutch? -- View this message in context: http://old.nabble.com/Can-nutch-index-file-exchanger-such-as-depositfiles.com-tp27874535p27874535.html Sent from the Nutch - User mailing list archive at Nabble.com.

OutOfMemoryError when index

2010-03-04 Thread xiao yang
Hi, all I get outofmemory Error when index using bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* I have configure HADOOP_HEAPSIZE in hadoop-env.sh and mapred.child.java.opts in mapred-site.xml to the hardware limit. mapred.child.java.opts -Xmx2600m

Re: Inject and index single url

2010-02-24 Thread xiao yang
There's no good way to do this. I'm waiting for Hbase integration with Nutch, which will make this operation much easier. The data store structure nutch is using now is not suitable for adding a single url to the index as I know. Thanks! Xiao On Tue, Feb 16, 2010 at 7:47 PM, Ahmad Al-A

Re: Two index

2010-02-22 Thread xiao yang
or you can add a previlege filed in lucene index, and add a phrase on this field for each search query. On Mon, Feb 22, 2010 at 10:48 AM, QueroVc wrote: > > Looking for a solution to the following subject: > > My research will be available to both internal and external audience

Two index

2010-02-22 Thread QueroVc
access the results? Now, thank you. -- View this message in context: http://old.nabble.com/Two-index-tp27692301p27692301.html Sent from the Nutch - User mailing list archive at Nabble.com.

Inject and index single url

2010-02-16 Thread Ahmad Al-Amri
and inject them into the crawldb... after that, I need to index it only; I guess just use the current generator and other stuff with depth equals to one doing it. what I supposed to use for doing this; and any other missing information I should know ?!! and is building a plug-in is more suitable

Re: Update live search index

2010-01-05 Thread Alexander Aristov
rts. So I had to write restarting procedure myself. Nothing difficult. Just close current index and reopen it again. And then I just pull corresponding URL to re-initialize index after re-crawling. private void reInitNutchBean() throws IOException { bean.close(); bean.reinitilize(); } p

Update live search index

2010-01-05 Thread Joshua J Pavel
Hello all - I need to update the live search index - most preferably without restarting the application. I'm using nutch 0.9 in WebSphere. By doing a few searches, it seems that this is a large issue with a lot of history. Where does it stand today? Is there a .jsp I can crea

Re: Luke reading index in hdfs

2009-12-12 Thread MilleBii
Great thx I can open it will help, but I don't get the summary page to be populated is this normal ??? 2009/12/11 Andrzej Bialecki > On 2009-12-11 22:21, MilleBii wrote: > >> Guys is there a way you can get Luke to read the index from hdfs:// ??? >> Or you have to c

Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki
On 2009-12-11 22:21, MilleBii wrote: Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x. Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1. Start

Luke reading index in hdfs

2009-12-11 Thread MilleBii
Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? -- -MilleBii-

RE: How to successfully crawl and index office 2007 documents in Nutch 1.0

2009-12-07 Thread Rupesh Mankar
Is there any readymade plug-in for office 2007 documents available or I have to write it by my own? -Original Message- From: yangfeng [mailto:yea...@gmail.com] Sent: Monday, December 07, 2009 4:35 PM To: nutch-user@lucene.apache.org Subject: Re: How to successfully crawl and index

Re: How to successfully crawl and index office 2007 documents in Nutch 1.0

2009-12-07 Thread yangfeng
docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. 2009/12/4 Rupesh Mankar > Hi, > > I am new to Nutch. I want to crawl and search office 2007 documents (.docx, > .pptx etc) from Nutch. But when I try to crawl, crawler throws

How to successfully crawl and index office 2007 documents in Nutch 1.0

2009-12-04 Thread Rupesh Mankar
Hi, I am new to Nutch. I want to crawl and search office 2007 documents (.docx, .pptx etc) from Nutch. But when I try to crawl, crawler throws following error: fetching http://10.88.45.140:8081/tutorial/Office-2007-document.docx Error parsing: http://10.88.45.140:8081/tutorial/Office-2007-docume

Re: can you incrementally build an index?

2009-11-24 Thread Andrzej Bialecki
Jesse Hires wrote: Does "bin/nutch merge" only create a whole new index out of several smaller indexes, or can it be used to incrementally update a single large index with newly fetched and indexed smaller segments? It can do either - the tool merges indexes as-is without de-duplica

can you incrementally build an index?

2009-11-23 Thread Jesse Hires
Does "bin/nutch merge" only create a whole new index out of several smaller indexes, or can it be used to incrementally update a single large index with newly fetched and indexed smaller segments? Jesse int GetRandomNumber() { return 4; // Chosen by fair ro

Is there a way to create and index a segment that only has fetched URLs?

2009-11-14 Thread Jesse Hires
mp): 2548 status 5 (db_redir_perm): 3252 status 6 (db_notmodified): 1499 CrawlDb statistics: done Once I do the generate/fetch/updatedb/mergesegs, I do a mergesegs -slice (1/2 * total_urls) This is the step that is taking too long. I then index each of the two new segments indivi

Re: How to make a Lucene-built index work with Nutch?

2009-11-10 Thread fadzi
hi, not sure if this will work in your case; but in a nutshell - first create a nutch index by crawling some urls. open both indexes ie IndexReader r = IndexReader.open(nutch_index) IndexReader r2 = IndexReader.open(your_index) then write a new index; IndexWriter writer = new IndexWriter

How to make a Lucene-built index work with Nutch?

2009-11-10 Thread Wang Muyuan
I have built a custom index using Lucene, as the data source is not web pages but some custom text files. I stored several indexed fields in my index. Now my colleague requires me to make sure this index works with Nutch. So I set up Nutch on the server, followed the steps, crawled some pages

Re: changing/addding field in existing index

2009-11-09 Thread Fadzi Ushewokunze
that seems to work. thanks for that. it was a bit fiddly more than i expected but got the index sorted. found an issue with sorting as most fields cannot be sorted by; and throwing a java.lang.RuntimeException: Unknown sort value type! at

Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki
fa...@butterflycluster.net wrote: hi all, i have an existing index - we have a custom field that needs to be added or changed in every currently indexed document ; whats the best way to go about this without recreating the index again? There are ways to do it directly on the index, but this

Re: Growing the index : Merging vs incremental

2009-11-08 Thread fadzi
? db.fetch.schedule.class = AdaptiveFetchSchedule db.update.additions.allowed = true db.ignore.internal.links = false db.ignore.external.links = true (because we are intranet only) > > > Currently we crawl every two days and create a new index and then merge > with > earlier index. For one it

changing/addding field in existing index

2009-11-08 Thread fadzi
hi all, i have an existing index - we have a custom field that needs to be added or changed in every currently indexed document ; whats the best way to go about this without recreating the index again? currently some documents have the field some dont; the ones that have it need to be updated

Growing the index : Merging vs incremental

2009-11-06 Thread sprabhu_PN
Currently we crawl every two days and create a new index and then merge with earlier index. For one it takes too long as mergesegs seems to take time proportional to the size of both indexes combined. Equally problematic issue is mergesegs fail a significant portion of the time. Probability

Multiple index from webapp

2009-11-05 Thread Bartosz Gadzimski
Hello, I am looking for a way to search for multiple indexes from one webapp and found some code. I can allways make one webapp = one website but what if it grows? Is it possible to make this code work: in search.jsp /* Comment this original line of code and use code below. Configuratio

Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
ed >> So didn't get outlinks to kml files from html. >> So I can't parse and index kml files. >> I might not be right, but I have a feeling that it's not possible >> without modifying source code. > > It's possible to do this with a custom indexing

Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki
Dmitriy Fundak wrote: If I disable html-parser(remove "parse-(html" from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's

Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
If I disable html-parser(remove "parse-(html" from plugin.includes property) html filed didn't get parsed So didn't get outlinks to kml files from html. So I can't parse and index kml files. I might not be right, but I have a feeling that it's not possible without mod

RE: How to index files only with specific type

2009-10-26 Thread BELLINI ADAM
disable the html-parser from the nutch-site and keep only your parser. you can also add in uour filter file this : -(htm|html)$ thx > Date: Mon, 26 Oct 2009 17:53:11 +0300 > Subject: How to index files only with specific type > From: dfun...@gmail.com > To: nutch-user@lucen

How to index files only with specific type

2009-10-26 Thread Dmitriy Fundak
Hi, I've create parser and indexer to specific file type(geo xml meta file - kml). I am trying to crawl couple of sites, and index only files of this type. I don't want to index html or anything else. How can I achieve this? Thanks.

Re: Missing pages from Index in NUTCH 1.0

2009-10-25 Thread reinhard schwab
paul tomblin has sent a patch at 14.10.2009 to filter out not modified pages makes sense for me if the index is built incrementally and if these pages are already in the index which is updated then lucene offers the option to update an index but in my case i always build a new one. you may

Re: Missing pages from Index in NUTCH 1.0

2009-10-24 Thread kevin chen
Some groups of urls are all from the same site. After I > am done with all groups, I copy all the segments together, do a crawldb > update, which will create a new crawldb, and then index. > > This scheme worked well with nutch 0.9. But when I switch to nutch 1.0, > search results

Missing pages from Index in NUTCH 1.0

2009-10-24 Thread kevin chen
, and schedules. Some groups of urls are all from the same site. After I am done with all groups, I copy all the segments together, do a crawldb update, which will create a new crawldb, and then index. This scheme worked well with nutch 0.9. But when I switch to nutch 1.0, search results will miss urls

Re: Accessing an Index from a shared location

2009-10-21 Thread JusteAvantToi
Andrzej Bialecki wrote: > > JusteAvantToi wrote: >> Hi all, >> >> I am new on using Nutch and I found that Nutch is really good. I have a >> problem and hope somebody can shed a light. >> >> I have built an index and a web application that makes u

Re: Accessing an Index from a shared location

2009-10-21 Thread Andrzej Bialecki
JusteAvantToi wrote: Hi all, I am new on using Nutch and I found that Nutch is really good. I have a problem and hope somebody can shed a light. I have built an index and a web application that makes use of that index. I plan to have two web application servers running the application. Since

Accessing an Index from a shared location

2009-10-21 Thread JusteAvantToi
Hi all, I am new on using Nutch and I found that Nutch is really good. I have a problem and hope somebody can shed a light. I have built an index and a web application that makes use of that index. I plan to have two web application servers running the application. Since I do not want to

Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread malcolm smith
esting it again. >> >> Needless to say it would seem more straightforward to tackle this in some >> kind of parser plugin that could break the original page into pieces that >> are treated as standalone pages for indexing purposes. >> >> Last but not least conce

Re: Extending HTML Parser to create subpage index documents

2009-10-19 Thread Andrzej Bialecki
ward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies "collection" an

Extending HTML Parser to create subpage index documents

2009-10-19 Thread malcolm smith
me kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies "collection" and index it as separate

Re: how can I index only a portion of html content?

2009-10-10 Thread winz
cation in nutch mean and how does it work?? Is it possible to configure nutch to remove duplicate contents like navigation bar during its de-duplication process?? Regards, Winz -- View this message in context: http://www.nabble.com/how-can-I-index-only-a-portion-of-html-content--tp5149557p258

Re: NutchBean refresh index problem

2009-10-05 Thread Marko Bauhardt
structure. the nutchgui-searcher use a little bit another folder structure like the original nutch. + you need a property 'nutch.instance.folder' which define the folder where your crawl folders exists. for example nutch.instance.folder: 'tmp/nutch/crawls' /tmp/nutch/crawl

NutchBean refresh index problem

2009-10-02 Thread Haris Papadopoulos
Hi, I 'm sorry I 'm sending the same question again. Each time I create a new index for my web app, NutchBean throws an exception and the search page crashes. I know that this is an old problem for nutch (it needs a context restart after index updating) but I was hoping that it would

NutchBean refresh index problem

2009-09-28 Thread Haris Papadopoulos
Hi, Each time I create a new index for my web app, NutchBean throws an exception and the search page crashes. I know that this is an old problem for nutch (it needs a context restart after index updating) but I was hoping that it would be solved with newer versions. I'm using Nutch 1.

Re: splitting an index (yes, again)

2009-09-25 Thread Jesse Hires
Perhaps I have my terminology wrong, so I am looking at this the wrong way. If I want to distribute my search across multiple nodes, having only a portion of the data on each node, is this just a matter of using mergesegs to get the number and size of segments I want, then rebuild the index (house

Re: splitting an index (yes, again)

2009-09-23 Thread Jesse Hires
gt; Ok, I will paraphrase the question. > > Consider I want to use distributed search using 3 servers: one primary and > two secondary nodes. > > I create single BIG index using distributed crawler using other computers. > Now I want to split this single BIG index on two parts t

Re: splitting an index (yes, again)

2009-09-23 Thread Alexander Aristov
Ok, I will paraphrase the question. Consider I want to use distributed search using 3 servers: one primary and two secondary nodes. I create single BIG index using distributed crawler using other computers. Now I want to split this single BIG index on two parts to put on the search nodes. How

AW: splitting an index (yes, again)

2009-09-22 Thread Koch Martina
Hi Jesse, I'm not sure what you're trying to achieve. Do you want to use the distributed search or do you want to split an existing index? None of these tasks is the prerequisite for the other. If you want to split an index, there are several ways to do this. Which way to choose depe

splitting an index (yes, again)

2009-09-22 Thread Jesse Hires
My apologies in advance. I've been digging through the mail archives searching for information on splitting the index after crawling, but I am getting even more confused or the information is too incomplete for a newbie like myself. I see reference to using mergesegs, but not enough to ma

Re: Adding Lucene Index with Nutch Crawl

2009-09-14 Thread MilleBii
You should give a bit more details because it depends how specific is you Lucene indexer. The way to extend indexing in Nutch is via the indexing plug-in mechanism (look in the wiki for plugin addition). A good starting point index-more plug-in. As in Lucene you want to have a query extension too

Adding Lucene Index with Nutch Crawl

2009-09-14 Thread mervyn_lee
Hi, I'm exploring nutch and this forum for help on integrating a Lucene indexer into the Nutch Crawl (version 0.9) process in Java. Can anyone suggest how to do so or recommend an example/similar thread? Thanks! -- View this message in context: http://www.nabble.com/Adding-Lucene-Index

Re: The index file made by executing main method of org.apache.nutch.crawl.Crawl can not be read from Luke.

2009-09-06 Thread Katsuki FUJISAWA
using > bin/nutch commnad. > When nutch 0.9 index file made by main method of > org.apache.nutch.crawl.Crawl class can be read from program. > But when nutch 1.0 index file  made by main method of > org.apache.nutch.crawl.Crawl class can not be read from program. > > > Also re

The index file made by executing main method of org.apache.nutch.crawl.Crawl can not be read from Luke.

2009-09-06 Thread Katsuki FUJISAWA
Hi, I am new to nutch. Now I am trying to do crawing from Java servlet program without using bin/nutch commnad. When nutch 0.9 index file made by main method of org.apache.nutch.crawl.Crawl class can be read from program. But when nutch 1.0 index file made by main method of

how to effectively update index

2009-09-03 Thread alxsss
Hello, I have a crawl folder with 2GB data and its index is 160MB. Then, nutch indexed another set of domains and its crawl folder is about 1MB. I wondered if there is an effective way making available for search indexes from both folders without using merge script, since merging large

Re: Which Java objects to index a web page ?

2009-08-12 Thread Fabrice Estiévenart
I like using Nutch for the crawlDB, scalability, threading, document parsing, ... but crawling is not important to me as I index targeted data sources. Obviously, I'm using it with Solr for indexing and searching documents. Fabrice Alexander Aristov a écrit : Nutch primarily is a crawl

Re: How do I get all the documents in the index without searching?

2009-08-12 Thread Paul Tomblin
On Tue, Aug 11, 2009 at 2:10 PM, Paul Tomblin wrote: > I want to iterate through all the documents that are in the crawl, > programattically.  The only code I can find does searches.  I don't > want to search for a term, I want everything.  Is there a way to do > this? To answer my own question, w

Re: Which Java objects to index a web page ?

2009-08-12 Thread Alexander Aristov
Nutch primarily is a crawler. I would suggest you to take a look at solr which is just indexer and searcher. You may use it's API as well as open interfaces Best Regards Alexander Aristov 2009/8/12 Fabrice Estiévenart > Hello, > > How can I use Nutch Java objects to index

Re: How do I get all the documents in the index without searching?

2009-08-12 Thread Alex McLintock
Try looking at how the indexers work. They *do* iterate through all the documents in the crawl (or rather one segment at a time). However they do it in a Hadoop way... 2009/8/11 Paul Tomblin : > I want to iterate through all the documents that are in the crawl, > programattically.  The only code

Which Java objects to index a web page ?

2009-08-12 Thread Fabrice Estiévenart
Hello, How can I use Nutch Java objects to index one (or a very limited set of) web page(s) without crawling them ? Do I need to use the crawling tools (such as Injector, Generator, ...) or can I do it by the means of lower-level objects (Content, ParseResult, ...) ? Thanks for your help

How do I get all the documents in the index without searching?

2009-08-11 Thread Paul Tomblin
I want to iterate through all the documents that are in the crawl, programattically. The only code I can find does searches. I don't want to search for a term, I want everything. Is there a way to do this? -- http://www.linkedin.com/in/paultomblin

Re: How to index other fields in solr

2009-07-27 Thread Doğacan Güney
On Mon, Jul 27, 2009 at 09:34, Saurabh Suman wrote: > > I am using solr for searching.I used the class SolrIndexer.But i can search > on content only?I want to search on author also?How to index on author? You need to write your own query plugin. Take a look at query-basic plugin under s

Re: How to index other fields in solr

2009-07-27 Thread Paul Tomblin
Wouldn't that be using facets, as per http://wiki.apache.org/solr/SimpleFacetParameters On Mon, Jul 27, 2009 at 2:34 AM, Saurabh Suman wrote: > > I am using solr for searching.I used the class SolrIndexer.But i can search > on content only?I want to search on author also?How to i

How to index other fields in solr

2009-07-26 Thread Saurabh Suman
I am using solr for searching.I used the class SolrIndexer.But i can search on content only?I want to search on author also?How to index on author? -- View this message in context: http://www.nabble.com/How-to-index-other-fields-in-solr-tp24674208p24674208.html Sent from the Nutch - User

Re: how to crawl a page but not index it

2009-07-15 Thread Jake Jacobson
of failure, but of succeeding at something that doesn't really matter. -- ANONYMOUS On Tue, Jul 14, 2009 at 8:32 AM, Beats wrote: > > hi, > > actually what i want is to crawl a web page say 'page A' and all its > outlinks. > i want to index all the content

Re: how to crawl a page but not index it

2009-07-14 Thread Beats
hi, actually what i want is to crawl a web page say 'page A' and all its outlinks. i want to index all the content gathered by crawling the outlinks. But not the 'page A'. is there any way to do it in single run. with Regards Beats be...@yahoo.com SunGod wrote: > &

Re: how to crawl a page but not index it

2009-07-13 Thread SunGod
h first > > 2009/7/13 Beats > > >> can anyone help me on this.. >> >> i m using solr to index the nutch doc. >> So i think prune tool will not work. >> >> i do not want to index the document taken from a particular set of sites >> >> wi

Re: how to crawl a page but not index it

2009-07-13 Thread SunGod
test/segments/20090628160619 loop step 3 - 5, write a bash script running is best! next time please use google search first 2009/7/13 Beats > > can anyone help me on this.. > > i m using solr to index the nutch doc. > So i think prune tool will not work. > > i do not want

Re: how to crawl a page but not index it

2009-07-13 Thread Beats
can anyone help me on this.. i m using solr to index the nutch doc. So i think prune tool will not work. i do not want to index the document taken from a particular set of sites with regards Beats -- View this message in context: http://www.nabble.com/how-to-crawl-a-page-but-not-index-it

how to crawl a page but not index it

2009-07-11 Thread Beats
hi all i want to crawl a page and then crawl all its outlinks and index the content of those crawled outlinks.. the problem is i dont want to index the page from where i get these outlinks.. thanx in advance -- View this message in context: http://www.nabble.com/how-to-crawl-a-page-but

Re: How to parse and index content field of RSS-Feed?

2009-07-10 Thread Beats
hi, i m getting the same problem here.. is there some changes that are needed to b done before using "feed" plugin?? i m getting parsing error Felix Zimmermann-2 wrote: > > Hi, > > > > Is there an easy way to parse and index the content field of feeds wi

Re: Index weightings of different types of text node...h1, h2 anchor etc..

2009-07-09 Thread Magnús Skúlason
yes that is correct, in order to do that you could modify the parser to store the content of special tags into another field that you would give a higher boost. best regards, Magnus On Thu, Jul 9, 2009 at 3:30 PM, Joel Halbert wrote: > Hi, Would I be correct in thinking that Nutch, when indexin

Index weightings of different types of text node...h1, h2 anchor etc..

2009-07-09 Thread Joel Halbert
Hi, Would I be correct in thinking that Nutch, when indexing an html document, does not weight the different text nodes (h1, h2, anchor etc) differently - instead it just lumps together all text as one? (this is the impression I get from looking at org.apache.nutch.parse.html.HtmlParser) Rgs, Joe

Re: Problems when index .chm files

2009-07-06 Thread Ken Krugler
Example1: Error parsing: http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.H

Problems when index .chm files

2009-07-06 Thread Yaidel Guedes Beltran
Example1: Error parsing: http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Hacks.Tips.and.Tools.for.Creating.Responsive.Web.Sites.Mar.2006.chm: org.apache.nutch.parse.ParseException: parser not found for contentType=chemical/x-chemdraw url=http://localhost/mydocs/Programacion/Web/Ajax/Ajax.Ha

list documents within nutch index

2009-06-18 Thread dimi
how can I list documents stored in the nutch index. I have indexed some file.zip to get information about the data within the .txt file (using parse-text) below ~/nutch-1.0/index/ are some binary files and I want to know if that .txt information went into the nutch index. please advise

Re: spliting an index

2009-06-17 Thread Dennis Kubes
Short answer, not way to do it currently. Now for the long answer. You can handle searching in two ways: 1) Have a single massive index and segments, merge everything including segments and indexes. Then split the indexes and segments (don't forget having to split the segments othe

Re: spliting an index

2009-06-16 Thread lei wang
I agree with you that we should spilit up index at the stage of indexing. We are thinking on the same page. Maybe we can read index file directory and segements directory in nutch api, and spilt segements file dir by documents, and build index on each segements file? nutch claim that it a

Re: spliting an index

2009-06-16 Thread Alexander Aristov
from the top of my head I am afraid this in no direct solution. Index can be split at the stage of indexing by setting the parameter number of URLs in an index but you also have segments which store page content. And you cannot alter them after you have done indexing. Probably the solution might

Re: spliting an index

2009-06-16 Thread beyiwork
i am considering this problem now , anyone can help? charlie w wrote: > > With regard to distributed search I see lots of discussion about splitting > the index, but no actual discussion about specifically how that's done. > > I have a small, but growing, index. Is it

  1   2   3   4   5   6   7   8   >