I'm so new to Nutch that I wasn't sure yet how to tie the feature into
a configuration file, but here's the first pass hardcoded version that
seems to do ok. At least on the perfectly clean data that I've been
feeding it. Probably blows up if someone forgets their
tag. I'd definitely like to see
Hi
I am sorry, it should be getTextHelper() method.
Say i want to index the content in this block:
This is not Ads
The code may look like this:
boolean contentStart;
boolean contentEnd;
if (node.getNodeType() == Node.COMMENT_NODE) {
// you can move the value to your configuration file.
Check the list for my earlier discussions. There are
tweaks you can do to enhance the performance if you
have available memory resources.
How large are your segments that you are indexing?
what file system do you use? what OS /JVM are you
building your index on?
-byron
--- "R.Mayoran" <[EMAIL PR
I would recommend that you search the list for some
great discussions on NDFS. Doug has a nice writeup of
his vision of using a map reduce job to push the
indexes to your query servers so they're updates as
the webdb is and managed that way.
NDFS just wasn't designed for the I/O of a query. You
wa
Teruhiko Kurosaka wrote:
Andrzej,
Thank you for explanation.
No, in this case, if "web" and "services" were added to
common-grams.utf8, the result would look like:
web|web-services, services|services-is, cool
where | marks tokens indexed at the same position in the index.
I guess
Andrzej,
Thank you for explanation.
> No, in this case, if "web" and "services" were added to
> common-grams.utf8, the result would look like:
>
> web|web-services, services|services-is, cool
>
> where | marks tokens indexed at the same position in the index.
I guess you meant common-terms.utf
Hi, I asked this question a while back and didn't get a response, so I rolled
my own parse solution using jericho-html and and applyling it to the
HTMLParseFilter
extension point.
I just took a look at the getText() method of the DOMContentUtils
class and I don't see any way to add your own custo
Teruhiko Kurosaka wrote:
I thought n-grams are used for language identification only but
I see they are used in another area.
In the source code of CommonGramps and the API doc:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG
rams.html
I see (tokens representing) n-gram
I thought n-grams are used for language identification only but
I see they are used in another area.
In the source code of CommonGramps and the API doc:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG
rams.html
I see (tokens representing) n-grams are "inserted" to the toke
So just use the ndfs command to download the relevant files from NDFS
and put them on the search server and from there to follow the sample on
your documentation project?
Thanks for all the help.
P.S. Do you have a clear view for the solution to the "slowness in
search over NDFS"? if so I wou
There will be a solution soon, if I found some more time, until this
for smaller installation you need a shell script that download the
index and segment to the box that runs the search server.
You also can move the index from ndfs to local instead of copy it.
check: "bin/nutch ndfs" for docum
Whoa, that was fast...
So all in all you would need two sets of the same data?
Did I understand there is an effort to improve the "poor performance" issue?
And if we are at it, would you care to explain how to download the index
to local and what happens if the data is growing over the boundar
Download index to a local file system.
Am 28.12.2005 um 14:25 schrieb Gal Nitzan:
Hi,
If using search over NDFS is too slow than what is the alternative
when all your data is in NDFS?
Thanks, Gal
Hi,
If using search over NDFS is too slow than what is the alternative when
all your data is in NDFS?
Thanks, Gal
Yes, you need to use map reduce on several boxes.
Anyway 100 mio files will also work on powerful box.
There are some configuration values in the nutch-default.xml that can
improve indexing speed.
Am 28.12.2005 um 09:56 schrieb R.Mayoran:
Hi,
I need to index about 100million files.
Is it
Hi
I have had no problem doing distributed crawl.
On 12/28/05, Pushpesh Kr. Rajwanshi <[EMAIL PROTECTED]> wrote:
> Hi NN,
>
> Thanks for replying me. Actually I wanted to know if distributed crawling in
> nutch is working fine and to what success? Like i am successful in setting
> up distributed
Hi NN,
Thanks for replying me. Actually I wanted to know if distributed crawling in
nutch is working fine and to what success? Like i am successful in setting
up distributed crawl for 2 machines (1 master and 1 slave) but when i try
with more than two machines there seems problem specially while i
Hi,
I need to index about 100million files.
Is it possible to cluster this job?
Are there any sugestions to increase the speed of indexing?
Thank you in advance.
Mayu.
I had exactly similler problem with JDK 1.5. Also when I worked with
only one data node problem doesn't occur.
Thanks
On 12/28/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Interesting!
> That is not a feature that is a bug, may you can open a minor bug
> report.
> Thanks.
> Stefan
> Am 28.12
19 matches
Mail list logo