-servers.txt) and that you can bring down a single
search server, update the index and pieces, and then bring the single
search server back up. This way the entire index is never down.
Hope this helps and let me know if you have any questions.
Dennis Kubes
--
Doğacan Güney
that, perhaps we can stop creating
a ParseUtil instance for every ParseSegment.map [even though it has a
smaller overhead]).
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find
Luca Rondanini
Doğacan Güney wrote:
On 7/25/07, Luca Rondanini [EMAIL PROTECTED] wrote:
this is my hadoop log(just in case):
2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting
2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment:
/home/semantix/nutch
to a slowish indexing
filter like language-identifier?)
Dennis
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events
! Answers - Check it out.
http://answers.yahoo.com/dir/?link=listsid=396545433
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events
://autos.yahoo.com/carfinder/
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your
create patches:
http://wiki.apache.org/nutch/HowToContribute )
Cheers,
Carl.
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events
Sick sense of humor? Visit Yahoo! TV's
Comedy with an Edge to see what's on, when.
http://tv.yahoo.com/collections/222
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping
=listsid=396545433
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your
reading plugin's source code.
Thanks
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX
think it will solve
your problem but Content.java has changed recently so I am not sure
what was in line 146. So, if problem reoccurs with latest trunk I can
check exactly which line is failing. Alternatively, you can send that
part of Content.java's code.
Cheers,
Carl.
--
Doğacan Güney
(db_redir_perm): 4
CrawlDb statistics: done
Luca Rondanini
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration
Luca Rondanini
--
Doğacan Güney
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using
there is such a difference and is there some way to
eliminate part of this overhead ?
Regards,
--
Marc
--
Doğacan Güney
-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now
.
-Original Message-
From: Doğacan Güney [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 25, 2007 2:44 PM
To: [EMAIL PROTECTED]
Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower
than Lucene ?)
On 7/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
I
commit it.
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar
/in/yahoo/mail/yahoomail/tools/tools-08.html/
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click
/hadoop.log or your tasktracker's log files and you should see a
more detailed log about your problem.
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express
- Original Message
From: Doğacan Güney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, June 27, 2007 10:59:52 PM
Subject: Re: Possibly use a different library to parse RSS feed for improved
performance and compatibility
On 6/28/07, Kai_testing Middleton [EMAIL PROTECTED
and it will work too.
--Kai Middleton
- Original Message
From: Doğacan Güney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, June 22, 2007 1:39:12 AM
Subject: Re: Possibly use a different library to parse RSS feed for improved
performance and compatibility
On 6/21/07, Kai_testing
appreciated :).
Thanks
Rob
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
$ ant clean ant
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front
--
Doğacan Güney
. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
--
Doğacan Güney
On 6/26/07, Sami Siren [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hi,
On 6/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Is this actually planned (addition of SolrIndexer to Nutch)?
A search for SolrIndexer in JIRA got no hits.
There is NUTCH-442 (one of the most popular
command parses the segment.
Thanks
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get
is happening :( when i remove the recommended
plugin from there the search.jsp page is displayed normally
If you are using tomcat, please start it in 'run' mode(./catalina.sh
run) and check if tomcat prints an exception.
please help its really urgent
--
Doğacan Güney
help
On 6/24/07, Doğacan Güney [EMAIL PROTECTED] wrote:
On 6/24/07, karan [EMAIL PROTECTED] wrote:
hi
i just tried to build the recommended plugin that is given in the plugin
writing example when i included the plugin in the plugin.includesproperty
the searc.jsp nothing
on/reviewing NUTCH-505 would be a nice place to start :).
8:00? 8:25? 8:40? Find a flick in no time
with the Yahoo! Search movie showtime shortcut.
http://tools.search.yahoo.com/shortcuts/#news
--
Doğacan Güney
, with your fetcher patch
applied. I will report back with the result when the process is done.
- Original Message
From: Doğacan Güney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, June 19, 2007 7:12:32 AM
Subject: Re: Indexing problems in nutch-nightly
Here is the patch
(NUTCH-504, rev 550196). See discussion here:
http://www.nabble.com/Indexing-problems-in-nutch-nightly-tf3923427.html
for why the problem occurs.
Conf:
1 single machine
Linux 2.6, Java 1.6
nutch nigthly + hadoop 0.12.3
Thanks in advance for ur help
--
Doğacan Güney
for ur help
--
Doğacan Güney
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click
the parameters like the one above.
On 6/24/07, Doğacan Güney [EMAIL PROTECTED] wrote:
On 6/24/07, karan [EMAIL PROTECTED] wrote:
hey...
thnx for reply tomcat in run mode does generate exceptions at the
terminal
:)..and the output shoes the plugin is in the registered list of plugins
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
These 'urls' most likely come from parse-js plugin. Can you disable it
and see if they disappear? To extract links from js code, parse-js
uses a heuristic that unfortunately also may extract garbage urls
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
These 'urls' most likely come from parse-js plugin. Can you disable it
and see if they disappear? To extract links from js code, parse
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
These 'urls' most likely come from parse-js plugin. Can
knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=listsid=396545433
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
://farechase.yahoo.com/
Building a website is a piece of cake. Yahoo! Small Business gives you all
the tools to get online.
http://smallbusiness.yahoo.com/webhosting
--
Doğacan Güney
a little closer to the building
of the Lucene query (and allow this behaviour) via a Nutch plugin?
Andrzej Bialecki is working on this - see NUTCH-479.
Thanks
Rob
--
Doğacan Güney
-
This SF.net email is sponsored by DB2
server I would love to hear about it.
The former suggestions of space and architecture are what we have
experienced.
Dennis Kubes
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C
and your index size will grow very
large.
-Brian
!DSPAM:467817bf321421501980509!
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control
to access segmented data.
Best regards,
Ronny
-Opprinnelig melding-
Fra: Doğacan Güney [mailto:[EMAIL PROTECTED]
Sendt: 20. juni 2007 08:14
Til: [EMAIL PROTECTED]
Emne: Re: Lucene client and nutch index
On 6/20/07, Naess, Ronny [EMAIL PROTECTED] wrote:
I tried your tip Brian
is the best to do to
gain in term of performance and to stay enough polite ?
That's kind of between you and the server you are fetching but I
wouldn't recommend a delay lower than 5 seconds.
More tricks to gain performance are welcome
E
--
Doğacan Güney
-tf3788992.html
I have put up a patchified version here:
http://www.ceng.metu.edu.tr/~e1345172/segment_reader_hang.patch
Can you retry with this patch?
Thanks!
--
Doğacan Güney
--
Doğacan Güney
-
This SF.net email
Here is the patch for the fetchers:
http://www.ceng.metu.edu.tr/~e1345172/parse_in_fetchers.patch
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express
fun:)
Anyway, it seems you are running into the problem described here:
http://www.nabble.com/bug-in-SegmentReader-tf3788992.html
I have put up a patchified version here:
http://www.ceng.metu.edu.tr/~e1345172/segment_reader_hang.patch
Can you retry with this patch?
Thanks!
--
Doğacan Güney
EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 seconds (3.473E-4 days)
Score: 1.0
Signature: c079280b4afb4347372982d5a034d51b
Metadata: _ngt_:1181243348572 _pst_:success(1), lastModified=0
- Original Message
From: Doğacan Güney
pages that return 200.
You can fix this by putting status code in Content's Metadata then
only parsing pages that have status code 200. (or, nutch stores page's
headers in content's metadata. You can check if content's metadata has
a location header).
--
Doğacan Güney
is necessary for
politeness). So, in your case, you either have very few hosts (of
which one has almost 100K urls) or there is a problem with
partitioning.
Patrik
[...snip...]
--
Doğacan Güney
-
This SF.net email is sponsored
=./servlet/cached?%=id%link/a to download it directly.
% } %
You can get a url's ParseText with bean.getParseText(details).
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE
to a value greater than 1
and you have a very unpolite fetcher. Please don't run this to fetch
a site you don't control :)
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version
like this:
int h = 0;
IteratorEntryK,V i = entrySet().iterator();
while (i.hasNext())
h += i.next().hashCode();
return h;
So if configuration's hashCode changes, CACHE's hashCode also changes.
Thanks for the detailed analysis!
Enzo
--
Doğacan
On 6/8/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
- Original Message -
From: Doğacan Güney [EMAIL PROTECTED]
Sent: Friday, June 08, 2007 3:49 PM
[...]
Any idea?
This will certainly help a lot. If it is not too much trouble, can you
add debug outputs for hashCodes of conf
On 6/8/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
- Original Message -
From: Doğacan Güney [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, June 08, 2007 8:27 PM
Subject: Re: Loading mechnism of plugin classes and singleton objects
[...]
This is strange, because, as you
to implement the management
of cookie in Nutch.
Thanks
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just
cookies across fetcher, well I am not sure
how to do it:) Perhaps, you can write an extra job that puts the
cookie to every datum from that host, then pick it up in fetcher. Or
perhaps someone has a better idea :)
Thanks
--
Doğacan Güney
machine.
(By the way, what does Until Nutch runtime mean here? Before Nutch
runtime, no class whatsoever is supposed to be alive in the JVM, is it?)
Enzo
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
code(at least for gzip) and do compression. It will just
be very slow:).
Can you create an OutputFormat in CrawlDbMerger and set compression
type to BLOCK manually? You can take a look at ParseOutputFormat's
code as an example.
Any clues ?
--
Doğacan Güney
. Plugin protocol-httpclient uses
commons-httpclient library, nutch disables redirects in this library
because nutch handles redirects itself.
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2
minds are what make reality real
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get
Content-Type: text/html
So, I'm lost.
On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
So, I have been having huge problems with parsing. It seems that many
urls are being ignored because the parser plugins throw
(plugin.includes property).
How can I make it parse these type of content while crawling?
And if I run the fetch in non-parsing mode how can I make it parse
them later and update it in crawl folder.
Please help.
--
Doğacan Güney
or point
me to proper articles or wiki where I can learn this.
On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hi everyone,
Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is
always slower (by a large margin) that Fetcher
On 5/31/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
I am still not sure about the source of this bug, but I think I found
some unnecessary waits in Fetcher2. Even if a url is blocked by
robots.txt (or has a crawl delay larger that max.crawl.delay),
Fetcher2 still
to display the parsed content of the PDF instead of
this message?
As its name implies, cached content shows url's content:) . What you
want to see is its parse text. Nutch doesn't do this but it is simple
to change it so that it reads from segment/parse_text instead of
segment/content .
--
Doğacan Güney
? Are there other URL filters? If so, in
what order are the filters called?
!DSPAM:465d634894881383415936!
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version
)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version
, what do I need
to check. Please help.
In your case, crawl-urlfilter.txt is not read because you are not
running 'crawl' command (as in bin/nutch crawl). You have to update
regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
enable them in your conf.
--
Doğacan Güney
, but could take up to a week to complete.
The new cluster was supposed to fix that and make this easier...
It looks like your problem is related to
https://issues.apache.org/jira/browse/NUTCH-246 .
Jeff
-Original Message-
From: Doğacan Güney [mailto:[EMAIL PROTECTED]
Sent: Friday, May 25
fetch those urls?)
Thanks for the help.
Jeff
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data
]
--
Dogacan Güney
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http
are
fetching?
I have ~3 urls with ~1000 hosts. Hosts have at most 500 urls and
there are 23 hosts that have 500 urls. I generally run Fetcher with
100-200 threads and Fetcher2 with 50 threads.
-vishal.
[snip]
--
Doğacan Güney
http.content.limit. And parse-pdf can't parse partial pdf files.
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits
(Fetcher finished in 1 hour, Fetcher2 in about 2.5). Though I
have performed other tests where their performance is similar(and I
have no idea why). I am trying to find the cause of problem, but so
far, had no luck.
Otis
[snip]
--
Doğacan Güney
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hi everyone,
Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is
always slower (by a large margin) that Fetcher.
For a segment with ~3 urls, we ran Fetcher with 150 threads and
Fetcher2
things) calculate score.
There are 6 CrawlDatum fields and all of them are exactly identical.
Is this a bug or am I missing something here?
Any light on this matter would be greatly appreciated.
Thank you.
Florent
--
Doğacan Güney
to get any result when you use a query
type:pdf with the webapp on your index ?
yes.
type:pdf something
cu
*pike
--
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C
of
question, I would suggest using Luke (http://www.getopt.org/luke/).
You can view each document to check whether type field is indexed
correctly, then you can do a search in Luke to see if that works.
--
Doğacan Güney
] wrote:
It should look like this but change out domain for your domain. Try
this and let me know if it works.
127.0.0.1 dhcppc0.domain.com dhcppc0
localhost.localdomain localhost
Dennis Kubes
--
Doğacan Güney
to feed these urls to java.net.URL you get this exception. It is
not a big deal (computation continues ignoring that url candidate)
though it may be a bit annoying.
[snip]
--
Doğacan Güney
-
This SF.net email is sponsored
be appreciated. Thanks!
--
View this message in context:
http://www.nabble.com/Plugin-to-index-categories-by-url-rules-tf3621139.html#a10112854
Sent from the Nutch - User mailing list archive at Nabble.com.
--
Doğacan Güney
) will assume
that you are accessing /user/username/relative_path. You either
have to put your crawldb there or configure nutch to use local fs or
change generate's arguments.
[snip]
--
Doğacan Güney
-
Take Surveys. Earn Cash
what the problem is then. Can you include
the output of commands:
hadoop dfs -ls /nutch/filesystem/crawl/
hadoop dfs -ls /nutch/filesystem/crawl/crawldb
Any other ideas?
--
Oleg.
--
Doğacan Güney
-
Take Surveys. Earn
on a different sort value.
The second part can be written with a different scoring plugin. Simply
put whatever it is you need in CrawlDatum's metadata then change
ScoringFilter.generatorSortValue to look up that value and give a
good/bad score.
[snip]
--
Doğacan Güney
the indexerScore method to give it an even
higher boost.
-Brian
--
Doğacan Güney
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions
sdeck wrote:
That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
No where in there do I see
Doğacan Güney wrote:
Hi,
After hadoop-0.9.1, parsing and indexing doesn't seem to work.
If you parse while fetching then it is fine, but if you run parse as a
different job, it creates an essentially empty parse_data
directory(which has index files, but doesn't have data files). I am
looking
Daniel López wrote:
Hi again,
I finally ignored the RTF and MP3 plugins and was able to compile
Nutch from scratch and then proceeded to create my own web search
application.
I get it up and running and I'm now displaying the same information as
the demo search pages that come with
Hi,
Feng Ji wrote:
hi there,
I got the huge percentage of fetching error for httpclient in hadoop
log as
followings:
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
I am not sure if this is an error. Plugin
92 matches
Mail list logo