Hi,
After today's big update, it seems invertlinks doesn't work if a linkdb
doesn't exist already, because fs.exists checks the wrong directory
(linkdb/ but not linkdb/current).
A simple patch is attached.
--
Doğacan Güney
Index: src/java/org/apache/nutch/crawl/LinkDb.java
Hi,
Armel T. Nene wrote:
Hi guys,
I want to extend Nutch to use real-time indexing on local file system. I
have been through the source code to find out ways to modify values stored
in CrawlDB. The idea is simple:
I have an external program (or a script) which checks for changes in a
.
--
Doğacan Güney
Could something like that work?
Doug
Hi,
Doug Cutting wrote:
Doğacan Güney wrote:
I think it would make much more sense to change parse plugins to take
content and return Parse[] instead of Parse.
You're right. That does make more sense.
OK, then should I go forward with this and implement something? This
should be pretty
-feeds) very
fast(with a 1 second delay), and then get the
items with 5 second delay.)
(I hope it is not stupid to point out Yahoo's crawler to someone who
works at Yahoo :)
--
Doğacan Güney
Thanks,
Renaud
for
everything but plugins) within the day.
--
Doğacan Güney
Thanks,
Renaud
rubdabadub wrote:
Hi:
I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.
Thank you.
I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--
Doğacan Güney
) think?
Thanks!
--
Doğacan Güney
file.
I put a simple patch at
http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you try it
with this?
--
Doğacan Güney
and
db.score.external.link is 1.0, filter will almost always distribute
less than its cash).
This will also work for your case, since you will just ignore the
outlinks and return the adjust datum based on information in parse
metadata.
What do you (and others) think?
Thanks!
--
Doğacan Güney
On 4/21/07, Lorenzo [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 4/19/07, Lorenzo [EMAIL PROTECTED] wrote:
Hi,
sorry to re-open this thread, but I am facing the same problem of
Nicolás.
I like both yours (Doğacan) and Nicolas' ideas, more yours as I think
abstract
classes
thread should wait crawl delay before making another
request to the same host. Am I missing something here?
--
Doğacan Güney
difference) to false to indicate lib-http shouldn't
handle blocking internally. Because of this, when you use Fetcher2,
lib-http still tries to block them which makes Fetcher2 much less
useful.
I am not sending a patch for this yet because I first want to get some
feedback on the first bug.
--
Doğacan
On 4/24/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
I don't get it. The code seems to do exactly the opposite of what you
are saying. If maxThreads == 1 then maxThreads 1 is false thus the
expression evaluates to minCrawlDelay not crawlDelay. Shouldn't the
expression
of all the pages in segment_dir/content. You can extract
individual contents with the command ./nutch readseg -get
segment_dir url -noparse -nofetch -nogenerate -noparsetext
-noparsedata.
Thanks for any help.
-Charlie
--
Doğacan Güney
the patch there.
For this case, there is a similar issue(with patch) at NUTCH-369.
Cheers,
Marcin
--
Doğacan Güney
if it solves your problem?
(*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch
Bye!
--
Doğacan Güney
/07, Doğacan Güney [EMAIL PROTECTED] wrote:
Hi,
On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:
I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems
that the plugin repository initializes itself all the timem until I get
an out of memory exception. I've been
, like
fetcher.store.content shouldn't force loading plugins again, though it
seems it may be inevitable
Anyway, I'll try to build my own Nutch to test your patch.
Thanks!
--
Doğacan Güney
On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)
One important information: in future versions of Hadoop the method
Configuration.setObject
On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
My patch is just a draft to see if we can create a better caching
mechanism. There are definitely some rough edges there:)
One important information: in future
: info at sigram dot com
--
Conscious decisions by conscious minds are what make reality real
--
Conscious decisions by conscious minds are what make reality real
--
Doğacan Güney
based
Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
--
Doğacan Güney
--
Doğacan Güney
code. Like,
it returns tr for ISO-8859-9.
That being said, language identification is a very crucial feature and
if it doesn't work properly, well, someone should do something about
it :).
--
Sami Siren
--
Doğacan Güney
even harder.
--
Doğacan Güney
, invertlinks, index, dedup) and they
work fine.
--
Doğacan Güney
, Errors: 0, Time elapsed: 1.304 sec
--
Doğacan Güney
On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:
This is rather strange. Here is part of the console output:
test:
[echo] Testing plugin: parse-swf
[junit] Running org.apache.nutch.parse.swf.TestSWFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec
/prefix-urlfilter.txt
Cheers,
Chris
On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote:
On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:
This is rather
strange. Here is part of the console output:
test:
[echo] Testing
plugin: parse-swf
[junit] Running
[junit] Running org.apache.nutch.parse.feed.TestFeedParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec
BUILD SUCCESSFUL
Total time: 3 seconds
[XXX:src/plugin/feed] mattmann%
Any ideas?
Cheers,
Chris
On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED
On 6/20/07, Chris Mattmann [EMAIL PROTECTED] wrote:
On 6/20/07 7:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote:
It never passes for me (not even when I do it in src/plugin/feed). If
you
check the output, parseResult only contains a single entry which
is
rsstest.rss.
Okay, please tell me I'm
,
-vishal.
--
Doğacan Güney
it to -1 (which
means, store all outlinks).
Park yourself in front of a world of choices in alternative vehicles. Visit the
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/
--
Doğacan
you are responding to an imaginary person:)
or through email (in which case, part of the discussion doesn't get
documented in JIRA). Why doesn't this work?
--
Doğacan Güney
Message
From: Doğacan Güney [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Tuesday, June 26, 2007 10:56:32 PM
Subject: Re: NUTCH-119 :: how hard to fix
On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
I am evaluating nutch+lucene as a crawl and search solution.
However, I am
: NUTCH-474
URL: https://issues.apache.org/jira/browse/NUTCH-474
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Andrzej Bialecki
Fix For: 1.0.0
.
Cheers,
Carl.
--
Doğacan Güney
.
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/
--
Doğacan Güney
at sigram dot com
--
Doğacan Güney
--
Doğacan Güney
Rob
07/07/25 11:52:00 WARN crawl.MapWritable: Unable to load meta data
entry, ignoring.. : java.io.IOException: unable to load class for id:
36
On 7/25/07, Doğacan Güney (JIRA) [EMAIL PROTECTED] wrote:
[
https://issues.apache.org/jira/browse/NUTCH-527?page
it and saying you won't merge it because something would be much
better than leaving it without a single comment. This may reduce your active
community.
Think of this.
Best regards,
Marcin Okraszewski
--
Doğacan Güney
://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
If the same content is available under multiple urls, I think it makes
sense to assume that the url with the highest score should be 'the
representative' url.
Not necessarily - it depends how you defined your
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hmm. The index should somehow contain _all_ urls, which point to the
same document. I.e. when you search for url http://example.com; it
should ideally return exactly the same Lucene document as when you
search
Java 1.6 and Ant 1.7.
CB
--
Doğacan Güney
comments/patches/etc. there. Btw, I agree that using a CSV is
better than using a new configuration parameter for every tag.
Regards,
Marcin
--
Doğacan Güney
-J
--
Doğacan Güney
Updating NUTCH-546
--
Doğacan Güney
(build 1.6.0_02-b05, mixed mode, sharing)
Anyways, I am going to commit a small fix that removes override
annotations so that code can be compiled.
Regards,
Susam Pal
http://susam.in/
On 9/11/07, Doğacan Güney [EMAIL PROTECTED] wrote:
On 9/11/07, [EMAIL PROTECTED]
[EMAIL PROTECTED] wrote
dot com
--
Doğacan Güney
).
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
On 9/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
config);
public void prepareGeneratorConfig(Path crawlDb, Configuration config);
public void prepareIndexerConfig(Path crawlDb, Path linkDb
Recording test results
--
Doğacan Güney
elsewhere (logs/hadoop.log
directory if you are local and your tasktracker's logs if you are
running in distributed mode). If you can send those logs we can make a
more informed analysis about your problem.
Thank You!
- Sagar
--
Doğacan Güney
) by accident. You should check your
plugin.includes option in nutch-site.xml, there is probably something
wrong with that. Perhaps, you put a new line there?
- Sagar
--
Doğacan Güney
.
thanks
--
Eyal Edri
--
Doğacan Güney
, you can send some statistics
regarding overhead of running two extra jobs or fetch performance
increase as a result of smarter url ordering. Again personally, I find
that patches with such numbers and test cases are a lot easier to
review (thus, easier to commit:).
--
Doğacan Güney
.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
. Configuration error?
Updating NUTCH-548
Updating NUTCH-494
Updating NUTCH-547
Updating NUTCH-538
--
Doğacan Güney
and first run this conversion job then run
requested job.
I personally favor starting from scratch when switching version but
probably there are users who wish to convert older data or are there?
--
Sami Siren
--
Doğacan Güney
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
there that we could learn?
--
Doğacan Güney
On Sat, Sep 27, 2008 at 10:32 PM, Nimesh Priyodit [EMAIL PROTECTED] wrote:
Hi,
Recently i have developed my own stemmer.
Can you please tell me how to integrate the module which i wrote, into
nutch?
Where exactly do you want to integrate it? Into indexing?
Regards,
Nimesh
--
Doğacan
-nocontent
This will give you parsed text of segments.
Thanks for ur help
--
Allan Roberto Avendaño Sudario
Guayaquil-Ecuador
Home : +593(4) 2800 692
Office : +593(4) 2269 268
+ MSN-Messenger: [EMAIL PROTECTED]
+ Gmail: [EMAIL PROTECTED]
--
Doğacan Güney
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
one around?)
Dennis
--
Doğacan Güney
: info at sigram dot com
--
Doğacan Güney
1.0 out the door already :D
Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.
--
Doğacan Güney
I forgot: I think there is a huge bug with MapWritable in nutch. I
didn't yet figure out what it is
exactly but it has something to do with the fact that id-class maps are static.
On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
And here is a list of issues from me
OK one last thing: Get rid of Fetcher and promote Fetcher2 to be the
default fetcher.
On Thu, Nov 27, 2008 at 7:15 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
I forgot: I think there is a huge bug with MapWritable in nutch. I
didn't yet figure out what it is
exactly but it has something to do
On Thu, Nov 27, 2008 at 11:40 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
It seems I wrote the patch in NUTCH-92. My recollection was that you
wrote it, Andrzej :D
No, I didn't - you did! :) I only came up with the proposal, after
discussing it with Doug.
Anyway, I
)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
... 10 more
Should I file both bugs on JIRA ?
This I am not sure, but did you try ant clean; ant? It may be a
version mismatch.
--
Doğacan Güney
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm [EMAIL PROTECTED] wrote:
On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote:
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote:
Using nutch 0.9 (hadoop 0.17.1):
[EMAIL PROTECTED] working]$ bin/nutch readlinkdb
you,
Vlad
--
Doğacan Güney
!
Piotr
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote:
Update external jars to latest versions
---
Key: NUTCH-680
URL: https://issues.apache.org/jira/browse/NUTCH-680
Project
- Nutch
- Original Message
From: Doğacan Güney doga...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 10:49:44 AM
Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
versions
2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http
have pmd)?
http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doğacan Güney doga...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 1:13:20 PM
this
is happening? What am I doing wrong? Thanks.
Maybe there are no inlinks to that page so inlinks is null? What is
the exception
exactly?
--
Doğacan Güney
that were the result of fixes for problems reported by
pmd. Or maybe they run pmd by hand?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doğacan Güney doga...@gmail.com
To: nutch-dev@lucene.apache.org
Sent: Tuesday, January 20, 2009 3
will try it now.
Thank you!
I have no information about the exception. It seems that simply the
program skips this part of the code... maybe a ScoringFilterExcetion is
thrown?
On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney doga...@gmail.com wrote:
On Tue, Jan 20, 2009 at 7:18 PM, Pau pau
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
:)
Anyway, my non-binding +1.
--
Sami Siren
--
Doğacan Güney
On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com wrote:
Doğacan Güney wrote:
On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote:
Hello,
I have packaged the first release candidate for Apache Nutch 1.0
release at
http://people.apache.org/~siren/nutch-1.0/rc0/
See
On Mon, Mar 9, 2009 at 17:46, Sami Siren ssi...@gmail.com wrote:
Doğacan Güney wrote:
On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com
mailto:ssi...@gmail.com wrote:
Doğacan Güney wrote:
On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com
mailto:ssi...@gmail.com wrote
Again, my non-binding +1 :)
On 10.Mar.2009, at 09:34, Sami Siren ssi...@gmail.com wrote:
Hello,
I have packaged the second release candidate for Apache Nutch 1.0
release at
http://people.apache.org/~siren/nutch-1.0/rc1/
See the CHANGES.txt[1] file for details on release contents and
--
Doğacan Güney
but I think we
should hold
it for 1.1 (or something else) if it requires architectural changes
(thus needs review
and testing).
--
Sami Siren
--
Doğacan Güney
://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ and
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ jars.
Thanks in advance
--
Rodrigo Reyes C.
--
Doğacan Güney
Do not release the packages because...
Here's my +1
Thanks!
[1]
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
--
Sami Siren
--
Doğacan Güney
Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Doğacan Güney
/%7Emattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++
--
Doğacan Güney
not release the packages because...
Here's my +1
Thanks!
[1]
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
--Sami Siren
--
Dog(acan Güney
--
Doğacan Güney
?
Best regards
George Herlin
--
Doğacan Güney
running Lucene 2.1.0. Any idea why I am getting the
ArrayIndexOutofBoundsEception?
Nic
--
Doğacan Güney
at
org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:79)
I have no idea why we get a 503 there?
--
Doğacan Güney
Recording test results
--
Doğacan Güney
you guys think?
--
Doğacan Güney
1 - 100 of 115 matches
Mail list logo