[
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-471.
--
Resolution: Fixed
> Fix synchronization in NutchBean creat
[
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712
]
Dennis Kubes commented on NUTCH-471:
Ah, sorry, my configuration was the problem. If you don't upgrad
[
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes reopened NUTCH-471:
This patch breaks the search.jsp with a null pointer because the nutch bean is
no longer created in
ooopsgotta remember to do that. Done.
Dennis
Chris Mattmann wrote:
> On 6/25/07 8:34 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
>
>> Author: kubes
>> Date: Mon Jun 25 20:33:59 2007
>> New Revision: 550669
>>
>> URL: http://svn.apache.org/viewvc?view=rev&rev=550669
>> Log:
>> NUTCH-4
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-497.
--
Issue resolved and committed.
> Extreme Nested Tags causes StackOverflowException in DomContentUt
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-497.
Resolution: Fixed
commited with revision 550669
> Extreme Nested Tags cau
If no one has any objections, I will go ahead and commit this.
Dennis Kubes
Dennis Kubes (JIRA) wrote:
> [
> https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Dennis Kubes up
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap3.patch
added nested-tags-trap3.patch with apache grant
> Extreme Nes
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap2.patch
added nested-tags-trap2.patch with apache grant
> Extreme Nes
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: (was: nested-tags-trap3.patch)
> Extreme Nested Tags causes StackOverflowException
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: (was: nested-tags-trap2.patch)
> Extreme Nested Tags causes StackOverflowException
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap3.patch
Adds a utility class called NodeWalker which allows a generic
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894
]
Dennis Kubes commented on NUTCH-497:
I agree, I think it would be better to have something generic if we are
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725
]
Dennis Kubes commented on NUTCH-497:
Doğacan, that is correct. By using the stack we shouldn't
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap2.patch
Patch with the curNodeDepth removed. The patch file is nested
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596
]
Dennis Kubes commented on NUTCH-497:
The newest patch is the nested-tags-trap.patch file.
> Extreme Nested T
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap.patch
This patch reworks DomContentUtils.getOutlinks to use a stack
Is this the same java 6 error that was popping up a while back? For
some reason with java 6 the XML is being parsed differently in the SWF
parser and therefore unit tests looking for exact strings were failing.
Could this be happening in the feed parser as well?
Dennis Kubes
Chris Mattmann
Congratulations Doğacan, it is good to have you on board.
Dennis
Andrzej Bialecki wrote:
> Hi all,
>
> I'm glad to announce that the Lucene PMC has voted to add Doğacan Güney
> as Nutch committer.
>
> Welcome, Doğacan! There are 192 open issues in Nutch JIRA waiting to be
> solved ... just di
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: ExtremeNestedTags.patch
This is a rudimentary fix for those that want a workaround for
Issue Type: Bug
Components: fetcher
Affects Versions: 0.9.0, 0.8.1, 1.0.0
Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0
Some webpages have a form of a spider trap that causes a
Take a look at this from the wiki:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
It shows how to create a patch from SVN. To apply a patch to your
source code you would use the patch command (on linux) like this:
patch -p0 < your_patch_file.patch
Dennis Kubes
Manoharam Reddy wr
explicitly specify all plugins in the plugin.includes
configuration variable.
Dennis Kubes
Manoharam Reddy wrote:
> I was observing plugins.include property of my nutch-site.xml
>
> It has has:-
>
> protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-
the file.content.limit and ftp.content.limit options in
your nutch-site.xml file.
Dennis Kubes
Manoharam Reddy wrote:
> Time and again I get this error and as a result the segment remains
> incomplete. This wastes one iteration of the for() loop in which I am
> doing generate, fetch a
what happens when java/nutch gets a hostname
> that is obviously malformed?
I believe is should throw a malformed url exception.
Dennis Kubes
>
> -Brian
>
>
>
>
> On May 6, 2007, at 11:00 AM, Andrzej Bialecki wrote:
>
>> Brian Whitman wrote:
>>> Got thi
Sigsev usually is the result of Hardware errors. At least that is what
I have found in the past. I would run memtest on the machine to check
for bad memory.
Dennis Kubes
Brian Whitman wrote:
> Got this segfault + crash when fetching in the middle of a large fetch.
> Seems to be in l
Without more information this sounds like your tomcat search
nutch-site.xml file is setup to use the DFS rather than the local file
system. Remember that processing jobs occurs on the DFS but for
searching, indexes are best moved to the local file system.
Dennis Kubes
JoostRuiter wrote:
>
I run the crawler through Nutch all the time. What are the specific
errors that you are getting?
Dennis Kubes
Tanmoy Kumar Mukherjee wrote:
> Hi .
> I am having certain problems in running the nutch crawler on eclipse
> after having followed the tutorial on Nutch wiki. It says ca
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks()
which is called from org.apache.nutch.parse.html.HtmlParser. Running
some simple tests on your fragment below I get non outlink for this.
What version of Nutch are you running?
Dennis Kubes
Ian Holsman wrote:
>
Andrzej Bialecki wrote:
> wangxu wrote:
>> Have anybody thought of replacing CrawlDb with any kind of Rational
>> DB,mysql,for example?
>>
>> Crawldb is so difficult to manipulate.
>> I often have the requirements to edit several entries in crawdb;
>> But that would cost too much waiting for the
Yeah, I agree, I just didn't know how to proceed with the new branch
structure. I will go ahead and put it into the trunk if there are no
objections from anyone.
Dennis
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> That works. I created the JIRA and attached your pat
That works. I created the JIRA and attached your patch. It passes all
build tests and works on my 150K run across my 5 machine dev cluster.
Should we go ahead and commit this?
Dennis
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Ok, I ran some bigger test crawls > 150K wi
[
https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-467:
---
Attachment: nutch-467.patch
Submitted by Andrzej Bialecki.
> DeleteDuplicate fails if Segment in
Components: indexer
Affects Versions: 0.9.0
Environment: all
Reporter: Dennis Kubes
Fix For: 0.9.0
If any of the segment indexes have 0 documents, then the DDRecordReader in
DeleteDuplicates throws an IndexOutOfBoundsException. The record reader needs
to
[X] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
Andrzej Bialecki wrote:
> Chris Mattmann wrote:
> [..]
>> [ ] +1 Release the packages as Apache Nutch 0.9
>> [ ] -1 Do not release the packages because...
>
> +1.
>
>
-
ng it up.
My guess would be that this is a small bug within the lucene libraries
when the directories have 0 results. What is everyone's opinion on this
in terms of the release? My vote would be to move forward with the release.
Dennis Kubes
Task Id : task_0027_m_03_
Chris,
I have updated changes and resolved and closed the issue. Sorry about
not getting to it sooner.
Dennis Kubes
Chris Mattmann wrote:
> Hi Dennis,
>
> Thanks for taking care of this. :-) Could you update CHANGES.txt as well?
> Once you take care of that, in about 2 hrs (whe
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-333.
--
> SegmentMerger and SegmentReader should use Nutch
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-333.
Resolution: Fixed
Issue resolved
> SegmentMerger and SegmentReader should use Nutch
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-333:
---
Attachment: use-nutch-job_patch.txt
updated patch, submitted by Doğacan Güney
> SegmentMerger
ime, Los Angeles, PST)
> on removing the tag, and starting the process over again.
>
> In the meanwhile, Dennis, do you have the patch that fixes the issue with
> Hadoop? If so, ,could you commit it ASAP to the trunk. Once that's done,
> I'll remove the tag, and star th
+1
Andrzej Bialecki wrote:
> Hi all,
>
> I know it's a trivial issue, but still ... When this release is out, I
> propose that we should name the next release 1.0.0, and not 0.10.0. The
> effect is purely psychological, but it also reflects our confidence in
> the platform.
I think that a 1.
again (large scale)
> 5. if all goes well, finish release process
> 6. tag tags/release-0.9
I agree with this process.
>
> Thoughts?
>
> Thanks!
>
> Cheers,
> Chris
>
>
> On 3/28/07 10:35 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:
Yes. This seems to have fixed the problem. All, do we want to create a
JIRA and commit this for the 0.9 release?
Dennis
Andrzej Bialecki wrote:
> Doğacan Güney wrote:
>> Hi,
>>
>> On 3/28/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>>>
>>> This is
class loading.
Dennis Kubes
Dennis Kubes wrote:
> I spoke too soon. Below is the output of errors on mergesegs. This
> looks more like a Hadoop issue to me, but I will need to dig into it. It
> also may be something that I am doing on my end. This was a merge of
> three differe
ahead.
Dennis Kubes
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: org.apache.nutch.metadata.MetaWrapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:344)
at
org.apache.hadoop.mapred.JobConf.getOutputValue
[X] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
I have been running some bigger crawls with the release this morning.
Everything looks good.
Dennis Kubes
Chris Mattmann wrote:
> Hi Folks,
>
> I have posted a candidate for the Apache
Let me know if I can help in any way?
Dennis Kubes
Chris Mattmann wrote:
> Hi Folks,
>
> As your friendly neighborhood 0.9 release manager, I just wanted to give
> you all a heads up that I'd like to begin the release process today. If I
> hear no objections by 00:00:00 UT
You would need to setup your logging configuration to include INFO in
the log4j.properties file in the conf directory.
Dennis Kubes
z0mbi3 wrote:
> Hi,
> I m new to nutch. I have been trying to understand the working of the opic
> scoring plugin but have certain issues:
>
> I
I worked through this swf issue a little more and it seems that java 6
parses out the content differently than java 5. My guess is that it is
some type of collection change from 5 to 6 because it looks like only
the ordering of the elements is different.
Dennis Kubes
Sample
Help
It shouldn't be too much trouble to attack this with the logging changes.
Dennis Kubes
Chris Mattmann wrote:
> Hey Doug,
>
> Do you think we should do this in Nutch too? I'm in favor of doing this --
> what does everyone else feel?
>
> Th
I did an update, clean, and test and go no errors.
BUILD SUCCESSFUL
Total time: 6 minutes
Sami Siren wrote:
> 2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>:
>>
>> Sami Siren wrote:
>> > for me it works:
>> >
>> > ...
>> > BUILD SUCCESSFUL
>> > Total time: 4 minutes 3 seconds
>>
>> I did a fresh
I am good to go as well.
Dennis Kubes
Andrzej Bialecki wrote:
> Sami Siren wrote:
>> Andrzej Bialecki wrote:
>>> Hi all,
>>>
>>> I just committed Hadoop 0.12.1. Let's double-check that it works ok.
>>> Here's the list of Critical/Blocker
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>>>
>>> Could you perhaps create a JIRA issue and attach the patches from the
>>> current trunk/ to your 0.12.1-based version? As soon as 0.12.1 is out
>>> the door we can upgrade, and then finally wrap
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Fix For: 0.9.0
Attachments: hadoop-0.12.1-dev-core.jar
This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636. I
far as I can tell this jar doesn't break any of the current Nutch trunk
[
https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-459:
---
Attachment: hadoop-0.12.1-dev-core.jar
hadoop-0.12.1-dev-core.jar as of revision 518636
> Upgr
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> The crawl for 1M pages completed successfully. There was an issue
>> with doing a copyToLocal but that has already been filed as a HADOOP
>> bug and the patch will be included in 0.12.x
>>
>
>
> That
s that take alot of RAM so I have the childopts set to
1024M. For standard fetching I don't know how much difference it would
make.
Dennis Kubes
>
> Thanks
> Marc
>
> On 3/14/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>> Marc Bouch
ll case. I don't have any benchmarks as of yet but I will keep the
list informed of our progress.
Dennis Kubes
>
> Thanks
> Marc Boucher, aTerra
-
Take Surveys. Earn Cash. Influence the Future of IT
Join
: 23022
min score: 0.0090
avg score: 0.173
max score: 2119.167
status 1 (db_unfetched):9899275
status 2 (db_fetched): 667354
status 3 (db_gone): 11195
status 4 (db_redir_temp): 219507
status 5 (db_redir_perm): 41839
Dennis Kubes
Andrzej Bialecki
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> I agree there may be subtle bugs.
>>
>> I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop
>> 12.1 on a small cluster of 5 machines if this would help? We have
>> already
>>
>
&
so) then wrote it all to disk (in the
Yes. The hadoop team implemented a in memory buffer and spill to disk
functionality. I believe the about stored in memory before spills is
configurable.
Dennis Kubes
> Hadoop temp directory) at once. During the write operation, which lasted
> no
lp? We have already
done some crawls > 100K urls with 11.2 without problems. I say let's test
it and if there aren't any significant issues then let's go with 12.1 if
the hadoop team thinks it will be more stable.
One question though, are there any concerns about upgrading clu
the unreleased changes section). Could you please append your changes to the
> end of the file, and recommit?
>
> Thanks a lot!
>
> Cheers,
> Chris
Sorry about that. I say the warning message thinking it was a version
break. Everything should be fixed now.
Dennis Kube
[
https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-233.
--
Issue closed
> wrong regular expression hang reduce process for e
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-436.
Resolution: Fixed
Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-436.
--
Issue closed.
> Incorrect handling of relative paths when the embedded URL path is em
[
https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-233.
Resolution: Fixed
The new regex has been added to both the regex-urlfilter.txt and the
crawl
Steve Severance wrote:
>> -Original Message-
>> From: Dennis Kubes [mailto:[EMAIL PROTECTED]
>> Sent: Friday, March 09, 2007 9:47 AM
>> To: nutch-dev@lucene.apache.org
>> Subject: Re: How to read data from segments
>>
>>
>>
>> Steve S
Dennis Kubes wrote:
>> Dennis Kubes wrote:
>>> I was looking through the JIRA to try and help create a list for this
>>> release and to say the least it is a little overwhelming. It looks
>>> like there are 183 issues total with 152 being unassigned. What has
this year the lack of detailed
> information for new developers was cited as a barrier to more involvement. I
> would be happy to contribute this back to the wiki if there is interest.
Absolutely. The more documentation we have, especially for new
develope
/jira/browse/NUTCH-457
Project: Nutch
Issue Type: Task
Environment: N/A
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Priority: Minor
The KEYS file contains public keys of committers and is used to sign releases.
According to a
> Dennis Kubes wrote:
>> I was looking through the JIRA to try and help create a list for this
>> release and to say the least it is a little overwhelming. It looks
>> like there are 183 issues total with 152 being unassigned. What has
>> been the current process fo
re of, as soon as you all give me
> the green light.
Good by me.
>
> So, please, committer-brethren, let me know what you think about 1-3, as it
> would help me understand how to move forward.
>
> Thanks!
>
> Cheers,
> Chris
>
>
Dennis Kubes
g when I said that I would email Piotr. It's too bad that
> this has turned out to be an issue that I've handled incorrectly, and for
> that, I apologize. I will do my best to thoroughly vet all such discussions
> on the nutch list in the future.
No issues with me.
Dennis Kub
Chris Mattmann wrote:
> Hi Guys,
>
>> Blocker
>>
>> * NUTCH-400 (Update & add missing license headers) - I believe this is
>> fixed and should be closed
>
> +1, thanks to Sami for closing it.
>
>> * NUTCH-353 (pages that serverside forwards will be refetched every
>> time) - this was
That is a hadoop.log.dir problem value not being set. It is trying to
use the DRFA appender to a file and can't find the log directory.
Dennis
Gal Nitzan wrote:
>
> Just installed latest from trunk.
>
> I run mergesegs and I get the following error in all tasks log files (I use
> default log4
OK. I finally figured out how to republish the site. Only took me 3
days. Feeling hazed now! :)
Dennis Kubes
Sami Siren wrote:
> Welcome on board Dennis!
>
> --
> Sami Siren
>
> Dennis Kubes wrote:
>> Hi All,
>>
>> Thank you Andrzej for your kind wo
NUTCH-436 has a patch now if we want to add that to this release.
Dennis Kubes
Andrzej Bialecki wrote:
> Sean Dean wrote:
>> As for which Hadoop version is included in the next Nutch release, I
>> share the same concern as Sami with 0.10.1 as it NPE's on anything
>>
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-436:
---
Attachment: NUTCH-436-20070304.patch
NUTCH-436-20070304.patch handles correct encoding of the params
ersion of Nutch to be soon followed by a minor stable release ...
+1 for using 0.11.2. I looked through the release notes for 0.12 and
there were some niceties such as HADOOP-432 for undeletes and alot of bug
fixes, but it didn't look like there were any critical issues as far as
Nutch is con
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes reassigned NUTCH-436:
--
Assignee: Dennis Kubes
> Incorrect handling of relative paths when the embedded URL path
: Dennis Kubes
Assigned To: Chris A. Mattmann
Fix For: 0.9.0, 0.8.1
There are currently log guards (i.e. is*Enabled type code) in many different
places in the code. NUTCH-309 is related to removing those log guards. The
caveat is that debug level log guards should be
. I am 28
and have been programming for about 12 years.
So as first commit I need to add my name and re-publish the website.
Let the hazing begin.
Dennis Kubes
Andrzej Bialecki wrote:
> Hi all,
>
> Some time ago I proposed to Lucene PMC that Dennis should become a Nutch
> committer
I can also work on this, Chris do you want me to do it or do you want to
coordinate our efforts?
Dennis Kubes
Jérôme Charron wrote:
> Hi Chris,
>
> The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
> Thanks for your help.
>
> Jérôme
>
> On
[
https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-448:
---
Attachment: plugin-fromfile.patch
The plugin-fromfile.patch file contains the functionality for
Environment: all platforms
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Priority: Minor
Fix For: 0.9.0
This functionality allows the plugin.includes and plugin.excludes values to be
moved out of the nutch-default.xml and nutch-site.xml files and loaded
[
https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713
]
Dennis Kubes commented on NUTCH-447:
This tool is for people who need a defined category structure or want to
[
https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-447:
---
Attachment: dmoz-structure.patch
Patch that contains the DmozStructureParser class.
> Dmoz Struct
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Priority: Minor
This is a tool that will take the dmoz structure RDF file and return a listing
of the categories. The categories return can be limited by depth or by regular
expression pattern. This tool borrows heavily from
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355
]
Dennis Kubes commented on NUTCH-247:
We could move the code to a utility class but if we want it to be called
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-247:
---
Attachment: agent-names3.patch.txt
This patch logs and throws an exception if the agent name is not
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474068
]
Dennis Kubes commented on NUTCH-247:
I agree, but then should we approach the check as a configurable option
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-247:
---
Attachment: agent-names.patch
This patch removes the checks and severe logging from the
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes reassigned NUTCH-247:
--
Assignee: Dennis Kubes
> robot parser to restr
AhhhNow I get it :)
Andrzej Bialecki wrote:
> Dennis Kubes wrote:
>> Sorry. I am still not getting this. I understand the reason but I am
>> not seeing how it works.
>
> Ah, because apparently it doesn't ... :( You were right, the first job
> consists on
come
from in the Reducer.
Dennis Kubes
Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Hi Andrzej,
>>
>> Does it mean that when you inject an existing (in crawldb) a URL it
>> changes
>> its status to STATUS_DB_UNFETCHED?
>>
>>
>
> With the c
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295
]
Dennis Kubes commented on NUTCH-247:
I think the idea here is to NOT allow people to run fetchers for which they
pdfbox software to parse PDF files so you may want to take the
specific file and see if it parses correctly outside of nutch using pdfbox.
Dennis Kubes
Armel T. Nene wrote:
> Dennis
>
> I was wondering if this patch could fix my problem which is, if not the
> same, very similar to
Actually I take it back. I don't think it is the same problem but I do
think it is the right solution.
Dennis Kubes
Dennis Kubes wrote:
> This has to do with HADOOP-964. Replace the jar files in your Nutch
> versions with the most recent versions from Hadoop. You will also need
1 - 100 of 149 matches
Mail list logo