[
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712
]
Dennis Kubes commented on NUTCH-471:
Ah, sorry, my configuration was the problem. If you don't upgrade
If no one has any objections, I will go ahead and commit this.
Dennis Kubes
Dennis Kubes (JIRA) wrote:
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-497.
--
Issue resolved and committed.
Extreme Nested Tags causes StackOverflowException in DomContentUtils
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-497.
Resolution: Fixed
commited with revision 550669
Extreme Nested Tags causes StackOverflowException
ooopsgotta remember to do that. Done.
Dennis
Chris Mattmann wrote:
On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Author: kubes
Date: Mon Jun 25 20:33:59 2007
New Revision: 550669
URL: http://svn.apache.org/viewvc?view=revrev=550669
Log:
NUTCH-497: Fixes problems
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: (was: nested-tags-trap2.patch)
Extreme Nested Tags causes StackOverflowException
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: (was: nested-tags-trap3.patch)
Extreme Nested Tags causes StackOverflowException
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap2.patch
added nested-tags-trap2.patch with apache grant
Extreme Nested
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap3.patch
added nested-tags-trap3.patch with apache grant
Extreme Nested
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894
]
Dennis Kubes commented on NUTCH-497:
I agree, I think it would be better to have something generic if we
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap.patch
This patch reworks DomContentUtils.getOutlinks to use a stack
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596
]
Dennis Kubes commented on NUTCH-497:
The newest patch is the nested-tags-trap.patch file.
Extreme Nested Tags
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: nested-tags-trap2.patch
Patch with the curNodeDepth removed. The patch file is nested
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725
]
Dennis Kubes commented on NUTCH-497:
Doğacan, that is correct. By using the stack we shouldn't get
Congratulations Doğacan, it is good to have you on board.
Dennis
Andrzej Bialecki wrote:
Hi all,
I'm glad to announce that the Lucene PMC has voted to add Doğacan Güney
as Nutch committer.
Welcome, Doğacan! There are 192 open issues in Nutch JIRA waiting to be
solved ... just dive in!
[
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-497:
---
Attachment: ExtremeNestedTags.patch
This is a rudimentary fix for those that want a workaround
Take a look at this from the wiki:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
It shows how to create a patch from SVN. To apply a patch to your
source code you would use the patch command (on linux) like this:
patch -p0 your_patch_file.patch
Dennis Kubes
Manoharam Reddy wrote
to explicitly specify all plugins in the plugin.includes
configuration variable.
Dennis Kubes
Manoharam Reddy wrote:
I was observing plugins.include property of my nutch-site.xml
It has has:-
valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url
the file.content.limit and ftp.content.limit options in
your nutch-site.xml file.
Dennis Kubes
Manoharam Reddy wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update
believe is should throw a malformed url exception.
Dennis Kubes
-Brian
On May 6, 2007, at 11:00 AM, Andrzej Bialecki wrote:
Brian Whitman wrote:
Got this segfault + crash when fetching in the middle of a large
fetch. Seems to be in looking up a hostname?
Is this by any chance
Sigsev usually is the result of Hardware errors. At least that is what
I have found in the past. I would run memtest on the machine to check
for bad memory.
Dennis Kubes
Brian Whitman wrote:
Got this segfault + crash when fetching in the middle of a large fetch.
Seems to be in looking up
Without more information this sounds like your tomcat search
nutch-site.xml file is setup to use the DFS rather than the local file
system. Remember that processing jobs occurs on the DFS but for
searching, indexes are best moved to the local file system.
Dennis Kubes
JoostRuiter wrote:
Hi
Andrzej Bialecki wrote:
wangxu wrote:
Have anybody thought of replacing CrawlDb with any kind of Rational
DB,mysql,for example?
Crawldb is so difficult to manipulate.
I often have the requirements to edit several entries in crawdb;
But that would cost too much waiting for the mapReduce.
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks()
which is called from org.apache.nutch.parse.html.HtmlParser. Running
some simple tests on your fragment below I get non outlink for this.
What version of Nutch are you running?
Dennis Kubes
Ian Holsman wrote:
Hi.
I'm
I run the crawler through Nutch all the time. What are the specific
errors that you are getting?
Dennis Kubes
Tanmoy Kumar Mukherjee wrote:
Hi .
I am having certain problems in running the nutch crawler on eclipse
after having followed the tutorial on Nutch wiki. It says canot build
[X] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
Andrzej Bialecki wrote:
Chris Mattmann wrote:
[..]
[ ] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
+1.
Components: indexer
Affects Versions: 0.9.0
Environment: all
Reporter: Dennis Kubes
Fix For: 0.9.0
If any of the segment indexes have 0 documents, then the DDRecordReader in
DeleteDuplicates throws an IndexOutOfBoundsException. The record reader needs
[
https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-467:
---
Attachment: nutch-467.patch
Submitted by Andrzej Bialecki.
DeleteDuplicate fails if Segment index
That works. I created the JIRA and attached your patch. It passes all
build tests and works on my 150K run across my 5 machine dev cluster.
Should we go ahead and commit this?
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Ok, I ran some bigger test crawls 150K with the 0.9RC
Yeah, I agree, I just didn't know how to proceed with the new branch
structure. I will go ahead and put it into the trunk if there are no
objections from anyone.
Dennis
Andrzej Bialecki wrote:
Dennis Kubes wrote:
That works. I created the JIRA and attached your patch. It passes
all
guess would be that this is a small bug within the lucene libraries
when the directories have 0 results. What is everyone's opinion on this
in terms of the release? My vote would be to move forward with the release.
Dennis Kubes
Task Id : task_0027_m_03_3, Status : FAILED
task_0027_m_03_3
? If so, ,could you commit it ASAP to the trunk. Once that's done,
I'll remove the tag, and star the release process over again, and get an RC
out for a vote. Then, we can move forward from there.
I will do this immediately.
Dennis Kubes
Thanks, guys!
Cheers,
Chris
I still propose
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-333:
---
Attachment: use-nutch-job_patch.txt
updated patch, submitted by Doğacan Güney
SegmentMerger
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-333.
Resolution: Fixed
Issue resolved
SegmentMerger and SegmentReader should use NutchJob
[
https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-333.
--
SegmentMerger and SegmentReader should use NutchJob
Chris,
I have updated changes and resolved and closed the issue. Sorry about
not getting to it sooner.
Dennis Kubes
Chris Mattmann wrote:
Hi Dennis,
Thanks for taking care of this. :-) Could you update CHANGES.txt as well?
Once you take care of that, in about 2 hrs (when I get home
[X] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
I have been running some bigger crawls with the release this morning.
Everything looks good.
Dennis Kubes
Chris Mattmann wrote:
Hi Folks,
I have posted a candidate for the Apache Nutch 0.9 release
.
Dennis Kubes
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: org.apache.nutch.metadata.MetaWrapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:344)
at
org.apache.hadoop.mapred.JobConf.getOutputValueClass
It shouldn't be too much trouble to attack this with the logging changes.
Dennis Kubes
Chris Mattmann wrote:
Hey Doug,
Do you think we should do this in Nutch too? I'm in favor of doing this --
what does everyone else feel?
Thanks!
Cheers,
Chris
I am good to go as well.
Dennis Kubes
Andrzej Bialecki wrote:
Sami Siren wrote:
Andrzej Bialecki wrote:
Hi all,
I just committed Hadoop 0.12.1. Let's double-check that it works ok.
Here's the list of Critical/Blocker issues I mentioned before, and their
current status:
Any other stuff
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Fix For: 0.9.0
Attachments: hadoop-0.12.1-dev-core.jar
This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636. I
far as I can tell this jar doesn't break any of the current Nutch trunk code
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Could you perhaps create a JIRA issue and attach the patches from the
current trunk/ to your 0.12.1-based version? As soon as 0.12.1 is out
the door we can upgrade, and then finally wrap up our release.
Do you want me to create a JIRA issue
: 23022
min score: 0.0090
avg score: 0.173
max score: 2119.167
status 1 (db_unfetched):9899275
status 2 (db_fetched): 667354
status 3 (db_gone): 11195
status 4 (db_redir_temp): 219507
status 5 (db_redir_perm): 41839
Dennis Kubes
Andrzej Bialecki
as of yet but I will keep the
list informed of our progress.
Dennis Kubes
Thanks
Marc Boucher, aTerra
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
have the childopts set to
1024M. For standard fetching I don't know how much difference it would
make.
Dennis Kubes
Thanks
Marc
On 3/14/07, Dennis Kubes [EMAIL PROTECTED] wrote:
Marc Boucher wrote:
Dennis,
I'm curious what kind of hardware your 5 system cluster uses? CPU, RAM
Andrzej Bialecki wrote:
Dennis Kubes wrote:
The crawl for 1M pages completed successfully. There was an issue
with doing a copyToLocal but that has already been filed as a HADOOP
bug and the patch will be included in 0.12.x
That's very good news, Dennis - thanks for taking the time
Andrzej Bialecki wrote:
Dennis Kubes wrote:
I agree there may be subtle bugs.
I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop
12.1 on a small cluster of 5 machines if this would help? We have
already
Certainly, that would be most welcome.
I will start
with 11.2 without problems. I say let's test
it and if there aren't any significant issues then let's go with 12.1 if
the hadoop team thinks it will be more stable.
One question though, are there any concerns about upgrading clusters as
opposed to new fetches?
Dennis Kubes
--
Best regards
to disk (in the
Yes. The hadoop team implemented a in memory buffer and spill to disk
functionality. I believe the about stored in memory before spills is
configurable.
Dennis Kubes
Hadoop temp directory) at once. During the write operation, which lasted
no more then 8 seconds each time
section). Could you please append your changes to the
end of the file, and recommit?
Thanks a lot!
Cheers,
Chris
Sorry about that. I say the warning message thinking it was a version
break. Everything should be fixed now.
Dennis Kubes
On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL
. The more documentation we have, especially for new
developers, the better. If you need any questions answered in doing
this, give me a shout and I will help as much as I can.
Dennis Kubes
Regards,
Steve
-
Take
Dennis Kubes wrote:
Dennis Kubes wrote:
I was looking through the JIRA to try and help create a list for this
release and to say the least it is a little overwhelming. It looks
like there are 183 issues total with 152 being unassigned. What has
been the current process for testing
Steve Severance wrote:
-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Friday, March 09, 2007 9:47 AM
To: nutch-dev@lucene.apache.org
Subject: Re: How to read data from segments
Steve Severance wrote:
I am trying to learn the internals of Nutch
[
https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-233.
Resolution: Fixed
The new regex has been added to both the regex-urlfilter.txt and the
crawl
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-436.
--
Issue closed.
Incorrect handling of relative paths when the embedded URL path is empty
[
https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes resolved NUTCH-436.
Resolution: Fixed
Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed
[
https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes closed NUTCH-233.
--
Issue closed
wrong regular expression hang reduce process for ever
, and for
that, I apologize. I will do my best to thoroughly vet all such discussions
on the nutch list in the future.
No issues with me.
Dennis Kubes
Cheers,
Chris
-- Forwarded Message
From: Chris Mattmann [EMAIL PROTECTED]
Date: Mon, 05 Mar 2007 21:25:30 -0800
To: Piotr
That is a hadoop.log.dir problem value not being set. It is trying to
use the DRFA appender to a file and can't find the log directory.
Dennis
Gal Nitzan wrote:
Just installed latest from trunk.
I run mergesegs and I get the following error in all tasks log files (I use
default
Chris Mattmann wrote:
Hi Guys,
Blocker
* NUTCH-400 (Update add missing license headers) - I believe this is
fixed and should be closed
+1, thanks to Sami for closing it.
* NUTCH-353 (pages that serverside forwards will be refetched every
time) - this was partially fixed
NUTCH-436 has a patch now if we want to add that to this release.
Dennis Kubes
Andrzej Bialecki wrote:
Sean Dean wrote:
As for which Hadoop version is included in the next Nutch release, I
share the same concern as Sami with 0.10.1 as it NPE's on anything
above 100-200k URLs. I can
OK. I finally figured out how to republish the site. Only took me 3
days. Feeling hazed now! :)
Dennis Kubes
Sami Siren wrote:
Welcome on board Dennis!
--
Sami Siren
Dennis Kubes wrote:
Hi All,
Thank you Andrzej for your kind words. I am looking forward to working
together
.
Dennis Kubes
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
I can also work on this, Chris do you want me to do it or do you want to
coordinate our efforts?
Dennis Kubes
Jérôme Charron wrote:
Hi Chris,
The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
Thanks for your help.
Jérôme
On 2/13/07, Chris Mattmann [EMAIL
[
https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713
]
Dennis Kubes commented on NUTCH-447:
This tool is for people who need a defined category structure or want
Environment: all platforms
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Priority: Minor
Fix For: 0.9.0
This functionality allows the plugin.includes and plugin.excludes values to be
moved out of the nutch-default.xml and nutch-site.xml files and loaded
[
https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-448:
---
Attachment: plugin-fromfile.patch
The plugin-fromfile.patch file contains the functionality
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Priority: Minor
This is a tool that will take the dmoz structure RDF file and return a listing
of the categories. The categories return can be limited by depth or by regular
expression pattern. This tool borrows heavily from
[
https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-447:
---
Attachment: dmoz-structure.patch
Patch that contains the DmozStructureParser class.
Dmoz Structure
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-247:
---
Attachment: agent-names3.patch.txt
This patch logs and throws an exception if the agent name
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355
]
Dennis Kubes commented on NUTCH-247:
We could move the code to a utility class but if we want it to be called
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474068
]
Dennis Kubes commented on NUTCH-247:
I agree, but then should we approach the check as a configurable option
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes reassigned NUTCH-247:
--
Assignee: Dennis Kubes
robot parser to restrict
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-247:
---
Attachment: agent-names.patch
This patch removes the checks and severe logging from
in the Reducer.
Dennis Kubes
Andrzej Bialecki wrote:
Gal Nitzan wrote:
Hi Andrzej,
Does it mean that when you inject an existing (in crawldb) a URL it
changes
its status to STATUS_DB_UNFETCHED?
With the current version of Injector - it won't. With previous versions
- it might
AhhhNow I get it :)
Andrzej Bialecki wrote:
Dennis Kubes wrote:
Sorry. I am still not getting this. I understand the reason but I am
not seeing how it works.
Ah, because apparently it doesn't ... :( You were right, the first job
consists only of new records. Now that I checked
pdfbox software to parse PDF files so you may want to take the
specific file and see if it parses correctly outside of nutch using pdfbox.
Dennis Kubes
Armel T. Nene wrote:
Dennis
I was wondering if this patch could fix my problem which is, if not the
same, very similar to this one. I am
[
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295
]
Dennis Kubes commented on NUTCH-247:
I think the idea here is to NOT allow people to run fetchers for which
[
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-437:
---
Description: The MapFile.Writer signature has changed in hadoop trunk
(version 10.x +) to include
This has to do with HADOOP-964. Replace the jar files in your Nutch
versions with the most recent versions from Hadoop. You will also need
to apply NUTCH-437 patch to get Nutch to work with the most recent
changes to the Hadoop codebase.
Dennis Kubes
Gal Nitzan wrote:
Hi,
Does anybody
Actually I take it back. I don't think it is the same problem but I do
think it is the right solution.
Dennis Kubes
Dennis Kubes wrote:
This has to do with HADOOP-964. Replace the jar files in your Nutch
versions with the most recent versions from Hadoop. You will also need
to apply
If no mapper or reducer class is set in the jobConf then the code
defaults to IdentityMapper and IdentityReducer respectively which
essentially are pass throughs of key/value pairs.
Dennis Kubes
Charlie Williams wrote:
I am very new to the Nutch source code, and have been reading over
Versions: 0.8.2, 0.9.0
Environment: windows xp and java
Reporter: Dennis Kubes
Assigned To: Dennis Kubes
Fix For: 0.8.2, 0.9.0
The MapFile.Writer signature has changed in hadoop 0.10.2 to include a
Configuration object. Object in the Nutch codebase
[
https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis Kubes updated NUTCH-437:
---
Attachment: nutch-hadoop-0.10.2-mapfile.patch
This patch changes the references to MapFile.Writer
and we will see if we
can integrate the requests into the development.
Dennis Kubes
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions
Zaheed Haque wrote:
On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Well ... so far this process was very informal, because there were so
few key developers that they more or less knew what needs to be done,
and who is doing what.
Hadoop follows a much stricter and formalized model,
night so everything
should be done it a couple of days.
Dennis Kubes
Chris Mattmann wrote:
Hi Dennis,
On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote:
All,
I am working on a How to Become a Nutch Developer document for the
wiki and I need some input.
I need an overview
Doug
Can you answer the question of how to add developer names to JIRA or if
that is only for committers?
Dennis
Doug Cutting wrote:
Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps
Doug can correct me if I'm wrong ... and yes, it uses JIRA
in the
JIRA system or with the mailing lists, committers, etc?
Getting this information together in one place will go a long way toward
helping others to start contributing more and more. Thanks for all your
input.
Dennis Kubes
Andrzej Bialecki wrote:
Dennis Kubes wrote:
I completely agree with this. I am interested in devoting as much
time as possible to seeing the success of Nutch, Hadoop, and Lucene.
As our business grows I would also be willing to devote developers
full time to work on Nutch, Hadoop
to know to help. At this time I don't think it is a design
problem I think it is a people problem. I will be more than willing to
head up training, documenting, and helping developers get up to speed.
I just need direction in this area myself.
Dennis Kubes
.
Dennis Kubes
Scott Green wrote:
Thanks Andrzej and Doug!
I will try both in my later work and evaluate them.
On 1/17/07, Doug Cutting [EMAIL PROTECTED] wrote:
Andrzej Bialecki wrote:
The reason is that if you pack this file into your job JAR, the job jar
would become very large (presumably
(conf);
PluginDescriptor desc = rep.getPluginDescriptor(parse-html);
String path = desc.getPluginPath();
System.out.println(path);
Dennis Kubes
Scott Green wrote:
Can someone give a answer? I dont think it is good idea we put all
configuration/resources under conf dir.
On 1/15/07
to get corr
category score and use that for sorting. Any thoughts?
You could populate the sort field dynamically but still only a single
field. Are you trying to sort on multiple category fields?
Dennis Kubes
Thanks,
On 1/11/07, Dennis Kubes [EMAIL PROTECTED] wrote:
You
You can write a scoring filter. That is much easier than changing
NutchSimplicity. Take a look at the scoring-opic plugin under src.
That will demostrate the default scoring algorithm.
Dennis Kubes
DS jha wrote:
Hello -
I would like to score summarize results on a different set
of the scoring algorithm.
Dennis Kubes
Otis Gospodnetic wrote:
Stephane,
Nutch uses Lucene for indexing, and Lucene has a class called IndexWriter
that is used for indexing Lucene Documents. Here is a quick grep in Nutch's
*java files:
$ ffjg -l IndexWriter
./src/test/org/apache/nutch
bruce wrote:
hi...
if it's ok, i've got some basic research questions.
can someone tell me if there's a limit to the number of simultaneous
websites that nutch/lucence can return...?
I assume you are asking its indexing capacity. If that is the case it
is billions, it is pretty much
. I am willing to put it into practice and test.
Dennis Kubes
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download
The tasktracker require intermediate space while performing the map
and reduce functions. Many smaller files are produced during the map
and reduce processes that are deleted when the processes finish. If you
are using the DFS then more disk space is required then is actually used
since disk
The MapWritable acts as a shared memory area or Map that you can put
other writables into and retrieve them from. To Add metatdata to the
CrawlDataum you would use something like this:
datum.getMetaData().put(key, value)
Where key and value are both Writable implementations such as UTF8 or
1 - 100 of 116 matches
Mail list logo