[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese
[ https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hiroaki Kawai updated NUTCH-162: Attachment: anchors_ja.properties cached_ja.properties explain_ja.properties We need some japanaese property files to make ja for the default language selection (Because of String language = ResourceBundle.getBundle(org.nutch.jsp.search, request.getLocale()).getLocale().getLanguage(); in seach.jsp for example). I'll submit those property files. country code jp is used instead of language code ja for Japanese Key: NUTCH-162 URL: https://issues.apache.org/jira/browse/NUTCH-162 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 0.7.1 Environment: n/a Reporter: KuroSaka TeruHiko Priority: Trivial Attachments: anchors_ja.properties, cached_ja.properties, explain_ja.properties In locale switching link for Japanese, jp is used as language code but it is an ISO country code. The language code ja should be used. By the way, I don't think many users are familiar with the ISO language codes. A Canadian user may click on ca uknowoing that ca stands for Catalan, not Canadian English or French. Rather than listing the language code, listing the language names in the prospective languages may be better. (I say may be because the browser could show some language names in corrupted text if the current font does not support that language --- this is a difficult problem.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese
[ https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hiroaki Kawai updated NUTCH-162: Attachment: search_ja.properties text_ja.properties Please put these property files in src/web/locale/org/nutch/jsp/ . country code jp is used instead of language code ja for Japanese Key: NUTCH-162 URL: https://issues.apache.org/jira/browse/NUTCH-162 Project: Nutch Issue Type: Bug Components: web gui Affects Versions: 0.7.1 Environment: n/a Reporter: KuroSaka TeruHiko Priority: Trivial Attachments: anchors_ja.properties, cached_ja.properties, explain_ja.properties, search_ja.properties, text_ja.properties In locale switching link for Japanese, jp is used as language code but it is an ISO country code. The language code ja should be used. By the way, I don't think many users are familiar with the ISO language codes. A Canadian user may click on ca uknowoing that ca stands for Catalan, not Canadian English or French. Rather than listing the language code, listing the language names in the prospective languages may be better. (I say may be because the browser could show some language names in corrupted text if the current font does not support that language --- this is a difficult problem.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. The comment on this change is: Fixed commang (single quotes missed). http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29 -- = New in Nutch 1.0-dev = - Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]]. + Please note that in the nightly version of Apache Nutch there is now a Solr integration embedded so you can start to use a lot easier. Just download a nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/. = Pre Solr Nutch integration = - This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to [[http://variogram.com||Brian Whitman at Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for all the help! You guys saved me a lot of time! :) + This is just a quick first pass at a guide for getting Nutch running with Solr. I'm sure there are better ways of doing some/all of it, but I'm not aware of them. By all means, please do correct/update this if someone has a better idea. Many thanks to http://variogram.com and http://blog.foofactory.fi for all the help! You guys saved me a lot of time! :) I'm posting it under Nutch rather than Solr on the presumption that people are more likely to be learning/using Solr first, then come here looking to combine it with Nutch. I'm going to skip over doing command by command for right now. I'm running/building on Ubuntu 7.10 using Java 1.6.0_05. I'm assuming that the Solr trunk code is checked out into solr-trunk and Nutch trunk code is checked out into nutch-trunk. @@ -12, +12 @@ * apt-get install sun-java6-jdk subversion ant patch unzip == Steps == - The first step to get started is to download the required software components, namely Apache Solr and Nutch. '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page @@ -23, +22 @@ '''4.''' Extract the Nutch package tar xzf apache-nutch-1.0.tar.gz + '''5.''' Configure Solr For the sake of simplicity we are going to use the example configuration of Solr as a base. - '''5.''' Configure Solr - For the sake of simplicity we are going to use the example - configuration of Solr as a base. - '''a.''' Copy the provided Nutch schema from directory - apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) + '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) We want to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: @@ -52, +48 @@ str name=qf - content^0.5 anchor^1.0 title^1.2 + content^0.5 anchor^1.0 title^1.2 /str - /str - str name=pf - content^0.5 anchor^1.5 title^1.2 site^1.5 + str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str - /str + str name=fl url /str - str name=fl - url - /str + str name=mm 2-1 5-2 690% /str - str name=mm - 2lt;-1 5lt;-2 6lt;90% - /str int name=ps100/int @@ -91, +80 @@ '''6.''' Start Solr + cd apache-solr-1.3.0/example java -jar start.jar - cd apache-solr-1.3.0/example - java -jar start.jar '''7. Configure Nutch''' a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) : + ?xml version=1.0? configuration - ?xml version=1.0? - configuration property @@ -109, +96 @@ /property - property - namegenerate.max.per.host/name + property namegenerate.max.per.host/name value100/value @@ -126, +112 @@ /configuration - '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace it’s content with following: -^(https|telnet|file|ftp|mailto): + - - # skip some suffixes - -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ + # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ - + - # skip URLs
[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. The comment on this change is: It's a problem to make wiki to display grave assent. Managed to do that using html codes. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=29rev2=30 -- The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable: - export SEGMENT=crawl/segments/``ls -tr crawl/segments|tail -1`` + export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` Now I launch the fetcher that actually goes to get the content:
[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunningNutchAndSolr page has been changed by Dmitrius. http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=30rev2=31 -- The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched. In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable: - export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` + export SEGMENT=crawl/segments/#96;ls -tr crawl/segments|tail -1#96; Now I launch the fetcher that actually goes to get the content:
[jira] Work started: (NUTCH-816) Add zip target to build.xml
[ https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-816 started by Chris A. Mattmann. Add zip target to build.xml --- Key: NUTCH-816 URL: https://issues.apache.org/jira/browse/NUTCH-816 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.0.0 Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.1 Just like we have an ant tar target (pun intended) we should have an ant zip target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-816) Add zip target to build.xml
[ https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-816. - Resolution: Fixed - fixed in r942427 Add zip target to build.xml --- Key: NUTCH-816 URL: https://issues.apache.org/jira/browse/NUTCH-816 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.0.0 Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.1 Just like we have an ant tar target (pun intended) we should have an ant zip target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[VOTE] Apache Nutch 1.1 Release Candidate #3
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc3/ The major differences between this release and rc #2 are the application of: NUTCH-816, NUTCH-732, NUTCH-815, NUTCH-814, and NUTCH-812 based on feedback from prior release candidates. For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ note In response to several user requests during the last RC cycle, I've also included *binary* releases (labeled as apache-nutch-1.1-bin.tar.gz and apache-nutch-1.1-bin.zip). This addresses Sami Siren's request that the tutorial be updated to reflect the fact that this release is a source-only release. Sami also requested to integrate RAT into the build, however, in the interest of getting this 1.1 out and getting going on the Nutch TLP, my proposal is: * run RAT and integrate into the build on releases post 1.1 /note Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Only votes from Nutch PMC are binding, but folks are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris P.S. Here is my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-811) Develop an ORM framework
[ https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865226#action_12865226 ] Enis Soztutar commented on NUTCH-811: - Hi Piet, The code for Gora will reside in GitHub for now, since Nutch and Gora are pretty orthogonal. But as stated before, Nutch is the first user of Gora, and Gora does not yet have a separate community so I intend to always keep nutch community updated (via this issue and nutch-dev mailing list), and hope for feedback from the Nutch community. Moreover, NutchBase has already been ported to using Gora, so at some point, Gora should be reviewed and accepted as a dependency for Nutch. Develop an ORM framework - Key: NUTCH-811 URL: https://issues.apache.org/jira/browse/NUTCH-811 Project: Nutch Issue Type: New Feature Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, so that different backends can be used to store data. This issue will track the development of the ORM layer. Initially full support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support scheduled for later. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-811) Develop an ORM framework
[ https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864744#action_12864744 ] Piet Schrijver commented on NUTCH-811: -- Will development for gora be tracked under this or any nutch ticket? Develop an ORM framework - Key: NUTCH-811 URL: https://issues.apache.org/jira/browse/NUTCH-811 Project: Nutch Issue Type: New Feature Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, so that different backends can be used to store data. This issue will track the development of the ORM layer. Initially full support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support scheduled for later. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1
[ https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche reassigned NUTCH-817: --- Assignee: Julien Nioche parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1 Key: NUTCH-817 URL: https://issues.apache.org/jira/browse/NUTCH-817 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: Suse linux 11.1, java version 1.6.0_13 Reporter: matthew a. grisius Assignee: Julien Nioche Attachments: sample-javadoc.html submitted per Julien Nioche. I did not see where to attach a file so I pasted it here. btw: Tika command line returns empty html body for this file. !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN http://www.w3.org/TR/html4/frameset.dtd; !--NewPage-- HTML HEAD !-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008-- TITLE Matrix Application Development Kit /TITLE SCRIPT type=text/javascript targetPage = + window.location.search; if (targetPage != targetPage != undefined) targetPage = targetPage.substring(1); function loadFrames() { if (targetPage != targetPage != undefined) top.classFrame.location = top.targetPage; } /SCRIPT NOSCRIPT /NOSCRIPT /HEAD FRAMESET cols=20%,80% title= onLoad=top.loadFrames() FRAMESET rows=30%,70% title= onLoad=top.loadFrames() FRAME src=overview-frame.html name=packageListFrame title=All Packages FRAME src=allclasses-frame.html name=packageFrame title=All classes and interfaces (except non-static nested types) /FRAMESET FRAME src=overview-summary.html name=classFrame title=Package, class and interface descriptions scrolling=yes NOFRAMES H2 Frame Alert/H2 P This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. BR Link toA HREF=overview-summary.htmlNon-frame version./A /NOFRAMES /FRAMESET /HTML -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-814) SegmentMerger bug
[ https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-814: Attachment: merger.patch Patch fixing the issue, and a unit test. I will commit this shortly. SegmentMerger bug - Key: NUTCH-814 URL: https://issues.apache.org/jira/browse/NUTCH-814 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Dennis Kubes Assignee: Andrzej Bialecki Fix For: 1.1 Attachments: merger.patch Dennis reported: {quote} In the SegmentMerger.java file about line 150 we have this: final SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(), job); Then about line 166 in the record reader we have this: boolean res = reader.next(key, w); If I am reading that right, that would mean that the map tap would loop over all records for a given file and not just a given split. {quote} Right, this should instead use SequenceFileRecordReader that already has the logic to handle splits. Patch coming shortly - thanks for spotting this! This could be the reason for out of disk space errors that many users reported. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work stopped: (NUTCH-466) Flexible segment format
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-466 stopped by Andrzej Bialecki . Flexible segment format --- Key: NUTCH-466 URL: https://issues.apache.org/jira/browse/NUTCH-466 Project: Nutch Issue Type: Improvement Components: searcher Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: ParseFilters.java, segmentparts.patch In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys. Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment parts, with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts. Example applications: * storing HTML previews of non-HTML pages, such as PDF, PS and Office documents * storing pre-tokenized version of plain text for faster snippet generation * storing linguistically tagged text for sophisticated data mining * storing image thumbnails etc, etc ... I'm going to prepare a patchset shortly. Any comments and suggestions are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-816) Add zip target to build.xml
Add zip target to build.xml --- Key: NUTCH-816 URL: https://issues.apache.org/jira/browse/NUTCH-816 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.0.0 Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.1 Just like we have an ant tar target (pun intended) we should have an ant zip target. I'd like to have this ready for the release and future releases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Nutch 1.1 Release Candidate #2
Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC? Cheers, Grant On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE - as well as some commits by Sami Siren to fix missing ASL license headers. For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ note There was a request by Sami Siren that the tutorial be updated to reflect the fact that this release is a source-only release, as well as a request to integrate RAT into the build, however, in the interest of getting this 1.1 out and getting going on the Nutch TLP, my proposal is: * update the docs independent of this release (the tutorial as it exists right now says 0.7 on it anyways and doesn't look like it's been updated in a while, so I think users can live with what's there and support on u...@nutch.apache.org or d...@nutch.apache.org until it's updated) * begin source only releases in general since we've long had the debate as to the size of the Nutch release. Most folks that use Nutch are likely familiar with running ant IMHO. * run RAT and integrate into the build /note Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Since Nutch is now a TLP and has its own PMC, there is a question of who are the binding release VOTES in this particular thread. My gut reaction is that since I started this release while we were under the Lucene PMC, for continuity purposes, only votes from Lucene PMC are binding, but everyone (especially newly minted Nutch PMC members!) are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris P.S. Here is my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2
Hi David, Thanks. In fact, running ant is probably simpler than running Nutch. The steps would be: * what OS are you on (Ant is available for all of them to my knowledge)? * if you need ant, grab a distro from ant.apache.org, otherwise, I'll assume that you've got ant installed and callable from the command line. * unpack the nutch src distribution, cd into that directory, type ant job, and there you go. HTH! You could try it out by taking the Nutch src code from SVN at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1, and then trying the steps above. Cheers, Chris On 4/26/10 7:24 AM, David M. Cole d...@colegroup.com wrote: At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote: Most folks that use Nutch are likely familiar with running ant IMHO. I guess then I fall into the category of not most folks. Have been running Nutch for about 14 months and I haven't a clue how to run ant. If there's a place to vote to suggest that compiled versions still be distributed, I vote for that. Thanks. \dmc -- *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+ David M. Coled...@colegroup.com Editor Publisher, NewsInc. http://newsinc.netV: (650) 557-2993 Consultant: The Cole Group http://colegroup.com/ F: (650) 475-8479 *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.1 Release Candidate #2
Hi Grant, Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella... Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :) Cheers, Chris On 4/26/10 5:34 AM, Grant Ignersoll gsing...@apache.org wrote: Might I suggest, that since Nutch is now a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC? Cheers, Grant On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE - as well as some commits by Sami Siren to fix missing ASL license headers. For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ note There was a request by Sami Siren that the tutorial be updated to reflect the fact that this release is a source-only release, as well as a request to integrate RAT into the build, however, in the interest of getting this 1.1 out and getting going on the Nutch TLP, my proposal is: * update the docs independent of this release (the tutorial as it exists right now says 0.7 on it anyways and doesn't look like it's been updated in a while, so I think users can live with what's there and support on u...@nutch.apache.org or d...@nutch.apache.org until it's updated) * begin source only releases in general since we've long had the debate as to the size of the Nutch release. Most folks that use Nutch are likely familiar with running ant IMHO. * run RAT and integrate into the build /note Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Since Nutch is now a TLP and has its own PMC, there is a question of who are the binding release VOTES in this particular thread. My gut reaction is that since I started this release while we were under the Lucene PMC, for continuity purposes, only votes from Lucene PMC are binding, but everyone (especially newly minted Nutch PMC members!) are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris P.S. Here is my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar closed NUTCH-808. --- Resolution: Fixed We have decided to go on with implementing an ORM layer as per the discussion on NUTCH-811. Closing this issue. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Nutch 1.1 Release Candidate #2
Hey Andrzej, Okey dokey, np! Let's get the patch in first :) I can cut as many RCs as needed. Cheers, Chris On 4/26/10 11:30 AM, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-26 17:19, Mattmann, Chris A (388J) wrote: Hi Grant, Thanks. I think it actually makes sense to finish off 1.1, and since there is overlap with the Nutch PMC and the Lucene PMC and since the thread started in Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami could check the release and that way we still have the continuity and can safely push it out as the last Nutch rel under the Lucene umbrella... Then all releases post 1.1 can cleanly be done under the auspices of the new PMC :) I know that Dennis Kubes just discovered a bug in SegmentMerger (he may report on it in a moment) - this bug has been there for a while, it's likely the cause of the mysterious out of disk space errors, and it manifests itself only with input files larger than HDFS block size (64MB). Since 1.1 is likely the final release of Nutch 1.x I think it would make sense to fix this bug before we release ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[VOTE] Apache Nutch 1.1 Release Candidate #2
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE - as well as some commits by Sami Siren to fix missing ASL license headers. For more detailed information, see the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ note There was a request by Sami Siren that the tutorial be updated to reflect the fact that this release is a source-only release, as well as a request to integrate RAT into the build, however, in the interest of getting this 1.1 out and getting going on the Nutch TLP, my proposal is: * update the docs independent of this release (the tutorial as it exists right now says 0.7 on it anyways and doesn't look like it's been updated in a while, so I think users can live with what's there and support on u...@nutch.apache.org or d...@nutch.apache.org until it's updated) * begin source only releases in general since we've long had the debate as to the size of the Nutch release. Most folks that use Nutch are likely familiar with running ant IMHO. * run RAT and integrate into the build /note Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Since Nutch is now a TLP and has its own PMC, there is a question of who are the binding release VOTES in this particular thread. My gut reaction is that since I started this release while we were under the Lucene PMC, for continuity purposes, only votes from Lucene PMC are binding, but everyone (especially newly minted Nutch PMC members!) are welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris P.S. Here is my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-710) Support for rel=canonical attribute
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286 ] Julien Nioche commented on NUTCH-710: - As suggested previously we could either treat canonicals as redirections or during deduplication. Neither are satisfactory solutions. Redirection : we want to index the document if/when the target of the canonical is not available for indexing. We also want to follow the outlinks. Dedup : could modify the *DeleteDuplicates code but canonical are more complex due to fact that we need to follow redirections We probably need a third approach: prefilter by going through the crawldb detect URLs which have a canonical target already indexed or ready to be indexed. We need to follow up to X levels of redirection e.g. doc A marked as canonical representation doc B, doc B redirects to doc C etc...if end of redirection chain exists and is valid then mark A as duplicate of C (intermediate redirs will not get indexed anyway) As we don't know if has been indexed yet we would give it a special marker (e.g. status_duplicate) in the crawlDB. Then - if indexer comes across such an entry : skip it - make so that *deleteDuplicates can take a list of URLs with status_duplicate as an additional source of input OR have a custom resource that deletes such entries in SOLR or Lucene indices The implementation would be as follows : Go through all redirections and generate all redirection chains e.g. A - B B - C D - C where C is an indexable document (i.e. has been fetched and parsed - it may have been already indexed. will yield A - C B - C D - C but also C - C Once we have all possible redirections : go through the crawlDB in search of canonicals. if the target of a canonical is the source of a valid alias (e.g. A - B - C - D) mark it as 'status:duplicate' This design implies generating quite a few intermediate structures + scanning the whole crawlDB twice (once of the aliases then for the canonical) + rewrite the whole crawlDB to mark some of the entries as duplicates. This would be much easier to do when we have Nutch2/HBase : could simply follow the redirs from the initial URL having a canonical tag instead of generating these intermediate structures. We can then modify the entries one by one instead of regenerating the whole crawlDB. WDYT? Support for rel=canonical attribute - Key: NUTCH-710 URL: https://issues.apache.org/jira/browse/NUTCH-710 Project: Nutch Issue Type: New Feature Affects Versions: 1.1 Reporter: Frank McCown Priority: Minor There is a the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
TLP Status
The Board has approved Mahout, Tika, and Nutch moving to be top level status. Congrats! Now begins the fun part of changing mailing lists, domains, etc. -Grant
Re: Developing Nutch for semantic search
Hi Adrash, we did a search engine for a limited Web space : ~100M pages. Our background is in semantic search - but first we needed to address all the general crawl search issues as in a traditional search engine. They are in no way less work than introducing some semantics. So - i'd suggest you start with being able to crawl, index and search your content - then go on with extending the functionality. borislav On Apr 17, 2010, at 6:59 PM, Adarsh malu wrote: Hello, I am running Nutch 0.9 . Our aim is to build a semantic search engine (for agriculture) using Nutch. However I am unable to proceed from where to start. Help me how could I proceed Adarsh
[jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implme
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859116#action_12859116 ] Ilguiz Latypov commented on NUTCH-427: -- I hesitate adding the .zip file because (a) it hides the intention of the change and (b) other developers who might have already modified their copies would have difficulty merging my change. I believe the GNU patch tool will apply my suggested change automatically, provided that one resides in the right working directory and, possibly, applies the -pX option where X is the number of upper level directory names to ignore in the patch. protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. -- Key: NUTCH-427 URL: https://issues.apache.org/jira/browse/NUTCH-427 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: JAVA - OS independent Reporter: Armel Nene Priority: Minor Attachments: protocol-smb-diff.txt, protocol-smb.zip, protocol-smb.zip, protocol-smb.zip Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares Author: Armel T. Nene Update: Vadim Bauer Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r AT g m x . d e A. Introduction The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also support all the properties from the JCifs library. You can find more information on the following site: http://jcifs.samba.org/ The smb protocol syntax for crawling is as follow: smb://x (i.e. smb://server/share). B. Installation 1) Binaries only: The protocol-smb files can be found in the ../plugins directory. Copy the protocol-smb to NUTCHHOME/build/plugins directory. Put the smb.properties file in the NUTCHHOME/conf directory. Configure the properties in smb.properties file Enable the plugin by updating nutch-site.xml file found in NUTCHHOME/conf directory e.g. property nameplugin.includes/name valueprotocol-smb| other plugins.../value description /description /property 2) Source code:The protocol-smb sources can be found in the ../src directory. Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin Update the build.xml in NUTCHHOME/src/plugin to include plugin Update the NUTCHHOME/default.properties file to include plugin run ant to build Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties Enable the plugin by updating the nutch-site.xml file C: Known Issues 1) URLMalformedException: unkown protocol: smb The SMB URL protocol handler is not being successfully installed. In short, the jCIFS jar must be loaded by the System class loader. Workaround: a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty(java.protocol.handler.pkgs, jcifs); new
[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilguiz Latypov updated NUTCH-427: - Attachment: (was: protocol-smb.zip) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. -- Key: NUTCH-427 URL: https://issues.apache.org/jira/browse/NUTCH-427 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: JAVA - OS independent Reporter: Armel Nene Priority: Minor Attachments: protocol-smb-diff.txt, protocol-smb.zip, protocol-smb.zip Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares Author: Armel T. Nene Update: Vadim Bauer Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r AT g m x . d e A. Introduction The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also support all the properties from the JCifs library. You can find more information on the following site: http://jcifs.samba.org/ The smb protocol syntax for crawling is as follow: smb://x (i.e. smb://server/share). B. Installation 1) Binaries only: The protocol-smb files can be found in the ../plugins directory. Copy the protocol-smb to NUTCHHOME/build/plugins directory. Put the smb.properties file in the NUTCHHOME/conf directory. Configure the properties in smb.properties file Enable the plugin by updating nutch-site.xml file found in NUTCHHOME/conf directory e.g. property nameplugin.includes/name valueprotocol-smb| other plugins.../value description /description /property 2) Source code:The protocol-smb sources can be found in the ../src directory. Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin Update the build.xml in NUTCHHOME/src/plugin to include plugin Update the NUTCHHOME/default.properties file to include plugin run ant to build Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties Enable the plugin by updating the nutch-site.xml file C: Known Issues 1) URLMalformedException: unkown protocol: smb The SMB URL protocol handler is not being successfully installed. In short, the jCIFS jar must be loaded by the System class loader. Workaround: a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty(java.protocol.handler.pkgs, jcifs); new java.util.PropertyPermission(java.protocol.handler.pkgs,read, write) //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html 2) FATAL smb.SMB - Could not read content of protocol: smb://xx This problem usually occurs if the following properties are not set correctly in the smb.properties file: - username - password - domain Also refer to the following resources for
[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment
[ https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilguiz Latypov updated NUTCH-427: - Attachment: protocol-smb-dist.zip Applied my diff to simplify importing into the Subversion tree. The build directory should not be imported, and the src/plugin/build.xml file should only add the new protocol-smb deploy and clean targets. The previous author did not grant the license to ASF. protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmentation. -- Key: NUTCH-427 URL: https://issues.apache.org/jira/browse/NUTCH-427 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: JAVA - OS independent Reporter: Armel Nene Priority: Minor Attachments: protocol-smb-diff.txt, protocol-smb-dist.zip, protocol-smb.zip, protocol-smb.zip Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows shares Author: Armel T. Nene Update: Vadim Bauer Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r AT g m x . d e A. Introduction The protocol-smb plugins allows you to crawl Microsoft Windows shares. It implements the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin replicate the behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses the JCifs library and also support all the properties from the JCifs library. You can find more information on the following site: http://jcifs.samba.org/ The smb protocol syntax for crawling is as follow: smb://x (i.e. smb://server/share). B. Installation 1) Binaries only: The protocol-smb files can be found in the ../plugins directory. Copy the protocol-smb to NUTCHHOME/build/plugins directory. Put the smb.properties file in the NUTCHHOME/conf directory. Configure the properties in smb.properties file Enable the plugin by updating nutch-site.xml file found in NUTCHHOME/conf directory e.g. property nameplugin.includes/name valueprotocol-smb| other plugins.../value description /description /property 2) Source code:The protocol-smb sources can be found in the ../src directory. Always refer to the Nutch wiki for detailed instructions on building Nutch. In short: Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin Update the build.xml in NUTCHHOME/src/plugin to include plugin Update the NUTCHHOME/default.properties file to include plugin run ant to build Copy the 'smb.properties' file to NUTCHHOME/conf, and configure the properties Enable the plugin by updating the nutch-site.xml file C: Known Issues 1) URLMalformedException: unkown protocol: smb The SMB URL protocol handler is not being successfully installed. In short, the jCIFS jar must be loaded by the System class loader. Workaround: a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty(java.protocol.handler.pkgs, jcifs); new java.util.PropertyPermission(java.protocol.handler.pkgs,read, write) //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html 2) FATAL
Re: Developing Nutch for semantic search
Need a bit more details... Besides why don't u take the 1.0 release, 1.1 being not far from release. 2010/4/17, Adarsh malu adarsh_th...@yahoo.co.in: Hello, I am running Nutch 0.9 . Our aim is to build a semantic search engine (for agriculture) using Nutch. However I am unable to proceed from where to start. Help me how could I proceed Adarsh -- -MilleBii-
[jira] Work started: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-812 started by Chris A. Mattmann. Crawl.java incorrectly uses the Generator API resulting in NPE -- Key: NUTCH-812 URL: https://issues.apache.org/jira/browse/NUTCH-812 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Chris A. Mattmann Priority: Critical As reported by Phil Barnett on nutch-user: {quote} The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-812: --- Assignee: Chris A. Mattmann Crawl.java incorrectly uses the Generator API resulting in NPE -- Key: NUTCH-812 URL: https://issues.apache.org/jira/browse/NUTCH-812 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Chris A. Mattmann Priority: Critical As reported by Phil Barnett on nutch-user: {quote} The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-812. - Fix Version/s: 1.1 Resolution: Fixed - fixed in r935453. Thanks, Phil and Andrzej! Crawl.java incorrectly uses the Generator API resulting in NPE -- Key: NUTCH-812 URL: https://issues.apache.org/jira/browse/NUTCH-812 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Chris A. Mattmann Priority: Critical Fix For: 1.1 As reported by Phil Barnett on nutch-user: {quote} The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine. {quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Developing Nutch for semantic search
Hello, I am running Nutch 0.9 . Our aim is to build a semantic search engine (for agriculture) using Nutch. However I am unable to proceed from where to start. Help me how could I proceed Adarsh
[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-813: --- Attachment: Patch Repetitive crawl 403 status page Key: NUTCH-813 URL: https://issues.apache.org/jira/browse/NUTCH-813 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Nguyen Manh Tien Attachments: Patch When we crawl a page the return a 403 status. It will be crawl repetitively each days with default schedule. Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-813) Repetitive crawl 403 status page
Repetitive crawl 403 status page Key: NUTCH-813 URL: https://issues.apache.org/jira/browse/NUTCH-813 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Nguyen Manh Tien Attachments: Patch When we crawl a page the return a 403 status. It will be crawl repetitively each days with default schedule. Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-813: --- Priority: Minor (was: Major) Repetitive crawl 403 status page Key: NUTCH-813 URL: https://issues.apache.org/jira/browse/NUTCH-813 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Nguyen Manh Tien Priority: Minor Attachments: Patch When we crawl a page the return a 403 status. It will be crawl repetitively each days with default schedule. Even when we restrict by paramter db.fetch.retry.max -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE
[ https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-812: Affects Version/s: 1.1 Priority: Critical (was: Major) Crawl.java incorrectly uses the Generator API resulting in NPE -- Key: NUTCH-812 URL: https://issues.apache.org/jira/browse/NUTCH-812 Project: Nutch Issue Type: Bug Affects Versions: 1.1 Reporter: Andrzej Bialecki Priority: Critical As reported by Phil Barnett on nutch-user: {quote} The Fix. In line 131 of Crawl.java Generate no longer returns segments like it used to. Now it returns segs. line 131 needs to read If (segs == null) Instead of the current If (segments == null) After that change and a recompile, crawl is working just fine. {quote} -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE 2] Board resolution for Nutch as TLP
On 04/12/2010 02:08 PM, Andrzej Bialecki wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [X] +1. Request the Board make Nutch a TLP -- Sami Siren
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349 ] Julien Nioche commented on NUTCH-808: - Hi Enis, {quote} On the other hand, current implementation is ... {quote} What do you mean by current implementation? NutchBase? My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. I know that Cascading have various Tape/Sink implementations including JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how they do it? Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856360#action_12856360 ] Enis Soztutar commented on NUTCH-808: - bq. What do you mean by current implementation? NutchBase? Indeed. In package o.a.n.storage deals with ORM (though not all classes) bq. I know that Cascading have various Tape/Sink implementations including JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how they do it? The way cascading does this is to convert Tuples (cascading data structure) to HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since they deal with only tuple - table row, it is not that difficult. But again, cascading does not allow for mapping lists to columns, etc. bq. My gut feeling would be to write a custom framework instead of relying on DataNucleus and use AVRO if possible. I really think that HBase support is urgently needed but am less convinced that we need MySQL in the very short term. Yeah, the more I think about it, the more I come to terms with custom implementation. However, I think we might benefit a lot from the ideas from JDO in the long term. Also, JDBC implementation may not be relevant for large scale deployments, but it will be a very nice side effect of the ORM layer, which will allow easy deployment, which in turn will hopefully bring more users. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[VOTE] Board resolution for Nutch as TLP
Hi, Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Hold on... (Re: [VOTE] Board resolution for Nutch as TLP)
On 2010-04-12 12:57, Andrzej Bialecki wrote: Hi, Following the discussion, below is the text of the proposed Board Resolution to vote upon. Ehh, scrap that ... I missed one occurrence of the crawling platform. Resending... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[VOTE 2] Board resolution for Nutch as TLP
Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE 2] Board resolution for Nutch as TLP
On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. And here is my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: [VOTE 2] Board resolution for Nutch as TLP
+1, thanks for pushing this forward Andrzej! Cheers, Chris On 4/12/10 4:32 AM, Doğacan Güney doga...@gmail.com wrote: On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. And here is my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE 2] Board resolution for Nutch as TLP
+1 Scott Ganyo Actor, Writer, Producer, Technologist www.scottganyo.com 310.359.8728 Where the spirit does not work with the hand, there is no art. - Leonardo da Vinci On Apr 12, 2010, at 4:08 AM, Andrzej Bialecki wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Resolved: (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved NUTCH-570. Resolution: Won't Fix Improvement of URL Ordering in Generator.java - Key: NUTCH-570 URL: https://issues.apache.org/jira/browse/NUTCH-570 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ned Rockson Assignee: Otis Gospodnetic Priority: Minor Attachments: GeneratorDiff.out, GeneratorDiff_v1.out [Copied directly from my email to nutch-dev list] Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time. Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible. So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster. Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856124#action_12856124 ] Enis Soztutar commented on NUTCH-808: - So, this is the results so far : DataNucleus was previously known as JPOX and it was the reference implementation for Java Data objects (JDO). JDO is a java standard for persistence. A similar specification, named JPA is also a persistence standard, which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will not be useful for us (http://www.datanucleus.org/products/accessplatform/persistence_api.html). In JDO, the first step is to define the domain objects as POJOs. Then, the persistance metadata is specified either using annotations, XML or both. Then a byte code enhancer uses instrumentation to add required methods to the classes defined as @PersistanceCapable. The database tables can be generated by hand, automatically by datanucleus, or by using a tool (SchemaTool). The persistence layer uses standard JDO syntax, which is similar to JDBC. The objects can be queried using JPQL. I have run a small test to persist objects of WebTableRow class (from NutchBase branch) to both MySQL and HBase. Although it took me a fair bit of time to set-up, I was able to persist objects to both. However, although it is possible to map complex fields (like lists, maps, arrays, etc) to RDBMs using different strategies (such as serializing directly, using Joins, using Foreign Keys), I was not able to find a way to leverage HBase data model. For example, we want to be able to map lists and maps to columns in column families. Without such functionality using column oriented stores does not bring any advantage. For the byte[] serialization for MapReduce, we can either implement a new datastore for datanucleus, which also implements Hadoop's Serialization, or use Avro to generate Java classes to be feed into JPOX enhancer, or else manually implement Writable. To sum up, datanucleus brings the following advantages : - out of the box RDBMs support - XML or annotation metadata - JDO is a Java standard - standard query interface - JSON support The disadvantages to use DataNucleus would be: - JDO is rather complex, Implementing a datastore is not very trivial - We need write patches to datanucleus to flexibly map complex fields to leverage HBase's data model - We have no control on the source code - no native Hbase support (for example using filters, etc) On the other hand, current implementation is - tested on production, - can leverage HBase data model, - can be modified to work with Avro serialization directly, - cassandra support could be added with little effort - can support multiple languages (in the future) I believe that having SQLite, MySQL and HBase support is critical for Nutch 2.0, for out-of-the-box use, ease of deployment and real-scale computing respectively. But obviously we cannot use DataNucleus out of the box either. ORM is inherently a hard problem. I propose we go ahead and make the changes to DataNucleus to see if it is feasible, and continue with it if it suits our needs. Of course, having a custom framework will also be great, so any feedback would be more than welcome. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [DISCUSS] Board resolution for Nutch as TLP
Hi, On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Actually, shouldn't that be something like web search platform, or maybe a crawling and search platform? Nutch is not just a crawler. Anyway, +1 from me. BR, Jukka Zitting -- Doğacan Güney
Re: [DISCUSS] Board resolution for Nutch as TLP
Hi Dogacan, +1 to calling it a web search platform, since I agree, it’s not just a crawler. Cheers, Chris On 4/11/10 11:40 AM, Doğacan Güney doga...@gmail.com wrote: Hi, On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Actually, shouldn't that be something like web search platform, or maybe a crawling and search platform? Nutch is not just a crawler. Anyway, +1 from me. BR, Jukka Zitting -- Doğacan Güney ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [DISCUSS] Board resolution for Nutch as TLP
On 2010-04-10 04:13, Mattmann, Chris A (388J) wrote: Hi Andrzej, +1, with the following amendment: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. This should read: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. Good catch, thanks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? BR, Jukka Zitting
Adding jpeg parser to nutch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello. Im working on a school task, wich is to modify nutch to be able to identify, and download jpegs , creaty a thumbnail , and index the url of this jpegs with the other crawl result so that the web interface can show images as well. At the start i found that ParserNotFound.java can do the trick for me. I modified the constructor so that it matches the url-s end to a pattern, and if it ends to jpeg it creates a file with the name of the md5sum of the url and writes the url in it to a directory found in my filesystem. Well.. this is ugly, i wanted to add the working directory to the parsernotfound.java , but i couldnt. And to move forward with my work, i need to know how to make my own jpeg parser as first task. After that i would like to index my result somehow :) So.. my question.. how can i add my jpeg parser? Or, how can i add a new parser to the nutch system? Thanks for your awnsers. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJLwIObAAoJEIJu8h6i9aAHb6AH/jegl+oqvUg8nJCJo1p/IuVx KuWthxGn0S+qDMfXrYb+AIRpmuj2YAWQwEE9Lhw2ftSJwFqH4gf4VwmDJq8CDTto BDX+/lOOI7ZVtKzNmDgaN2nwX0gwn0PJgKTV8BGkUbVy3McfisQ/9v9UBzhjj7f7 DTvsZN2yNyv9PUls9GSqXw9czFsuKB7PLGnssqB6a8DTgFeoLT2F8e0B9q2Tht92 eAZV2awEnnH/wNTIjfwO00YXNdvNcGANiFzz0v4CoMekSEigoRBSemtYhsYCOppo S0OUy8SCT4A2B6sWADIQjMKgnWuLm53dkHl9D91p0zMpnCTcq5u3hjLnxgq69L8= =M7VY -END PGP SIGNATURE-
Re: Adding jpeg parser to nutch
Hi David, The latest Nutch release candidate (1.1, http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1) includes the tika-parser plugin, which provides a JpegParser (see here: http://bit.ly/b0zRX8) that hopefully can suit your needs. Let me know what you think. Cheers, Chris On 4/10/10 6:56 AM, Gombkötő Dávid madav...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello. Im working on a school task, wich is to modify nutch to be able to identify, and download jpegs , creaty a thumbnail , and index the url of this jpegs with the other crawl result so that the web interface can show images as well. At the start i found that ParserNotFound.java can do the trick for me. I modified the constructor so that it matches the url-s end to a pattern, and if it ends to jpeg it creates a file with the name of the md5sum of the url and writes the url in it to a directory found in my filesystem. Well.. this is ugly, i wanted to add the working directory to the parsernotfound.java , but i couldnt. And to move forward with my work, i need to know how to make my own jpeg parser as first task. After that i would like to index my result somehow :) So.. my question.. how can i add my jpeg parser? Or, how can i add a new parser to the nutch system? Thanks for your awnsers. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJLwIObAAoJEIJu8h6i9aAHb6AH/jegl+oqvUg8nJCJo1p/IuVx KuWthxGn0S+qDMfXrYb+AIRpmuj2YAWQwEE9Lhw2ftSJwFqH4gf4VwmDJq8CDTto BDX+/lOOI7ZVtKzNmDgaN2nwX0gwn0PJgKTV8BGkUbVy3McfisQ/9v9UBzhjj7f7 DTvsZN2yNyv9PUls9GSqXw9czFsuKB7PLGnssqB6a8DTgFeoLT2F8e0B9q2Tht92 eAZV2awEnnH/wNTIjfwO00YXNdvNcGANiFzz0v4CoMekSEigoRBSemtYhsYCOppo S0OUy8SCT4A2B6sWADIQjMKgnWuLm53dkHl9D91p0zMpnCTcq5u3hjLnxgq69L8= =M7VY -END PGP SIGNATURE- ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [DISCUSS] Board resolution for Nutch as TLP
On 2010-04-10 15:32, Jukka Zitting wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Yes, that's a good change too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
Looks good to me after the proposed changes. -- Sami Siren On Sat, Apr 10, 2010 at 6:09 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-10 15:32, Jukka Zitting wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Yes, that's a good change too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
I think it looks good after the minor changes. +1. Dennis Andrzej Bialecki wrote: Hi, I was told that the next step is to come up with the proposed Board resolution and vote it among committers. Here's the proposed text (shameless copypaste from Tika and Mahout proposals). IMPORTANT NOTE: I removed from the members of the PMC those existing Nutch committers that haven't been active for more than 1 year, with the intention of moving them to Emeritus status. If any one of these people feels left out and would like to become an active committer in the project, please let us know and we will gladly welcome you back :) The text of the resolution follows. Committers, please read it and optionally comment on the salient points of the text, the rest is boilerplate. If there's an overall consensus I will call for a formal vote to submit this proposal to the Board. == X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. =
[DISCUSS] Board resolution for Nutch as TLP
Hi, I was told that the next step is to come up with the proposed Board resolution and vote it among committers. Here's the proposed text (shameless copypaste from Tika and Mahout proposals). IMPORTANT NOTE: I removed from the members of the PMC those existing Nutch committers that haven't been active for more than 1 year, with the intention of moving them to Emeritus status. If any one of these people feels left out and would like to become an active committer in the project, please let us know and we will gladly welcome you back :) The text of the resolution follows. Committers, please read it and optionally comment on the salient points of the text, the rest is boilerplate. If there's an overall consensus I will call for a formal vote to submit this proposal to the Board. == X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web crawling platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. = -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [DISCUSS] Board resolution for Nutch as TLP
Hi Andrzej, +1, with the following amendment: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Nutch Project are hereafter discharged. This should read: RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.0 roadmap
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Hmm .. this puzzles me, do you think we should port changes from 1.1 to nutchbase? I thought we should do it the other way around, i.e. merge nutchbase bits to trunk. Hmm, I am a bit out of touch with the latest changes but I know that the differences between trunk and nutchbase are unfortunately rather large right now. If merging nutchbase back into trunk would be easier then sure, let's do that. * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. Again, the advantage of DataNucleus is that we don't have to handcraft all the mid- to low-level mappings, just the mid-level ones (JOQL or whatever) - the cost of maintenance is lower, and the number of backends that are supported out of the box is larger. Of course, this is just IMHO - we won't know for sure until we try to use both your custom ORM and DataNucleus... I am obviously a bit biased here but I have no strong feelings really. DataNucleus is an excellent project. What I like about avro-based approach is the essentially free MapReduce support we get and the fact that supporting another language is easy. So, we can expose partial hbase data through a server and a python-client can easily read/write to it, thanks to avro. That being said, I am all for DataNucleus or something else. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Nutch 2.0 roadmap
Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney -- -MilleBii- -- Doğacan Güney
Re: Nutch 2.0 roadmap
Not sure what u mean by pig script, but I'd like to be able to make a multi-criteria selection of Url for fetching... The scoring method forces into a kind of mono dimensional approach which is not really easy to deal with. The regex filters are good but it assumes you want select URLs on data which is in the URL... Pretty limited in fact I basically would like to do 'content' based crawling. Say for example: that I'm interested in topic A. I'd'like to label URLs that match Topic A (user supplied logic). Later on I would want to crawl topic A urls at a certain frequency and non labeled urls for exploring in a different way. This looks like hard to do right now 2010/4/8, Doğacan Güney doga...@gmail.com: Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update?
Re: Nutch 2.0 roadmap
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote: Not sure what u mean by pig script, but I'd like to be able to make a multi-criteria selection of Url for fetching... I mean a query language like http://hadoop.apache.org/pig/ if we expose data correctly, then you should be able to generate on any criteria that you want. The scoring method forces into a kind of mono dimensional approach which is not really easy to deal with. The regex filters are good but it assumes you want select URLs on data which is in the URL... Pretty limited in fact I basically would like to do 'content' based crawling. Say for example: that I'm interested in topic A. I'd'like to label URLs that match Topic A (user supplied logic). Later on I would want to crawl topic A urls at a certain frequency and non labeled urls for exploring in a different way. This looks like hard to do right now 2010/4/8, Doğacan Güney doga...@gmail.com: Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical
Re: [VOTE] Apache Nutch 1.1 Release Candidate #1
..and here is to a Vote: +1 Oh, per usual, forgot to throw in my +1. So, +1! Cheers, Chris On 4/7/10 1:14 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[Nutch Wiki] Update of FrontPage by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by JulienNioche. http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129 -- * [[Mailing]] Lists * AcademicArticles that deal with Nutch * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch search engine author:Doug Cutting,Video Lecture - == Nutch Administration == * DownloadingNutch @@ -89, +88 @@ * TikaPlugin - Comments on the Tika integration and differences with existing parse plugins == Nutch 2.0 == + * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0 - * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture. + * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old) * NewScoring -- New stable pagerank like webgraph and link-analysis jobs. * NewScoringIndexingExample -- Two full fetch cycles of commands using new scoring and indexing systems.
[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Roadmap page has been changed by JulienNioche. http://wiki.apache.org/nutch/Nutch2Roadmap -- New page: = Nutch2Roadmap = Here is a list of the features and architectural changes that will be implemented in Nutch 2.0. * Storage Abstraction * initially with back end implementations for HBase and HDFS * extend it to other storages later e.g. MySQL etc... * Plugin cleanup : Tika only for parsing document formats * keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. * Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/] * robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * Remove index / search and delegate to SOLR * we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?), but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. * Various new functionalities * e.g. sitemap support, canonical tag, better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. This document is meant to serve as a basis for discussion, feel free to contribute to it
Re: Nutch 2.0 roadmap
Hi, I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... yes, maybe we should start the 2.0 branch from 1.1 instead Dogacan - what do you think? BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. definitely +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. I think that separating the parsing filters from the indexing filters can have its merits e.g. combining the metadata generated by 2 or more different parsing filters into a single field in the NutchDocument, keeping only a subset of the available information etc... I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Have created a new page to serve as a support for discussion : http://wiki.apache.org/nutch/Nutch2Roadmap julien -- DigitalPebble Ltd http://www.digitalpebble.com
[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch2Roadmap page has been changed by JulienNioche. http://wiki.apache.org/nutch/Nutch2Roadmap?action=diffrev1=1rev2=2 -- * Storage Abstraction * initially with back end implementations for HBase and HDFS * extend it to other storages later e.g. MySQL etc... - * Plugin cleanup : Tika only for parsing document formats + * Plugin cleanup : Tika only for parsing document formats (see http://wiki.apache.org/nutch/TikaPlugin) * keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. * Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/] * robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats.
[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-808: Fix Version/s: 2.0 Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 2.0 We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Nutch 2.0 roadmap
Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Nutch 2.0 roadmap
Hi, On 04/07/2010 07:54 PM, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. A suggestion would be to continue with trunk until nutch-base is stable. Once it is, then we can merge the nutchbase branch to trunk (after 1.1 split), at which point trunk becomes the nutchbase+other issues merged. Then when the time comes, we can fork branch-2.0 and release when blockers are done. I strongly suggest against having a trunk and a 2.0 branch for development. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. Current ORM code is merged with nutchbase code, but I think the sooner we split it the better, since development will be much more clear and simple this way. A have opened Nutch-808 to explore the alternatives, but we might as well continue with current implementation. I intent to share my findings in a couple of days. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. So, it seems that at some point, we need to bite the bullet, and refactor plugins, dropping backwards compatibility. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
Forgot to say that, at Hadoop, it is the convention that big issues, like the ones under discussion come with a design document. So that a solid design is agreed upon for the work. We can apply the same pattern at Nutch. On 04/07/2010 07:54 PM, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Hmm .. this puzzles me, do you think we should port changes from 1.1 to nutchbase? I thought we should do it the other way around, i.e. merge nutchbase bits to trunk. * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. Again, the advantage of DataNucleus is that we don't have to handcraft all the mid- to low-level mappings, just the mid-level ones (JOQL or whatever) - the cost of maintenance is lower, and the number of backends that are supported out of the box is larger. Of course, this is just IMHO - we won't know for sure until we try to use both your custom ORM and DataNucleus... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
On 2010-04-07 19:24, Enis Söztutar wrote: Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. So, it seems that at some point, we need to bite the bullet, and refactor plugins, dropping backwards compatibility. Right, that was my point - now is the time to break it, with the cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well enough in the interim period. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 roadmap
Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney -- -MilleBii-
[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854665#action_12854665 ] Otis Gospodnetic commented on NUTCH-570: I'm tempted to close this issue as Won't Fix, because: * I have no way to test and verify this * nobody seems to be using this * this issue has only 2 votes and only 3 watchers * the original reporter mentioned he noticed only marginal speedups Improvement of URL Ordering in Generator.java - Key: NUTCH-570 URL: https://issues.apache.org/jira/browse/NUTCH-570 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ned Rockson Assignee: Otis Gospodnetic Priority: Minor Attachments: GeneratorDiff.out, GeneratorDiff_v1.out [Copied directly from my email to nutch-dev list] Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time. Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible. So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster. Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java
[ https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854767#action_12854767 ] Chris A. Mattmann commented on NUTCH-570: - Hi Otis: I think your logic perfectly rational here. Maybe you could leave it open for another 48 hrs, and then close it out if you don't get any feedback from the original reporter, or those that were interested. Cheers, Chris Improvement of URL Ordering in Generator.java - Key: NUTCH-570 URL: https://issues.apache.org/jira/browse/NUTCH-570 Project: Nutch Issue Type: Improvement Components: generator Reporter: Ned Rockson Assignee: Otis Gospodnetic Priority: Minor Attachments: GeneratorDiff.out, GeneratorDiff_v1.out [Copied directly from my email to nutch-dev list] Recently I switched to Fetcher2 over Fetcher for larger whole web fetches (50-100M at a time). I found that the URLs generated are not optimal because they are simply randomized by a hash comparator. In one crawl on 24 machines it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I had set with regular Fetcher.java this was at least 3 fold more time. Anyway, I realized that the best situation for ordering can be approached by randomization, but in order to get optimal ordering, urls from the same host should be as far apart in the list as possible. So I wrote a series of 2 map/reduces to optimize the ordering and for a list of 25M documents it takes about 10 minutes on our cluster. Right now I have it in its own class, but I figured it can go in Generator.java and just add a flag in nutch-default.xml determining if the user wants to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-810) Upgrade to Tika 0.7
Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Upgrading to Tika 0.7 before 1.1 release The TikaConfig mechanism has changed and does not rely on a default XML config file anymore. Am working on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-789: Component/s: (was: fetcher) parser Fix Version/s: (was: 1.1) Have created a separate issue for the upgrade of Tika 0.7 and moved this one out of 1.1 Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: parser Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7
[ https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-810. --- Resolution: Fixed Committed in rev 931098. http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is created as it does not rely on a tika-config.xml file any longer. Our custom TikaConfig has been modified to reflect these changes. This was the last remaining issue marked for 1.1 Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Upgrading to Tika 0.7 before 1.1 release The TikaConfig mechanism has changed and does not rely on a default XML config file anymore. Am working on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
release of 1.1?
Chris, Just to let you know that I have committed https://issues.apache.org/jira/browse/NUTCH-810 which was the last open issue before the release of 1.1 Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com
Nutch 2.0 roadmap
Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) * plugin cleanup : Tika only for parsing - get rid of everything else? * remove index / search and delegate to SOLR * new functionalities e.g. sitemap support, canonical tag etc... I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? I look forward to hearing your thoughts on this Julien -- DigitalPebble Ltd http://www.digitalpebble.com
Re: release of 1.1?
Thanks Julien! OK, I'll cut the RC at some point today. Thanks! Cheers, Chris On 4/6/10 4:47 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Chris, Just to let you know that I have committed https://issues.apache.org/jira/browse/NUTCH-810 which was the last open issue before the release of 1.1 Thanks Julien ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.0 roadmap
On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-810) Upgrade to Tika 0.7
[ https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854332#action_12854332 ] Hudson commented on NUTCH-810: -- Integrated in Nutch-trunk #1116 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1116/]) Upgraded to Tika 0.7 Upgrade to Tika 0.7 --- Key: NUTCH-810 URL: https://issues.apache.org/jira/browse/NUTCH-810 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.0.0 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.1 Upgrading to Tika 0.7 before 1.1 release The TikaConfig mechanism has changed and does not rely on a default XML config file anymore. Am working on it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[VOTE] Apache Nutch 1.1 Release Candidate #1
Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, documented on the Wiki here: http://bit.ly/d5ugid A Nutch 1.1 tag is at: http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/ Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.1. [ ] -1 Do not release the packages because... Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251 ] Julien Nioche commented on NUTCH-789: - Will upgrade as soon as 0.7 is available from http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet. I will leave this issue open but unmark it as 1.1 Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Question: Nutch 0.8.2 and Nutch 0.7.3?
On 2010-04-04 02:59, Mattmann, Chris A (388J) wrote: Hey Guys, Question. I see 2 releases that haven't been cut in JIRA: 0.8.2: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312064 0.7.3: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312176 I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door. However, I have a question: is this Nutch 0.8.2 in SVN? http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/ That's the code that was intended to become 0.8.2 ... However, I'm not sure whether there's any benefit in releasing either of these. Those who really had the need to track this branch (or 0.7) likely used the code from this branch even though it wasn't released. And I believe we are not interested in maintaining a new release based on this code...? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Question: Nutch 0.8.2 and Nutch 0.7.3?
Hey Andrzej, http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/ That's the code that was intended to become 0.8.2 ... However, I'm not sure whether there's any benefit in releasing either of these. Those who really had the need to track this branch (or 0.7) likely used the code from this branch even though it wasn't released. And I believe we are not interested in maintaining a new release based on this code...? No problem, just wanted to guage interest. Is everyone OK with me closing out those releases in JIRA, then? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853285#action_12853285 ] Chris A. Mattmann commented on NUTCH-789: - Hey Julien, Tika 0.7 is available from Maven central: http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/ Cheers, Chris Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-807) JSParseFilter produces malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Minyao Zhu updated NUTCH-807: - Summary: JSParseFilter produces malformed URL (was: JSParseFilter produces weired URL) JSParseFilter produces malformed URL Key: NUTCH-807 URL: https://issues.apache.org/jira/browse/NUTCH-807 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.0.0 Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux Reporter: Minyao Zhu This is found when crawling site: http://zhidao.baidu.com/( a Chinese language site ) It appears this page contains javascripts which confused JSParseFilter, which produced URL like this: http://zhidao.baidu.com/){if(A===46){baidu.hide( Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-789) Improvements to Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853212#action_12853212 ] Chris A. Mattmann commented on NUTCH-789: - Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and close this one out...after that, I'll cut the Nutch 1.1 RC. Thanks! Cheers, Chris Improvements to Tika parser --- Key: NUTCH-789 URL: https://issues.apache.org/jira/browse/NUTCH-789 Project: Nutch Issue Type: Improvement Components: fetcher Environment: reported by Sami, in NUTCH-766 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.1 Attachments: NutchTikaConfig.java, TikaParser.java As reported by Sami in NUTCH-766, Sami has a few improvements he made to the Tika parser. We'll track that progress here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Question: Nutch 0.8.2 and Nutch 0.7.3?
Hey Guys, Question. I see 2 releases that haven't been cut in JIRA: 0.8.2: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312064 0.7.3: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106 80fixfor=12312176 I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door. However, I have a question: is this Nutch 0.8.2 in SVN? http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/ Nutch 0.7.3 has no issues associated with it, so should I remove it? It's been a few years since it was created it seems and I don't think it's got active maintenance, or a user base. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Created: (NUTCH-807) JSParseFilter produces weired URL
JSParseFilter produces weired URL - Key: NUTCH-807 URL: https://issues.apache.org/jira/browse/NUTCH-807 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.0.0 Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux Reporter: Minyao Zhu This is found when crawling site: http://zhidao.baidu.com/( a Chinese language site ) It appears this page contains javascripts which confused JSParseFilter, which produced URL like this: http://zhidao.baidu.com/){if(A===46){baidu.hide( Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
(apologies for the cross-post, but this impacts Nutch 1.1, so just wanted folks to see it) * +1 on extending the deadline until Monday, April 5th. Right now, we have 3 +1s, so technically we could still do the 72 hrs and still be OK, but I¹m fine with giving folks some more time to take a look * Thanks to jzitting and gsingers for taking a look and voting so far * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? * Thanks for comments on the CHANGES from gsingers, and the mention to include the sha1 of the src archive from jzitting. Will do on both, going forward. * +1 for having a direct link to tika-app on the website. Cheers, Chris On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing these packages as Apache Tika 0.7. +1 Thanks! Some minor notes: * It would be good to have also a SHA1 checksum for the release archive. * Perhaps we should start offering also the tika-app jar as a direct download from l.a.o/tika/download.html? The vote is open for the next 72 hours. It looks like people.apache.org is not accessible at the moment (I downloaded the release candidate yesterday), so it might be a good idea to extend the vote period over the Easter holidays. BR, Jukka Zitting ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Created: (NUTCH-809) Parse-metatags plugin
Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
Hi Chris, * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? Great. I'll definitely give 0.7 a try and make sure it works in Nutch. Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing these packages as Apache Tika 0.7. +1 Thanks! Some minor notes: * It would be good to have also a SHA1 checksum for the release archive. * Perhaps we should start offering also the tika-app jar as a direct download from l.a.o/tika/download.html? The vote is open for the next 72 hours. It looks like people.apache.org is not accessible at the moment (I downloaded the release candidate yesterday), so it might be a good idea to extend the vote period over the Easter holidays. BR, Jukka Zitting ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: (was: NUTCH-809.patch) Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com was: h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.