[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2010-05-10 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated NUTCH-162:


Attachment: anchors_ja.properties
cached_ja.properties
explain_ja.properties

We need some japanaese property files to make ja for the default language 
selection (Because of String language = 
ResourceBundle.getBundle(org.nutch.jsp.search, 
request.getLocale()).getLocale().getLanguage(); in seach.jsp for example).

I'll submit those property files.

 country code jp is used instead of language code ja for Japanese
 

 Key: NUTCH-162
 URL: https://issues.apache.org/jira/browse/NUTCH-162
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: n/a
Reporter: KuroSaka TeruHiko
Priority: Trivial
 Attachments: anchors_ja.properties, cached_ja.properties, 
 explain_ja.properties


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2010-05-10 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated NUTCH-162:


Attachment: search_ja.properties
text_ja.properties

Please put these property files in src/web/locale/org/nutch/jsp/ .


 country code jp is used instead of language code ja for Japanese
 

 Key: NUTCH-162
 URL: https://issues.apache.org/jira/browse/NUTCH-162
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: n/a
Reporter: KuroSaka TeruHiko
Priority: Trivial
 Attachments: anchors_ja.properties, cached_ja.properties, 
 explain_ja.properties, search_ja.properties, text_ja.properties


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
The comment on this change is: Fixed commang (single quotes missed).
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=28rev2=29

--

  = New in Nutch 1.0-dev =
- Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from [[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/]].
+ Please note that in the nightly version of Apache Nutch there is now a Solr 
integration embedded so you can start to use a lot easier. Just download a 
nightly version from http://hudson.zones.apache.org/hudson/job/Nutch-trunk/.
  
  = Pre Solr Nutch integration =
- This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to [[http://variogram.com||Brian Whitman at 
Variogr.am]] and [[http://blog.foofactory.fi||Sami Siren at FooFactory]] for 
all the help!  You guys saved me a lot of time! :)
+ This is just a quick first pass at a guide for getting Nutch running with 
Solr.  I'm sure there are better ways of doing some/all of it, but I'm not 
aware of them.  By all means, please do correct/update this if someone has a 
better idea.  Many thanks to http://variogram.com and http://blog.foofactory.fi 
for all the help!  You guys saved me a lot of time! :)
  
  I'm posting it under Nutch rather than Solr on the presumption that people 
are more likely to be learning/using Solr first, then come here looking to 
combine it with Nutch.  I'm going to skip over doing command by command for 
right now.  I'm running/building on Ubuntu 7.10 using Java 1.6.0_05.  I'm 
assuming that the Solr trunk code is checked out into solr-trunk and Nutch 
trunk code is checked out into nutch-trunk.
  
@@ -12, +12 @@

   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
- 
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
  '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
@@ -23, +22 @@

  
  '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  
+ '''5.''' Configure Solr For the sake of simplicity we are going to use the 
example configuration of Solr as a base.
- '''5.''' Configure Solr
- For the sake of simplicity we are going to use the example
- configuration of Solr as a base.
  
- '''a.''' Copy the provided Nutch schema from directory
- apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
+ '''a.''' Copy the provided Nutch schema from directory apache-nutch-1.0/conf 
to directory apache-solr-1.3.0/example/solr/conf (override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
@@ -52, +48 @@

  
  str name=qf
  
- content^0.5 anchor^1.0 title^1.2
+ content^0.5 anchor^1.0 title^1.2 /str
- /str
  
- str name=pf
- content^0.5 anchor^1.5 title^1.2 site^1.5
+ str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str
- /str
  
+ str name=fl url /str
- str name=fl
- url
- /str
  
+ str name=mm 2-1 5-2 690% /str
- str name=mm
- 2lt;-1 5lt;-2 6lt;90%
- /str
  
  int name=ps100/int
  
@@ -91, +80 @@

  
  '''6.''' Start Solr
  
+ cd apache-solr-1.3.0/example java -jar start.jar
- cd apache-solr-1.3.0/example
- java -jar start.jar
  
  '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
+ ?xml version=1.0? configuration
- ?xml version=1.0?
- configuration
  
  property
  
@@ -109, +96 @@

  
  /property
  
- property
- namegenerate.max.per.host/name
+ property namegenerate.max.per.host/name
  
  value100/value
  
@@ -126, +112 @@

  
  /configuration
  
- 
  '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
+ 
-  
- # skip some suffixes
- 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+ # skip some suffixes 
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-  
+ 
- # skip URLs 

[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
The comment on this change is: It's a problem to make wiki to display grave 
assent. Managed to do that using html codes.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=29rev2=30

--

  
  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:
  
- export SEGMENT=crawl/segments/``ls -tr crawl/segments|tail -1``
+ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
  
  Now I launch the fetcher that actually goes to get the content:
  


[Nutch Wiki] Update of RunningNutchAndSolr by Dmitriu s

2010-05-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The RunningNutchAndSolr page has been changed by Dmitrius.
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diffrev1=30rev2=31

--

  
  The above command will generate a new segment directory under crawl/segments 
that at this point contains files that store the url(s) to be fetched. In the 
following commands we need the latest segment dir as parameter so we’ll store 
it in an environment variable:
  
- export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
+ export SEGMENT=crawl/segments/#96;ls -tr crawl/segments|tail -1#96;
  
  Now I launch the fetcher that actually goes to get the content:
  


[jira] Work started: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-816 started by Chris A. Mattmann.

 Add zip target to build.xml
 ---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


 Just like we have an ant tar target (pun intended) we should have an ant zip 
 target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-816.
-

Resolution: Fixed

- fixed in r942427

 Add zip target to build.xml
 ---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


 Just like we have an ant tar target (pun intended) we should have an ant zip 
 target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[VOTE] Apache Nutch 1.1 Release Candidate #3

2010-05-08 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc3/

The major differences between this release and rc #2 are the application of:
NUTCH-816, NUTCH-732, NUTCH-815, NUTCH-814, and NUTCH-812 based on feedback
from prior release candidates.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

note
In response to several user requests during the last RC cycle, I've also
included *binary* releases (labeled as apache-nutch-1.1-bin.tar.gz and
apache-nutch-1.1-bin.zip). This addresses Sami Siren's request that the
tutorial be updated to reflect the fact that this release is a source-only
release.

Sami also requested to integrate RAT into the build, however, in the
interest of getting this 1.1 out and getting going on the Nutch TLP, my
proposal is:

* run RAT and integrate into the build on releases post 1.1

/note

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Only votes from Nutch PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Commented: (NUTCH-811) Develop an ORM framework

2010-05-07 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865226#action_12865226
 ] 

Enis Soztutar commented on NUTCH-811:
-

Hi Piet,
The code for Gora will reside in GitHub for now, since Nutch and Gora are 
pretty orthogonal. But as stated before, Nutch is the first user of Gora, and 
Gora does not yet have a separate community so I intend to always keep nutch 
community updated (via this issue and nutch-dev mailing list), and hope for 
feedback from the Nutch community.

Moreover, NutchBase has already been ported to using Gora, so at some point, 
Gora should be reviewed and accepted as a dependency for Nutch.

 Develop an ORM framework 
 -

 Key: NUTCH-811
 URL: https://issues.apache.org/jira/browse/NUTCH-811
 Project: Nutch
  Issue Type: New Feature
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, 
 so that different backends can be used to store data. 
 This issue will track the development of the ORM layer. Initially full 
 support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support 
 scheduled for later. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-811) Develop an ORM framework

2010-05-06 Thread Piet Schrijver (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864744#action_12864744
 ] 

Piet Schrijver commented on NUTCH-811:
--

Will development for gora be tracked under this or any nutch ticket?

 Develop an ORM framework 
 -

 Key: NUTCH-811
 URL: https://issues.apache.org/jira/browse/NUTCH-811
 Project: Nutch
  Issue Type: New Feature
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, 
 so that different backends can be used to store data. 
 This issue will track the development of the ORM layer. Initially full 
 support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support 
 scheduled for later. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-817:
---

Assignee: Julien Nioche

 parse-(html)does follow links of full html page, parse-(tika) does follow any 
 links and stops at level 1
 

 Key: NUTCH-817
 URL: https://issues.apache.org/jira/browse/NUTCH-817
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Suse linux 11.1, java version 1.6.0_13
Reporter: matthew a. grisius
Assignee: Julien Nioche
 Attachments: sample-javadoc.html


 submitted per Julien Nioche. I did not see where to attach a file so I pasted 
 it here. btw: Tika command line returns empty html body for this file.
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN 
 http://www.w3.org/TR/html4/frameset.dtd;
 !--NewPage--
 HTML
 HEAD
 !-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008--
 TITLE
 Matrix Application Development Kit
 /TITLE
 SCRIPT type=text/javascript
 targetPage =  + window.location.search;
 if (targetPage !=   targetPage != undefined)
targetPage = targetPage.substring(1);
 function loadFrames() {
 if (targetPage !=   targetPage != undefined)
  top.classFrame.location = top.targetPage;
 }
 /SCRIPT
 NOSCRIPT
 /NOSCRIPT
 /HEAD
 FRAMESET cols=20%,80% title= onLoad=top.loadFrames()
 FRAMESET rows=30%,70% title= onLoad=top.loadFrames()
 FRAME src=overview-frame.html name=packageListFrame title=All Packages
 FRAME src=allclasses-frame.html name=packageFrame title=All classes and 
 interfaces (except non-static nested types)
 /FRAMESET
 FRAME src=overview-summary.html name=classFrame title=Package, class 
 and interface descriptions scrolling=yes
 NOFRAMES
 H2
 Frame Alert/H2
 P
 This document is designed to be viewed using the frames feature. If you see 
 this message, you are using a non-frame-capable web client.
 BR
 Link toA HREF=overview-summary.htmlNon-frame version./A
 /NOFRAMES
 /FRAMESET
 /HTML

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-814) SegmentMerger bug

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-814:


Attachment: merger.patch

Patch fixing the issue, and a unit test. I will commit this shortly.

 SegmentMerger bug
 -

 Key: NUTCH-814
 URL: https://issues.apache.org/jira/browse/NUTCH-814
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Dennis Kubes
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: merger.patch


 Dennis reported:
 {quote}
 In the SegmentMerger.java file about line 150 we have this:
final SequenceFile.Reader reader =
  new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(),
 job);
 Then about line 166 in the record reader we have this:
 boolean res = reader.next(key, w);
 If I am reading that right, that would mean that the map tap would loop
 over all records for a given file and not just a given split.
 {quote}
 Right, this should instead use SequenceFileRecordReader that already has the 
 logic to handle splits. Patch coming shortly - thanks for spotting this! This 
 could be the reason for out of disk space errors that many users reported.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (NUTCH-466) Flexible segment format

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-466 stopped by Andrzej Bialecki .

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: ParseFilters.java, segmentparts.patch


 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-816) Add zip target to build.xml

2010-04-27 Thread Chris A. Mattmann (JIRA)
Add zip target to build.xml
---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


Just like we have an ant tar target (pun intended) we should have an ant zip 
target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Grant Ingersoll
Might I suggest, that since Nutch is now a TLP that you delay this release by a 
few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

 Hi Folks,
 
 I have posted an updated candidate for the Apache Nutch 1.1 release. The
 source code is at:
 
 http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
 
 The major difference between this release and rc #1 is the application of
 NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
 as well as some commits by Sami Siren to fix missing ASL license headers.
 
 For more detailed information, see the included CHANGES.txt file for details
 on release contents and latest changes. The release was made using the Nutch
 release process, documented on the Wiki here:
 
 http://bit.ly/d5ugid
 
 A Nutch 1.1 tag is at:
 
 http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
 
 note
 There was a request by Sami Siren that the tutorial be updated to reflect
 the fact that this release is a source-only release, as well as a request to
 integrate RAT into the build, however, in the interest of getting this 1.1
 out and getting going on the Nutch TLP, my proposal is:
 
 * update the docs independent of this release (the tutorial as it exists
 right now says 0.7 on it anyways and doesn't look like it's been updated in
 a while, so I think users can live with what's there and support on
 u...@nutch.apache.org or d...@nutch.apache.org until it's updated)
 
 * begin source only releases in general since we've long had the debate as
 to the size of the Nutch release. Most folks that use Nutch are likely
 familiar with running ant IMHO.
 
 * run RAT and integrate into the build
 
 /note
 
 Please vote on releasing these packages as Apache Nutch 1.1. The vote is
 open for the next 72 hours.
 
 Since Nutch is now a TLP and has its own PMC, there is a question of who are
 the binding release VOTES in this particular thread. My gut reaction is that
 since I started this release while we were under the Lucene PMC, for
 continuity purposes, only votes from Lucene PMC are binding, but everyone
 (especially newly minted Nutch PMC members!) are  welcome to check the
 release candidate and voice their approval or disapproval. The vote passes
 if at least three binding +1 votes are cast.
 
 [ ] +1 Release the packages as Apache Nutch 1.1.
 
 [ ] -1 Do not release the packages because...
 
 Thanks!
 
 Cheers,
 Chris
 
 P.S. Here is my +1.
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 




Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hi David,

Thanks. In fact, running ant is probably simpler than running Nutch. The steps 
would be:


 *   what OS are you on (Ant is available for all of them to my knowledge)?
 *   if you need ant, grab a distro from ant.apache.org, otherwise, I'll assume 
that you've got ant installed and callable from the command line.
 *   unpack the nutch src distribution, cd into that directory, type ant job, 
and there you go.

HTH! You could try it out by taking the Nutch src code from SVN at: 
http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1, and then trying the 
steps above.

Cheers,
Chris


On 4/26/10 7:24 AM, David M. Cole d...@colegroup.com wrote:

At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
Most folks that use Nutch are likely
familiar with running ant IMHO.

I guess then I fall into the category of not most folks. Have been
running Nutch for about 14 months and I haven't a clue how to run ant.

If there's a place to vote to suggest that compiled versions still be
distributed, I vote for that.

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Coled...@colegroup.com
Editor  Publisher, NewsInc. http://newsinc.netV: (650) 557-2993
Consultant: The Cole Group http://colegroup.com/   F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hi Grant,

Thanks. I think it actually makes sense to finish off 1.1, and since there is 
overlap with the Nutch PMC and the Lucene PMC and since the thread started in 
Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami 
could check the release and that way we still have the continuity and can 
safely push it out as the last Nutch rel under the Lucene umbrella...

Then all releases post 1.1 can cleanly be done under the auspices of the new 
PMC :)

Cheers,
Chris


On 4/26/10 5:34 AM, Grant Ignersoll gsing...@apache.org wrote:

Might I suggest, that since Nutch is now a TLP that you delay this release by a 
few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

 Hi Folks,

 I have posted an updated candidate for the Apache Nutch 1.1 release. The
 source code is at:

 http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/

 The major difference between this release and rc #1 is the application of
 NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
 as well as some commits by Sami Siren to fix missing ASL license headers.

 For more detailed information, see the included CHANGES.txt file for details
 on release contents and latest changes. The release was made using the Nutch
 release process, documented on the Wiki here:

 http://bit.ly/d5ugid

 A Nutch 1.1 tag is at:

 http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

 note
 There was a request by Sami Siren that the tutorial be updated to reflect
 the fact that this release is a source-only release, as well as a request to
 integrate RAT into the build, however, in the interest of getting this 1.1
 out and getting going on the Nutch TLP, my proposal is:

 * update the docs independent of this release (the tutorial as it exists
 right now says 0.7 on it anyways and doesn't look like it's been updated in
 a while, so I think users can live with what's there and support on
 u...@nutch.apache.org or d...@nutch.apache.org until it's updated)

 * begin source only releases in general since we've long had the debate as
 to the size of the Nutch release. Most folks that use Nutch are likely
 familiar with running ant IMHO.

 * run RAT and integrate into the build

 /note

 Please vote on releasing these packages as Apache Nutch 1.1. The vote is
 open for the next 72 hours.

 Since Nutch is now a TLP and has its own PMC, there is a question of who are
 the binding release VOTES in this particular thread. My gut reaction is that
 since I started this release while we were under the Lucene PMC, for
 continuity purposes, only votes from Lucene PMC are binding, but everyone
 (especially newly minted Nutch PMC members!) are  welcome to check the
 release candidate and voice their approval or disapproval. The vote passes
 if at least three binding +1 votes are cast.

 [ ] +1 Release the packages as Apache Nutch 1.1.

 [ ] -1 Do not release the packages because...

 Thanks!

 Cheers,
 Chris

 P.S. Here is my +1.

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++








++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-26 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar closed NUTCH-808.
---

Resolution: Fixed

We have decided to go on with implementing an ORM layer as per the discussion 
on NUTCH-811. Closing this issue. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hey Andrzej,

Okey dokey, np! Let's get the patch in first :) I can cut as many RCs as needed.

Cheers,
Chris

On 4/26/10 11:30 AM, Andrzej Bialecki a...@getopt.org wrote:

On 2010-04-26 17:19, Mattmann, Chris A (388J) wrote:
 Hi Grant,

 Thanks. I think it actually makes sense to finish off 1.1, and since there is 
 overlap with the Nutch PMC and the Lucene PMC and since the thread started in 
 Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami 
 could check the release and that way we still have the continuity and can 
 safely push it out as the last Nutch rel under the Lucene umbrella...

 Then all releases post 1.1 can cleanly be done under the auspices of the new 
 PMC :)

I know that Dennis Kubes just discovered a bug in SegmentMerger (he may
report on it in a moment) - this bug has been there for a while, it's
likely the cause of the mysterious out of disk space errors, and it
manifests itself only with input files larger than HDFS block size
(64MB). Since 1.1 is likely the final release of Nutch 1.x I think it
would make sense to fix this bug before we release ...

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-25 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/

The major difference between this release and rc #1 is the application of
NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
as well as some commits by Sami Siren to fix missing ASL license headers.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

note
There was a request by Sami Siren that the tutorial be updated to reflect
the fact that this release is a source-only release, as well as a request to
integrate RAT into the build, however, in the interest of getting this 1.1
out and getting going on the Nutch TLP, my proposal is:

* update the docs independent of this release (the tutorial as it exists
right now says 0.7 on it anyways and doesn't look like it's been updated in
a while, so I think users can live with what's there and support on
u...@nutch.apache.org or d...@nutch.apache.org until it's updated)

* begin source only releases in general since we've long had the debate as
to the size of the Nutch release. Most folks that use Nutch are likely
familiar with running ant IMHO.

* run RAT and integrate into the build

/note

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Since Nutch is now a TLP and has its own PMC, there is a question of who are
the binding release VOTES in this particular thread. My gut reaction is that
since I started this release while we were under the Lucene PMC, for
continuity purposes, only votes from Lucene PMC are binding, but everyone
(especially newly minted Nutch PMC members!) are  welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[jira] Commented: (NUTCH-710) Support for rel=canonical attribute

2010-04-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286
 ] 

Julien Nioche commented on NUTCH-710:
-

As suggested previously we could either treat canonicals as redirections or 
during deduplication. Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical 
is not available for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex 
due to fact that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb  
detect URLs which have a canonical target already indexed or ready to be 
indexed. We need to follow up to X levels of redirection e.g. doc A marked as 
canonical representation doc B, doc B redirects to doc C etc...if end of 
redirection chain exists and is valid then mark A as duplicate of C 
(intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker 
(e.g. status_duplicate) in the crawlDB. Then
- if indexer comes across such an entry : skip it
- make so that *deleteDuplicates can take a list of URLs with status_duplicate 
as an additional source of input OR have a custom resource that deletes such 
entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A - B
B - C
D - C

where C is an indexable document (i.e. has been fetched and parsed - it may 
have been already indexed.

will yield

A - C
B - C
D - C

but also

C - C

Once we have all possible redirections : go through the crawlDB in search of 
canonicals. if the target of a canonical is the source of a valid alias (e.g. A 
- B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning 
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite 
the whole crawlDB to mark some of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow 
the redirs from the initial URL having a canonical tag instead of generating 
these intermediate structures. We can then modify the entries one by one 
instead of regenerating the whole crawlDB.

WDYT?



 Support for rel=canonical attribute
 -

 Key: NUTCH-710
 URL: https://issues.apache.org/jira/browse/NUTCH-710
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.1
Reporter: Frank McCown
Priority: Minor

 There is a the new rel=canonical attribute which is
 now being supported by Google, Yahoo, and Live:
 http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
 Adding support for this attribute value will potentially reduce the number of 
 URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



TLP Status

2010-04-21 Thread Grant Ingersoll
The Board has approved Mahout, Tika, and Nutch moving to be top level status.  
Congrats!  Now begins the fun part of changing mailing lists, domains, etc.  

-Grant

Re: Developing Nutch for semantic search

2010-04-20 Thread borislav popov

Hi Adrash,
	we did a search engine for a limited Web space : ~100M pages. Our  
background is in semantic search - but first we needed to address all  
the general crawl  search issues as in a traditional search engine.  
They are in no way less work than introducing some semantics. So - i'd  
suggest you start with being able to crawl,  index and search your  
content - then go on with extending the functionality.

borislav

On Apr 17, 2010, at 6:59 PM, Adarsh malu wrote:


Hello,
 I am running Nutch 0.9 .
 Our aim is to build a semantic search engine (for agriculture)  
using Nutch.

 However I am unable to proceed from where to start.
Help me how could I proceed

Adarsh





[jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implme

2010-04-20 Thread Ilguiz Latypov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859116#action_12859116
 ] 

Ilguiz Latypov commented on NUTCH-427:
--

I hesitate adding the .zip file because (a) it hides the intention of the 
change and (b) other developers who might have already modified their copies 
would have difficulty merging my change.

I believe the GNU patch tool will apply my suggested change automatically, 
provided that one resides in the right working directory and, possibly, applies 
the -pX option where X is the number of upper level directory names to ignore 
in the patch.


 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 

[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment

2010-04-20 Thread Ilguiz Latypov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilguiz Latypov updated NUTCH-427:
-

Attachment: (was: protocol-smb.zip)

 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL smb.SMB - Could not read content of protocol: smb://xx
This problem usually occurs if the following properties are not set 
 correctly in
the smb.properties file:
- username
- password
- domain
Also refer to the following resources for 

[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment

2010-04-20 Thread Ilguiz Latypov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilguiz Latypov updated NUTCH-427:
-

Attachment: protocol-smb-dist.zip

Applied my diff to simplify importing into the Subversion tree.  The build 
directory should not be imported, and the src/plugin/build.xml file should only 
add the new protocol-smb deploy and clean targets.

The previous author did not grant the license to ASF.


 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb-dist.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL 

Re: Developing Nutch for semantic search

2010-04-19 Thread MilleBii
Need a bit more details... Besides why don't u take the 1.0 release,
1.1 being not far from release.

2010/4/17, Adarsh malu adarsh_th...@yahoo.co.in:
 Hello,
  I am running Nutch 0.9 .
  Our aim is to build a semantic search engine (for agriculture) using
 Nutch.
  However I am unable to proceed from where to start.
     Help me how could I proceed

 Adarsh




-- 
-MilleBii-


[jira] Work started: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-812 started by Chris A. Mattmann.

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-812:
---

Assignee: Chris A. Mattmann

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-812.
-

Fix Version/s: 1.1
   Resolution: Fixed

- fixed in r935453. Thanks, Phil and Andrzej!

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: 1.1


 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Developing Nutch for semantic search

2010-04-17 Thread Adarsh malu
Hello,
 I am running Nutch 0.9 .
 Our aim is to build a semantic search engine (for agriculture) using Nutch.
 However I am unable to proceed from where to start.
    Help me how could I proceed 

Adarsh



[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Manh Tien updated NUTCH-813:
---

Attachment: Patch

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
 Attachments: Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)
Repetitive crawl 403 status page


 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
 Attachments: Patch

When we crawl a page the return a 403 status. It will be crawl repetitively 
each days with default schedule.
Even when we restrict by paramter db.fetch.retry.max


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Manh Tien updated NUTCH-813:
---

Priority: Minor  (was: Major)

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
Priority: Minor
 Attachments: Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-812:


Affects Version/s: 1.1
 Priority: Critical  (was: Major)

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-13 Thread Sami Siren

On 04/12/2010 02:08 PM, Andrzej Bialecki wrote:

Hi,

Take two, after s/crawling/search/ ...

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.



[X] +1.  Request the Board make Nutch a TLP

--
 Sami Siren


[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349
 ] 

Julien Nioche commented on NUTCH-808:
-

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 

I know that Cascading have various Tape/Sink implementations including JDBC, 
HBase  but also SimpleDB. Maybe it would be worth having a look at how they do 
it?

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856360#action_12856360
 ] 

Enis Soztutar commented on NUTCH-808:
-

bq. What do you mean by current implementation? NutchBase?
Indeed. In package o.a.n.storage deals with ORM (though not all classes)

bq. I know that Cascading have various Tape/Sink implementations including 
JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how 
they do it?
The way cascading does this is to convert Tuples (cascading data structure) to 
HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since 
they deal with only tuple - table row, it is not that difficult. But again, 
cascading does not allow for mapping lists to columns, etc. 

bq. My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 
Yeah, the more I think about it, the more I come to terms with custom 
implementation. However, I think we might benefit a lot from the ideas from JDO 
in the long term. Also, JDBC implementation may not be relevant for large scale 
deployments, but it will be a very nice side effect of the ORM layer, which 
will allow easy deployment, which in turn will hopefully bring more users. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[VOTE] Board resolution for Nutch as TLP

2010-04-12 Thread Andrzej Bialecki
Hi,

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.

[] +1.  Request the Board make Nutch a TLP
[] +0.  I don't feel strongly about it, but I'm okay with this.
[] -1.  No, don't request the Board make Nutch a TLP, and here are my
 reasons...

This is a majority count vote (i.e. no vetoes). The vote is open for 72
hours.

Here's my +1.

===
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web search
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Lucene Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
===


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Hold on... (Re: [VOTE] Board resolution for Nutch as TLP)

2010-04-12 Thread Andrzej Bialecki
On 2010-04-12 12:57, Andrzej Bialecki wrote:
 Hi,
 
 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.

Ehh, scrap that ... I missed one occurrence of the crawling platform.
Resending...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Andrzej Bialecki
Hi,

Take two, after s/crawling/search/ ...

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.

[] +1.  Request the Board make Nutch a TLP
[] +0.  I don't feel strongly about it, but I'm okay with this.
[] -1.  No, don't request the Board make Nutch a TLP, and here are my
 reasons...

This is a majority count vote (i.e. no vetoes). The vote is open for 72
hours.

Here's my +1.

===
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web search
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web search platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Lucene Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
===


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Doğacan Güney
On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote:
 Hi,

 Take two, after s/crawling/search/ ...

 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.

 [] +1.  Request the Board make Nutch a TLP
 [] +0.  I don't feel strongly about it, but I'm okay with this.
 [] -1.  No, don't request the Board make Nutch a TLP, and here are my
  reasons...

 This is a majority count vote (i.e. no vetoes). The vote is open for 72
 hours.

 Here's my +1.

And here is my +1.


 ===
 X. Establish the Apache Nutch Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the Apache Nutch Project,
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further

 RESOLVED, that the office of Vice President, Apache Nutch be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:

        • Andrzej Bialecki a...@...
        • Otis Gospodnetic o...@...
        • Dogacan Guney doga...@...
        • Dennis Kubes ku...@...
        • Chris Mattmann mattm...@...
        • Julien Nioche jnio...@...
        • Sami Siren si...@...

 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.
 ===


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







-- 
Doğacan Güney


Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Mattmann, Chris A (388J)
+1, thanks for pushing this forward Andrzej!

Cheers,
Chris


On 4/12/10 4:32 AM, Doğacan Güney doga...@gmail.com wrote:

On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote:
 Hi,

 Take two, after s/crawling/search/ ...

 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.

 [] +1.  Request the Board make Nutch a TLP
 [] +0.  I don't feel strongly about it, but I'm okay with this.
 [] -1.  No, don't request the Board make Nutch a TLP, and here are my
  reasons...

 This is a majority count vote (i.e. no vetoes). The vote is open for 72
 hours.

 Here's my +1.

And here is my +1.


 ===
 X. Establish the Apache Nutch Project

 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.

 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the Apache Nutch Project,
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further

 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further

 RESOLVED, that the office of Vice President, Apache Nutch be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further

 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.
 ===


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com







--
Doğacan Güney



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-12 Thread Scott Ganyo
+1

Scott Ganyo
Actor, Writer, Producer, Technologist
www.scottganyo.com
310.359.8728

Where the spirit does not work with the hand, there is no art. - Leonardo da 
Vinci

On Apr 12, 2010, at 4:08 AM, Andrzej Bialecki wrote:

 Hi,
 
 Take two, after s/crawling/search/ ...
 
 Following the discussion, below is the text of the proposed Board
 Resolution to vote upon.
 
 [] +1.  Request the Board make Nutch a TLP
 [] +0.  I don't feel strongly about it, but I'm okay with this.
 [] -1.  No, don't request the Board make Nutch a TLP, and here are my
 reasons...
 
 This is a majority count vote (i.e. no vetoes). The vote is open for 72
 hours.
 
 Here's my +1.
 
 ===
 X. Establish the Apache Nutch Project
 
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web search
 platform for distribution at no charge to the public.
 
 NOW, THEREFORE, BE IT RESOLVED, that a Project Management
 Committee (PMC), to be known as the Apache Nutch Project,
 be and hereby is established pursuant to Bylaws of the
 Foundation; and be it further
 
 RESOLVED, that the Apache Nutch Project be and hereby is
 responsible for the creation and maintenance of software
 related to a large-scale web search platform; and be it further
 
 RESOLVED, that the office of Vice President, Apache Nutch be
 and hereby is created, the person holding such office to
 serve at the direction of the Board of Directors as the chair
 of the Apache Nutch Project, and to have primary responsibility
 for management of the projects within the scope of
 responsibility of the Apache Nutch Project; and be it further
 
 RESOLVED, that the persons listed immediately below be and
 hereby are appointed to serve as the initial members of the
 Apache Nutch Project:
 
   • Andrzej Bialecki a...@...
   • Otis Gospodnetic o...@...
   • Dogacan Guney doga...@...
   • Dennis Kubes ku...@...
   • Chris Mattmann mattm...@...
   • Julien Nioche jnio...@...
   • Sami Siren si...@...
 
 RESOLVED, that the Apache Nutch Project be and hereby
 is tasked with the migration and rationalization of the Apache
 Lucene Nutch sub-project; and be it further
 
 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.
 
 NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
 be appointed to the office of Vice President, Apache Nutch, to
 serve in accordance with and subject to the direction of the
 Board of Directors and the Bylaws of the Foundation until
 death, resignation, retirement, removal or disqualification,
 or until a successor is appointed.
 ===
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
 
 



[jira] Resolved: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-12 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved NUTCH-570.


Resolution: Won't Fix

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-12 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856124#action_12856124
 ] 

Enis Soztutar commented on NUTCH-808:
-

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference 
implementation for Java Data objects (JDO). JDO is a java standard for 
persistence. A similar specification, named JPA is also a persistence standard, 
which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will 
not be useful for us 
(http://www.datanucleus.org/products/accessplatform/persistence_api.html). 

In JDO, the first step is to define the domain objects as POJOs. Then, the 
persistance metadata is specified either using annotations, XML or both. Then a 
byte code enhancer uses instrumentation to add required methods to the classes 
defined as @PersistanceCapable. The database tables can be generated by hand, 
automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The 
objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase 
branch) to both MySQL and HBase. Although it took me a fair bit of time to 
set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, 
arrays, etc) to RDBMs using different strategies (such as serializing directly, 
using Joins, using Foreign Keys), I was not able to find a way to leverage 
HBase data model. For example, we want to be able to map lists and maps to 
columns in column families. Without such functionality using column oriented 
stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new 
datastore for datanucleus, which also implements Hadoop's Serialization, or use 
Avro to generate Java classes to be feed into JPOX enhancer, or else manually 
implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to 
leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 
2.0, for out-of-the-box use, ease of deployment and real-scale computing 
respectively. But obviously we cannot use DataNucleus out of the box either. 


ORM is inherently a hard problem. I propose we go ahead and make the changes to 
DataNucleus to see if it is feasible, and continue with it if it suits our 
needs. Of course, having a custom framework will also be great, so any feedback 
would be more than welcome. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-11 Thread Doğacan Güney
Hi,

On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.

 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?


Actually, shouldn't that be something like web search platform, or maybe a
crawling and search platform? Nutch is not just a crawler.

Anyway, +1 from me.

 BR,

 Jukka Zitting




-- 
Doğacan Güney


Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-11 Thread Mattmann, Chris A (388J)
Hi Dogacan,

+1 to calling it a web search platform, since I agree, it’s not just a
crawler.

Cheers,
Chris


On 4/11/10 11:40 AM, Doğacan Güney doga...@gmail.com wrote:

 Hi,
 
 On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,
 
 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.
 
 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?
 
 
 Actually, shouldn't that be something like web search platform, or maybe a
 crawling and search platform? Nutch is not just a crawler.
 
 Anyway, +1 from me.
 
 BR,
 
 Jukka Zitting
 
 
 
 
 --
 Doğacan Güney
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Andrzej Bialecki
On 2010-04-10 04:13, Mattmann, Chris A (388J) wrote:
 Hi Andrzej,
 
 +1, with the following amendment:
 

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Nutch Project are hereafter discharged.
 
 This should read:
 
 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

Good catch, thanks.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Jukka Zitting
Hi,

On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.

Would it make sense to simplify the scope to ... open-source software
related to large-scale web crawling for distribution at no charge to
the public?

BR,

Jukka Zitting


Adding jpeg parser to nutch

2010-04-10 Thread Gombkötő Dávid
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello.

Im working on a school task, wich is to modify nutch to be able to
identify, and download jpegs , creaty a thumbnail , and index the url of
this jpegs with the other crawl result so that the web interface can
show images as well.

 At the start i found that ParserNotFound.java can do the trick for me.
I modified the constructor so that it matches the url-s end to a
pattern, and if it ends to jpeg it creates a file with the name of the
md5sum of the url and writes the url in it to a directory found in my
filesystem. Well.. this is ugly, i wanted to add the working directory
to the parsernotfound.java , but i couldnt. And to move forward with my
work, i need to know how to make my own jpeg parser as first task. After
that i would like to index my result somehow :)

So.. my question.. how can i add my jpeg parser? Or, how can i add a new
parser to the nutch system? Thanks for your awnsers.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLwIObAAoJEIJu8h6i9aAHb6AH/jegl+oqvUg8nJCJo1p/IuVx
KuWthxGn0S+qDMfXrYb+AIRpmuj2YAWQwEE9Lhw2ftSJwFqH4gf4VwmDJq8CDTto
BDX+/lOOI7ZVtKzNmDgaN2nwX0gwn0PJgKTV8BGkUbVy3McfisQ/9v9UBzhjj7f7
DTvsZN2yNyv9PUls9GSqXw9czFsuKB7PLGnssqB6a8DTgFeoLT2F8e0B9q2Tht92
eAZV2awEnnH/wNTIjfwO00YXNdvNcGANiFzz0v4CoMekSEigoRBSemtYhsYCOppo
S0OUy8SCT4A2B6sWADIQjMKgnWuLm53dkHl9D91p0zMpnCTcq5u3hjLnxgq69L8=
=M7VY
-END PGP SIGNATURE-


Re: Adding jpeg parser to nutch

2010-04-10 Thread Mattmann, Chris A (388J)
Hi David,

The latest Nutch release candidate (1.1, 
http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1) includes the tika-parser 
plugin, which provides a JpegParser (see here: http://bit.ly/b0zRX8) that 
hopefully can suit your needs.

Let me know what you think.

Cheers,
Chris


On 4/10/10 6:56 AM, Gombkötő Dávid madav...@gmail.com wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello.

Im working on a school task, wich is to modify nutch to be able to
identify, and download jpegs , creaty a thumbnail , and index the url of
this jpegs with the other crawl result so that the web interface can
show images as well.

 At the start i found that ParserNotFound.java can do the trick for me.
I modified the constructor so that it matches the url-s end to a
pattern, and if it ends to jpeg it creates a file with the name of the
md5sum of the url and writes the url in it to a directory found in my
filesystem. Well.. this is ugly, i wanted to add the working directory
to the parsernotfound.java , but i couldnt. And to move forward with my
work, i need to know how to make my own jpeg parser as first task. After
that i would like to index my result somehow :)

So.. my question.. how can i add my jpeg parser? Or, how can i add a new
parser to the nutch system? Thanks for your awnsers.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLwIObAAoJEIJu8h6i9aAHb6AH/jegl+oqvUg8nJCJo1p/IuVx
KuWthxGn0S+qDMfXrYb+AIRpmuj2YAWQwEE9Lhw2ftSJwFqH4gf4VwmDJq8CDTto
BDX+/lOOI7ZVtKzNmDgaN2nwX0gwn0PJgKTV8BGkUbVy3McfisQ/9v9UBzhjj7f7
DTvsZN2yNyv9PUls9GSqXw9czFsuKB7PLGnssqB6a8DTgFeoLT2F8e0B9q2Tht92
eAZV2awEnnH/wNTIjfwO00YXNdvNcGANiFzz0v4CoMekSEigoRBSemtYhsYCOppo
S0OUy8SCT4A2B6sWADIQjMKgnWuLm53dkHl9D91p0zMpnCTcq5u3hjLnxgq69L8=
=M7VY
-END PGP SIGNATURE-



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Andrzej Bialecki
On 2010-04-10 15:32, Jukka Zitting wrote:
 Hi,
 
 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.
 
 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?

Yes, that's a good change too.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Sami Siren
Looks good to me after the proposed changes.

--
 Sami Siren

On Sat, Apr 10, 2010 at 6:09 PM, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-10 15:32, Jukka Zitting wrote:
 Hi,

 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.

 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?

 Yes, that's a good change too.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Dennis Kubes

I think it looks good after the minor changes.  +1.

Dennis

Andrzej Bialecki wrote:

Hi,

I was told that the next step is to come up with the proposed Board
resolution and vote it among committers. Here's the proposed text
(shameless copypaste from Tika and Mahout proposals).

IMPORTANT NOTE: I removed from the members of the PMC those existing
Nutch committers that haven't been active for more than 1 year, with the
intention of moving them to Emeritus status. If any one of these people
feels left out and would like to become an active committer in the
project, please let us know and we will gladly welcome you back :)

The text of the resolution follows. Committers, please read it and
optionally comment on the salient points of the text, the rest is
boilerplate. If there's an overall consensus I will call for a formal
vote to submit this proposal to the Board.


==
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web crawling
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Nutch Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
=






[DISCUSS] Board resolution for Nutch as TLP

2010-04-09 Thread Andrzej Bialecki
Hi,

I was told that the next step is to come up with the proposed Board
resolution and vote it among committers. Here's the proposed text
(shameless copypaste from Tika and Mahout proposals).

IMPORTANT NOTE: I removed from the members of the PMC those existing
Nutch committers that haven't been active for more than 1 year, with the
intention of moving them to Emeritus status. If any one of these people
feels left out and would like to become an active committer in the
project, please let us know and we will gladly welcome you back :)

The text of the resolution follows. Committers, please read it and
optionally comment on the salient points of the text, the rest is
boilerplate. If there's an overall consensus I will call for a formal
vote to submit this proposal to the Board.


==
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web crawling
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Nutch Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
=




-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-09 Thread Mattmann, Chris A (388J)
Hi Andrzej,

+1, with the following amendment:

 
 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Nutch Project are hereafter discharged.

This should read:

 RESOLVED, that all responsibilities pertaining to the Apache
 Lucene Nutch sub-project encumbered upon the
 Apache Lucene Project are hereafter discharged.

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think 
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.

 Hmm .. this puzzles me, do you think we should port changes from 1.1 to
 nutchbase? I thought we should do it the other way around, i.e. merge
 nutchbase bits to trunk.


Hmm, I am a bit out of touch with the latest changes but I know that
the differences
between trunk and nutchbase are unfortunately rather large right now.
If merging nutchbase
back into trunk would be easier then sure, let's do that.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 Again, the advantage of DataNucleus is that we don't have to handcraft
 all the mid- to low-level mappings, just the mid-level ones (JOQL or
 whatever) - the cost of maintenance is lower, and the number of backends
 that are supported out of the box is larger. Of course, this is just
 IMHO - we won't know for sure until we try to use both your custom ORM
 and DataNucleus...

I am obviously a bit biased here but I have no strong feelings really.
DataNucleus
is an excellent project. What I like about avro-based approach is the
essentially free
MapReduce support we get and the fact that supporting another language
is easy. So,
we can expose partial hbase data through a server and a python-client
can easily read/write to it, thanks
to avro. That being said, I am all for DataNucleus or something else.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
Hi,

On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


At some point, it would be nice to change generator so that it is only a handful
of methods and a pig (or something else) script. So, we would provide
most of the functions
you may need during generation (accessing various data) but actual
generation would be a pig
process. This way, anyone can easily change generate any way they want
(even make it more jobs
than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 --
 Doğacan Güney



 --
 -MilleBii-




-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-08 Thread MilleBii
Not sure what u mean by pig script, but I'd like to be able to make a
multi-criteria selection of Url for fetching...
 The scoring method forces into a kind of mono dimensional approach
which is not really easy to deal with.

The regex filters are good but it assumes you want select URLs on data
which is in the URL... Pretty limited in fact

I basically would like to do 'content' based crawling. Say for
example: that I'm interested in topic A.
I'd'like to label URLs that match Topic A (user supplied logic).
Later on I would want to crawl topic A urls at a certain frequency
and non labeled urls for exploring in a different way.

 This looks like hard to do right now

2010/4/8, Doğacan Güney doga...@gmail.com:
 Hi,

 On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


 At some point, it would be nice to change generator so that it is only a
 handful
 of methods and a pig (or something else) script. So, we would provide
 most of the functions
 you may need during generation (accessing various data) but actual
 generation would be a pig
 process. This way, anyone can easily change generate any way they want
 (even make it more jobs
 than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
 be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 

Re: Nutch 2.0 roadmap

2010-04-08 Thread Doğacan Güney
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote:
 Not sure what u mean by pig script, but I'd like to be able to make a
 multi-criteria selection of Url for fetching...

I mean a query language like

http://hadoop.apache.org/pig/

if we expose data correctly, then you should be able to generate on any criteria
that you want.

  The scoring method forces into a kind of mono dimensional approach
 which is not really easy to deal with.

 The regex filters are good but it assumes you want select URLs on data
 which is in the URL... Pretty limited in fact

 I basically would like to do 'content' based crawling. Say for
 example: that I'm interested in topic A.
 I'd'like to label URLs that match Topic A (user supplied logic).
 Later on I would want to crawl topic A urls at a certain frequency
 and non labeled urls for exploring in a different way.

  This looks like hard to do right now

 2010/4/8, Doğacan Güney doga...@gmail.com:
 Hi,

 On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote:
 Just a question ?
 Will the new HBase implementation allow more sophisticated crawling
 strategies than the current score based.

 Give you a few  example of what I'd like to do :
 Define different crawling frequency for different set of URLs, say
 weekly for some url, monthly or more for others.

 Select URLs to re-crawl based on attributes previously extracted.Just
 one example: recrawl urls that contained a certain keyword (or set of)

 Select URLs that have not yet been crawled, at the frontier of the
 crawl therefore


 At some point, it would be nice to change generator so that it is only a
 handful
 of methods and a pig (or something else) script. So, we would provide
 most of the functions
 you may need during generation (accessing various data) but actual
 generation would be a pig
 process. This way, anyone can easily change generate any way they want
 (even make it more jobs
 than 2 if they want more complex schemes).




 2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will
 be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical 

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-07 Thread Fadzi Ushewokunze
..and here is to a Vote: +1

 Oh, per usual, forgot to throw in my +1. So, +1!

 Cheers,
 Chris


 On 4/7/10 1:14 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Folks,

 I have posted a candidate for the Apache Nutch 1.1 release. The source
 code
 is at:

 http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/

 See the included CHANGES.txt file for details on release contents and
 latest
 changes. The release was made using the Nutch release process, documented
 on
 the Wiki here:

 http://bit.ly/d5ugid

 A Nutch 1.1 tag is at:

 http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

 Please vote on releasing these packages as Apache Nutch 1.1. The vote is
 open for the next 72 hours. Only votes from Lucene PMC are binding, but
 everyone is welcome to check the release candidate and voice their
 approval
 or disapproval. The vote passes if at least three binding +1 votes are
 cast.

 [ ] +1 Release the packages as Apache Nutch 1.1.

 [ ] -1 Do not release the packages because...

 Thanks!

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






[Nutch Wiki] Update of FrontPage by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by JulienNioche.
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=128rev2=129

--

   * [[Mailing]] Lists
   * AcademicArticles that deal with Nutch
   * http://videolectures.net/iiia06_cutting_ense/| Experiences with the Nutch 
search engine author:Doug   Cutting,Video Lecture
- 
  
  == Nutch Administration ==
   * DownloadingNutch
@@ -89, +88 @@

   * TikaPlugin - Comments on the Tika integration and differences with 
existing parse plugins
  
  == Nutch 2.0 ==
+  * Nutch2Roadmap -- Discussions on the architecture and features of Nutch 2.0
-  * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture.
+  * Nutch2Architecture -- Discussions on the Nutch 2.0 architecture (old)
   * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.
   * NewScoringIndexingExample -- Two full fetch cycles of commands using new 
scoring and indexing systems.
  


[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Nutch2Roadmap page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap

--

New page:
= Nutch2Roadmap =

Here is a list of the features and architectural changes that will be 
implemented in Nutch 2.0.

  * Storage Abstraction
* initially with back end implementations for HBase and HDFS  
* extend it to other storages later e.g. MySQL etc...
  * Plugin cleanup : Tika only for parsing document formats
* keep only stuff HtmlParseFilters (probably with a different API) so that 
we can post-process the DOM created in Tika from whatever original format.
  * Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
* robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.
  * Remove index / search and delegate to SOLR
* we may still keep a thin abstract layer to allow other indexing/search 
backends (ElasticSearch?), but the current mess of indexing/query filters and 
competing indexing frameworks (lucene, fields, solr) should go away. We should 
go directly from DOM to a NutchDocument, and stop there.
  * Various new functionalities 
* e.g. sitemap support, canonical tag, better handling of redirects, 
detecting duplicated sites, detection of spam cliques, tools to manage the 
webgraph, etc.


This document is meant to serve as a basis for discussion, feel free to 
contribute to it


Re: Nutch 2.0 roadmap

2010-04-07 Thread Julien Nioche
Hi,

I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


yes, maybe we should start the 2.0 branch from 1.1 instead
Dogacan - what do you think?

BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it


 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.


definitely

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.



I think that separating the parsing filters from the indexing filters can
have its merits e.g. combining the metadata generated by 2 or more different
parsing filters into a single field in the NutchDocument, keeping only a
subset of the available information etc...


 
  I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
  update?


Have created a new page to serve as a support for discussion :
http://wiki.apache.org/nutch/Nutch2Roadmap

julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


[Nutch Wiki] Update of Nutch2Roadmap by JulienNioche

2010-04-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Nutch2Roadmap page has been changed by JulienNioche.
http://wiki.apache.org/nutch/Nutch2Roadmap?action=diffrev1=1rev2=2

--

* Storage Abstraction
  * initially with back end implementations for HBase and HDFS  
  * extend it to other storages later e.g. MySQL etc...
-   * Plugin cleanup : Tika only for parsing document formats
+   * Plugin cleanup : Tika only for parsing document formats (see 
http://wiki.apache.org/nutch/TikaPlugin)
  * keep only stuff HtmlParseFilters (probably with a different API) so 
that we can post-process the DOM created in Tika from whatever original format.
* Externalize functionalities to crawler-commons project 
[http://code.google.com/p/crawler-commons/] 
  * robots handling, url filtering and url normalization, URL state 
management, perhaps deduplication. We should coordinate our efforts, and share 
code freely so that other projects (bixo, heritrix,droids) may contribute to 
this shared pool of functionality, much like Tika does for the common need of 
parsing complex formats.


[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-808:


Fix Version/s: 2.0

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Nutch 2.0 roadmap

2010-04-07 Thread Doğacan Güney
Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar

Hi,

On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.
   


A suggestion would be to continue with trunk until nutch-base is stable. 
Once it is, then we can merge the nutchbase branch to trunk (after 1.1 
split), at which point trunk becomes the nutchbase+other issues merged. 
Then when the time comes, we can fork branch-2.0 and release when 
blockers are done. I strongly suggest against having a trunk and a 2.0 
branch for development.


   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.
   
Current ORM code is merged with nutchbase code, but I think the sooner 
we split it the better, since development will be much more clear and 
simple this way. A have opened Nutch-808 to explore the alternatives, 
but we might as well continue with current implementation. I intent to 
share my findings in a couple of days.


   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 


So, it seems that at some point, we need to bite the bullet, and 
refactor plugins, dropping backwards compatibility.



* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Enis Söztutar
Forgot to say that, at Hadoop, it is the convention that big issues, 
like the ones under discussion come with a design document. So that a 
solid design is agreed upon for the work. We can apply the same pattern 
at Nutch.


On 04/07/2010 07:54 PM, Doğacan Güney wrote:

Hey everyone,

On Tue, Apr 6, 2010 at 20:23, Andrzej Bialeckia...@getopt.org  wrote:
   

On 2010-04-06 15:43, Julien Nioche wrote:
 

Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?
   

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 

I know... But I still intend to finish it, I just need to schedule
some time for it.

My vote would be to go with nutchbase.

   

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
   

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 

Yeah, there is already a simple ORM within nutchbase that is
avro-based and should
be generic enough to also support MySQL, cassandra and berkeleydb. But
any good ORM will
be a very good addition.

   

* plugin cleanup : Tika only for parsing - get rid of everything else?
   

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 

* remove index / search and delegate to SOLR
   

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

 

Agreed. I would like to add support for katta and other indexing
backends at some point but
NutchDocument should be our canonical representation. The rest should
be up to indexing backends.

   

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 

* new functionalities e.g. sitemap support, canonical tag etc...
   

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?
   

Definitely. :)

--
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


 



   




Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 18:54, Doğacan Güney wrote:
 Hey everyone,
 
 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...

 
 I know... But I still intend to finish it, I just need to schedule
 some time for it.
 
 My vote would be to go with nutchbase.

Hmm .. this puzzles me, do you think we should port changes from 1.1 to
nutchbase? I thought we should do it the other way around, i.e. merge
nutchbase bits to trunk.


 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.

 
 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

Again, the advantage of DataNucleus is that we don't have to handcraft
all the mid- to low-level mappings, just the mid-level ones (JOQL or
whatever) - the cost of maintenance is lower, and the number of backends
that are supported out of the box is larger. Of course, this is just
IMHO - we won't know for sure until we try to use both your custom ORM
and DataNucleus...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread Andrzej Bialecki
On 2010-04-07 19:24, Enis Söztutar wrote:

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

  
 
 So, it seems that at some point, we need to bite the bullet, and
 refactor plugins, dropping backwards compatibility.

Right, that was my point - now is the time to break it, with the
cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well
enough in the interim period.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 roadmap

2010-04-07 Thread MilleBii
Just a question ?
Will the new HBase implementation allow more sophisticated crawling
strategies than the current score based.

Give you a few  example of what I'd like to do :
Define different crawling frequency for different set of URLs, say
weekly for some url, monthly or more for others.

Select URLs to re-crawl based on attributes previously extracted.Just
one example: recrawl urls that contained a certain keyword (or set of)

Select URLs that have not yet been crawled, at the frontier of the
crawl therefore




2010/4/7, Doğacan Güney doga...@gmail.com:
 Hey everyone,

 On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,

 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly
 for
 JIRA so that we can file issues / feature requests on 2.0? Do you think
 that
 the current NutchBase could be used as a basis for the 2.0 branch?

 I'm not sure what is the status of the nutchbase - it's missed a lot of
 fixes and changes in trunk since it's been last touched ...


 I know... But I still intend to finish it, I just need to schedule
 some time for it.

 My vote would be to go with nutchbase.


 Talking about features, what else would we add apart from :

 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

 This IMHO is promising, this could open the doors to small-to-medium
 installations that are currently too cumbersome to handle.


 Yeah, there is already a simple ORM within nutchbase that is
 avro-based and should
 be generic enough to also support MySQL, cassandra and berkeleydb. But
 any good ORM will
 be a very good addition.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

 Basically, yes - keep only stuff like HtmlParseFilters (probably with a
 different API) so that we can post-process the DOM created in Tika from
 whatever original format.

 Also, the goal of the crawler-commons project is to provide APIs and
 implementations of stuff that is needed for every open source crawler
 project, like: robots handling, url filtering and url normalization, URL
 state management, perhaps deduplication. We should coordinate our
 efforts, and share code freely so that other projects (bixo, heritrix,
 droids) may contribute to this shared pool of functionality, much like
 Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

 +1 - we may still keep a thin abstract layer to allow other
 indexing/search backends, but the current mess of indexing/query filters
 and competing indexing frameworks (lucene, fields, solr) should go away.
 We should go directly from DOM to a NutchDocument, and stop there.


 Agreed. I would like to add support for katta and other indexing
 backends at some point but
 NutchDocument should be our canonical representation. The rest should
 be up to indexing backends.

 Regarding search - currently the search API is too low-level, with the
 custom text and query analysis chains. This needlessly introduces the
 (in)famous Nutch Query classes and Nutch query syntax limitations, We
 should get rid of it and simply leave this part of the processing to the
 search backend. Probably we will use the SolrCloud branch that supports
 sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

 Plus a better handling of redirects, detecting duplicated sites,
 detection of spam cliques, tools to manage the webgraph, etc.


 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

 Definitely. :)

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 --
 Doğacan Güney



-- 
-MilleBii-


[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854665#action_12854665
 ] 

Otis Gospodnetic commented on NUTCH-570:


I'm tempted to close this issue as Won't Fix, because:
* I have no way to test and verify this
* nobody seems to be using this
* this issue has only 2 votes and only 3 watchers
* the original reporter mentioned he noticed only marginal speedups


 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854767#action_12854767
 ] 

Chris A. Mattmann commented on NUTCH-570:
-

Hi Otis:

I think your logic perfectly rational here. Maybe you could leave it open for 
another 48 hrs, and then close it out if you don't get any feedback from the 
original reporter, or those that were interested.

Cheers,
Chris


 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7
---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Upgrading to Tika 0.7 before 1.1 release

The TikaConfig mechanism has changed and does not rely on a default XML config 
file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-789:


  Component/s: (was: fetcher)
   parser
Fix Version/s: (was: 1.1)

Have created a separate issue for the upgrade of Tika 0.7 and moved this one 
out of 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: parser
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-810.
---

Resolution: Fixed

Committed in rev 931098.

http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is 
created as it does not rely on a  tika-config.xml file any longer. Our custom 
TikaConfig has been modified to reflect these changes.

This was the last remaining issue marked for 1.1 



 Upgrade to Tika 0.7
 ---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 Upgrading to Tika 0.7 before 1.1 release
 The TikaConfig mechanism has changed and does not rely on a default XML 
 config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



release of 1.1?

2010-04-06 Thread Julien Nioche
Chris,

Just to let you know that I have committed
https://issues.apache.org/jira/browse/NUTCH-810 which was the last open
issue before the release of 1.1

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Nutch 2.0 roadmap

2010-04-06 Thread Julien Nioche
Hi guys,

I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
based on what is currently referred to as NutchBase. Shall we create a
branch for 2.0 in the Nutch SVN repository and have a label accordingly for
JIRA so that we can file issues / feature requests on 2.0? Do you think that
the current NutchBase could be used as a basis for the 2.0 branch?

Talking about features, what else would we add apart from :

* support for HBase : via ORM or not (see
NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
)
* plugin cleanup : Tika only for parsing - get rid of everything else?
* remove index / search and delegate to SOLR
* new functionalities e.g. sitemap support, canonical tag etc...

I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
update?

I look forward to hearing your thoughts on this

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: release of 1.1?

2010-04-06 Thread Mattmann, Chris A (388J)
Thanks Julien!

OK, I'll cut the RC at some point today. Thanks!

Cheers,
Chris


On 4/6/10 4:47 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote:

Chris,

Just to let you know that I have committed 
https://issues.apache.org/jira/browse/NUTCH-810 which was the last open issue 
before the release of 1.1

Thanks

Julien


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 2.0 roadmap

2010-04-06 Thread Andrzej Bialecki
On 2010-04-06 15:43, Julien Nioche wrote:
 Hi guys,
 
 I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
 based on what is currently referred to as NutchBase. Shall we create a
 branch for 2.0 in the Nutch SVN repository and have a label accordingly for
 JIRA so that we can file issues / feature requests on 2.0? Do you think that
 the current NutchBase could be used as a basis for the 2.0 branch?

I'm not sure what is the status of the nutchbase - it's missed a lot of
fixes and changes in trunk since it's been last touched ...

 
 Talking about features, what else would we add apart from :
 
 * support for HBase : via ORM or not (see
 NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808
 )

This IMHO is promising, this could open the doors to small-to-medium
installations that are currently too cumbersome to handle.

 * plugin cleanup : Tika only for parsing - get rid of everything else?

Basically, yes - keep only stuff like HtmlParseFilters (probably with a
different API) so that we can post-process the DOM created in Tika from
whatever original format.

Also, the goal of the crawler-commons project is to provide APIs and
implementations of stuff that is needed for every open source crawler
project, like: robots handling, url filtering and url normalization, URL
state management, perhaps deduplication. We should coordinate our
efforts, and share code freely so that other projects (bixo, heritrix,
droids) may contribute to this shared pool of functionality, much like
Tika does for the common need of parsing complex formats.

 * remove index / search and delegate to SOLR

+1 - we may still keep a thin abstract layer to allow other
indexing/search backends, but the current mess of indexing/query filters
and competing indexing frameworks (lucene, fields, solr) should go away.
We should go directly from DOM to a NutchDocument, and stop there.

Regarding search - currently the search API is too low-level, with the
custom text and query analysis chains. This needlessly introduces the
(in)famous Nutch Query classes and Nutch query syntax limitations, We
should get rid of it and simply leave this part of the processing to the
search backend. Probably we will use the SolrCloud branch that supports
sharding and global IDF.

 * new functionalities e.g. sitemap support, canonical tag etc...

Plus a better handling of redirects, detecting duplicated sites,
detection of spam cliques, tools to manage the webgraph, etc.

 
 I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
 update?

Definitely. :)

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Commented: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854332#action_12854332
 ] 

Hudson commented on NUTCH-810:
--

Integrated in Nutch-trunk #1116 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1116/])
 Upgraded to Tika 0.7


 Upgrade to Tika 0.7
 ---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 Upgrading to Tika 0.7 before 1.1 release
 The TikaConfig mechanism has changed and does not rely on a default XML 
 config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted a candidate for the Apache Nutch 1.1 release. The source code
is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/

See the included CHANGES.txt file for details on release contents and latest
changes. The release was made using the Nutch release process, documented on
the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours. Only votes from Lucene PMC are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251
 ] 

Julien Nioche commented on NUTCH-789:
-

Will upgrade as soon as 0.7 is available from 
http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet.
I will leave this issue open but unmark it as 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Question: Nutch 0.8.2 and Nutch 0.7.3?

2010-04-04 Thread Andrzej Bialecki
On 2010-04-04 02:59, Mattmann, Chris A (388J) wrote:
 Hey Guys,
 
 Question. I see 2 releases that haven't been cut in JIRA:
 
 0.8.2: 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
 80fixfor=12312064
 
 0.7.3:
 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
 80fixfor=12312176
 
 I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door.
 However, I have a question: is this Nutch 0.8.2 in SVN?
 
 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/

That's the code that was intended to become 0.8.2 ...

However, I'm not sure whether there's any benefit in releasing either of
these. Those who really had the need to track this branch (or 0.7)
likely used the code from this branch even though it wasn't released.
And I believe we are not interested in maintaining a new release based
on this code...?


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Question: Nutch 0.8.2 and Nutch 0.7.3?

2010-04-04 Thread Mattmann, Chris A (388J)
Hey Andrzej,

 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/
 
 That's the code that was intended to become 0.8.2 ...
 
 However, I'm not sure whether there's any benefit in releasing either of
 these. Those who really had the need to track this branch (or 0.7)
 likely used the code from this branch even though it wasn't released.
 And I believe we are not interested in maintaining a new release based
 on this code...?

No problem, just wanted to guage interest. Is everyone OK with me closing
out those releases in JIRA, then?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853285#action_12853285
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien, Tika 0.7 is available from Maven central:

http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/

Cheers,
Chris


 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-807) JSParseFilter produces malformed URL

2010-04-03 Thread Minyao Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minyao Zhu updated NUTCH-807:
-

Summary: JSParseFilter produces malformed URL  (was: JSParseFilter produces 
weired URL)

 JSParseFilter produces malformed URL
 

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu

 This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
 language site )
 It appears this page contains javascripts which confused JSParseFilter, which 
 produced URL like this:
 http://zhidao.baidu.com/){if(A===46){baidu.hide(
 Not sure the impact/scope of this issue in general.  The observation for this 
 specific site is, much less pages got crawled.
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853212#action_12853212
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and 
close this one out...after that, I'll cut the Nutch 1.1 RC.

Thanks!

Cheers,
Chris


 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Question: Nutch 0.8.2 and Nutch 0.7.3?

2010-04-03 Thread Mattmann, Chris A (388J)
Hey Guys,

Question. I see 2 releases that haven't been cut in JIRA:

0.8.2: 
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
80fixfor=12312064

0.7.3:

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truepid=106
80fixfor=12312176

I'm happy to cut 0.8.2 as part of the 1.1 effort, to get it out the door.
However, I have a question: is this Nutch 0.8.2 in SVN?

http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8/

Nutch 0.7.3 has no issues associated with it, so should I remove it? It's
been a few years since it was created it seems and I don't think it's got
active maintenance, or a user base.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Created: (NUTCH-807) JSParseFilter produces weired URL

2010-04-02 Thread Minyao Zhu (JIRA)
JSParseFilter produces weired URL
-

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu


This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
language site )

It appears this page contains javascripts which confused JSParseFilter, which 
produced URL like this:

http://zhidao.baidu.com/){if(A===46){baidu.hide(

Not sure the impact/scope of this issue in general.  The observation for this 
specific site is, much less pages got crawled.

Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)
Evaluate ORM Frameworks which support non-relational column-oriented datastores 
and RDBMs 
--

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar


We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler 
to compile class definitions given in JSON. Before moving on with this, we 
might benefit from evaluating other frameworks, whether they suit our needs. 

We want at least the following capabilities:
- Using POJOs 
- Able to persist objects to at least HBase, Cassandra, and RDBMs 
- Able to efficiently serialize objects as task outputs from Hadoop jobs
- Allow native queries, along with standard queries 




Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Mattmann, Chris A (388J)
(apologies for the cross-post, but this impacts Nutch 1.1, so just wanted
folks to see it)

* +1 on extending the deadline until Monday, April 5th. Right now, we have 3
+1s, so technically we could still do the 72 hrs and still be OK, but I¹m
fine with giving folks some more time to take a look
* Thanks to jzitting and gsingers for taking a look and voting so far
* Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
OK, Nutchers?
* Thanks for comments on the CHANGES from gsingers, and the mention to
include the sha1 of the src archive from jzitting. Will do on both, going
forward. 
* +1 for having a direct link to tika-app on the website.

Cheers,
Chris




On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:

 Hi,
 
 On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Please vote on releasing these packages as Apache Tika 0.7.
 
 +1 Thanks!
 
 Some minor notes:
 * It would be good to have also a SHA1 checksum for the release archive.
 * Perhaps we should start offering also the tika-app jar as a direct
 download from l.a.o/tika/download.html?
 
 The vote is open for the next 72 hours.
 
 It looks like people.apache.org is not accessible at the moment (I
 downloaded the release candidate yesterday), so it might be a good
 idea to extend the vote period over the Easter holidays.
 
 BR,
 
 Jukka Zitting
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Apache Tika 0.7 Release Candidate #1

2010-04-02 Thread Julien Nioche
Hi Chris,


 * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch
 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That
 OK, Nutchers?


Great. I'll definitely give 0.7 a try and make sure it works in Nutch.

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com




 On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:

  Hi,
 
  On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J)
  chris.a.mattm...@jpl.nasa.gov wrote:
  Please vote on releasing these packages as Apache Tika 0.7.
 
  +1 Thanks!
 
  Some minor notes:
  * It would be good to have also a SHA1 checksum for the release archive.
  * Perhaps we should start offering also the tika-app jar as a direct
  download from l.a.o/tika/download.html?
 
  The vote is open for the next 72 hours.
 
  It looks like people.apache.org is not accessible at the moment (I
  downloaded the release candidate yesterday), so it might be a good
  idea to extend the vote period over the Easter holidays.
 
  BR,
 
  Jukka Zitting
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche

 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >