[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2010-05-10 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated NUTCH-162:


Attachment: anchors_ja.properties
cached_ja.properties
explain_ja.properties

We need some japanaese property files to make ja for the default language 
selection (Because of String language = 
ResourceBundle.getBundle(org.nutch.jsp.search, 
request.getLocale()).getLocale().getLanguage(); in seach.jsp for example).

I'll submit those property files.

 country code jp is used instead of language code ja for Japanese
 

 Key: NUTCH-162
 URL: https://issues.apache.org/jira/browse/NUTCH-162
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: n/a
Reporter: KuroSaka TeruHiko
Priority: Trivial
 Attachments: anchors_ja.properties, cached_ja.properties, 
 explain_ja.properties


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-162) country code jp is used instead of language code ja for Japanese

2010-05-10 Thread Hiroaki Kawai (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hiroaki Kawai updated NUTCH-162:


Attachment: search_ja.properties
text_ja.properties

Please put these property files in src/web/locale/org/nutch/jsp/ .


 country code jp is used instead of language code ja for Japanese
 

 Key: NUTCH-162
 URL: https://issues.apache.org/jira/browse/NUTCH-162
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 0.7.1
 Environment: n/a
Reporter: KuroSaka TeruHiko
Priority: Trivial
 Attachments: anchors_ja.properties, cached_ja.properties, 
 explain_ja.properties, search_ja.properties, text_ja.properties


 In locale switching link for Japanese, jp is used as language code but it 
 is an ISO country code.  The language code ja should be used.
 By the way, I don't think many users are familiar with the ISO language 
 codes.  A Canadian user may click on ca uknowoing that ca stands for 
 Catalan, not Canadian English or French. Rather than listing the language 
 code, listing the language names in the prospective languages may be better. 
 (I say may be because the browser could show some language names in 
 corrupted text if the current font does not support that language --- this is 
 a difficult problem.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-816 started by Chris A. Mattmann.

 Add zip target to build.xml
 ---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


 Just like we have an ant tar target (pun intended) we should have an ant zip 
 target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-816) Add zip target to build.xml

2010-05-08 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-816.
-

Resolution: Fixed

- fixed in r942427

 Add zip target to build.xml
 ---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


 Just like we have an ant tar target (pun intended) we should have an ant zip 
 target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-811) Develop an ORM framework

2010-05-07 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12865226#action_12865226
 ] 

Enis Soztutar commented on NUTCH-811:
-

Hi Piet,
The code for Gora will reside in GitHub for now, since Nutch and Gora are 
pretty orthogonal. But as stated before, Nutch is the first user of Gora, and 
Gora does not yet have a separate community so I intend to always keep nutch 
community updated (via this issue and nutch-dev mailing list), and hope for 
feedback from the Nutch community.

Moreover, NutchBase has already been ported to using Gora, so at some point, 
Gora should be reviewed and accepted as a dependency for Nutch.

 Develop an ORM framework 
 -

 Key: NUTCH-811
 URL: https://issues.apache.org/jira/browse/NUTCH-811
 Project: Nutch
  Issue Type: New Feature
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, 
 so that different backends can be used to store data. 
 This issue will track the development of the ORM layer. Initially full 
 support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support 
 scheduled for later. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-811) Develop an ORM framework

2010-05-06 Thread Piet Schrijver (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12864744#action_12864744
 ] 

Piet Schrijver commented on NUTCH-811:
--

Will development for gora be tracked under this or any nutch ticket?

 Develop an ORM framework 
 -

 Key: NUTCH-811
 URL: https://issues.apache.org/jira/browse/NUTCH-811
 Project: Nutch
  Issue Type: New Feature
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 By Nutch-808, it is clear that we need an ORM layer on top of the datastore, 
 so that different backends can be used to store data. 
 This issue will track the development of the ORM layer. Initially full 
 support for HBase is planned, with RDBM, Hadoop MapFile and Cassandra support 
 scheduled for later. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-817) parse-(html)does follow links of full html page, parse-(tika) does follow any links and stops at level 1

2010-05-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-817:
---

Assignee: Julien Nioche

 parse-(html)does follow links of full html page, parse-(tika) does follow any 
 links and stops at level 1
 

 Key: NUTCH-817
 URL: https://issues.apache.org/jira/browse/NUTCH-817
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1
 Environment: Suse linux 11.1, java version 1.6.0_13
Reporter: matthew a. grisius
Assignee: Julien Nioche
 Attachments: sample-javadoc.html


 submitted per Julien Nioche. I did not see where to attach a file so I pasted 
 it here. btw: Tika command line returns empty html body for this file.
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Frameset//EN 
 http://www.w3.org/TR/html4/frameset.dtd;
 !--NewPage--
 HTML
 HEAD
 !-- Generated by javadoc on Fri Mar 28 17:23:42 EDT 2008--
 TITLE
 Matrix Application Development Kit
 /TITLE
 SCRIPT type=text/javascript
 targetPage =  + window.location.search;
 if (targetPage !=   targetPage != undefined)
targetPage = targetPage.substring(1);
 function loadFrames() {
 if (targetPage !=   targetPage != undefined)
  top.classFrame.location = top.targetPage;
 }
 /SCRIPT
 NOSCRIPT
 /NOSCRIPT
 /HEAD
 FRAMESET cols=20%,80% title= onLoad=top.loadFrames()
 FRAMESET rows=30%,70% title= onLoad=top.loadFrames()
 FRAME src=overview-frame.html name=packageListFrame title=All Packages
 FRAME src=allclasses-frame.html name=packageFrame title=All classes and 
 interfaces (except non-static nested types)
 /FRAMESET
 FRAME src=overview-summary.html name=classFrame title=Package, class 
 and interface descriptions scrolling=yes
 NOFRAMES
 H2
 Frame Alert/H2
 P
 This document is designed to be viewed using the frames feature. If you see 
 this message, you are using a non-frame-capable web client.
 BR
 Link toA HREF=overview-summary.htmlNon-frame version./A
 /NOFRAMES
 /FRAMESET
 /HTML

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-814) SegmentMerger bug

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-814:


Attachment: merger.patch

Patch fixing the issue, and a unit test. I will commit this shortly.

 SegmentMerger bug
 -

 Key: NUTCH-814
 URL: https://issues.apache.org/jira/browse/NUTCH-814
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Dennis Kubes
Assignee: Andrzej Bialecki 
 Fix For: 1.1

 Attachments: merger.patch


 Dennis reported:
 {quote}
 In the SegmentMerger.java file about line 150 we have this:
final SequenceFile.Reader reader =
  new SequenceFile.Reader(FileSystem.get(job), fSplit.getPath(),
 job);
 Then about line 166 in the record reader we have this:
 boolean res = reader.next(key, w);
 If I am reading that right, that would mean that the map tap would loop
 over all records for a given file and not just a given split.
 {quote}
 Right, this should instead use SequenceFileRecordReader that already has the 
 logic to handle splits. Patch coming shortly - thanks for spotting this! This 
 could be the reason for out of disk space errors that many users reported.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (NUTCH-466) Flexible segment format

2010-04-27 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-466 stopped by Andrzej Bialecki .

 Flexible segment format
 ---

 Key: NUTCH-466
 URL: https://issues.apache.org/jira/browse/NUTCH-466
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: ParseFilters.java, segmentparts.patch


 In many situations it is necessary to store more data associated with pages 
 than it's possible now with the current segment format. Quite often it's a 
 binary data. There are two common workarounds for this: one is to use 
 per-page metadata, either in Content or ParseData, the other is to use an 
 external independent database using page ID-s as foreign keys.
 Currently segments can consist of the following predefined parts: content, 
 crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I 
 propose a third option, which is a natural extension of this existing segment 
 format, i.e. to introduce the ability to add arbitrarily named segment 
 parts, with the only requirement that they should be MapFile-s that store 
 Writable keys and values. Alternatively, we could define a 
 SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.
 Existing segment API and searcher API (NutchBean, DistributedSearch 
 Client/Server) should be extended to handle such arbitrary parts.
 Example applications:
 * storing HTML previews of non-HTML pages, such as PDF, PS and Office 
 documents
 * storing pre-tokenized version of plain text for faster snippet generation
 * storing linguistically tagged text for sophisticated data mining
 * storing image thumbnails
 etc, etc ...
 I'm going to prepare a patchset shortly. Any comments and suggestions are 
 welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-816) Add zip target to build.xml

2010-04-27 Thread Chris A. Mattmann (JIRA)
Add zip target to build.xml
---

 Key: NUTCH-816
 URL: https://issues.apache.org/jira/browse/NUTCH-816
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.1


Just like we have an ant tar target (pun intended) we should have an ant zip 
target. I'd like to have this ready for the release and future releases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-26 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar closed NUTCH-808.
---

Resolution: Fixed

We have decided to go on with implementing an ORM layer as per the discussion 
on NUTCH-811. Closing this issue. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-710) Support for rel=canonical attribute

2010-04-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859286#action_12859286
 ] 

Julien Nioche commented on NUTCH-710:
-

As suggested previously we could either treat canonicals as redirections or 
during deduplication. Neither are satisfactory solutions.

Redirection : we want to index the document if/when the target of the canonical 
is not available for indexing. We also want to follow the outlinks. 
Dedup : could modify the *DeleteDuplicates code but canonical are more complex 
due to fact that we need to follow redirections

We probably need a third approach: prefilter by going through the crawldb  
detect URLs which have a canonical target already indexed or ready to be 
indexed. We need to follow up to X levels of redirection e.g. doc A marked as 
canonical representation doc B, doc B redirects to doc C etc...if end of 
redirection chain exists and is valid then mark A as duplicate of C 
(intermediate redirs will not get indexed anyway)

As we don't know if has been indexed yet we would give it a special marker 
(e.g. status_duplicate) in the crawlDB. Then
- if indexer comes across such an entry : skip it
- make so that *deleteDuplicates can take a list of URLs with status_duplicate 
as an additional source of input OR have a custom resource that deletes such 
entries in SOLR or Lucene indices

The implementation would be as follows :

Go through all redirections and generate all redirection chains e.g.

A - B
B - C
D - C

where C is an indexable document (i.e. has been fetched and parsed - it may 
have been already indexed.

will yield

A - C
B - C
D - C

but also

C - C

Once we have all possible redirections : go through the crawlDB in search of 
canonicals. if the target of a canonical is the source of a valid alias (e.g. A 
- B - C - D) mark it as 'status:duplicate'

This design implies generating quite a few intermediate structures + scanning 
the whole crawlDB twice (once of the aliases then for the canonical) + rewrite 
the whole crawlDB to mark some of the entries as duplicates.

This would be much easier to do when we have Nutch2/HBase : could simply follow 
the redirs from the initial URL having a canonical tag instead of generating 
these intermediate structures. We can then modify the entries one by one 
instead of regenerating the whole crawlDB.

WDYT?



 Support for rel=canonical attribute
 -

 Key: NUTCH-710
 URL: https://issues.apache.org/jira/browse/NUTCH-710
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.1
Reporter: Frank McCown
Priority: Minor

 There is a the new rel=canonical attribute which is
 now being supported by Google, Yahoo, and Live:
 http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
 Adding support for this attribute value will potentially reduce the number of 
 URLs crawled and indexed and reduce duplicate page content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implme

2010-04-20 Thread Ilguiz Latypov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12859116#action_12859116
 ] 

Ilguiz Latypov commented on NUTCH-427:
--

I hesitate adding the .zip file because (a) it hides the intention of the 
change and (b) other developers who might have already modified their copies 
would have difficulty merging my change.

I believe the GNU patch tool will apply my suggested change automatically, 
provided that one resides in the right working directory and, possibly, applies 
the -pX option where X is the number of upper level directory names to ignore 
in the patch.


 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new

[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment

2010-04-20 Thread Ilguiz Latypov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilguiz Latypov updated NUTCH-427:
-

Attachment: (was: protocol-smb.zip)

 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL smb.SMB - Could not read content of protocol: smb://xx
This problem usually occurs if the following properties are not set 
 correctly in
the smb.properties file:
- username
- password
- domain
Also refer to the following resources

[jira] Updated: (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implment

2010-04-20 Thread Ilguiz Latypov (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilguiz Latypov updated NUTCH-427:
-

Attachment: protocol-smb-dist.zip

Applied my diff to simplify importing into the Subversion tree.  The build 
directory should not be imported, and the src/plugin/build.xml file should only 
add the new protocol-smb deploy and clean targets.

The previous author did not grant the license to ASF.


 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Attachments: protocol-smb-diff.txt, protocol-smb-dist.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL

[jira] Work started: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-812 started by Chris A. Mattmann.

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-812:
---

Assignee: Chris A. Mattmann

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-812.
-

Fix Version/s: 1.1
   Resolution: Fixed

- fixed in r935453. Thanks, Phil and Andrzej!

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Chris A. Mattmann
Priority: Critical
 Fix For: 1.1


 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Manh Tien updated NUTCH-813:
---

Attachment: Patch

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
 Attachments: Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)
Repetitive crawl 403 status page


 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
 Attachments: Patch

When we crawl a page the return a 403 status. It will be crawl repetitively 
each days with default schedule.
Even when we restrict by paramter db.fetch.retry.max


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-813) Repetitive crawl 403 status page

2010-04-17 Thread Nguyen Manh Tien (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Manh Tien updated NUTCH-813:
---

Priority: Minor  (was: Major)

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
Priority: Minor
 Attachments: Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-812) Crawl.java incorrectly uses the Generator API resulting in NPE

2010-04-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-812:


Affects Version/s: 1.1
 Priority: Critical  (was: Major)

 Crawl.java incorrectly uses the Generator API resulting in NPE
 --

 Key: NUTCH-812
 URL: https://issues.apache.org/jira/browse/NUTCH-812
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Priority: Critical

 As reported by Phil Barnett on nutch-user:
 {quote}
 The Fix.
 In line 131 of Crawl.java
 Generate no longer returns segments like it used to. Now it returns segs.
 line 131 needs to read
  If (segs == null)
  Instead of the current
 If (segments == null)
 After that change and a recompile, crawl is working just fine.
 {quote}

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856349#action_12856349
 ] 

Julien Nioche commented on NUTCH-808:
-

Hi Enis,

{quote}
On the other hand, current implementation is ...
{quote}

What do you mean by current implementation? NutchBase?

My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 

I know that Cascading have various Tape/Sink implementations including JDBC, 
HBase  but also SimpleDB. Maybe it would be worth having a look at how they do 
it?

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-13 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856360#action_12856360
 ] 

Enis Soztutar commented on NUTCH-808:
-

bq. What do you mean by current implementation? NutchBase?
Indeed. In package o.a.n.storage deals with ORM (though not all classes)

bq. I know that Cascading have various Tape/Sink implementations including 
JDBC, HBase but also SimpleDB. Maybe it would be worth having a look at how 
they do it?
The way cascading does this is to convert Tuples (cascading data structure) to 
HBase/JDBC records. The schema for HBase/JDBC is given as a metadata. Since 
they deal with only tuple - table row, it is not that difficult. But again, 
cascading does not allow for mapping lists to columns, etc. 

bq. My gut feeling would be to write a custom framework instead of relying on 
DataNucleus and use AVRO if possible. I really think that HBase support is 
urgently needed but am less convinced that we need MySQL in the very short 
term. 
Yeah, the more I think about it, the more I come to terms with custom 
implementation. However, I think we might benefit a lot from the ideas from JDO 
in the long term. Also, JDBC implementation may not be relevant for large scale 
deployments, but it will be a very nice side effect of the ORM layer, which 
will allow easy deployment, which in turn will hopefully bring more users. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-12 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved NUTCH-570.


Resolution: Won't Fix

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-12 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12856124#action_12856124
 ] 

Enis Soztutar commented on NUTCH-808:
-

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference 
implementation for Java Data objects (JDO). JDO is a java standard for 
persistence. A similar specification, named JPA is also a persistence standard, 
which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will 
not be useful for us 
(http://www.datanucleus.org/products/accessplatform/persistence_api.html). 

In JDO, the first step is to define the domain objects as POJOs. Then, the 
persistance metadata is specified either using annotations, XML or both. Then a 
byte code enhancer uses instrumentation to add required methods to the classes 
defined as @PersistanceCapable. The database tables can be generated by hand, 
automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The 
objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase 
branch) to both MySQL and HBase. Although it took me a fair bit of time to 
set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, 
arrays, etc) to RDBMs using different strategies (such as serializing directly, 
using Joins, using Foreign Keys), I was not able to find a way to leverage 
HBase data model. For example, we want to be able to map lists and maps to 
columns in column families. Without such functionality using column oriented 
stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new 
datastore for datanucleus, which also implements Hadoop's Serialization, or use 
Avro to generate Java classes to be feed into JPOX enhancer, or else manually 
implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to 
leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 
2.0, for out-of-the-box use, ease of deployment and real-scale computing 
respectively. But obviously we cannot use DataNucleus out of the box either. 


ORM is inherently a hard problem. I propose we go ahead and make the changes to 
DataNucleus to see if it is feasible, and continue with it if it suits our 
needs. Of course, having a custom framework will also be great, so any feedback 
would be more than welcome. 

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-808:


Fix Version/s: 2.0

 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 2.0


 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854665#action_12854665
 ] 

Otis Gospodnetic commented on NUTCH-570:


I'm tempted to close this issue as Won't Fix, because:
* I have no way to test and verify this
* nobody seems to be using this
* this issue has only 2 votes and only 3 watchers
* the original reporter mentioned he noticed only marginal speedups


 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-04-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854767#action_12854767
 ] 

Chris A. Mattmann commented on NUTCH-570:
-

Hi Otis:

I think your logic perfectly rational here. Maybe you could leave it open for 
another 48 hrs, and then close it out if you don't get any feedback from the 
original reporter, or those that were interested.

Cheers,
Chris


 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)
Upgrade to Tika 0.7
---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


Upgrading to Tika 0.7 before 1.1 release

The TikaConfig mechanism has changed and does not rely on a default XML config 
file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-789) Improvements to Tika parser

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-789:


  Component/s: (was: fetcher)
   parser
Fix Version/s: (was: 1.1)

Have created a separate issue for the upgrade of Tika 0.7 and moved this one 
out of 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: parser
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-810.
---

Resolution: Fixed

Committed in rev 931098.

http://issues.apache.org/jira/browse/TIKA-317 changed the way the TikaConfig is 
created as it does not rely on a  tika-config.xml file any longer. Our custom 
TikaConfig has been modified to reflect these changes.

This was the last remaining issue marked for 1.1 



 Upgrade to Tika 0.7
 ---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 Upgrading to Tika 0.7 before 1.1 release
 The TikaConfig mechanism has changed and does not rely on a default XML 
 config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-810) Upgrade to Tika 0.7

2010-04-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854332#action_12854332
 ] 

Hudson commented on NUTCH-810:
--

Integrated in Nutch-trunk #1116 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1116/])
 Upgraded to Tika 0.7


 Upgrade to Tika 0.7
 ---

 Key: NUTCH-810
 URL: https://issues.apache.org/jira/browse/NUTCH-810
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 Upgrading to Tika 0.7 before 1.1 release
 The TikaConfig mechanism has changed and does not rely on a default XML 
 config file anymore. Am working on it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853251#action_12853251
 ] 

Julien Nioche commented on NUTCH-789:
-

Will upgrade as soon as 0.7 is available from 
http://repo1.maven.org/maven2/org/apache/tika/ - which is not the case yet.
I will leave this issue open but unmark it as 1.1

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-04 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853285#action_12853285
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien, Tika 0.7 is available from Maven central:

http://repo1.maven.org/maven2/org/apache/tika/tika-parsers/

Cheers,
Chris


 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-807) JSParseFilter produces malformed URL

2010-04-03 Thread Minyao Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minyao Zhu updated NUTCH-807:
-

Summary: JSParseFilter produces malformed URL  (was: JSParseFilter produces 
weired URL)

 JSParseFilter produces malformed URL
 

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu

 This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
 language site )
 It appears this page contains javascripts which confused JSParseFilter, which 
 produced URL like this:
 http://zhidao.baidu.com/){if(A===46){baidu.hide(
 Not sure the impact/scope of this issue in general.  The observation for this 
 specific site is, much less pages got crawled.
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-04-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853212#action_12853212
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Hey Julien -- okey dok, Tika 0.7 has been released. Feel free to upgrade, and 
close this one out...after that, I'll cut the Nutch 1.1 RC.

Thanks!

Cheers,
Chris


 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-807) JSParseFilter produces weired URL

2010-04-02 Thread Minyao Zhu (JIRA)
JSParseFilter produces weired URL
-

 Key: NUTCH-807
 URL: https://issues.apache.org/jira/browse/NUTCH-807
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.0.0
 Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
Reporter: Minyao Zhu


This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
language site )

It appears this page contains javascripts which confused JSParseFilter, which 
produced URL like this:

http://zhidao.baidu.com/){if(A===46){baidu.hide(

Not sure the impact/scope of this issue in general.  The observation for this 
specific site is, much less pages got crawled.

Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)
Evaluate ORM Frameworks which support non-relational column-oriented datastores 
and RDBMs 
--

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar


We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler 
to compile class definitions given in JSON. Before moving on with this, we 
might benefit from evaluating other frameworks, whether they suit our needs. 

We want at least the following capabilities:
- Using POJOs 
- Able to persist objects to at least HBase, Cassandra, and RDBMs 
- Able to efficiently serialize objects as task outputs from Hadoop jobs
- Allow native queries, along with standard queries 




Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)
Parse-metatags plugin
-

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch

h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: (was: NUTCH-809.patch)

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche

 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Attachment: NUTCH-809.patch

Modified version of the plugin which is compatible with parse-tika

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
 [TIKA-379]).* 
 To use the legacy HTML parser specify in parse-plugins.xml
 {code:xml}
 mimeType name=text/html
   plugin id=parse-html /
 /mimeType
 {code}
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-809) Parse-metatags plugin

2010-04-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-809:


Description: 
h2. Parse-metatags plugin

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com



  was:
h2. Parse-metatags plugin

*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see 
[TIKA-379]).* 

To use the legacy HTML parser specify in parse-plugins.xml

{code:xml}
mimeType name=text/html
  plugin id=parse-html /
/mimeType
{code}

The parse-metatags plugin consists of a HTMLParserFilter which takes as 
parameter a list of metatag names with '*' as default value. The values are 
separated by ';'.

In order to extract the values of the metatags description and keywords, you 
must specify in nutch-site.xml

{code:xml}
property
  namemetatags.names/name
  valuedescription;keywords/value
/property
{code}

The MetatagIndexer uses the output of the parsing above to create two fields 
'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.

This code has been developed by DigitalPebble Ltd and offered to the community 
by ANT.com




 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-809.patch


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The MetaTagsQueryFilter allows to include the fields above in the Nutch 
 queries.
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs

2010-04-02 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840
 ] 

Enis Soztutar commented on NUTCH-808:
-

A candidate framework is DataNucleus. It has the following benefits. 

- Apache 2 license. 
- JDO support 
- HBase, RDBMS, XML persistance. 

I will further investigate whether we can integrate Hadoop writables/Avro 
serialization so that objects can be passed from Mapred. 


 Evaluate ORM Frameworks which support non-relational column-oriented 
 datastores and RDBMs 
 --

 Key: NUTCH-808
 URL: https://issues.apache.org/jira/browse/NUTCH-808
 Project: Nutch
  Issue Type: Task
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 We have an ORM layer in the NutchBase branch, which uses Avro Specific 
 Compiler to compile class definitions given in JSON. Before moving on with 
 this, we might benefit from evaluating other frameworks, whether they suit 
 our needs. 
 We want at least the following capabilities:
 - Using POJOs 
 - Able to persist objects to at least HBase, Cassandra, and RDBMs 
 - Able to efficiently serialize objects as task outputs from Hadoop jobs
 - Allow native queries, along with standard queries 
 Any comments, suggestions for other frameworks are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-706:


Fix Version/s: (was: 1.1)

Both variants of the substitution rule above break existing tests. More work 
will be needed to get a pattern which covers the case described by Meghna *and* 
is compatible with the existing test cases.
Moving it to post-1.1

 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-706) Url regex normalizer

2010-03-31 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851923#action_12851923
 ] 

Ken Krugler commented on NUTCH-706:
---

Two comments about this:

1. From my experiences with Nutch  Bixo, I think that URL normalization 
ultimately needs to be more structured - ie first break the URL into pieces, 
then apply rules against the pieces. Trying to craft regular expressions to 
handle target cases leads to big, hairy, hard-to-understand strings.

2. URL normalization is something that makes a lot of sense for 
crawler-commons. If somebody from the Nutch side wants to define a target API, 
I could look at porting existing Bixo code to crawler-commons.


 Url regex normalizer
 

 Key: NUTCH-706
 URL: https://issues.apache.org/jira/browse/NUTCH-706
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Meghna Kukreja
Priority: Minor

 Hey,
 I encountered the following problem while trying to crawl a site using
 nutch-trunk. In the file regex-normalize.xml, the following regex is
 used to remove session ids:
 pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern.
 This pattern also transforms a url, such as,
 newsId=2000484784794newsLang=en into newnewsLang=en (since it
 matches 'sId' in the 'newsId'), which is incorrect and hence does not
 get fetched. This expression needs to be changed to prevent this.
 Thanks,
 Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-249) black- white list url filtering

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-249:


Fix Version/s: (was: 1.1)

- push out per http://bit.ly/c7tBv9

 black- white list url filtering
 ---

 Key: NUTCH-249
 URL: https://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Dennis Kubes
Priority: Trivial
 Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch, bw.patch


 Existing url filter mechanisms need to process each url against each filter 
 pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-309) Uses commons logging Code Guards

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-309:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Uses commons logging Code Guards
 

 Key: NUTCH-309
 URL: https://issues.apache.org/jira/browse/NUTCH-309
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-763) Separate configuration files from resources to be included in the job file

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-763:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Separate configuration files from resources to be included in the job file
 --

 Key: NUTCH-763
 URL: https://issues.apache.org/jira/browse/NUTCH-763
 Project: Nutch
  Issue Type: Wish
Reporter: Julien Nioche
Priority: Minor

 One of the things I found confusing when I was learning Nutch was the fact 
 that the conf/ directory contains at the same time : 
 - configuration files for Hadoop / Nutch which are put in the jar files but 
 not used there
 - resource files (e.g. filtering rules) which MUST be up to date in the job 
 file
 I would separate the conf/ directory from say a resources/ directory which 
 would contain the rule files and other things to put in the job file. Unless 
 I am mistaken none of the configuration files need to be in the job file. I 
 know it is a very minor point, but that would probably simplify things and 
 make it easier for beginners to understand what has to be modified where. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-577) Use explicit tika-config.xml file to enable mime magic detection to be turned on and off

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-577:


 Due Date: 30/Nov/07  (was: 30/Nov/07)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Use explicit tika-config.xml file to enable mime magic detection to be turned 
 on and off
 

 Key: NUTCH-577
 URL: https://issues.apache.org/jira/browse/NUTCH-577
 Project: Nutch
  Issue Type: Improvement
  Components: mime_type_detector
Affects Versions: 1.0.0
 Environment: Mac Book Pro Intel Core Duo 2.0 Ghz, 2. 0 GB RAM, Mac OS 
 X 10.4, although improvement is indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor

 Currently, there is a configuration file for Tika (which the trunk in Nutch 
 uses for its mime type detection) called tika-config.xml left unexposed (a 
 default one lives in the tika-0.1-dev.jar file). Tika's mime system has two 
 config files it relies on: tika-mimetypes.xml (which Nutch has its own 
 version of, that overrides the version that comes with the tika jar file), 
 and tika-config.xml (to turn on or off magic char detection). We should 
 probably have a nutch version of tika-config.xml, so that Nutch users can 
 employ magic char mime detection. I'll get going on this in the next day or 
 so.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-310) Review Log Levels

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-310:


Fix Version/s: (was: 1.1)
 Assignee: Chris A. Mattmann  (was: Jerome Charron)

- pushing this out per http://bit.ly/c7tBv9 (and assign to me, I think this can 
be closed but will wait until after 1.1 to revisit)

 Review Log Levels
 -

 Key: NUTCH-310
 URL: https://issues.apache.org/jira/browse/NUTCH-310
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Jerome Charron
Assignee: Chris A. Mattmann
Priority: Minor

 Review of logs content and logs levels (see Commons Logging Best Parctices : 
 http://jakarta.apache.org/commons/logging/guide.html#Message_Priorities_Levels)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-673:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-664) Possibility to update already stored documents.

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-664:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Possibility to update already stored documents.
 ---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: Wish
Reporter: Sergey Khilkov
Priority: Minor

 We have huge index of stored documents. It is high cost procedure to fetch 
 page, merge indexes any time we update some information about page. The 
 information can be changed 1-3 times per day. At this moment we have to store 
 changed info in database, but in this case we have lots of problems with 
 sorting, search restricions and so on. Lucene itself allows delete single 
 document and add new one into existing index. But there is a problem with 
 hadoop... As I understand hadoop filesystem has no possibility to write in 
 random positions. But it will be great feature if nutch will be able to 
 update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-750) HtmlParser plugin - page title extraction

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-750:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 HtmlParser plugin - page title extraction
 -

 Key: NUTCH-750
 URL: https://issues.apache.org/jira/browse/NUTCH-750
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Alexey Torochkov
Priority: Minor
 Attachments: SkipBody.patch


 A little improvement to trying to extract title tag in body if it doesn't 
 exist in head.
 In current version DOMContentUtils just skip all after body in getTitle() 
 method.
 Attached patch allows to change this behavior (for default it doesn't change 
 anything) and can cope with webmasters mistakes

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-564) External parser supports encoding attribute

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-564:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 External parser supports encoding attribute
 ---

 Key: NUTCH-564
 URL: https://issues.apache.org/jira/browse/NUTCH-564
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Antony Bowesman
Priority: Minor
 Attachments: ExtParser_0.9.0.patch, ExtParser_1.0.0.patch


 When an external component generates text, which is returned to the external 
 parser, it always converts the text using the default character set.  
 (os.toString()).  For example, the returned text may be utf-8, but will not 
 be converted to a String correctly.
 I added the attribute encoding to the implementation XML in plugin.xml 
 and this is then used to convert the text.
 I have tested my original fix on my local 0.9 and include a patch, but have 
 also made an untested patch for trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-477) Extend URLFilters to support different filtering chains

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-477:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Minor
 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-251) Administration GUI

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-251:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9 (comment from me: would be nice to 
get this into 1.2)

 Administration GUI
 --

 Key: NUTCH-251
 URL: https://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Minor
 Attachments: hadoop_nutch_gui_v1.patch, Nutch-251-AdminGUI.tar.gz, 
 nutch_gui_plugins_v1.zip, nutch_gui_v1.patch


 Having a web based administration interface would help to make nutch 
 administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-609:


 Due Date: 13/Feb/08  (was: 13/Feb/08)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Allow Plugins to be Loaded from Jar File(s)
 ---

 Key: NUTCH-609
 URL: https://issues.apache.org/jira/browse/NUTCH-609
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Attachments: NUTCH-609-1-20080212.patch


 Currently plugins cannot be loaded from a jar file.  Plugins must be unzipped 
 in one or more directories specified by the plugin.folders config.  I have 
 been thinking about an extension to PluginRepository or PluginManifestParser 
 (or both) that would allow plugins to packaged into multiple independent jar 
 files and placed on the classpath.  The system would search the classpath for 
 resources with the correct folder name and would load any plugins in those 
 jars.
 This functionality would be very useful in making the nutch core more 
 flexible in terms of packaging.  It would also help with web applications 
 where we don't want to have a plugins directory included in the webapp.
 Thoughts so far are unzipping those plugin jars into a common temp directory 
 before loading.  Another option is using something like commons vfs to 
 interact with the jar files.  VFS essential uses a disk based temporary cache 
 for jar files, so it is pretty much the same solution.   What are everyone 
 else's thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-794.
-

Resolution: Fixed

@julien -- I think this issue has been fixed in Tika right? If not, feel free 
to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. 
Thanks!

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-578:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Dennis Kubes
 Attachments: crawl-urlfilter.txt, NUTCH-578.patch, 
 NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, nutch-site.xml, 
 regex-normalize.xml, urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-540) some problem about the Nutch cache

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-540:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Attachments: 1.gif, 1186733525.jpg


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-455) dedup on tokenized fields is faulty

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-455:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-747) injectIndex metadatas and inherit these metadatas to all matching suburls

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-747:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 injectIndex metadatas and inherit these metadatas to all matching suburls
 --

 Key: NUTCH-747
 URL: https://issues.apache.org/jira/browse/NUTCH-747
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, injector
Reporter: Marko Bauhardt
 Attachments: index-metadata.patch, metadata.patch


 Hi.
 the following two patches supports
 + inject metadatas to url's into a metadatadb
 url.com TAB METAKEY : TAB METAVALUE TAB METAVALUE METAKEY : 
 METAVALUE ...
 ...
 + updates the parse_data metadata from a shard and write the metadatas to all 
 fetched urls that starts with an url from the metadatadb
 + this patch support's metadata to all matching suburls inheritance
 the second patch implements a index-metadata plugin.
 + this plugin extract all metadats from the parse_data of a shard and index 
 it. which metadats you can configure in the plugin.properties.
 + to index for example the lang you have to configure the plugin.properties: 
 lang=STORE,UNTOKENIZED
 + that means that the index plugin exract metadata values with key lang. if 
 exists, all values are indexed stored and untokenized
 Example
 create start url's in /tmp/urls/start/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/index.html
 http://lucene.apache.org/nutch/apidocs-0.9/index.html
 create metadata url's in /tmp/urls/metadata/urls.txt
 http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0
 http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9
 Inject Urls
 bin/nutch inject crawldb /tmp/urls/start/
 bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb 
 /tmp/urls/metadata/
 Fetch  Parse  Update
 bin/nutch generate crawldb segments
 bin/nutch fetch segments/20090806105717/
 bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb 
 segments/20090806105717
 bin/nutch updatedb crawldb/ segments/20090806105717/
 Fetch  Parse  Update Again
 ...
 Index
 bin/nutch invertlinks linkdb -dir segments/
 bin/nutch index index crawldb/ linkdb/ segments/20090806105717 
 segments/20090806110127
 Check your Index
 All urls starting with http://lucene.apache.org/nutch/apidocs-1.0/  are 
 indexed with version:1.0.
 All urls starting with http://lucene.apache.org/nutch/apidocs-0.9/  are 
 indexed with version:0.9.
 This issue is some related to http://issues.apache.org/jira/browse/NUTCH-655

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-479) Support for OR queries

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-479:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: nutch_0.9_OR.patch, or.patch, or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-677) Segment merge filering based on segment content

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-677:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Segment merge filering based on segment content
 ---

 Key: NUTCH-677
 URL: https://issues.apache.org/jira/browse/NUTCH-677
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Marcin Okraszewski
 Attachments: MergeFilter.patch, MergeFilter_for_1.0.patch, 
 SegmentMergeFilter.java, SegmentMergeFilter.java, SegmentMergeFilters.java, 
 SegmentMergeFilters.java


 I needed a segment filtering based on meta data detected during parse phase. 
 Unfortunately current URL based filtering does not allow for this. So I have 
 created a new SegmentMergeFilter extension which receives segment entry which 
 is being merged and decides if it should be included or not. Even though I 
 needed only ParseData for my purpose I have done it a bit more general 
 purpose, so the filter receives all merged data.
 The attached patch is for version 0.9 which I use. Unfortunately I didn't 
 have time to check how it fits to trunk version. Sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-774) Retry interval in crawl date is set to 0

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-774:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Retry interval in crawl date is set to 0
 

 Key: NUTCH-774
 URL: https://issues.apache.org/jira/browse/NUTCH-774
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Reinhard Schwab
Assignee: Andrzej Bialecki 
 Attachments: NUTCH-774.patch, NUTCH-774_2.patch


 When i fetch and parse a feed with the feed plugin,
 http://www.wachauclimbing.net/home/impressum-disclaimer/feed/
 another crawl date is generated
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/
 after fetching a second round
 the dump in the crawl db still shows a retry interval with value 0.
 http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ 
 Version: 7
 Status: 2 (db_fetched)
 Fetch time: Wed Dec 02 12:48:22 CET 2009
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 0 seconds (0 days)
 Score: 1.084
 Signature: db9ab2193924cd2d0b53113a500ca604
 Metadata: _pst_: success(1), lastModified=0
 a check should be done in DefaultFetchSchedule (or AbstractFetchSchedule) in 
 the
 method 
 setFetchSchedule

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-460) RDF parser plugin

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-460:


Patch Info: [Patch Available]

- pushing this out per http://bit.ly/c7tBv9

 RDF parser plugin
 -

 Key: NUTCH-460
 URL: https://issues.apache.org/jira/browse/NUTCH-460
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Ricardo J. Méndez
 Attachments: rubyspider-rdf.zip


 I've written a couple plugins that I'd like to contribute.  
 RDFLinkParseFilter looks for links on the pages that point towards RDF 
 information, and tags the pages with metadata about the type of links they 
 hold. RDFLinkIndexingFilter indexes said metadata.  RDFParser parses RDF 
 information from several possible formats using Jena, and extracts the links 
 that the file points to as Outlinks so that they can be fetched as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-729:


 Due Date: 26/Mar/09  (was: 26/Mar/09)
   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 NPE in FieldIndexer when BasicFields url doesn't exist
 --

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-729-1-20090235.patch


 There is a NullPointerException during a logging call in FieldIndexer when 
 there isn't a url for a document.  Documents shouldn't be without urls but 
 since the FieldIndexer doesn't validate fields it is possible for it to 
 occur.  Most often this happens when BasicFields is run with the wrong 
 segments directory and doesn't complain.  It could also occur if using the 
 FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-573) Multiple Domains - Query Search

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-573:



- pushing this out per http://bit.ly/c7tBv9

 Multiple Domains - Query Search
 ---

 Key: NUTCH-573
 URL: https://issues.apache.org/jira/browse/NUTCH-573
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 0.9.0
 Environment: All
Reporter: Rajasekar Karthik
Assignee: Enis Soztutar
 Attachments: multiTermQuery_v1.patch


 Searching multiple domains can be done on Lucene - nut not that efficiently 
 on nutch.
 Query:
 +content:abc +(sitewww.aaa.com site:www.bbb.com)
 works on lucene but the same concept does not work on nutch.
 In Lucene, it works with 
 org.apache.lucene.analysis.KeywordAnalyzer
 org.apache.lucene.analysis.standard.StandardAnalyzer 
 but NOT on
 org.apache.lucene.analysis.SimpleAnalyzer 
 Is Nutch analyzer based on SimpleAnalyzer? In this case, is there a 
 workaround to make this work? Is there an option to change what analyzer 
 nutch is using? 
 Just FYI, another solution (inefficient I believe) which seems to be working 
 on nutch
 query -site:ccc.com -site:ddd.com 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-717) Make Nutch Solr integration easier

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-717:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren

 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-541) Index url field untokenized

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-541:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Index url field untokenized
 ---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
 untokenized version of the url field in some contexts : 
 1. For deleting duplicates by url (at search time). see NUTCH-455
 2. For restricting the search to a certain url (may be used in the case of 
 RSS search where each entry in the Rss is added as a distinct document with 
 (possibly) same url ) 
query-url extends FieldQueryFilter so: 
 Query: url:http://www.apache.org/
 Parsed: url:http http-www http-www-apache www www-apache apache org
 Translated: +url:http-http-www http-www-http-www-apache 
 http-www-apache-www www-www-apache www-apache apache org
 3. for accessing a document(s) in the search servers in the search servers. 
 (using query plugin)
 I suggest we add url as in index-basic and implement a query-url-untoken 
 plugin. 
 doc.add(new Field(url, url.toString(), Field.Store.YES, 
 Field.Index.TOKENIZED));
 doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
 Field.Index.UN_TOKENIZED));

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-628) Host database to keep track of host-level information

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-628:


   Patch Info: [Patch Available]
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Attachments: domain_statistics_v2.patch, 
 NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-650) Hbase Integration

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-650:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Hbase Integration
 -

 Key: NUTCH-650
 URL: https://issues.apache.org/jira/browse/NUTCH-650
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
 malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, 
 NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch


 This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-583) FeedParser empty links for items

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-583:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 FeedParser empty links for items
 

 Key: NUTCH-583
 URL: https://issues.apache.org/jira/browse/NUTCH-583
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar

 FeedParser in feed plugin just discards the item if it does not have link 
 element. However Rss 2.0 does not necessitate the link element for each 
 item. 
 Moreover sometimes the link is given in the guid element which is a 
 globally unique identifier for the item. I think we can search the url for an 
 item first, then if it is still not found, we can use the feed's url, but 
 with merging all the parse texts into one Parse object. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


 Due Date: 27/Nov/08  (was: 27/Nov/08)
Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-666:


Patch Info: [Patch Available]

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-666-1-20081126.patch, NUTCH-666-2-20091217-nf.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-475) Adaptive crawl delay

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-475:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Adaptive crawl delay
 

 Key: NUTCH-475
 URL: https://issues.apache.org/jira/browse/NUTCH-475
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Doğacan Güney
 Attachments: adaptive-delay_draft.patch


 Current fetcher implementation waits a default interval before making another 
 request to the same server (if crawl-delay is not specified in robots.txt). 
 IMHO, an adaptive implementation will be better. If the server is under 
 little load and can server requests fast, then fetcher can ask for more pages 
 in a given interval. Similarly, if the server is suffering from heavy load, 
 fetcher can slow down(w.r.t that host), easing the load on the server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-771) Add WebGraph classes to the bin/nutch script

2010-03-31 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-771:


Fix Version/s: (was: 1.1)

- pushing this out per http://bit.ly/c7tBv9

 Add WebGraph classes to the bin/nutch script
 

 Key: NUTCH-771
 URL: https://issues.apache.org/jira/browse/NUTCH-771
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All, shell script
Reporter: Dennis Kubes
Assignee: Dennis Kubes

 Currently the webgraph jobs are called on the command line by calling main 
 methods on their classes.  I propose to upgrade the bin/nutch shell script to 
 allow calling these jobs as well.  This would include the webgraphdb, 
 linkrank, scoreupdater, and nodedumper jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852047#action_12852047
 ] 

Chris A. Mattmann commented on NUTCH-673:
-

Folks: if you get time to put together a patch for 1.1 or feel that this should 
go into 1.1, please see:  http://bit.ly/c7tBv9 and comment in the next 48 hrs...

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor

 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852048#action_12852048
 ] 

Chris A. Mattmann commented on NUTCH-789:
-

Folks, I'm going to put together an RC for Tika 0.7 and take care of JIRA now. 
Once I do that, we can try and close out this issue for 1.1. I should be able 
to do this before the 48 hr deadline I threw up for Nutch 1.1...

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-794) Language Identification must use check the parse metadata for language values

2010-03-31 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852101#action_12852101
 ] 

Chris A. Mattmann commented on NUTCH-794:
-

Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If 
the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 
release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 
after...thoughts?

 Language Identification must use check the parse metadata for language values 
 --

 Key: NUTCH-794
 URL: https://issues.apache.org/jira/browse/NUTCH-794
 Project: Nutch
  Issue Type: Bug
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-794.patch


 The following HTML document : 
 html lang=fiheaddocument 1 title/headbodyjotain 
 suomeksi/body/html
 is rendered as the following xhtml by Tika : 
 ?xml version=1.0 encoding=UTF-8?html 
 xmlns=http://www.w3.org/1999/xhtml;headtitle//headbodydocument 1 
 titlejotain suomeksi/body/html
 with the lang attribute getting lost.  The lang is not stored in the metadata 
 either.
 I will open an issue on Tika and modify TestHTMLLanguageParser so that the 
 tests don't break anymore 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Serykh Evgeniy (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Serykh Evgeniy updated NUTCH-570:
-

Attachment: GeneratorDiff_v1.out

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-779.
-

   Resolution: Fixed
Fix Version/s: 1.1

Committed revision 929038.

Thanks Andrzej for your feedback

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-30 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-785.
---

Resolution: Fixed

Committed revision 929039

Thanks Andrzej for reviewing it

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851316#action_12851316
 ] 

Julien Nioche commented on NUTCH-789:
-

Shall we postpone the work on this issue to after 1.1?

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-03-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851331#action_12851331
 ] 

Andrzej Bialecki  commented on NUTCH-789:
-

There are no diffs, so it's difficult to figure out what's changed ... I think 
that Tika will soon release v. 0.7 which may also impact this patch if we 
decide to upgrade before our release. I asked the Tika guys about their 
release, let's wait a couple days more.

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851461#action_12851461
 ] 

Otis Gospodnetic commented on NUTCH-570:


Serykh, what does your version of the patch do differently? (maybe it's just an 
update so it applies to trunk?)

Julien, want to take this?


 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851545#action_12851545
 ] 

Julien Nioche commented on NUTCH-570:
-

{quote}Julien, want to take this?{quote}

Not particularly. I am busy on short term issues for 1.1  so feel free to take 
it if you have a particular interest in this. 
I would be curious to see some figures on the improvements from this patch, my 
impression is that NUTCH-776 would be quicker to implement and maintain and 
might possibly give similar gains. 

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-570) Improvement of URL Ordering in Generator.java

2010-03-30 Thread Dmitry Lihachev (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851710#action_12851710
 ] 

Dmitry Lihachev commented on NUTCH-570:
---

Yeah, Otis. It's just an update so it applies to trunk.

 Improvement of URL Ordering in Generator.java
 -

 Key: NUTCH-570
 URL: https://issues.apache.org/jira/browse/NUTCH-570
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Ned Rockson
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: GeneratorDiff.out, GeneratorDiff_v1.out


 [Copied directly from my email to nutch-dev list]
 Recently I switched to Fetcher2 over Fetcher for larger whole web fetches 
 (50-100M at a time).  I found that the URLs generated are not optimal because 
 they are simply randomized by a hash comparator.  In one crawl on 24 machines 
 it took about 3 days to crawl 30M URLs.  In comparison with old benchmarks I 
 had set with regular Fetcher.java this was at least 3 fold more time.
 Anyway, I realized that the best situation for ordering can be approached by 
 randomization, but in order to get optimal ordering, urls from the same host 
 should be as far apart in the list as possible.  So I wrote a series of 2 
 map/reduces to optimize the ordering and for a list of 25M documents it takes 
 about 10 minutes on our cluster.  Right now I have it in its own class, but I 
 figured it can go in Generator.java and just add a flag in nutch-default.xml 
 determining if the user wants to use it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851719#action_12851719
 ] 

Hudson commented on NUTCH-779:
--

Integrated in Nutch-trunk #1112 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1112/])
 Mechanism for passing metadata from parse to crawldb


 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-784.
---

Resolution: Fixed

Committed revision 928746

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-784:


Fix Version/s: 1.1

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-784) CrawlDBScanner

2010-03-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850896#action_12850896
 ] 

Andrzej Bialecki  commented on NUTCH-784:
-

This should have been reviewed first - I don't question the usefulness of this 
class, but I think that this should have been added as an option to 
CrawlDbReader. As it is now we get a new tool with a cryptic name that performs 
a function that is a variant of another existing tool...

 CrawlDBScanner 
 ---

 Key: NUTCH-784
 URL: https://issues.apache.org/jira/browse/NUTCH-784
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-784.patch


 The patch file contains a utility which dumps all the entries matching a 
 regular expression on their URL. The dump mechanism of the crawldb reader is 
 not  very useful on large crawldbs as the ouput can be extremely large and 
 the -url  function can't help if we don't know what url we want to have a 
 look at.
 The CrawlDBScanner can either generate a text representation of the 
 CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 
 Usage: CrawlDBScanner crawldb output regex [-s status] -text
 regex: regular expression on the crawldb key
 -s status : constraint on the status of the crawldb entries e.g. db_fetched, 
 db_unfetched
 -text : if this parameter is used, the output will be of TextOutputFormat; 
 otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
 for instance the command below : 
 ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* 
 -s db_fetched -text
 will generate a text file /tmp/amazon-dump containing all the entries of the 
 crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2010-03-29 Thread Julien Nioche (JIRA)
Merge CrawlDBScanner with CrawlDBReader
---

 Key: NUTCH-806
 URL: https://issues.apache.org/jira/browse/NUTCH-806
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche


The CrawlDBScanner [NUTCH-784] should be merged with the CrawlDBReader. Will do 
that after the 1.1 release 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-783) IndexerChecker Utilty

2010-03-29 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-783:


Fix Version/s: (was: 1.1)

Removed tag 1.1
Will rename to IndexingPluginsChecker later

 IndexerChecker Utilty
 -

 Key: NUTCH-783
 URL: https://issues.apache.org/jira/browse/NUTCH-783
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-783.patch


 This patch contains a new utility which allows to check the configuration of 
 the indexing filters. The IndexerChecker reads and parses a URL and run the 
 indexers on it. Displays the fields obtained and the first
  100 characters of their value.
 Can be used e.g. ./nutch org.apache.nutch.indexer.IndexerChecker 
 http://www.lemonde.fr/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850912#action_12850912
 ] 

Julien Nioche commented on NUTCH-785:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-779) Mechanism for passing metadata from parse to crawldb

2010-03-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850915#action_12850915
 ] 

Julien Nioche commented on NUTCH-779:
-

Could anyone please review this issue? I would like to commit it in time for 
the 1.1 release

 Mechanism for passing metadata from parse to crawldb
 

 Key: NUTCH-779
 URL: https://issues.apache.org/jira/browse/NUTCH-779
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-779, NUTCH-779-v2.patch


 The patch attached allows to pass parse metadata to the corresponding entry 
 of the crawldb.  
 Comments are welcome

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-785) Fetcher : copy metadata from origin URL when redirecting + call scfilters.initialScore on newly created URL

2010-03-29 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850931#action_12850931
 ] 

Andrzej Bialecki  commented on NUTCH-785:
-

+1. The scoring api should allow us to set this metadata in one call, but 
changing the API now would be problematic.

 Fetcher : copy metadata from origin URL when redirecting + call 
 scfilters.initialScore on newly created URL
 ---

 Key: NUTCH-785
 URL: https://issues.apache.org/jira/browse/NUTCH-785
 Project: Nutch
  Issue Type: Bug
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-785.patch


 When following the redirections, the Fetcher does not copy the metadata from 
 the original URL to the new one or calls the method scfilters.initialScore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   6   7   8   9   10   >