Re: [VOTE 2] Board resolution for Nutch as TLP

2010-04-13 Thread Sami Siren

On 04/12/2010 02:08 PM, Andrzej Bialecki wrote:

Hi,

Take two, after s/crawling/search/ ...

Following the discussion, below is the text of the proposed Board
Resolution to vote upon.



[X] +1.  Request the Board make Nutch a TLP

--
 Sami Siren


Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Sami Siren
Looks good to me after the proposed changes.

--
 Sami Siren

On Sat, Apr 10, 2010 at 6:09 PM, Andrzej Bialecki a...@getopt.org wrote:
 On 2010-04-10 15:32, Jukka Zitting wrote:
 Hi,

 On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote:
 WHEREAS, the Board of Directors deems it to be in the best
 interests of the Foundation and consistent with the
 Foundation's purpose to establish a Project Management
 Committee charged with the creation and maintenance of
 open-source software related to a large-scale web crawling
 platform for distribution at no charge to the public.

 Would it make sense to simplify the scope to ... open-source software
 related to large-scale web crawling for distribution at no charge to
 the public?

 Yes, that's a good change too.

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: [DISCUSS] Nutch as a top level project (TLP)?

2010-03-23 Thread Sami Siren
My opinion is neutral on this matter. I don't see any technical benefit 
from going to top level project, exposure-wise I think the impact is 
probably negative. So for me the reason would be strictly political.


But the fact is that Nutch is pretty independent from Lucene/Solr and 
there is not much overlap with dev communities.


--
 Sami Siren


[jira] Commented: (NUTCH-798) Upgrade to SOLR1.4

2010-03-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12843546#action_12843546
 ] 

Sami Siren commented on NUTCH-798:
--

+1

 Upgrade to SOLR1.4
 --

 Key: NUTCH-798
 URL: https://issues.apache.org/jira/browse/NUTCH-798
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.1


 in particular SOLR1.4 has a StreamingUpdateSolrServer which would simplify 
 the way we buffer the docs before sending them to the SOLR instance 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: need advice trouble shooting zero results problem

2010-02-18 Thread Sami Siren
I think we should add some logging on the initialization code of
various back ends, currently they log nearly nothing and it's hard to
find out what's happening (specially when something is wrong).

You can go ahead and propose a patch in jira that adds proper logging
statements so that it's easier to diagnose situations like that.

--
 Sami Siren




On Fri, Feb 19, 2010 at 5:24 AM, Jesse Hires jhi...@gmail.com wrote:
 I am getting zero results when I search, but have no idea where to look for
 clues as to why. Is there a log that shows failure to find
 search-servers.txt, or failures to connect to the searchers? Is there a way
 I can verify the searchers can find the indexes?


 There seems to be very few breadcrumbs to follow when the configuration is
 not quite correct.

 I have had this working at one point, but decided to start over with the
 latest version from the trunk. I have a feeling I missed something, but I
 just don't know where to look.

 Jesse

 int GetRandomNumber()
 {
    return 4; // Chosen by fair roll of dice
                 // Guaranteed to be random
 } // xkcd.com




[jira] Created: (NUTCH-793) search.jsp compile errors

2010-02-15 Thread Sami Siren (JIRA)
search.jsp compile errors
-

 Key: NUTCH-793
 URL: https://issues.apache.org/jira/browse/NUTCH-793
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


Related to the searcher interface changes recently committed I broke search.jsp 
which does not currently compile.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: exception in search.jsp

2010-02-15 Thread Sami Siren

Hi Jesse,

thanks for spotting this. I fixed the problem in trunk, see 
https://issues.apache.org/jira/browse/NUTCH-793


--
 Sami Siren

Jesse Hires wrote:

I am seeing the following and am able to find any notes anywhere on it.

org.apache.jasper.JasperException: Unable to compile class for JSP: 


An error occurred at line: 207 in the jsp file: /search.jsp

query.getParams cannot be resolved or is not a field
204: // position this is good, bad?... ugly?
205:Hits hits;
206:try{
207:   query.getParams.initFrom(start + hitsToRetrieve, hitsPerSite, 
site, sort, reverse);

208:  hits = bean.search(query);
209:} catch (IOException e){
210:  hits = new Hits(0,new Hit[0]);



It looks like this change came in recently to SVN

--- lucene/nutch/trunk/src/web/jsp/search.jsp   2009/10/09 17:02:32 823614

+++ lucene/nutch/trunk/src/web/jsp/search.jsp   2010/02/01 20:47:34 905410
@@ -204,8 +204,8 @@
 // position this is good, bad?... ugly?
Hits hits;
try{
- hits = bean.search(query, start + hitsToRetrieve, hitsPerSite, site,

-sort, reverse);
+  query.getParams.initFrom(start + hitsToRetrieve, hitsPerSite, site, 
sort, reverse);
+ hits = bean.search(query);
} catch (IOException e){
  hits = new Hits(0,new Hit[0]);

}


Has anyone else run into this, or did I miss something when updating to 
the latest version?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com http://xkcd.com





[jira] Resolved: (NUTCH-793) search.jsp compile errors

2010-02-15 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-793.
--

Resolution: Fixed

committed a fix

 search.jsp compile errors
 -

 Key: NUTCH-793
 URL: https://issues.apache.org/jira/browse/NUTCH-793
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


 Related to the searcher interface changes recently committed I broke 
 search.jsp which does not currently compile.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-788) search.jsp typo causing searches to fail

2010-02-15 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-788.
--

   Resolution: Fixed
Fix Version/s: 1.1
 Assignee: Sami Siren

Thanks Sammy for the fix, I did not realize you had spotted this too. It's now 
fixed in trunk.

 search.jsp typo causing searches to fail
 

 Key: NUTCH-788
 URL: https://issues.apache.org/jira/browse/NUTCH-788
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.1
 Environment: On trunk
Reporter: Sammy Yu
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: 0001-Fix-up-servlet.patch


 Call to initialize the servlet parameter is missing parentheses.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-789) Improvements to Tika parser

2010-02-15 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833714#action_12833714
 ] 

Sami Siren commented on NUTCH-789:
--

It would be really useful to include the improvements in the functionality 
since that way almost all (-flash ?) parsers would be covered.

 Improvements to Tika parser
 ---

 Key: NUTCH-789
 URL: https://issues.apache.org/jira/browse/NUTCH-789
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
 Environment: reported by Sami, in NUTCH-766
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.1

 Attachments: NutchTikaConfig.java, TikaParser.java


 As reported by Sami in NUTCH-766, Sami has a few improvements he made to the 
 Tika parser. We'll track that progress here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)
Some external javadoc links are broken
--

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial


Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-790:
-

Attachment: NUTCH-790.patch

proposed patch, fixes links for lucene and hadoop, also updates j2se link to 
version 1.6

 Some external javadoc links are broken
 --

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial
 Attachments: NUTCH-790.patch


 Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-791) External links for published javadocs are partially broken

2010-02-14 Thread Sami Siren (JIRA)
External links for published javadocs are partially broken
--

 Key: NUTCH-791
 URL: https://issues.apache.org/jira/browse/NUTCH-791
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Reporter: Sami Siren


Lucene and Hadoop links point to non existing urls. For some versions of 
apidocs the links are just broken and for some they do not exist at all. 
Basically what is required is that the javadocs are generated again with proper 
urls for external packages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-790) Some external javadoc links are broken

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-790.
--

   Resolution: Fixed
Fix Version/s: 1.1

committed

 Some external javadoc links are broken
 --

 Key: NUTCH-790
 URL: https://issues.apache.org/jira/browse/NUTCH-790
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Trivial
 Fix For: 1.1

 Attachments: NUTCH-790.patch


 Nutch javadoc links for lucene and hadoop are broken.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-792:
-

Attachment: NUTCH-792.patch

pump version to 1.1-dev

 Nutch version still contains 1.0
 

 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch


 Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)
Nutch version still contains 1.0


 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch

Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-792) Nutch version still contains 1.0

2010-02-14 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-792.
--

Resolution: Fixed

committed

 Nutch version still contains 1.0
 

 Key: NUTCH-792
 URL: https://issues.apache.org/jira/browse/NUTCH-792
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Sami Siren
Assignee: Sami Siren
 Attachments: NUTCH-792.patch


 Should be 1.1-dev now in trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832406#action_12832406
 ] 

Sami Siren commented on NUTCH-766:
--

I suggest that we would still drive this a bit further an use. currently this 
patch does not use Tika for pkg formats nor html.

Julien: was there a reason not to use AutoDetect parser? The only thing that I 
could come with was that the mime type detection would be done twice. We could 
get around this by implementing somethin simlilar to what composite parser does 
(it uses a parser (AutodetectParser) class from the context to do further 
parsing) to cover all supported pkg formats.

Also was there a reson not to parse html wtih tika?

I have a patch nearby to demonstrate some of the improvements that I will try 
to post briefly.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed

[jira] Updated: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-766:
-

Attachment: NutchTikaConfig.java

Extended TikaConfig that is able to load parsers and can be used with existing 
tika classes. The call to (super) cannot load parser but then the config is 
porcessed again locally. This is a hack and hopefully at some point we can drop 
the class alltogether.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-766:
-

Attachment: TikaParser.java

Modified parser that can process package formats too. To get rid of the mime 
type detection happening twice we have to extend AutoDetectParser so that skips 
the intitial detection but does the detection for the rest of the content (in 
pkg formats)

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2010-02-05 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830053#action_12830053
 ] 

Sami Siren commented on NUTCH-673:
--

{quote}
Any plans or reasons not to upgrade to Lucene 3.0?
{quote}

I see no reason to stick with 2.9

{quote}
I can prepare a patch replacing Lucene 2.9 with Lucene 3.0 (as a separate 
issue).
{quote}

+1

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.1


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-02 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828561#action_12828561
 ] 

Sami Siren commented on NUTCH-781:
--

{quote}
the version we had was the same as the one provided by Tika 0.4 so I suppose we 
could safely rely on theTika defaults. MimeUtil currently requires needs 
tika-mimetypes.xml to be in the available in the classpath but we could modify 
that so that it uses the default version from the tika jar if nothing can be 
found in conf. Let's put that in a separate JIRA issue if we really want it, in 
the meantime I'll commit the v 0.6 of tika-mimetypes.xml
{quote}

ok. thanks.

 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-775) Enhance Searcher interface

2010-02-01 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-775.
--

Resolution: Fixed

I committed this

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-781) Update Tika to v0.6 for the MimeType detection

2010-02-01 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828275#action_12828275
 ] 

Sami Siren commented on NUTCH-781:
--

did you forgot to update conf/tika-mimetypes.xml ?

Related question: do we actually need our own version on the tika config 
anymore? I saw there were some old issues that were fixed in the custom version 
but i would quess those changes, if important, have already made their way into 
Tika?



 Update Tika to v0.6  for the MimeType detection
 ---

 Key: NUTCH-781
 URL: https://issues.apache.org/jira/browse/NUTCH-781
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1


 [from annoucement]
 Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and
 extracting metadata and structured text content from various documents using
 existing parser libraries.
 Apache Tika 0.6 contains a number of improvements and bug fixes. Details can
 be found in the changes file:
 http://www.apache.org/dist/lucene/tika/CHANGES-0.6.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806019#action_12806019
 ] 

Sami Siren commented on NUTCH-775:
--

If there are no objections I'll commit the proposed patch within few days.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-775) Enhance Searcher interface

2010-01-28 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806051#action_12806051
 ] 

Sami Siren commented on NUTCH-775:
--

{quote}IMHO this could go as it is ... one suggestion though: this 
Query/QueryContext now resembles SolrQuery/SolrParams. Perhaps we could rename 
QueryContext to QueryParams?
{quote}
That sounds reasonable, I will change the name before committing. Also I forgot 
to change web gui to use the new api, will do that also.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805661#action_12805661
 ] 

Sami Siren commented on NUTCH-766:
--

{quote}
Sure, it's more of a configuration backwards-compat issue. For those folks who 
have gone to the trouble of customizing their nutch configuration 
(nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the 
parsing plugins (e.g., basically say they don't exist anymore and update your 
deployed configuration to use the tika-plugin), this patch would require a 
configuration update in their deployed environments. Because of that, why don't 
we ease them into that upgrade with at least one released version before the 
plugins go away. It would make it easier from a configuration backwards-compat 
perspective.
{quote}

Ok, so you mean that we need to have duplicate parser plugins because we don't 
want to ask people already using nutch to reconfigure the bits this involves 
now even though we have to do it later? How is postponing going to ease the 
task they need to do anyway at some point? I still don't understand the (longer 
term) benefit.

I am not strongly against the idea of keeping duplicate plugins, I mean it's 
just another ~20M in the .job, what I am worried about is that the history will 
repeat itself and we will end up having one more case of duplicate components 
(in this case many of them) doing the same work and no interest in cleaning up 
afterwards. Doing it the way I suggested would guarantee that this will not 
happen.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804448#action_12804448
 ] 

Sami Siren commented on NUTCH-766:
--

+1, I'm going to agree on this one here Julien. Other communities  have 
convinced me of the need for backwards compat and unobtrusiveness when 
bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving 
the old plugins (perhaps mentioning they should be deprecated and replaced by 
the Tika functionality) and then removing them in 1.2 or 1.3.

Chris, can you please explain me how keeping two components doing identical 
work would be more backwards compatible than having only 1? 



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803664#action_12803664
 ] 

Sami Siren commented on NUTCH-766:
--

I took a brief look into the proposed patch, some somments:

The public API footprint of new classes should be smaller, eg use private, 
package private or protected methods/classes as much as possible.

I think the end result of this plugin should be replacing all Tika supported 
parsers (or the parsers we choose to replace) with the TikaParser and not to 
build a parallel ways to parse same formats. So I think we need to copy all of 
the the existing test files and moveadapt the existing testcases fully before 
committing this. That is a good way of seeing that the parse result is what is 
expected and also find out about possible differences with old vs. Tika version.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803673#action_12803673
 ] 

Sami Siren commented on NUTCH-766:
--

 Sure, but it would be silly to block the whole Tika plugin because Tika does 
 not support such or such format as well as the original Nutch plugins. As I 
 explained above we can configure which parser to use for which mimetype and 
 use the Tika-plugin by default. Hopefully the Tika implementation will get 
 better and better and there will be no need for keeping the old plugins.

I meant test files for the parsers we replace, not all

 BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the 
 current version of Tika and the existing Nutch parsers

ok, I had misses that one. 

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your

[jira] Updated: (NUTCH-775) Enhance Searcher interface

2009-12-30 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-775:
-

Attachment: NUTCH-775.patch

I ended up changing the Query API instead since the changes were smaller from 
API perspective that way.

 Enhance Searcher interface
 --

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1

 Attachments: NUTCH-775.patch


 Current Searcher interface is too limited for many purposes:
 Hits search(Query query, int numHits, String dedupField, String sortField,
   boolean reverse) throws IOException;
 It would be nice that we had an interface that allowed adding different 
 features without changing the interface. I am proposing that we deprecate the 
 current search method and introduce something like:
 Hits search(Query query, Metadata context) throws IOException;
 Also at the same time we should enhance the QueryFilter interface to look 
 something like:
 BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
 throws QueryException;
 I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-12-16 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791829#action_12791829
 ] 

Sami Siren commented on NUTCH-666:
--

We should also consider switching to Tika for language identification and route 
the proposed improvements in that area through Tika?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-775) Enhance Searcher interface

2009-12-15 Thread Sami Siren (JIRA)
Enhance Searcher interface
--

 Key: NUTCH-775
 URL: https://issues.apache.org/jira/browse/NUTCH-775
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.1


Current Searcher interface is too limited for many purposes:

Hits search(Query query, int numHits, String dedupField, String sortField,
  boolean reverse) throws IOException;

It would be nice that we had an interface that allowed adding different 
features without changing the interface. I am proposing that we deprecate the 
current search method and introduce something like:

Hits search(Query query, Metadata context) throws IOException;

Also at the same time we should enhance the QueryFilter interface to look 
something like:

BooleanQuery filter(Query input, BooleanQuery translation, Metadata context)
throws QueryException;

I would like to hear your comments before proceeding with a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-743) Site search powered by Lucene/Solr

2009-07-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-743.
--

Resolution: Fixed

committed

 Site search powered by Lucene/Solr
 --

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-743.patch


 Replace current Nutch site search with Lucene/Solr powered search hosted by 
 Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
 search all of the Nutch (content from other parts of the Lucene ecosystem is 
 also available) content from a single place, including web, wiki, JIRA and 
 mail archives. Lucid has a fault tolerant setup with replication and fail 
 over as well as monitoring services in place. 
 A preview of the site with the new search enabled is available at 
 http://people.apache.org/~siren/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-743) Site search powered by Lucene/Solr

2009-06-23 Thread Sami Siren (JIRA)
Site search powered by Lucene/Solr
--

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor


Replace current Nutch site search with Lucene/Solr powered search hosted by 
Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
search all of the Nutch (content from other parts of the Lucene ecosystem is 
also available) content from a single place, including web, wiki, JIRA and mail 
archives. Lucid has a fault tolerant setup with replication and fail over as 
well as monitoring services in place. 

A preview of the site with the new search enabled is available at 
http://people.apache.org/~siren/site/


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-743) Site search powered by Lucene/Solr

2009-06-23 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-743:
-

Attachment: NUTCH-743.patch

If there are no objections I will commit this within a week or so.

 Site search powered by Lucene/Solr
 --

 Key: NUTCH-743
 URL: https://issues.apache.org/jira/browse/NUTCH-743
 Project: Nutch
  Issue Type: New Feature
  Components: documentation
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Attachments: NUTCH-743.patch


 Replace current Nutch site search with Lucene/Solr powered search hosted by 
 Lucid Imagination (http://www.lucidimagination.com/search).  It allows one to 
 search all of the Nutch (content from other parts of the Lucene ecosystem is 
 also available) content from a single place, including web, wiki, JIRA and 
 mail archives. Lucid has a fault tolerant setup with replication and fail 
 over as well as monitoring services in place. 
 A preview of the site with the new search enabled is available at 
 http://people.apache.org/~siren/site/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[ANNOUNCE] Apache Nutch 1.0

2009-03-28 Thread Sami Siren

I am pleased to announce the availability of  Apache Nutch 1.0.

Apache Nutch, a subproject of Apache Lucene, is open source web-search 
software. It builds on Lucene Java, adding web-specifics, such as a 
crawler, a link-graph database, parsers for HTML and other document formats.


Apache Nutch 1.0 contains a number of bug fixes and improvements such as 
Solr Integration, new indexing framework and new scoring framework just 
to mention a few. Details can be found in the changes file:


http://svn.apache.org/repos/asf/lucene/nutch/tags/release-1.0/CHANGES.txt

Apache Nutch is available for download from the following download page:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/nutch-1.0.tar.gz

When downloading from a mirror site, please remember to verify the 
downloads using signatures found on the Apache site:

http://www.apache.org/dist/lucene/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://lucene.apache.org/nutch

-- Sami Siren (on behalf of the Apache Nutch community)


Re: [VOTE] Release Apache Nutch 1.0

2009-03-27 Thread Sami Siren

Thanks Andrzej,

This vote has passed, we now have a release with three binding +1 votes 
from:


-Andrzej Bialecki
-Dennis Kubes
-Sami Siren

I'll finalize the remaining tasks and do the announcement after the 
package has been mirrored.


ps. we should perhaps create jira issues for all the findings, small and 
big, so we can take care of them before next release.


--
 Sami Siren



Andrzej Bialecki wrote:

Sami Siren wrote:

Hello,

I have packaged the third release candidate for Apache Nutch 1.0 
release at http://people.apache.org/~siren/nutch-1.0/rc2/


See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/


The following issues that were discovered during the review of last rc 
have been fixed:


https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least 
three binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...


+1. There's a minor issue when using the supplied build.xml to rebuild
the sources - there are no conf/*.template files in the package, so Ant
fails with an error. Creating an empty conf/dummy.template fixes this.
IMHO this is a minor thing, so I vote for releasing the package as is.






[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph

2009-03-27 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-730:
-

Fix Version/s: (was: 1.0.0)

 NPE in LinkRank if no nodes with which to create the WebGraph
 -

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-730-1-20090325.patch


 For LinkRank, if there are no nodes to process, then a NullPointerException 
 is thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-23 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-722.
--

Resolution: Fixed

removed the jars and added note about this in README.txt

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



NUTCH-722 is resolved

2009-03-23 Thread Sami Siren
I think we are good to go for rc2 and it also seems that the smartest 
thing to do with the package contents at this point is do not touch them.


I will roll out the new rc later today.

--
 Sami Siren


[VOTE] Release Apache Nutch 1.0

2009-03-23 Thread Sami Siren

Hello,

I have packaged the third release candidate for Apache Nutch 1.0 release 
at http://people.apache.org/~siren/nutch-1.0/rc2/


See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/


The following issues that were discovered during the review of last rc 
have been fixed:


https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511

--
Sami Siren


[jira] Commented: (NUTCH-728) Improve nutch release packaging

2009-03-20 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683814#action_12683814
 ] 

Sami Siren commented on NUTCH-728:
--

not really, it just happens to be the mirror I use.

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Sami Siren

Fellow PMC members,

As you might know already we have posted a release candidate for Nutch 
1.0 some time ago. However we have so far received only two +1 votes 
from Lucene PMC members and one more is required before we can actually 
finalize the release.


The vote thread as it currently is can be seen from:
http://www.lucidimagination.com/search/document/33b2a26db25db492/vote_release_apache_nutch_1_0

We (as a Nutch community) would really appreciate if somebody from the 
PMC had the time to check it out.


Thanks for your time,

 Sami Siren



Sami Siren wrote:

We're lacking one +1, could someone please take a look?

Thanks,

Sami Siren



Sami Siren wrote:

Hello,

I have packaged the second release candidate for Apache Nutch 1.0 
release at


http://people.apache.org/~siren/nutch-1.0/rc1/

See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 



Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least 
three binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
*http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 



*--
Sami Siren







Re: [VOTE] Release Apache Nutch 1.0

2009-03-19 Thread Sami Siren

thanks Jukka,

Jukka Zitting wrote:

Hi,

On Thu, Mar 19, 2009 at 10:32 AM, Sami Siren ssi...@gmail.com wrote:

We (as a Nutch community) would really appreciate if somebody from the PMC
had the time to check it out.


-1 The release contains the Java Advanced Imaging libraries
(jai_core.jar and jai_codec.jar) which are licensed under Sun's Binary
Code License. We can't redistribute those libraries.


ok, we need to address that somehow.


Other comments based on a quick look:

* The LICENSE.txt file should have at least references to the licenses
of the bundled libraries.

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

* The README.txt should start with Apache Nutch instead of Nutch

* Why does the release package contain pre-built documentation and
binaries? Downloading the 90MB package takes much longer than checking
out and building the 40MB tag from svn.
IMHO it would be a service to users to make the release contain just the svn 
export with instruction
on how to build the rest. 


I see your point about the fat artifact but I am not totally convinced 
that users (as in end users) would prefer the idea of fetching the 
development tools and compiling the software before they use it, at 
least I am not doing that with the software I use.


I will discuss this with rest of the devs and see what we can do here. 
One solution could be to split the release in two parts binary only and 
source (they would both be about the same size since out build process 
currently copies jars around I think that's mostly the reason for the 
gigantic size) as you propose below.



We can also still provide pre-built binaries
as separate downloads. 
More notably: how am I to verify that the

release came from the sources in our svn when it contains stuff that
doesn't exist in the svn?


May be that I don't understand what you're trying to say here but isn't 
that always the case with binary releases (the difficulty to verify that 
the binary is build from certain tag from svn)?


--
 Sami Siren


[jira] Created: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)
Nutch contains jars that we cannot redistribute
---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


It seems that we have some jars (as part of pdf parser) that we cannot 
redistribute.

Jukkas comment from email:

The release contains the Java Advanced Imaging libraries (jai_core.jar and 
jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
redistribute those libraries.





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
LICENCE.txt is lacking info that should be there


 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The LICENSE.txt file should have at least references to the licenses of the 
bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
NOTICE.txt is lacking info that should be there
---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


Jukkas comment from email:

* The NOTICE.txt file should start with the the following lines:

  Apache Nutch
  Copyright 2009 The Apache Software Foundation

* The NOTICE.txt file should contain the required copyright notices
from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)
README.txt is lacking info that should be there
---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren


from Jukkas email:

* The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-727) Add KEYS file to release artifact

2009-03-19 Thread Sami Siren (JIRA)
Add KEYS file to release artifact
-

 Key: NUTCH-727
 URL: https://issues.apache.org/jira/browse/NUTCH-727
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren


comment from Grant:

 Where's the KEYS file for Nutch?

 hi,

 the keys file is at the top level nutch directory (eg: 
 http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)

OK, I think it should be in the tarball, too., at the top 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw in 
your opinions...



the related snippet from email discussion:

Sami Siren wrote:
 Jukka Zitting wrote:
 * Why does the release package contain pre-built documentation and
 binaries? Downloading the 90MB package takes much longer than checking
 out and building the 40MB tag from svn.
 IMHO it would be a service to users to make the release contain just
 the svn export with instruction
 on how to build the rest.

 I see your point about the fat artifact but I am not totally convinced
 that users (as in end users) would prefer the idea of fetching the
 development tools and compiling the software before they use it, at
 least I am not doing that with the software I use.

 I will discuss this with rest of the devs and see what we can do here.
 One solution could be to split the release in two parts binary only and
 source (they would both be about the same size since out build process
 currently copies jars around I think that's mostly the reason for the
 gigantic size) as you propose below.


--
 Sami Siren


[jira] Resolved: (NUTCH-726) README.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-726.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

committed

 README.txt is lacking info that should be there
 ---

 Key: NUTCH-726
 URL: https://issues.apache.org/jira/browse/NUTCH-726
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren
 Fix For: 1.0.0


 from Jukkas email:
 * The README.txt should start with Apache Nutch instead of Nutch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-724) Drop the JAI libraries

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-724.
--

Resolution: Duplicate

 Drop the JAI libraries
 --

 Key: NUTCH-724
 URL: https://issues.apache.org/jira/browse/NUTCH-724
 Project: Nutch
  Issue Type: Bug
Reporter: Jukka Zitting
Priority: Blocker
 Fix For: 1.0.0


 The PDF parser plugin contains Java Advanced Imaging (JAI) libraries 
 (jai_core.jar and jai_codec.jar) that are licensed under the Sun Binary Code 
 License. The license is incompatible with Apache policies, so we need to drop 
 those libraries.
 AFAIK (see PDFBOX-381) PDFBox only uses the JAI libraries for handling page 
 rotations and tiff images, so simply dropping the JAI jars shouldn't have too 
 much impact. A better solution would be to switch to using Apache PDFBox that 
 has a proper workaround for this issue, but the first Apache PDFBox release 
 has not yet been made.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683482#action_12683482
 ] 

Sami Siren commented on NUTCH-722:
--

+1, i am fine with this solution too

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw 
in your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the officially 
released sources.


The source package is straight forward one. Size of source package would 
be about 30GB. but the binary package will still remain quite big if we 
need to allow it to run on local and distributed mode (plugins as 
exploded format and also the .job + .war), size of such binary package 
would still be nearly 80G.


We could split the binary to yet smaller pieces: one for local mode, one 
for distributed mode, and the .war separately but I am not sure if 
that's worth the effort.


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren
The source package is straight forward one. Size of source package 
would be about 30GB. but the binary package will still remain quite 
big if we 

   

Now, this is big, indeed ;)


heh, some serious software, need to buy more disc just to download it 
(yes I was thinking of M not G)  :)


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a 
local job, no optional filesystems etc), the *.job and *.war files and 
scripts. Scripts would check for the presence of plugins/ dir, and offer 
an option to create it from *.job. Assumption here is that this shouldbe 
enough to run full cycle in local mode, and that people who want to run 
a distributed cluster will first install a plain Hadoop release, and 
then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), simple 
tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.


--
 Sami Siren



[jira] Resolved: (NUTCH-725) NOTICE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-725.
--

Resolution: Fixed

went through the libs and added copyright notices

 NOTICE.txt is lacking info that should be there
 ---

 Key: NUTCH-725
 URL: https://issues.apache.org/jira/browse/NUTCH-725
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The NOTICE.txt file should start with the the following lines:
   Apache Nutch
   Copyright 2009 The Apache Software Foundation
 * The NOTICE.txt file should contain the required copyright notices
 from all bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-723.
--

Resolution: Fixed

added licenses of 4rd party software

 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-723) LICENCE.txt is lacking info that should be there

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683618#action_12683618
 ] 

Sami Siren edited comment on NUTCH-723 at 3/19/09 2:11 PM:
---

added licenses of 3rd party software

  was (Author: siren):
added licenses of 4rd party software
  
 LICENCE.txt is lacking info that should be there
 

 Key: NUTCH-723
 URL: https://issues.apache.org/jira/browse/NUTCH-723
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.0
Reporter: Sami Siren

 Jukkas comment from email:
 * The LICENSE.txt file should have at least references to the licenses of the 
 bundled libraries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-728) Improve nutch release packaging

2009-03-19 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-728:
-

Attachment: NUTCH-728.patch

add simple target to generate source release tgz from svn tag

-did not touch to the binary one

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-722) Nutch contains jars that we cannot redistribute

2009-03-19 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12683634#action_12683634
 ] 

Sami Siren commented on NUTCH-722:
--

if there are no objections I will commit this change tomorrow morning (EET)

 Nutch contains jars that we cannot redistribute
 ---

 Key: NUTCH-722
 URL: https://issues.apache.org/jira/browse/NUTCH-722
 Project: Nutch
  Issue Type: Bug
Reporter: Sami Siren
Priority: Blocker
 Fix For: 1.0.0


 It seems that we have some jars (as part of pdf parser) that we cannot 
 redistribute.
 Jukkas comment from email:
 
 The release contains the Java Advanced Imaging libraries (jai_core.jar and 
 jai_codec.jar) which are licensed under Sun's Binary Code License. We can't 
 redistribute those libraries.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Sami Siren wrote:

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start 
a local job, no optional filesystems etc), the *.job and *.war files 
and scripts. Scripts would check for the presence of plugins/ dir, and 
offer an option to create it from *.job. Assumption here is that this 
shouldbe enough to run full cycle in local mode, and that people who 
want to run a distributed cluster will first install a plain Hadoop 
release, and then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), 
simple tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.




I added a simple patch to NUTCH-728 to make a plain source release from 
svn, what do people think should we add the plain source package into 
next rc. I would not like to make changes to binary package now but 
propose that we do those changes post 1.0.


--
 Sami Siren


Re: [VOTE] Release Apache Nutch 1.0

2009-03-15 Thread Sami Siren

Grant Ingersoll wrote:
Where's the KEYS file for Nutch? 


hi,

the keys file is at the top level nutch directory (eg: 
http://www.nic.funet.fi/pub/mirrors/apache.org/lucene/nutch/KEYS)



 I don't see it in the tarball and I
don't think Sami's key is on a public server that I am aware of (at 
least not pgp.mit.edu).



http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x0B7E6CFA

--
 Sami Siren





On Mar 11, 2009, at 10:13 AM, Andrzej Bialecki wrote:


Sami Siren wrote:

Hello,
I have packaged the second release candidate for Apache Nutch 1.0 
release at

http://people.apache.org/~siren/nutch-1.0/rc1/






Re: [VOTE] Release Apache Nutch 1.0

2009-03-10 Thread Sami Siren
This vote has been cancelled due to some last minute additions. I will 
post another RC soon.


Sami Siren wrote:

--
Sami Siren

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 
release at


http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and 
latest changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 



Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least 
three binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!

--
Sami Siren










Re: Nutch ML cleanup

2009-03-10 Thread Sami Siren
Like I suspected: I have no power to do or view any admin stuff there. 
Btw. I am not seeing any span, perhaps google takes care of that for me?


--
Sami Siren

Sami Siren wrote:
I'll take a look at this, I am pretty sure we have to ask Doug at the 
end :)


--
Sami Siren

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs 
get the most junk emails - both rude/rudeish emails, as well as 
clear spam (with SPAM in the subject - something must be detecting 
it). 
I just looked at the headers of the clearly labeled spam messages and 
found that they all seem to come from SF:


 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs 
on SF to the new Nutch MLs at ASF.

Do you think we could remove this forwarding and get rid of this spam?

Sami  Andrzej seem to be members who mght be able to make this 
change:


http://sourceforge.net/project/memberlist.php?group_id=59548

Otis
  






[jira] Resolved: (NUTCH-715) Subcollection plugin doesn't work with default subcollections.xml file

2009-03-10 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-715.
--

Resolution: Fixed

committed, thanks Dmitry!

 Subcollection plugin doesn't work with default subcollections.xml file
 --

 Key: NUTCH-715
 URL: https://issues.apache.org/jira/browse/NUTCH-715
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Assignee: Sami Siren
 Fix For: 1.0.0

 Attachments: NUTCH-715-testcase.patch, 
 NUTCH-715_subcollections_fix.patch


 Subcollection plugin cann't parse his configuration file because it contatins 
 top level comment (ASF notice) and DomUtil doesn't carry about of top-level 
 comments

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[VOTE] Release Apache Nutch 1.0

2009-03-10 Thread Sami Siren

Hello,

I have packaged the second release candidate for Apache Nutch 1.0 
release at


http://people.apache.org/~siren/nutch-1.0/rc1/

See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004


Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
*http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004


*--
Sami Siren



Re: [VOTE] Release Apache Nutch 1.0

2009-03-10 Thread Sami Siren

!!!NOTE!!!
There was faulty link in the message I sent earlier, hopefully I get
it right this time:


Hello,

I have packaged the second release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc1/

See the CHANGES.txt[1] file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004


Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1] 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004


--
Sami Siren


[jira] Commented: (NUTCH-705) parse-rtf plugin

2009-03-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12680411#action_12680411
 ] 

Sami Siren commented on NUTCH-705:
--

I think we should start looking at Apache Tika for most (or all) of our parsers.

 parse-rtf plugin
 

 Key: NUTCH-705
 URL: https://issues.apache.org/jira/browse/NUTCH-705
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.1

 Attachments: NUTCH-705.patch


 Demoting this issue and moving to 1.1 - current patch is not suitable due to 
 LGPL licensed parts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-717) Make Nutch Solr integration easier

2009-03-10 Thread Sami Siren (JIRA)
Make Nutch Solr integration easier
--

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren
 Fix For: 1.1


Erik Hatcher proposed we should provide a full solr config dir to be used with 
Nutch-Solr. Now we only provide index schema. It would be considerably easier 
to setup nutch-solr if we provided the whole conf dir that you could use with 
solr like:

java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Moving Nutch parsers to Tika

2009-03-10 Thread Sami Siren

Andrzej Bialecki wrote:

Hi all,

I've been debating this for a while, too, what Sami suggested in another 
thread: I think we should start looking at Apache Tika for most (or 
all) of our parsers.


This is actually a part of my broader vision for Nutch, that this 
project should not duplicate functionality of other well-established 
projects by re-implementing the same functionality, only poorly - 
because our focus is not on parsers, plugins, mime/charset detection, 
distributed RPC, but on building a robust platform for crawling.


I share that same vision.



We could start working on this particular issue by donating the Nutch 
parsers to Tika, those that are not already present there, and start 
using Tika's parsers in Nutch where it's already possible. Once Tika 
supports all types of parsers that we have, we should switch completely 
to Tika.


I think that the only parser that is totally missing from Tika is swf 
(https://issues.apache.org/jira/browse/TIKA-147). Tika also supports 
some formats that Nutch currently does not (in addition to providing 
more advanced parsing on some formats).


--
 Sami Siren


NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Sami Siren

Dog(acan Güney wrote:

On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote:
  

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and latest
changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The vote is open
for the next 72 hours. Only votes from Lucene PMC members are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote  passes if at least three binding +1 votes are
cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!




That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.
  


uh, I missed that one, sorry. Do you think it's ready to be included? 
(IMO that's an important feature) It's not a big deal for me to rebuild 
the package with that feature included.


--
Sami Siren



Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]

2009-03-09 Thread Sami Siren

Doğacan Güney wrote:



On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com 
mailto:ssi...@gmail.com wrote:



Doğacan Güney wrote:

On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com 
mailto:ssi...@gmail.com wrote:
  

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/ 
http://people.apache.org/%7Esiren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and latest
changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The vote is open
for the next 72 hours. Only votes from Lucene PMC members are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote  passes if at least three binding +1 votes are
cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!



That's great!

I would like to see NUTCH-684 in but I guess I was too late :)

Anyway, my non-binding +1.
  


uh, I missed that one, sorry. Do you think it's ready to be included? 
(IMO that's an important feature) It's not a big deal for me to 
rebuild the package with that feature included.




I only tested it on a small crawl. Still, I believe it is important 
too so I would like to include it. Worst case we release a 1.0.1 soon 
after:)
I am fine either way. So if you think it's good enough to go in just 
commit it and I'll build another rc. If not then we can release it later 
too when it's ready.


--
Sami Siren





--
 Sami Siren





[VOTE] Release Apache Nutch 1.0

2009-03-08 Thread Sami Siren

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and latest 
changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480

Please vote on releasing this package as Apache Nutch 1.0. The vote is open for 
the next 72 hours. Only votes from Lucene PMC members are binding, but everyone 
is welcome to check the release candidate and voice their approval or 
disapproval. The vote  passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!

--
Sami Siren








Re: planning for nutch-1.0-rc1

2009-03-05 Thread Sami Siren
I am sure all of you noticed that the release planned to be cut during 
this week was delayed because of a new discovery right before the 
deadline (NUTCH-711). That has now been fixed so it's time to move on. I 
am now going to build the first RC during the weekend.


--
Sami Siren

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or 
timing?


--
Sami Siren





[jira] Commented: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1

2009-03-04 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12678691#action_12678691
 ] 

Sami Siren commented on NUTCH-711:
--

+1

 Indexer failing after upgrade to Hadoop 0.19.1
 --

 Key: NUTCH-711
 URL: https://issues.apache.org/jira/browse/NUTCH-711
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.0.0

 Attachments: patch.txt


 After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order 
 than before (see http://svn.apache.org/viewvc?view=revrevision=736239). 
 IndexingFilters populate current JobConf with field options that are required 
 for IndexerOutputFormat to function properly. However, the filters are 
 instantiated in Reducer.configure(), which is now called after the 
 OutputFormat is initialized, and not before as previously.
 The workaround for now is to instantiate IndexinigFilters once again inside 
 IndexerOutputFormat.  This issue should be revisited before 1.1 in order to 
 find a better solution.
 See this thread for more information: 
 http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Resolved: (NUTCH-711) Indexer failing after upgrade to Hadoop 0.19.1

2009-03-04 Thread Sami Siren


Alternatively you could create another issue to track the proper fix and 
let this close during the release process.


--
Sami Siren

Andrzej Bialecki (JIRA) wrote:

 [ 
https://issues.apache.org/jira/browse/NUTCH-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-711.
-

Resolution: Fixed

Applied the patch in rev. 750037. I'm not closing this issue, because this 
needs to be solved in a better way after 1.0.

  

Indexer failing after upgrade to Hadoop 0.19.1
--

Key: NUTCH-711
URL: https://issues.apache.org/jira/browse/NUTCH-711
Project: Nutch
 Issue Type: Bug
   Affects Versions: 1.0.0
   Reporter: Andrzej Bialecki 
   Assignee: Andrzej Bialecki 
   Priority: Blocker

Fix For: 1.0.0

Attachments: patch.txt


After upgrade to Hadoop 0.19.1 Reducer is initialized in a different order than 
before (see http://svn.apache.org/viewvc?view=revrevision=736239). 
IndexingFilters populate current JobConf with field options that are required for 
IndexerOutputFormat to function properly. However, the filters are instantiated in 
Reducer.configure(), which is now called after the OutputFormat is initialized, and 
not before as previously.
The workaround for now is to instantiate IndexinigFilters once again inside 
IndexerOutputFormat.  This issue should be revisited before 1.1 in order to 
find a better solution.
See this thread for more information: 
http://www.lucidimagination.com/search/document/7c62c625c7ea17fe/problem_with_crawling_using_the_latest_1_0_trunk



  




[jira] Updated: (NUTCH-700) Neko1.9.11 goes into a loop

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-700:
-

Fix Version/s: 1.0.0
 Assignee: Sami Siren

This one just bit me - the effect is that parsing hangs forever. I am promoting 
it to be fixed in  1.0.

 Neko1.9.11 goes into a loop
 ---

 Key: NUTCH-700
 URL: https://issues.apache.org/jira/browse/NUTCH-700
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche
Assignee: Sami Siren
Priority: Critical
 Fix For: 1.0.0


 Neko1.9.11 goes into a loop on some documents e.g. 
 http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm
 http://cizel.co.kr/main.php
 reverting to 0.9.4 seems to fix the problem
 The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 
 could be a way to alleviate similar issues
 PS: haven't had time to report to the Neko people yet, will do at some stage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: planning for nutch-1.0-rc1

2009-03-02 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)


There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
I'm not sure what are the feature differences (pertinent to Nutch) 
between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course of 
action.

I will take care of that.



I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or 
timing?


Sounds good.

great!

--
Sami Siren




[jira] Resolved: (NUTCH-700) Neko1.9.11 goes into a loop

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-700.
--

Resolution: Fixed

reverted to 0.9.4

 Neko1.9.11 goes into a loop
 ---

 Key: NUTCH-700
 URL: https://issues.apache.org/jira/browse/NUTCH-700
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: julien nioche
Assignee: Sami Siren
Priority: Critical
 Fix For: 1.0.0


 Neko1.9.11 goes into a loop on some documents e.g. 
 http://mediacet.com/Archive/FourYorkshiremen/bb/post.htm
 http://cizel.co.kr/main.php
 reverting to 0.9.4 seems to fix the problem
 The approach mentioned in https://issues.apache.org/jira/browse/NUTCH-696 
 could be a way to alleviate similar issues
 PS: haven't had time to report to the Neko people yet, will do at some stage

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Resolved: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-03-02 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren (JIRA) wrote:
 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
]


Sami Siren resolved NUTCH-669.
--

Resolution: Fixed

replaced fetcher with fetcher2


I'm puzzled ..  it seemed the goal was to integrate Todd's patch, 
which effectively replaces both Fetchers. Does this mean that Todd's 
version was not ready, or is the current code based on Todd's version?
There was no Todd's path that I could see,  he never provided one even 
after asked multiple times, first by you at dec 2008 then dogacan jan 
2009 and finally me last week.


My motivation to get this fixed was, as I understood most of the 
developers thought too, to get rid of the burden of supporting two 
classes providing roughly the same piece of functionality. I opened a 
jira for this but closed it soon after as you told me it was a duplicate 
to this one.


So, what I did was: replaced original Fetcher with Fetcher2. The Fetcher 
is still there to be improved by Todd and others at will.


--
Sami Siren


Re: Release 1.0?

2009-02-28 Thread Sami Siren

dealmaker wrote:

Hi,
  Is there going to be a delay of the 1.0 release?  Today is almost Feb 28. 
You said that 1.0 will come in Feb.  I am customizing Nutch 0.9, and I am

wondering if I should wait couple more days for the 1.0 release.
  
I think that no one else but me made any guesses about the release date? 
(since it is virtually impossible due to fact that this is not a paid 
project).


The general consensus seems to be that we should get the next release 
out preferably sooner than later. I personally still think that the 
first release candidate is not that far away - we have no blocker issues 
left and it seems (judged by the lack of activity on working with those 
remaining issues) that the ones still there are not too important.


I am going to commit NUTCH-669 soon and after that I am fine with 
starting the release process. Other devs might have different opinions.


--
Sami Siren





--
Sami Siren


Thanks.


Andrzej Bialecki wrote:
  

Marko Bauhardt wrote:


Hi,
is there anybody out there? ;)
exists a plan when version 1.0 will be released?

thanks
marko


On Jan 28, 2009, at 9:45 AM, Marko Bauhardt wrote:

  

Hi all,
is there a timeline for the release 1.0? Currently it exists 33 issues 
(9 Bugs).
Is there a plan for a feature freeze? Maybe some big issues can be 
moved to version 1.1?

We do exist. ;) We plan to release in February - I can't tell you yet 
when exactly, we need to review the (few) remaining issues that we want 
to resolve before the release.




--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






  




Re: Release 1.0?

2009-02-28 Thread Sami Siren

Sami Siren wrote:
I think that no one else but me made any guesses about the release 
date? (since it is virtually impossible due to fact that this is not a 
paid project).


Andrzej Bialecki wrote:  

We do exist. ;) We plan to release in February - I can't tell you 
yet when exactly, we need to review the (few) remaining issues that 
we want to resolve before the release.

oh I now see Andrzej made some also :) .
--
Sami Siren


planning for nutch-1.0-rc1

2009-02-28 Thread Sami Siren
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or timing?


--
Sami Siren



[jira] Commented: (NUTCH-705) parse-rtf plugin

2009-02-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677508#action_12677508
 ] 

Sami Siren commented on NUTCH-705:
--

I think that the patch contains some lgpl code that we cannot commit into 
apache repository.

 parse-rtf plugin
 

 Key: NUTCH-705
 URL: https://issues.apache.org/jira/browse/NUTCH-705
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
 Fix For: 1.0.0

 Attachments: NUTCH-705.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Url regex normalizer

2009-02-27 Thread Sami Siren

Meghna Kukreja wrote:

Thanks Andrzej.

Here is the issue that I created in JIRA:
https://issues.apache.org/jira/browse/NUTCH-706. I have suggested an
alternative regular expression but would appreciate if someone could
verfiy this as I am not very great with those :)
  


Perhaps you could write some junit test to verify it behaves as expected?

--
Sami Siren



[jira] Resolved: (NUTCH-699) Add an official solr schema for solr integration

2009-02-26 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-699.
--

Resolution: Fixed

committed

 Add an official solr schema for solr integration
 --

 Key: NUTCH-699
 URL: https://issues.apache.org/jira/browse/NUTCH-699
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0


 See Andrzej's comments on NUTCH-684 for more info.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-26 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren reassigned NUTCH-669:


Assignee: Sami Siren

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Assignee: Sami Siren
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-703) Upgrade to Hadoop 0.19.1

2009-02-26 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12677266#action_12677266
 ] 

Sami Siren commented on NUTCH-703:
--

Andrzej, are you working with this now?

 Upgrade to Hadoop 0.19.1
 

 Key: NUTCH-703
 URL: https://issues.apache.org/jira/browse/NUTCH-703
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Blocker
 Fix For: 1.0.0


 From release notes: Release 0.19.1 fixes many critical bugs in 0.19.0, 
 including ***some data loss issues***..

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-247) robot parser to restrict.

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-247.
--

Resolution: Fixed
  Assignee: Sami Siren  (was: Dennis Kubes)

committed this

- added checking to F2 (which is soon to be Fetcher)



 robot parser to restrict.
 -

 Key: NUTCH-247
 URL: https://issues.apache.org/jira/browse/NUTCH-247
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.0.0

 Attachments: agent-names.patch, agent-names3.patch.txt


 If the agent name and the robots agents are not proper configure the Robot 
 rule parser uses LOG.severe to log the problem but solve it also. 
 Later on the fetcher thread checks for severe errors and stop if there is one.
 RobotRulesParser:
 if (agents.size() == 0) {
   agents.add(agentName);
   LOG.severe(No agents listed in 'http.robots.agents' property!);
 } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
   agents.add(0, agentName);
   LOG.severe(Agent we advertise ( + agentName
  + ) not listed first in 'http.robots.agents' property!);
 }
 Fetcher.FetcherThread:
  if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;  
 I suggest to use warn or something similar instead of severe to log this 
 problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-701) replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)
replace Fetcher with Fetcher2
-

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


Currently there are two fetcher implementation within nutch, one too many. This 
task tracks the process of promoting Fetcher2.

my plan is basically to
-remove Fetcher all together and rename Fetcher2 to Fetcher
-fix crawl class so it works with F2 api.

If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-701:
-

Summary: Replace Fetcher with Fetcher2  (was: replace Fetcher with Fetcher2)

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-698) CrawlDb is corrupted after a few crawl cycles

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-698.
--

Resolution: Fixed

committed. thanks guys

 CrawlDb is corrupted after a few crawl cycles
 -

 Key: NUTCH-698
 URL: https://issues.apache.org/jira/browse/NUTCH-698
 Project: Nutch
  Issue Type: Bug
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-698_v1.patch


 After change to hadoop's MapWritable, crawldb becomes corrupted after some 
 fetch cycles. For more details see this discussion thread:
 http://www.nabble.com/Fetcher2-crashes-with-current-trunk-td21978049.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-699) Add an official solr schema for solr integration

2009-02-24 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12676233#action_12676233
 ] 

Sami Siren commented on NUTCH-699:
--

We could put it under conf/ ?

 Add an official solr schema for solr integration
 --

 Key: NUTCH-699
 URL: https://issues.apache.org/jira/browse/NUTCH-699
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0


 See Andrzej's comments on NUTCH-684 for more info.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-701) Replace Fetcher with Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-701.
--

Resolution: Duplicate

 Replace Fetcher with Fetcher2
 -

 Key: NUTCH-701
 URL: https://issues.apache.org/jira/browse/NUTCH-701
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Sami Siren
Assignee: Sami Siren
 Fix For: 1.0.0


 Currently there are two fetcher implementation within nutch, one too many. 
 This task tracks the process of promoting Fetcher2.
 my plan is basically to
 -remove Fetcher all together and rename Fetcher2 to Fetcher
 -fix crawl class so it works with F2 api.
 If there are no objections I will proceed with this soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-02-24 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated NUTCH-669:
-

Fix Version/s: (was: 1.1)
   1.0.0

Moving this back to 1.0

Are you close with your patch? As discussed in this thread we should just 
replace Fetcher With Fetcher2, change Crawl class and check that the tests 
pass. other issues we can deal within their own tickets.

I can also help with this if you don't have the time.



 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-694) Distributed Search Server fails

2009-02-22 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren resolved NUTCH-694.
--

Resolution: Fixed

Committed. Thanks for testing it.

 Distributed Search Server fails
 ---

 Key: NUTCH-694
 URL: https://issues.apache.org/jira/browse/NUTCH-694
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
 Environment: Single Server with one Nutch instance in 
 DistributedSearchServerMode, not in PseudoDistirubutedMode
Reporter: Dr. Nadine Hochstotter
Assignee: Sami Siren
Priority: Blocker
 Fix For: 1.0.0

 Attachments: NUTCH-694-2.patch, NUTCH-694.patch


 I run Nutch on a single server, I have two crawl directories, that's why I 
 use Nutch  in distributed search server mode as described in the hadoop 
 manual.
 But since I have a new Trunk Version (04.02.2009) it fails. Local search on 
 one index works fine. But distributed search throws following exception:
 In catalina.out (server)
 2009-02-18 17:08:14,906 ERROR NutchBean - 
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: Unknown Protocol 
 classname:org.apache.nutch.searcher.RPCSegmentBean
at 
 org.apache.nutch.searcher.NutchBean.getProtocolVersion(NutchBean.java:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at 
 org.apache.nutch.searcher.DistributedSegmentBean.init(DistributedSegmentBean.java:103)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:111)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:80)
at 
 org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:422)
at 
 org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at 
 org.apache.catalina.core.StandardContext.start(StandardContext.java:4350)
at 
 org.apache.catalina.core.StandardContext.reload(StandardContext.java:3099)
at 
 org.apache.catalina.manager.ManagerServlet.reload(ManagerServlet.java:913)
at 
 org.apache.catalina.manager.HTMLManagerServlet.reload(HTMLManagerServlet.java:536)
at 
 org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:114)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
 org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at 
 org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
 And in Hadoop.log:
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 48 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 49 on 13001: 
 starting
 2009-02-18 17:07:52,847 INFO  ipc.Server - IPC Server handler 40 on 13001: 
 starting
 2009-02-18 17

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-02-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675793#action_12675793
 ] 

Sami Siren commented on NUTCH-477:
--

It's your call.

IMO the whole URLFIlters - URLFIlter, URLNormalizers - URLNormalizer is a bit 
too complex as it is now, we can make it more clean but it's probably not worth 
the trouble pre 1.0.



 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   5   >