[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255502#comment-13255502
 ] 

Markus Jelsma commented on NUTCH-585:
-

This issue is not going to be part of Nutch 1.5 which is likely to be released 
very soon. However, you can download the patch and see if it works for you, the 
plugin builds fine for 1.4, 1.5 and the to-be 1.6-SNAPSHOT.

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255528#comment-13255528
 ] 

Markus Jelsma commented on NUTCH-585:
-

You should take the latest patch: blacklist_whitelist_plugin.patch. It contains 
example config etc. Please let us know if you get it to work.

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1339) Default URL normalization rules to remove page anchors completely

2012-04-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255936#comment-13255936
 ] 

Markus Jelsma commented on NUTCH-1339:
--

The anchor is still removed by the BasicURLNormalizer. We worked around the 
problem for the AJAXNormalizer by simply changing the normalizer order. First 
the AJAXNormalizer and then everything else. But, when indexing, first do the 
BasicNormalizer (if enabled) and only then the AJAXNormalizer.


 Default URL normalization rules to remove page anchors completely
 -

 Key: NUTCH-1339
 URL: https://issues.apache.org/jira/browse/NUTCH-1339
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6
Reporter: Sebastian Nagel
 Attachments: NUTCH-1339-2.patch, NUTCH-1339.patch


 The default rules of URLNormalizerRegex remove the anchor up to the first
 occurrence of ? or . The remaining part of the anchor is kept
 which may cause a large, possibly infinite number of outlinks when the same 
 document
 fetched again and again with different URLs,
 see http://www.mail-archive.com/user%40nutch.apache.org/msg05940.html
 Parameters in inner-page anchors are a common practice in AJAX web sites.
 Currently, crawling AJAX content is not supported (NUTCH-1323).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1

2012-04-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244147#comment-13244147
 ] 

Markus Jelsma commented on NUTCH-1234:
--

Excellent! I'll remember next time!
Thanks :)

 Upgrade to Tika 1.1
 ---

 Key: NUTCH-1234
 URL: https://issues.apache.org/jira/browse/NUTCH-1234
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1234-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1321) IDNNormalizer

2012-03-30 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242219#comment-13242219
 ] 

Markus Jelsma commented on NUTCH-1321:
--

...or, we could do a toUnicode for outlinks or directly in the fetcher. This 
also makes sense because as ASCII these URL's are longer, sometimes much 
longer. This can stir trouble for filters that, partly, rely on string length. 
If both conversions are implemented in the fetcher or protocol library then we 
don't have to worry about it, and have better logging in the fetcher!



 IDNNormalizer
 -

 Key: NUTCH-1321
 URL: https://issues.apache.org/jira/browse/NUTCH-1321
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma

 Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
 indexer so it will encode ASCII URL's to their proper unicode equivalant.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1234) Upgrade to Tika 1.1

2012-03-30 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242239#comment-13242239
 ] 

Markus Jelsma commented on NUTCH-1234:
--

Julien or Chris, can either of you check this out? I'm wasting time and gaining 
frustration! I cannot get it to work :)

 Upgrade to Tika 1.1
 ---

 Key: NUTCH-1234
 URL: https://issues.apache.org/jira/browse/NUTCH-1234
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1234-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241090#comment-13241090
 ] 

Markus Jelsma commented on NUTCH-1024:
--

I'll change the legacy sys.out to logging. HttpHeaders doesnt have Text 
representations of the strings but i'll be happy to add if you want.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-03-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241225#comment-13241225
 ] 

Markus Jelsma commented on NUTCH-1320:
--

Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem 
there but these tools lack conversion. The filter and normalizer checker tools 
would also benefit. This also suggests the need of an IDNNormalizer that does 
toUnicode when indexing, you don't want http://xn--*/ URL's in your index.

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241229#comment-13241229
 ] 

Markus Jelsma commented on NUTCH-1024:
--

I'll fix the logging, this is old code. The inc and dec rate directives are 
already in nutch-default but the mime-file and the file itself are missing.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-28 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240430#comment-13240430
 ] 

Markus Jelsma commented on NUTCH-1024:
--

Thoughts? I'd like to send this one in.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

2012-03-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239418#comment-13239418
 ] 

Markus Jelsma commented on NUTCH-1300:
--

I think a scope index makes sense. It would make building a two-way 
normalizer a bit easier. Commandline options can be added but you can use -D 
option as well.

 Indexer to normalize URL's
 --

 Key: NUTCH-1300
 URL: https://issues.apache.org/jira/browse/NUTCH-1300
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1300-1.5-1.patch


 Indexers should be able to normalize URL's. This is useful when a new 
 normalizer is applied to the entire CrawlDB. Without it, some or all records 
 in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1317) Max content length by MIME-type

2012-03-20 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233970#comment-13233970
 ] 

Markus Jelsma commented on NUTCH-1317:
--

I am not sure about the root of the problem. We only use Tika for parsing PDF 
and (X)HTML and rely on Boilerpipe. Some HTML pages are quite a thing, full of 
stuff or endless tables. You'll press page down over a hundred times to scroll 
to the bottom. I've not tested all bad URL's but i think Tika does the job 
eventually, if not i'll file a ticket. Most i tested work, given enough time.
HTML pages that take more than one second to parse are considered bad, it 
should be less than 50ms on average. Those that are bad usually contain too 
much elements and are large in size.

 Max content length by MIME-type
 ---

 Key: NUTCH-1317
 URL: https://issues.apache.org/jira/browse/NUTCH-1317
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The good old http.content.length directive is not sufficient in large 
 internet crawls. For example, a 5MB PDF file may be parsed without issues but 
 a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

2012-03-19 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232914#comment-13232914
 ] 

Markus Jelsma commented on NUTCH-1315:
--

Speculative task execution is enabled by default but the fetch and index jobs 
disable them. We have disabled speculative execution altogether at some point 
only because we need those slots to be free for other jobs.

Should extended OutputFormat's take care of this? It isn't clear in MapRed's 
API docs whether this is a problem. The name parameter is to be unique for the 
task's part of the output for the entire job, which it is.

Wouldn't including a task ID in the output name cause a mess in the final 
output?

In the mean time i would indeed disable speculative execution. In my opinion 
and experience with Nutch and other jobs it's not really worth it. It takes 
empty slots that you can use for other jobs and if there are no other jobs it 
still takes additional CPU cycles and RAM and disk I/O for a few seconds. I 
must add that our network is homogenous (fallacy) and all nodes have almost 
equal load.

 reduce speculation on but ParseOutputFormat doesn't name output files 
 correctly?
 

 Key: NUTCH-1315
 URL: https://issues.apache.org/jira/browse/NUTCH-1315
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: ubuntu 64bit, hadoop 1.0.1, 3 Node Cluster, segment size 
 1.5M urls
Reporter: Rafael
  Labels: hadoop, hdfs

 From time to time the Reducer log contains the following and one tasktracker 
 gets blacklisted.
 org.apache.hadoop.ipc.RemoteException: 
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
 create file 
 /user/test/crawl/segments/20120316065507/parse_text/part-1/data for 
 DFSClient_attempt_201203151054_0028_r_01_1 on client xx.x.xx.xx.10, 
 because this file is already being created by 
 DFSClient_attempt_201203151054_0028_r_01_0 on xx.xx.xx.9
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1404)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1244)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1186)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:628)
   at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
   at org.apache.hadoop.ipc.Client.call(Client.java:1066)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
   at $Proxy2.create(Unknown Source)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
   at $Proxy2.create(Unknown Source)
   at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3245)
   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
   at 
 org.apache.hadoop.io.SequenceFile$RecordCompressWriter.init(SequenceFile.java:1132)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
   at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:157)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:134)
   at org.apache.hadoop.io.MapFile$Writer.init(MapFile.java:92)
   at 
 

[jira] [Commented] (NUTCH-1311) Add response headers to datastore for the protocol-httpclient plugin

2012-03-16 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231172#comment-13231172
 ] 

Markus Jelsma commented on NUTCH-1311:
--

Hi, HTTP response headers are available as Content metadata in trunk.

 Add  response headers to datastore for the protocol-httpclient plugin
 -

 Key: NUTCH-1311
 URL: https://issues.apache.org/jira/browse/NUTCH-1311
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Dan Rosher
Priority: Minor
 Fix For: nutchgora

 Attachments: NUTCH-1311.patch


 Response Headers need to be added to the page to add to the datastore for 
 this plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-03-16 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231368#comment-13231368
 ] 

Markus Jelsma commented on NUTCH-1314:
--

This should then also work for the Tika parser and the OutlinkExtractor i 
think. Parse-html is similar to parse-tika, it there are no outlinks obtain by 
getOutlinks in Domcontentutils then the outlink extractor is used.

 Impose a limit on the length of outlink target urls
 ---

 Key: NUTCH-1314
 URL: https://issues.apache.org/jira/browse/NUTCH-1314
 Project: Nutch
  Issue Type: Improvement
Reporter: Ferdy Galema
 Attachments: NUTCH-1314.patch


 In the past we have encountered situations where crawling specific broken 
 sites resulted in ridiciously long urls that caused the stalling of tasks. 
 The regex plugins (normalizing/filtering) processed single urls for hours, if 
 not indefinitely hanging.
 My suggestion is to limit the outlink url target length as soon possible. It 
 is a configurable limit, the default is 3000. This should be reasonably long 
 enough for most uses. But sufficienly strict enough to make sure regex 
 plugins do not choke on urls that are too long. Please see attached patch for 
 the Nutchgora implementation.
 I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header

2012-03-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229254#comment-13229254
 ] 

Markus Jelsma commented on NUTCH-1310:
--

Any idea on how to resolve this? Suggestions for code location and header value?

 Nutch to send HTTP-accept header
 

 Key: NUTCH-1310
 URL: https://issues.apache.org/jira/browse/NUTCH-1310
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 Nutch does not send a HTTP-accept header with its requests. This is usually 
 not a problem but some firewall do not like it and will reject the request.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1310) Nutch to send HTTP-accept header

2012-03-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229295#comment-13229295
 ] 

Markus Jelsma commented on NUTCH-1310:
--

Ah, yes, that should work out just fine. Thanks for pointing me to it!

 Nutch to send HTTP-accept header
 

 Key: NUTCH-1310
 URL: https://issues.apache.org/jira/browse/NUTCH-1310
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 Nutch does not send a HTTP-accept header with its requests. This is usually 
 not a problem but some firewall do not like it and will reject the request.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13225209#comment-13225209
 ] 

Markus Jelsma commented on NUTCH-1305:
--

Thanks Lewis.

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1282) linkdb scalability

2012-03-03 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221598#comment-13221598
 ] 

Markus Jelsma commented on NUTCH-1282:
--

There is an issue for that. In my opinion with that issue implemented the 
current linkdb can be deprecated.  Please check NUTCH-1181 if you have a patch 
for this.

 linkdb scalability
 --

 Key: NUTCH-1282
 URL: https://issues.apache.org/jira/browse/NUTCH-1282
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.4
Reporter: behnam nikbakht

 as described in NUTCH-1054, the linkdb is optional in solrindex and it's 
 usage is only for anchor and not impact on scoring. 
 as seemed, size of linkdb in incremental crawl grow very fast and make it 
 unscalable for huge size of web sites.
 so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
 second, make it scalable
 in invertlinks, there is 2 jobs, first for construct new linkdb from new 
 parsed segments, and second for merge new linkdb with old linkdb. the second 
 job is unscalable and we can ignore it with this changes in solrIndex:
 in the class IndexerMapReduce, reduce method, if fetchDatum == null or 
 dbDatum == null or parseText == null or parseData == null, then add anchor to 
 doc and update solr (no insert)
 here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-03-01 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220080#comment-13220080
 ] 

Markus Jelsma commented on NUTCH-1258:
--

The patch won't patch as it complains about being malformed. Also, the Writable 
class is not imported for some reason. It seems to work. Want me to commit?

 MoreIndexingFilter should be able to read Content-Type from both parse 
 metadata and content metadata
 

 Key: NUTCH-1258
 URL: https://issues.apache.org/jira/browse/NUTCH-1258
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1258-1.5-1.patch, NUTCH-1258-v2.patch


 The MoreIndexingFilter reads the Content-Type from parse metadata. However, 
 this usually contains a lot of crap because web developers can set it to 
 anything they like. The filter must be able to read the Content-Type field 
 from content metadata as well because that contains the type detected by 
 Tika's Detector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers

2012-02-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219079#comment-13219079
 ] 

Markus Jelsma commented on NUTCH-945:
-

Perhaps good to know for other readers, the patches submitted by Sujit are for 
the Nutch Gora branch.

 Indexing to multiple SOLR Servers
 -

 Key: NUTCH-945
 URL: https://issues.apache.org/jira/browse/NUTCH-945
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Charan Malemarpuram
 Attachments: MurmurHashPartitioner.java, 
 NonPartitioningPartitioner.java, patch-NUTCH-945.txt


 It would be nice to have a default Indexer in Nutch, which can submit docs to 
 multiple SOLR Servers.
  Partitioning is always the question, when writing to multiple SOLR Servers.
  Default partitioning can be a simple hashcode based distribution with 
  addition hooks to customization.
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

2012-02-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217157#comment-13217157
 ] 

Markus Jelsma commented on NUTCH-1289:
--

In trunk records of the same queue end up in the same fetch list which 
corresponds to a single mapper.

 In distributed mode URL's are not partitioned
 -

 Key: NUTCH-1289
 URL: https://issues.apache.org/jira/browse/NUTCH-1289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: nutchgora
Reporter: Dan Rosher
 Fix For: nutchgora

 Attachments: NUTCH-1289.patch


 In distributed mode URL's are not partitioned to a specific machine which 
 means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

2012-02-24 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215507#comment-13215507
 ] 

Markus Jelsma commented on NUTCH-965:
-

has this been fixed now?

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, 
 NUTCH-965-v3-trunk.txt, parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

2012-02-24 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215561#comment-13215561
 ] 

Markus Jelsma commented on NUTCH-965:
-

Hi Ferdy,

With a parsing fetcher on trunk we see the ParseStatus.success counter rarely 
being incremented. A test crawl succesfully fetches 10.000 records but the 
success counter hangs around 15 records. Most, if not all, fetched pages are 
well below the truncating threshold.

Cheers

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, 
 NUTCH-965-v3-trunk.txt, parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-965) Skip parsing for truncated documents

2012-02-24 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215585#comment-13215585
 ] 

Markus Jelsma commented on NUTCH-965:
-

Hmm, cleaning and rebuilding the job fixes that issue here. Please ignore :)

 Skip parsing for truncated documents
 

 Key: NUTCH-965
 URL: https://issues.apache.org/jira/browse/NUTCH-965
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Alexis
Assignee: Lewis John McGibbney
 Fix For: nutchgora, 1.5

 Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt, 
 NUTCH-965-v3-trunk.txt, parserJob.patch


 The issue you're likely to run into when parsing truncated FLV files is 
 described here:
 http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
 The parser library gets stuck in infinite loop as it encounters corrupted 
 data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1283) Ridically update all Solr configuration in Nutchgora

2012-02-20 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211732#comment-13211732
 ] 

Markus Jelsma commented on NUTCH-1283:
--

1.4 is the schema version of Solr 3.5. It is up to date.

 Ridically update all Solr configuration in Nutchgora
 

 Key: NUTCH-1283
 URL: https://issues.apache.org/jira/browse/NUTCH-1283
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora


 We're currently running with a Schema which states it's 1.4 :0| There should 
 be better support for newer stuff going on over the Solrland. Thsi issue 
 should track those improvements entirely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210339#comment-13210339
 ] 

Markus Jelsma commented on NUTCH-1246:
--

hmm, the jackson dep is still there but it should be removed as it is properly 
included with the deps of 1.0.0.

 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-16 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209436#comment-13209436
 ] 

Markus Jelsma commented on NUTCH-1210:
--

Sure, will include a sample file and change ivy's include path. That property 
is also not included for the regular domainfilter. I don't think users are 
likely to change the value.

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2012-02-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208309#comment-13208309
 ] 

Markus Jelsma commented on NUTCH-1129:
--

This is a parser plugin right? How will this work if we for example would like 
to parse microdata with any23 and use Tika's BoilerpipeContentHandler to 
extraction? In the current BP patch we use multiple content handlers to parse 
all in one go so i wonder if this could be implemented as such.

Please correct me when wrong :)

 Any23 Nutch plugin
 --

 Key: NUTCH-1129
 URL: https://issues.apache.org/jira/browse/NUTCH-1129
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1129.patch


 This plugin should build on the Any23 library to provide us with a plugin 
 which extracts RDF data from HTTP and file resources. Although as of writing 
 Any23 not part of the ASF, the project is working towards integration into 
 the Apache Incubator. Once the project proves its value, this would be an 
 excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-02-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208461#comment-13208461
 ] 

Markus Jelsma commented on NUTCH-1262:
--

Is this issue still subject to debate? Opinions?

 Map `duplicating` content-types to a single type
 

 Key: NUTCH-1262
 URL: https://issues.apache.org/jira/browse/NUTCH-1262
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1262-1.5-1.patch


 Similar or duplicating content-types can end-up differently in an index. 
 With, for example, both application/xhtml+xml and text/html it is impossible 
 to use a single filter to select `web pages`.
 See also: 
 http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) Store detected content type in crawldatum metadata

2012-02-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207746#comment-13207746
 ] 

Markus Jelsma commented on NUTCH-1259:
--

Splendid work my friend! The fetcher runs smoothly again! I'll check out your 
patch for NUTCH-1258 this week.
But what about segments fetched with and without this new feature and 
db.parsemeta.to.crawldb=Content-Type property?

I assume i'd have to update the segments before this change with the property 
enabled and update the segments fetched with this feature without the 
db.parsemeta.to.crawldb property.


 Store detected content type in crawldatum metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-11 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206142#comment-13206142
 ] 

Markus Jelsma commented on NUTCH-1259:
--

Great. Thanks!

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205413#comment-13205413
 ] 

Markus Jelsma commented on NUTCH-1259:
--

Sounds good! We already store the Content-Type in de CrawlDatum's metadata for 
NUTCH-1024 via db.parsemeta.to.crawldb. Wouldn't it be better to store it in 
the CrawlDatum object itself just like the signature? Then someone cannot 
override it by accident.

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205548#comment-13205548
 ] 

Markus Jelsma commented on NUTCH-1259:
--

NUTCH-1024 relies on the Content-Type to be added crawldatum metadata via 
db.parsemeta.to.crawldb.

Anyway, i agree. Will you open another issue?

have a nice weekend :)


 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-09 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204556#comment-13204556
 ] 

Markus Jelsma commented on NUTCH-1259:
--

Hi,

Consider the following URL that produces bad output. This URL is not the only 
producing bad output. We've seen countless of examples that produce funky 
values in both content meta and parse meta, or no value at all.

http://kam.mff.cuni.cz/conferences/GraDR/

The current Nutch trunk shows us the following meta data for this URL obtained 
via parsechecker with only parse-tika enabled:

{code}
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 
14:37:47 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip 
Content-Location=index.html.bak Content-Type=application/x-trash 
Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) 
mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 
OpenSSL/0.9.8g 
Parse Metadata: Content-Encoding=ISO-8859-1
{code}

It's an application/x-trash according to content meta and no data is available 
in parse meta. But, it's just an ordinary HTML page. This cannot be true, from 
an index point of view we will never know that this is an HTML page. With this 
patch enabled we will get the following output:

{code}
Content Metadata: Vary=negotiate,accept,Accept-Encoding Date=Thu, 09 Feb 2012 
14:40:15 GMT Content-Length=4911 TCN=choice Content-Encoding=gzip 
Content-Location=index.html.bak Content-Type=application/x-trash 
Connection=close Accept-Ranges=bytes Server=Apache/2.2.9 (Debian) 
mod_auth_kerb/5.3 PHP/5.2.6-1+lenny14 with Suhosin-Patch mod_ssl/2.2.9 
OpenSSL/0.9.8g 
Parse Metadata: Content-Encoding=ISO-8859-1 Content-Type=text/html
{code}

For us, this solves all problems as we now only rely on Tika's MIME-detector 
and store it in parse meta. The value of content meta cannot be trusted. It's 
the same as with languages, when we do not use Tika to detect the language we 
get all sorts of crap.

Since the upgrade to Tika 1.0 and with NUTCH-1230 we obtain the detected 
MIME-type but it's not added to the parse meta. Now it is.

Do you have another suggestion? 

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203488#comment-13203488
 ] 

Markus Jelsma commented on NUTCH-1269:
--

It won't patch for trunk, all hunks fail. Anyway, this issue looks like 
NUTCH-1074. Segment sizes are uniform and the correct number of records per 
queue end up in a segment. I think this duplicates NUTCH-1074 which was fixed 
for 1.4. What Nutch are you using Benham?

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203501#comment-13203501
 ] 

Markus Jelsma commented on NUTCH-1269:
--

Ah, yes, i understand now. Your patch is an attempt to spread the host (or 
domain) limit over all generated segments. Interesting. Can you provide a patch 
that works with trunk and have this feature enabled via configuration?

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202219#comment-13202219
 ] 

Markus Jelsma commented on NUTCH-1005:
--

i'll commit this one shortly if there are no objections
thanks

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch, 
 NUTCH-1005-1.5-5.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1266) Subcollection to optionally write to configured fields

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202220#comment-13202220
 ] 

Markus Jelsma commented on NUTCH-1266:
--

comments? 

 Subcollection to optionally write to configured fields
 --

 Key: NUTCH-1266
 URL: https://issues.apache.org/jira/browse/NUTCH-1266
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1266-1.5-1.patch


 The subcollection plugin writes the contents of the name element of a given 
 subcollection to the subcollection field. There are cases in which writing to 
 fields other than subcollection is useful.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1210) DomainBlacklistFilter

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202221#comment-13202221
 ] 

Markus Jelsma commented on NUTCH-1210:
--

I'll send this one in if there are no objections.

 DomainBlacklistFilter
 -

 Key: NUTCH-1210
 URL: https://issues.apache.org/jira/browse/NUTCH-1210
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1210-1.5-1.patch


 The current DomainFilter acts as a white list. We also need a filter that 
 acts as a black list so we can allow tld's and/or domains with DomainFilter 
 but blacklist specific subdomains. If we would patch the current DomainFilter 
 for this behaviour it would break current semantics such as it's precedence. 
 Therefore i would propose a new filter instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1266) Subcollection to optionally write to configured fields

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202378#comment-13202378
 ] 

Markus Jelsma commented on NUTCH-1266:
--

I'll commit this one in a few hours unless there are objections.

 Subcollection to optionally write to configured fields
 --

 Key: NUTCH-1266
 URL: https://issues.apache.org/jira/browse/NUTCH-1266
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1266-1.5-1.patch


 The subcollection plugin writes the contents of the name element of a given 
 subcollection to the subcollection field. There are cases in which writing to 
 fields other than subcollection is useful.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202460#comment-13202460
 ] 

Markus Jelsma commented on NUTCH-1259:
--

I'll comment on it myself then: the code above fixes the issue and adds a 
proper content-type to parsemeta. Consider the following URL with a very bad 
content-type:

http://kam.mff.cuni.cz/conferences/GraDR/

I'll upload a patch in a minute that sets the detected content type in the 
metadata instead

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-07 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202483#comment-13202483
 ] 

Markus Jelsma commented on NUTCH-1259:
--

you're right. but since you're most of the time the only person reviewing and 
the fact this issue has your attention now, what is your opinion on this 
problem? ;)

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1259-1.5-1.patch


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1264) Configurable indexing plugin (index-extra)

2012-02-06 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201303#comment-13201303
 ] 

Markus Jelsma commented on NUTCH-1264:
--

+1

Didn't manage to test last week but it works like a charm now! I'll upload a 
headings plugin without indexing that works with this plugin. 

 Configurable indexing plugin (index-extra) 
 ---

 Key: NUTCH-1264
 URL: https://issues.apache.org/jira/browse/NUTCH-1264
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Julien Nioche
 Attachments: NUTCH-1264-trunk.patch


 We currently have several plugins already distributed or proposed which do 
 very comparable things : 
 - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
 index them
 - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
 index them
 - index-extra [NUTCH-422] to index configurable fields 
 - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
 and index them
 - index-static [NUTCH-940] to generate configurable static fields 
 All these plugins have in common that they allow to extract information from 
 various sources and generate fields from them and are largely redundant. 
 Instead this issue proposes to have a single plugin allowing to generate 
 configurable fields from : 
 - static values
 - parse metadata
 - content metadata
 - crawldb metadata
 and let the other plugins focus on the parsing and extraction of the values 
 to index. This will make the addition of new fields simpler by relying on a 
 stable common plugin instead of multiplying the code in various plugins.
 This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] 
 and will serve as a basis for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1264) Configurable indexing plugin (index-metadata)

2012-02-06 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13201392#comment-13201392
 ] 

Markus Jelsma commented on NUTCH-1264:
--

Works fine!

 Configurable indexing plugin (index-metadata) 
 --

 Key: NUTCH-1264
 URL: https://issues.apache.org/jira/browse/NUTCH-1264
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.5
Reporter: Julien Nioche
 Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch


 We currently have several plugins already distributed or proposed which do 
 very comparable things : 
 - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and 
 index them
 - headings [NUTCH-1005] to generate headings fields in parse-metadata and 
 index them
 - index-extra [NUTCH-422] to index configurable fields 
 - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks 
 and index them
 - index-static [NUTCH-940] to generate configurable static fields 
 All these plugins have in common that they allow to extract information from 
 various sources and generate fields from them and are largely redundant. 
 Instead this issue proposes to have a single plugin allowing to generate 
 configurable fields from : 
 - static values
 - parse metadata
 - content metadata
 - crawldb metadata
 and let the other plugins focus on the parsing and extraction of the values 
 to index. This will make the addition of new fields simpler by relying on a 
 stable common plugin instead of multiplying the code in various plugins.
 This plugin will replace index-extra [NUTCH-422] and will serve as a basis 
 for further improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197870#comment-13197870
 ] 

Markus Jelsma commented on NUTCH-1005:
--

Hi!

Don't you mean:
{code}
parse.getData().getParseMeta().set(headings[i], heading.trim());
{code}

That still works well with the indexfilter when testing via indexchecker.

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197886#comment-13197886
 ] 

Markus Jelsma commented on NUTCH-1005:
--

Yes i'll give it a shot this week. Your patch can index fields from content, 
parse and db metadata which replaces the indexing filter of this headings 
plugin. I assume i have to disable the indexing filter of this plugin but keep 
the parse filter since your patch does not do any parsing right?

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197894#comment-13197894
 ] 

Markus Jelsma commented on NUTCH-1005:
--

index-meta comes to mind! It's exactly what it does right?

I'll try the patch with the headings indexing filter disabled and with good 
results will provide a new patch without the indexing filter extension.

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch, NUTCH-1005-1.5-4.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score

2012-01-31 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196830#comment-13196830
 ] 

Markus Jelsma commented on NUTCH-1256:
--

I'll commit this one today unless there are objections.

 WebGraph to dump host + score
 -

 Key: NUTCH-1256
 URL: https://issues.apache.org/jira/browse/NUTCH-1256
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1256-1.5-1.patch


 WebGraph's NodeDumper tool can dump url,score information but a 
 host|domain,score output can also be put to good use. This is likely to 
 require a new MapReduce job as the NodeDumper's atonomy is not suited to 
 return max or or summed scores. Code could also be merged with the tool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-31 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196962#comment-13196962
 ] 

Markus Jelsma commented on NUTCH-1242:
--

Yes it should. Thanks! It's now changed in equalsIgnoreCase!

 Allow disabling of URL Filters in ParseSegment
 --

 Key: NUTCH-1242
 URL: https://issues.apache.org/jira/browse/NUTCH-1242
 Project: Nutch
  Issue Type: Improvement
Reporter: Edward Drapkin
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1242-1.5-1.patch, ParseSegment.patch, 
 parseoutputformat.patch, trunk.patch


 Right now, the ParseSegment job does not allow you to disable URL filtration. 
  For reasons that aren't worth explaining, I need to do this, so I enabled 
 this behavior through the use of a boolean configuration value 
 parse.filter.urls which defaults to true.
 I've attached a simple, preliminary patch that enables this behavior with 
 that configuration option.  I'm not sure if it should be made a command line 
 option or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1256) WebGraph to dump host + score

2012-01-30 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196033#comment-13196033
 ] 

Markus Jelsma commented on NUTCH-1256:
--

comments?

 WebGraph to dump host + score
 -

 Key: NUTCH-1256
 URL: https://issues.apache.org/jira/browse/NUTCH-1256
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1256-1.5-1.patch


 WebGraph's NodeDumper tool can dump url,score information but a 
 host|domain,score output can also be put to good use. This is likely to 
 require a new MapReduce job as the NodeDumper's atonomy is not suited to 
 return max or or summed scores. Code could also be merged with the tool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-01-30 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13196034#comment-13196034
 ] 

Markus Jelsma commented on NUTCH-1259:
--

comments please.

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192992#comment-13192992
 ] 

Markus Jelsma commented on NUTCH-1258:
--

Comments? Tested and things work as expected, tests pass. Ill commit shortly 
unless there are objections.

 MoreIndexingFilter should be able to read Content-Type from both parse 
 metadata and content metadata
 

 Key: NUTCH-1258
 URL: https://issues.apache.org/jira/browse/NUTCH-1258
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1258-1.5-1.patch


 The MoreIndexingFilter reads the Content-Type from parse metadata. However, 
 this usually contains a lot of crap because web developers can set it to 
 anything they like. The filter must be able to read the Content-Type field 
 from content metadata as well because that contains the type detected by 
 Tika's Detector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193001#comment-13193001
 ] 

Markus Jelsma commented on NUTCH-1258:
--

That may be a good idea indeed but we need to extend it too. This patch fixes 
some issues with bad content-types but it seems the problem is bigger. The 
example URL [1] doesn't provide any Content-Type in ParseMeta and a bad 
Content-Type in ContentMeta, application/x-trash which is found in the HTTP 
resp. header. However, parserchecker (and indexchecker) both show 
contentType: text/html at the top but this value is not added to any metadata 
AFAIK. In this case only contentType = content.getContentType(); returns the 
desired Content-Type.

Any idea how we can get a hold on that value when we have an instance of 
ParseData in the MoreIndexingFilter? 

[1]: http://kam.mff.cuni.cz/conferences/GraDR/

 MoreIndexingFilter should be able to read Content-Type from both parse 
 metadata and content metadata
 

 Key: NUTCH-1258
 URL: https://issues.apache.org/jira/browse/NUTCH-1258
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1258-1.5-1.patch


 The MoreIndexingFilter reads the Content-Type from parse metadata. However, 
 this usually contains a lot of crap because web developers can set it to 
 anything they like. The filter must be able to read the Content-Type field 
 from content metadata as well because that contains the type detected by 
 Tika's Detector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193019#comment-13193019
 ] 

Markus Jelsma commented on NUTCH-1258:
--

Ah, the Content-Type detected by Tika is never added to ParseMeta in the first 
place! I've modified TikaParser with nutchMetadata.add(Content-Type, 
mimeType);. In cases where at first i had a bad Content-Type in ParseMeta (but 
a good one in Content-Meta) i now have good old text/html. The problem is with 
Content-Types already added to the MetaData by the parser. In that case both 
the good and bad Content-Types are present in ParseMeta.

Just as commented in the code we now have a problem with multi values fields.

{code}
// populate Nutch metadata with Tika metadata
String[] TikaMDNames = tikamd.names();
for (String tikaMDName : TikaMDNames) {
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
continue;
// TODO what if multivalued?
nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
}
{code}

This needs another issue opened but some comments are more than appreciated 
first.

Thanks

 MoreIndexingFilter should be able to read Content-Type from both parse 
 metadata and content metadata
 

 Key: NUTCH-1258
 URL: https://issues.apache.org/jira/browse/NUTCH-1258
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1258-1.5-1.patch


 The MoreIndexingFilter reads the Content-Type from parse metadata. However, 
 this usually contains a lot of crap because web developers can set it to 
 anything they like. The filter must be able to read the Content-Type field 
 from content metadata as well because that contains the type detected by 
 Tika's Detector.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193030#comment-13193030
 ] 

Markus Jelsma commented on NUTCH-1259:
--

A solution would be to prevent the type to be added just like what is already 
being done with the title field. Now, a reliable Content-Type value is added to 
the ParseMetaData.

{code}
// populate Nutch metadata with Tika metadata
String[] TikaMDNames = tikamd.names();
for (String tikaMDName : TikaMDNames) {
if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
continue;

  // DO NOT ADD Content-Type FROM HTTP_HEADERS, ONLY ADD THE DETECTED TYPE 
SEE https://issues.apache.org/jira/browse/NUTCH-1259
   if (tikaMDName.equalsIgnoreCase(Metadata.CONTENT_TYPE))
continue;

// TODO what if multivalued?
nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
}
// Only add the detected TYPE
nutchMetadata.add(Content-Type, mimeType);
{code}

 TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata
 --

 Key: NUTCH-1259
 URL: https://issues.apache.org/jira/browse/NUTCH-1259
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 The MIME-type detected by Tika's Detect() API is never added to a Parse's 
 ContentMetaData or ParseMetaData. Because of this bad Content-Types will end 
 up in the documents. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187884#comment-13187884
 ] 

Markus Jelsma commented on NUTCH-1201:
--

Hi Edward,

I've already modified Fetcher to allow for different Fetcher impls via 
configuration that inherit from Fetcher itself. It works fine and i can 
override methods i need. However, it may not be that elegant. There's no code 
to use other queue impls. I'll cook a patch tomorrow.

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188095#comment-13188095
 ] 

Markus Jelsma commented on NUTCH-1251:
--

Can you provide a patch for trunk?

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.5


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186177#comment-13186177
 ] 

Markus Jelsma commented on NUTCH-1247:
--

Lewis, we're seeing many URL's with a high retry value. When the value is 
greater than 127 they're negative. This is in itself not a problem but it seems 
in my setup it will continue to increase.

Andrzej, there may indeed be something wrong. Might this be related to 
NUTCH-1245 then? There seems to be something wrong with the following 
CrawlDBReducer code:

{code}
260 case CrawlDatum.STATUS_FETCH_RETRY: // temporary failure
261 if (oldSet) {
262 result.setSignature(old.getSignature()); // use old signature
263 }
264 result = schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
265 prevModifiedTime, fetch.getFetchTime());
266 if (result.getRetriesSinceFetch()  retryMax) {
267 result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
268 } else {
269 result.setStatus(CrawlDatum.STATUS_DB_GONE);
270 }
271 break;
{code}

In setPageRetrySchedule() the num retries is always incremented. This causes 
records with exceptions such as UnknownHostException to be refetched for each 
segment. This makes sense because the first segment in our cycle has much more 
exceptions than average.

Do you follow?

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186217#comment-13186217
 ] 

Markus Jelsma commented on NUTCH-1247:
--

Alright, then i think this must be related to NUTCH-1245. In that case the 
record is set to DB_GONE but generated anyway so this counter would continue to 
increase forever.

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-01-13 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185499#comment-13185499
 ] 

Markus Jelsma commented on NUTCH-1246:
--

This means the Jackson dep can be removed again as it is now fixed in 1.0.0

* HADOOP-7461. Fix to add jackson dependency to hadoop pom.

 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1177) Generator to select on retry interval

2012-01-13 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185560#comment-13185560
 ] 

Markus Jelsma commented on NUTCH-1177:
--

i'll commit this one if there are no objections.

 Generator to select on retry interval
 -

 Key: NUTCH-1177
 URL: https://issues.apache.org/jira/browse/NUTCH-1177
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1177-1.5-1.patch


 The generator already has a mechanism to select entries with a score larger 
 than specified threshold but should also have a means to select entries with 
 a retry interval lower than specified by a configuration option.
 Such a feature is particulary useful when dealing with too large crawldb's 
 where you still want a crawl to fetch rapid changing url's first.
 This issue should also add the missing generate.min.score configuration to 
 nutch-default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1247) CrawlDatum.retries should be int

2012-01-13 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185826#comment-13185826
 ] 

Markus Jelsma commented on NUTCH-1247:
--

Hints and thoughts are much appreciated, messing with CrawlDatum is pretty 
invasive.

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.5


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-09 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182530#comment-13182530
 ] 

Markus Jelsma commented on NUTCH-1244:
--

I'll commit shortly if there are no objections.

 CrawlDBDumper to filter by regex
 

 Key: NUTCH-1244
 URL: https://issues.apache.org/jira/browse/NUTCH-1244
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1244-1.5-1.patch, NUTCH-1244-1.5-2.patch


 The CrawlDBDumper tool should be able to filter records by an option regular 
 expression.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-05 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180442#comment-13180442
 ] 

Markus Jelsma commented on NUTCH-1244:
--

Almost, this doesn't allow for creating mini-crawldb's using that feature. 
Perhaps a -format crawldb and setting a MapFileOutputFormat would do the trick.

 CrawlDBDumper to filter by regex
 

 Key: NUTCH-1244
 URL: https://issues.apache.org/jira/browse/NUTCH-1244
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1244-1.5-1.patch


 The CrawlDBDumper tool should be able to filter records by an option regular 
 expression.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-05 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13180515#comment-13180515
 ] 

Markus Jelsma commented on NUTCH-1245:
--

Thanks! This must be the same issue as NUTCH-578 but marked as related for now. 
Can you provide a patch?

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel

 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records

2012-01-04 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179374#comment-13179374
 ] 

Markus Jelsma commented on NUTCH-1241:
--

yes of course but it is not user friendly. You cannot search for product in 
http://host/product/123 in a user friendly manner. Also, using a Matcher would 
slighly boost performance.

 CrawlDBScanner should also be able to find records
 --

 Key: NUTCH-1241
 URL: https://issues.apache.org/jira/browse/NUTCH-1241
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 The CrawlDBScanner cannot find partial matches because it uses 
 String.match(); Instead, it should be able to use the Matcher.find() to find 
 partial matches. Right now regex http will never match any records. It can 
 then also reuse a compiled pattern.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records

2012-01-04 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179386#comment-13179386
 ] 

Markus Jelsma commented on NUTCH-1241:
--

hmm yes, if we add a -regex option that goes with the -dump option in the 
reader we can also have csv output! However, due to NUTCH-1029 i cannot test it 
properly in a production environment. Care to have a look?

 CrawlDBScanner should also be able to find records
 --

 Key: NUTCH-1241
 URL: https://issues.apache.org/jira/browse/NUTCH-1241
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 The CrawlDBScanner cannot find partial matches because it uses 
 String.match(); Instead, it should be able to use the Matcher.find() to find 
 partial matches. Right now regex http will never match any records. It can 
 then also reuse a compiled pattern.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1241) CrawlDBScanner should also be able to find records

2012-01-04 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13179517#comment-13179517
 ] 

Markus Jelsma commented on NUTCH-1241:
--

Ugh, that's NUTCH-1084 instead. I don't need that ticket to test the regex dump 
because dumping the DB still works, reading a record doesn't. 



 CrawlDBScanner should also be able to find records
 --

 Key: NUTCH-1241
 URL: https://issues.apache.org/jira/browse/NUTCH-1241
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 The CrawlDBScanner cannot find partial matches because it uses 
 String.match(); Instead, it should be able to use the Matcher.find() to find 
 partial matches. Right now regex http will never match any records. It can 
 then also reuse a compiled pattern.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1239) Webgraph should remove deleted pages from segment input

2012-01-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178358#comment-13178358
 ] 

Markus Jelsma commented on NUTCH-1239:
--

I'll commit shortly unless there are objections.
thanks


 Webgraph should remove deleted pages from segment input
 ---

 Key: NUTCH-1239
 URL: https://issues.apache.org/jira/browse/NUTCH-1239
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Attachments: NUTCH-1239-1.5-1.patch


 Webgraph's outlink job is currently unable to remove links. It should expand 
 it's segment input and be able to remove nodes for pages that no longer exist.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1232) Remove host|site fields from index-basic

2012-01-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178373#comment-13178373
 ] 

Markus Jelsma commented on NUTCH-1232:
--

I'll remove the site field and commit unless there are objections. Users that 
have application software relying on that field can simply use a copyField to 
resolve the issue.

 Remove host|site fields from index-basic
 

 Key: NUTCH-1232
 URL: https://issues.apache.org/jira/browse/NUTCH-1232
 Project: Nutch
  Issue Type: Bug
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 Either fields needs to be removed, it makes no sense to have two identical 
 values for separate fields. I propose to get rid of the site field and leave 
 the host field. This may be a breaking change for some installations however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2012-01-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178374#comment-13178374
 ] 

Markus Jelsma commented on NUTCH-1138:
--

lewis, isnt this issue resolved now?

 remove LogUtil from trunk and nutch gora
 

 Key: NUTCH-1138
 URL: https://issues.apache.org/jira/browse/NUTCH-1138
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5

 Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch


 This should move towards the removal of the LogUtil class from both codebases 
 as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1238) Fetcher throughput threshold must start before feeder finished

2011-12-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177116#comment-13177116
 ] 

Markus Jelsma commented on NUTCH-1238:
--

Tested and works. Unit tests also still pass. I'll commit shortly if there are 
no objections.

 Fetcher throughput threshold must start before feeder finished
 --

 Key: NUTCH-1238
 URL: https://issues.apache.org/jira/browse/NUTCH-1238
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1238-1.5-1.patch


 Right now the fetcher's minimum throughput threshold is activated only when 
 the feeder has finished. However, for various reasons a running fetch can be 
 slow. This issue must change the feature to start checking earlier, but not 
 right after initialization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176164#comment-13176164
 ] 

Markus Jelsma commented on NUTCH-1225:
--

Old mapred version restored per rev. 1224905.

 Migrate CrawlDBScanner to MapReduce API
 ---

 Key: NUTCH-1225
 URL: https://issues.apache.org/jira/browse/NUTCH-1225
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1225-1.5-1.patch, NUTCH-1225-1.5-2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2011-12-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176194#comment-13176194
 ] 

Markus Jelsma commented on NUTCH-961:
-

Fixed already. See NUTCH-1233 for a patch!

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2011-12-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176166#comment-13176166
 ] 

Markus Jelsma commented on NUTCH-1222:
--

Reverted per rev. 1224906.

 Upgrade to new Hadoop 0.22.0
 

 Key: NUTCH-1222
 URL: https://issues.apache.org/jira/browse/NUTCH-1222
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5

 Attachments: NUTCH-1222-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1235) Upgrade to new Hadoop 0.20.205.0

2011-12-27 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176181#comment-13176181
 ] 

Markus Jelsma commented on NUTCH-1235:
--

Forgot to add the Jackson ASL mapper as a dependency. It's something the new 
Hadoop needs AVRO and it needs Jackson. For some reason it is with 0.20.205.0 
not specified as dependency.

Committed in rev. 1224912.

 Upgrade to new Hadoop 0.20.205.0
 

 Key: NUTCH-1235
 URL: https://issues.apache.org/jira/browse/NUTCH-1235
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1230) MimeType utils broken with Tika 1.1

2011-12-21 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174021#comment-13174021
 ] 

Markus Jelsma commented on NUTCH-1230:
--

Seems the thing became deprecated in 1.0 and is in my 1.1-snapshot 
inaccessible. I'll try an upgrade to 1.0.

 MimeType utils broken with Tika 1.1
 ---

 Key: NUTCH-1230
 URL: https://issues.apache.org/jira/browse/NUTCH-1230
 Project: Nutch
  Issue Type: Bug
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5


 We used Tika 1.0-SNAPSHOT in production and just switched to 1.1-SNAPSHOT. 
 The new version triggers the following error:
 {code}
 2011-12-21 12:29:56,665 ERROR http.Http - java.lang.IllegalAccessError: tried 
 to access method 
 org.apache.tika.mime.MimeTypes.getMimeType([B)Lorg/apache/tika/mime/MimeType; 
 from class org.apache.nutch.util.MimeUtil
 2011-12-21 12:29:56,665 ERROR http.Http - at 
 org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:169)
 2011-12-21 12:29:56,665 ERROR http.Http - at 
 org.apache.nutch.protocol.Content.getContentType(Content.java:292)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.protocol.Content.init(Content.java:88)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:82)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1230) MimeType API deprecated and breaks with Tika 1.0

2011-12-21 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174071#comment-13174071
 ] 

Markus Jelsma commented on NUTCH-1230:
--

actually, Tika now returns the octetstream for that data. Please advice!

 MimeType API deprecated and breaks with Tika 1.0
 

 Key: NUTCH-1230
 URL: https://issues.apache.org/jira/browse/NUTCH-1230
 Project: Nutch
  Issue Type: Bug
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.5

 Attachments: NUTCH-1230-1.5-2.patch


 We used Tika 1.0-SNAPSHOT in production and just switched to 1.1-SNAPSHOT. 
 The new version triggers the following error:
 {code}
 2011-12-21 12:29:56,665 ERROR http.Http - java.lang.IllegalAccessError: tried 
 to access method 
 org.apache.tika.mime.MimeTypes.getMimeType([B)Lorg/apache/tika/mime/MimeType; 
 from class org.apache.nutch.util.MimeUtil
 2011-12-21 12:29:56,665 ERROR http.Http - at 
 org.apache.nutch.util.MimeUtil.autoResolveContentType(MimeUtil.java:169)
 2011-12-21 12:29:56,665 ERROR http.Http - at 
 org.apache.nutch.protocol.Content.getContentType(Content.java:292)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.protocol.Content.init(Content.java:88)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:82)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
 2011-12-21 12:29:56,666 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-19 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172209#comment-13172209
 ] 

Markus Jelsma commented on NUTCH-1225:
--

I'll commit shortly if there are no objections

 Migrate CrawlDBScanner to MapReduce API
 ---

 Key: NUTCH-1225
 URL: https://issues.apache.org/jira/browse/NUTCH-1225
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1225-1.5-1.patch, NUTCH-1225-1.5-2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1222) Upgrade to newer Hadoop versions

2011-12-19 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172211#comment-13172211
 ] 

Markus Jelsma commented on NUTCH-1222:
--

If there are no objections i'll upgrade the ivy deps to Hadoop 0.22.0

 Upgrade to newer Hadoop versions
 

 Key: NUTCH-1222
 URL: https://issues.apache.org/jira/browse/NUTCH-1222
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.5




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-12-19 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172350#comment-13172350
 ] 

Markus Jelsma commented on NUTCH-1184:
--

If there are no further objection i will commit this one tomorrow.

 Fetcher to parse and follow Nth degree outlinks
 ---

 Key: NUTCH-1184
 URL: https://issues.apache.org/jira/browse/NUTCH-1184
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, 
 NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, 
 NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch, 
 NUTCH-1184-1.5-9-ParseOutputFormat.patch, NUTCH-1185-1.5-6.patch, 
 NUTCH-1185-1.5-7.patch, NUTCH-1185-1.5-8.patch, NUTCH-1185-1.5-9.patch


 Fetcher improvements to parse and follow outlinks up to a specified depth. 
 The number of outlinks to follow can be decreased by depth using a divisor. 
 This patch introduces three new configuration directives:
 {code}
 property
   namefetcher.follow.outlinks.depth/name
   value-1/value
   description(EXPERT)When fetcher.parse is true and this value is greater 
 than 0 the fetcher will extract outlinks
   and follow until the desired depth is reached. A value of 1 means all 
 generated pages are fetched and their first degree
   outlinks are fetched and parsed too. Be careful, this feature is in itself 
 agnostic of the state of the CrawlDB and does not
   know about already fetched pages. A setting larger than 2 will most likely 
 fetch home pages twice in the same fetch cycle.
   It is highly recommended to set db.ignore.external.links to true to 
 restrict the outlink follower to URL's within the same
   domain. When disabled (false) the feature is likely to follow duplicates 
 even when depth=1.
   A value of -1 of 0 disables this feature.
   /description
 /property
 property
   namefetcher.follow.outlinks.num.links/name
   value4/value
   description(EXPERT)The number of outlinks to follow when 
 fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply
   the total number of pages to fetch. This works with 
 fetcher.follow.outlinks.depth.divisor, by default settings the followed 
 outlinks
   at depth 1 is 8, not 4.
   /description
 /property
 property
   namefetcher.follow.outlinks.depth.divisor/name
   value2/value
   description(EXPERT)The divisor of fetcher.follow.outlinks.num.links per 
 fetcher.follow.outlinks.depth. This decreases the number
   of outlinks to follow by increasing depth. The formula used is: outlinks = 
 floor(divisor / depth * num.links). This prevents
   exponential growth of the fetch list.
   /description
 /property
 {code}
 Please, do not use this unless you know what you're doing. This feature does 
 not consider the state of the CrawlDB nor does it consider generator settings 
 such as limiting the number of pages per (domain|host|ip) queue. It is not 
 polite to use this feature with high settings as it can fetch many pages from 
 the same domain including duplicates.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13170313#comment-13170313
 ] 

Markus Jelsma commented on NUTCH-1225:
--

I removed the Hadoop deps from Ivy and manually added Hadoop 0.21 jars to the 
lib directory. Next two other deps must be added to Ivy

{code}
!-- need to compile webgraph --
dependency org=commons-cli name=commons-cli 
rev=20040117.00
conf=*-default /
!-- avro --
dependency org=org.apache.avro name=avro rev=1.6.1
conf=*-default /
{code}


 Migrate CrawlDBScanner to MapReduce API
 ---

 Key: NUTCH-1225
 URL: https://issues.apache.org/jira/browse/NUTCH-1225
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1225-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-06 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13163515#comment-13163515
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Ah yes it makes sense now!
 
If you look at the patch for NUTCH-1139 you can see that the endpoint, Solr in 
this case, implements the delete method as called from NutchIndexAction. 
Another endpoint could simply ignore and do nothing but write out WARC or Solr 
XML files.

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2011-12-05 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162977#comment-13162977
 ] 

Markus Jelsma commented on NUTCH-1047:
--

Hi Julien,

I'm not sure i get your point exactly but if we don't generate WARC files we:
- don't have to think about the problem you state
- don't create an additional process between Nutch and a search engine

If you'd need WARC files, for some reason, i'd rather have an endpoint for it 
just like for ES and Solr instead of using WARC files as an intermediate format.

Does your suggestion imply: segment+crawldb  warc files  search engine? 


 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.5


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161501#comment-13161501
 ] 

Markus Jelsma commented on NUTCH-1206:
--

fetching: https://issues.apache.org/jira/secure/attachment/12505323/direct.pdf
Can't fetch URL successfully

This is obviously not a parser problem as it tells you it's a fetcher problem. 
Also, can you fetch httpS url's at all with the protocol plugin you use?


 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-12-02 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13161521#comment-13161521
 ] 

Markus Jelsma commented on NUTCH-1206:
--

I see. Check your logs for something peculiar. I can fetch and parse this file 
with Nutch 1.4 with protocol-htttpclient. 

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh
Assignee: Chris A. Mattmann
 Attachments: direct.pdf


 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 out.printf(meta 

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

2011-11-21 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154180#comment-13154180
 ] 

Markus Jelsma commented on NUTCH-1206:
--

Have you tried the Nutch trunk or the most recent Tika as suggested? 

 tika parser of nutch 1.3 is failing to prcess pdfs
 --

 Key: NUTCH-1206
 URL: https://issues.apache.org/jira/browse/NUTCH-1206
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: Solaris/Linux/Windows
Reporter: dibyendu ghosh

 Please refer to this message: 
 http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
 parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
 though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
 not have parse-pdf plugin and it is not able to parse even older pdfs.
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 public class TestParse {
 private static Configuration conf = NutchConfiguration.create();
 public TestParse() {
 }
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 public static String convert(String fileName) {
 String newName = abc.html;
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 // Start Document
 out.println(html);
 // Start Header
 out.println(head);
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 out.printf(meta name=\%s\ content=\%s\/\n,
name, values);
 }
 out.println(meta http-equiv=\Content-Type\
 

[jira] [Commented] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150402#comment-13150402
 ] 

Markus Jelsma commented on NUTCH-1184:
--

Any comments? Objections? I'd like to push this in and mark the new config 
directives as expert.

 Fetcher to parse and follow Nth degree outlinks
 ---

 Key: NUTCH-1184
 URL: https://issues.apache.org/jira/browse/NUTCH-1184
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1184-1.5-1.patch, NUTCH-1184-1.5-2.patch, 
 NUTCH-1184-1.5-3.patch, NUTCH-1184-1.5-4.patch, 
 NUTCH-1184-1.5-5-ParseData.patch, NUTCH-1184-1.5-5.patch


 Improvements to fetcher to follow Nth degree outlinks of fetched items:
 - fetch
 - parse
 - normalize and filter outlinks
 - create new FetchItem and inject in the queue

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs

2011-11-14 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149708#comment-13149708
 ] 

Markus Jelsma commented on NUTCH-1202:
--

i haven't looked at implementation details and cannot offer a suggestion (yet). 
Doing it in configure.setup is not pretty for the reason in the comments. On 
the other hand, the current implementation does not allow one to submit several 
jobs at once without risking a lot of records being hit by time limit.

 Fetcher timebomb kills long waiting fetch jobs
 --

 Key: NUTCH-1202
 URL: https://issues.apache.org/jira/browse/NUTCH-1202
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Markus Jelsma
 Fix For: 1.5


 The timebomb feature kills of mappers of jobs that have been waiting too long 
 in the job queue. The timebomb feature should start at mapper initialization 
 instead, not in job init.
 Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1180) UpdateDB to backup previous CrawlDB

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147641#comment-13147641
 ] 

Markus Jelsma commented on NUTCH-1180:
--

I'll send this in if there are no objections.

 UpdateDB to backup previous CrawlDB
 ---

 Key: NUTCH-1180
 URL: https://issues.apache.org/jira/browse/NUTCH-1180
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1180-1.5.-1.patch


 Nutch currently replaces an existing CrawlDB with the new CrawlDB. By 
 optionally keeping a previous version on HDFS users can easily revert in case 
 of a mistake without relying on external backup mechanims.
 This should be enabled by default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1178) Incorrect CSV header CrawlDatumCsvOutputFormat

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147642#comment-13147642
 ] 

Markus Jelsma commented on NUTCH-1178:
--

Objections?

 Incorrect CSV header CrawlDatumCsvOutputFormat
 --

 Key: NUTCH-1178
 URL: https://issues.apache.org/jira/browse/NUTCH-1178
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.5

 Attachments: NUTCH-1178-1.5-1.patch, NUTCH-1178-1.5-2.patch


 The CSV header doesn't mention both retry interval fields (seconds + days). 
 We should either add another field to the header to get rid of one retry 
 interval field. I prefer the former as people may already rely on the current 
 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1142) Normalization and filtering in WebGraph

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147644#comment-13147644
 ] 

Markus Jelsma commented on NUTCH-1142:
--

I'll send this in today.

 Normalization and filtering in WebGraph
 ---

 Key: NUTCH-1142
 URL: https://issues.apache.org/jira/browse/NUTCH-1142
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch, 
 NUTCH-1142-1.5-3.patch


 The WebGraph programs performs URL normalization. Since normalization of 
 outlinks is already performed during the parse it should become optional. 
 There is also no URL filtering mechanism in the web graph program. When a 
 CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
 to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1174) Outlinks are not properly normalized

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147654#comment-13147654
 ] 

Markus Jelsma commented on NUTCH-1174:
--

Will commit of there are no objections.

 Outlinks are not properly normalized
 

 Key: NUTCH-1174
 URL: https://issues.apache.org/jira/browse/NUTCH-1174
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1174-1.5-1.patch


 In ParseOutputFormat, the toUrl is read from Outlink and is processed. This 
 String object is filtered, normalized etc but the original Outlink object is 
 actually added. The normalized url in toUrl is not written back to the 
 Outlink object.
 This issue adds a setUrl method to Outlink which is used in ParseOutputFormat 
 to overwrite the unnormalized url.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147655#comment-13147655
 ] 

Markus Jelsma commented on NUTCH-1061:
--

Any comments on this one?

 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
 -

 Key: NUTCH-1061
 URL: https://issues.apache.org/jira/browse/NUTCH-1061
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1061-1.4-1.patch


 Here's a migrating resetTitle method to use Apache ORO. There was no unit 
 test for this method so i added it. The test passes with old Apache ORO impl. 
 and with the new j.u.regex impl.
 Please comment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1139) Indexer to delete documents

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147656#comment-13147656
 ] 

Markus Jelsma commented on NUTCH-1139:
--

Comments please? 

 Indexer to delete documents
 ---

 Key: NUTCH-1139
 URL: https://issues.apache.org/jira/browse/NUTCH-1139
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1139-1.4-1.patch


 Add an option -delete to the solrindex command. With this feature enabled 
 documents of the currently processing segment with status FETCH_GONE or 
 FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore.
 This issue is a follow up of NUTCH-1052.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1180) UpdateDB to backup previous CrawlDB

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147715#comment-13147715
 ] 

Markus Jelsma commented on NUTCH-1180:
--

Config directive:

{code}
property
  namedb.preserve.backup/name
  valuetrue/value
  descriptionIf true, updatedb will keep a backup of the previous CrawlDB
  version in the old directory. In case of disaster, one can rename old to 
  current and restore the CrawlDB to its previous state.
  /description
/property
{code}

Fine? Wrong? 

 UpdateDB to backup previous CrawlDB
 ---

 Key: NUTCH-1180
 URL: https://issues.apache.org/jira/browse/NUTCH-1180
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1180-1.5.-1.patch


 Nutch currently replaces an existing CrawlDB with the new CrawlDB. By 
 optionally keeping a previous version on HDFS users can easily revert in case 
 of a mistake without relying on external backup mechanims.
 This should be enabled by default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1139) Indexer to delete documents

2011-11-10 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147726#comment-13147726
 ] 

Markus Jelsma commented on NUTCH-1139:
--

Yes, but does that also cover the indexer deleting PERM_REDIR? If so, then 
agreed.

 Indexer to delete documents
 ---

 Key: NUTCH-1139
 URL: https://issues.apache.org/jira/browse/NUTCH-1139
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1139-1.4-1.patch


 Add an option -delete to the solrindex command. With this feature enabled 
 documents of the currently processing segment with status FETCH_GONE or 
 FETCH_REDIR_PERM are deleted, a following SolrClean is not required anymore.
 This issue is a follow up of NUTCH-1052.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1186) FreeGenerator always normalizes

2011-11-09 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147069#comment-13147069
 ] 

Markus Jelsma commented on NUTCH-1186:
--

This actually not the FreeGenerator but the URLPartitioner class doing the 
Partioning scope normalizing. I'm not sure what would be good behaviour. The 
common generator is also affected and uses the partitioner when turning fetch 
lists into segments. Without scope, this means ALL selected URL's are at least 
normalized once, twice when the normalizing is actually in use.

Thoughts?

 FreeGenerator always normalizes
 ---

 Key: NUTCH-1186
 URL: https://issues.apache.org/jira/browse/NUTCH-1186
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5


 The FreeGenerator does not honor the -normalize option, it always normalizes 
 all URL's in the input directory. The -filter option is respected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1199) unfetched URLs problem

2011-11-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146144#comment-13146144
 ] 

Markus Jelsma commented on NUTCH-1199:
--

And what exactly is the problem definition?



 unfetched URLs problem
 --

 Key: NUTCH-1199
 URL: https://issues.apache.org/jira/browse/NUTCH-1199
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, generator
Reporter: behnam nikbakht
Priority: Critical
  Labels: db_unfetched, fetch, freegen, generate, unfetched, 
 updatedb

 we write a script to fetch unfetched urls:
 #first dump from readdb to a text file, and extract unfetched urls to a text 
 file:
 bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format 
 csv
 cat $SITE_DIR/tmp/dump_urls.txt/part-0 | grep db_unfetched  
 $SITE_DIR/tmp/dump_unf
 unfetched_urls_file=$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt
 cat $SITE_DIR/tmp/dump_unf | awk -F '' '{print $2}'   
 $unfetched_urls_file
 unfetched_count=`cat $unfetched_urls_file|wc -l`
 #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use 
 command freegen to create segments for #these urls, we can not use command 
 generate because these url's were generated previously
if [[ $unfetched_count -lt $it_size ]]
then
 echo UNFETCHED $J , $it_size URLs from 
 $unfetched_count generated
 ((J++))
 bin/nutch freegen 
 $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
 s2=`ls -d $crawlseg/2* | tail -1`
 bin/nutch fetch $s2
 bin/nutch parse $s2
 bin/nutch updatedb $crawldb $s2
 echo bin/nutch updatedb $crawldb $s2  
 $SITE_DIR/updatedblog.txt
 get_new_links
 exit
fi
 # if number of urls are greater than it_size, then package them
 ij=1
 while read line
 do
 let ind = $ij / $it_size
 mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
 echo $line  
 $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
 echo $ind
 ((ij++))
 let completed=$ij % $it_size
if [[ $completed -eq 0 ]]
then
   echo 
 UNFETCHED $J , $it_size URLs from $unfetched_count generated
 ((J++))
 bin/nutch freegen 
 $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt 
 $crawlseg
 #finally fetch,parse and update new segment
 s2=`ls -d $crawlseg/2* | tail -1`
 bin/nutch fetch $s2
 bin/nutch parse $s2
 rm $crawldb/.locked
 bin/nutch updatedb $crawldb $s2
 echo bin/nutch updatedb $crawldb $s2  
 $SITE_DIR/updatedblog.txt
fi
 done $unfetched_urls_file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >