[jira] [Commented] (NUTCH-2270) Solr indexer Failed i

2016-06-07 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318973#comment-15318973
 ] 

kaveh minooie commented on NUTCH-2270:
--

also a duplicate of NUTCH-2267 

> Solr indexer Failed i
> -
>
> Key: NUTCH-2270
> URL: https://issues.apache.org/jira/browse/NUTCH-2270
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Hadoop 2.7.2 , Solr 6.0.0 , Nutch 1.12 on Single node 
>Reporter: narendra
>
> When i run this command
>  bin/nutch solrindex http://localhost:8983/solr/#/gettingstarted 
> crawl_Test1/crawldb -linkdb crawl_Test1/linkdb  crawl_Test1/segments/*
> 16/05/31 22:21:47 WARN segment.SegmentChecker: The input path at * is not a 
> segment... skipping
> 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: starting at 2016-05-31 
> 22:21:47
> 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: deleting gone documents: 
> false
> 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: URL filtering: false
> 16/05/31 22:21:47 INFO indexer.IndexingJob: Indexer: URL normalizing: false
> 16/05/31 22:21:47 INFO plugin.PluginRepository: Plugins: looking in: 
> /tmp/hadoop-unjar8621976524622577403/classes/plugins
> 16/05/31 22:21:47 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
> [true]
> 16/05/31 22:21:47 INFO plugin.PluginRepository: Registered Plugins:
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Regex URL Filter 
> (urlfilter-regex)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Html Parse Plug-in 
> (parse-html)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   HTTP Framework 
> (lib-http)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   the nutch core 
> extension points (nutch-extensionpoints)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Basic Indexing Filter 
> (index-basic)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Anchor Indexing Filter 
> (index-anchor)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Tika Parser Plug-in 
> (parse-tika)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Basic URL Normalizer 
> (urlnormalizer-basic)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Regex URL Filter 
> Framework (lib-regex-filter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Regex URL Normalizer 
> (urlnormalizer-regex)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   CyberNeko HTML Parser 
> (lib-nekohtml)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   OPIC Scoring Plug-in 
> (scoring-opic)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Pass-through URL 
> Normalizer (urlnormalizer-pass)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Http Protocol Plug-in 
> (protocol-http)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   SolrIndexWriter 
> (indexer-solr)
> 16/05/31 22:21:47 INFO plugin.PluginRepository: Registered Extension-Points:
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Content Parser 
> (org.apache.nutch.parse.Parser)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch URL Filter 
> (org.apache.nutch.net.URLFilter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   HTML Parse Filter 
> (org.apache.nutch.parse.HtmlParseFilter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Scoring 
> (org.apache.nutch.scoring.ScoringFilter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch URL Normalizer 
> (org.apache.nutch.net.URLNormalizer)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Protocol 
> (org.apache.nutch.protocol.Protocol)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch URL Ignore 
> Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Index Writer 
> (org.apache.nutch.indexer.IndexWriter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Segment Merge 
> Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 16/05/31 22:21:47 INFO plugin.PluginRepository:   Nutch Indexing Filter 
> (org.apache.nutch.indexer.IndexingFilter)
> 16/05/31 22:21:47 INFO indexer.IndexWriters: Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 16/05/31 22:21:47 INFO indexer.IndexingJob: Active IndexWriters :
> SOLRIndexWriter
>   solr.server.url : URL of the SOLR instance
>   solr.zookeeper.hosts : URL of the Zookeeper quorum
>   solr.commit.size : buffer size when sending to SOLR (default 1000)
>   solr.mapping.file : name of the mapping file

[jira] [Commented] (NUTCH-2268) SolrIndexerJob: java.lang.RuntimeException

2016-06-07 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15318948#comment-15318948
 ] 

kaveh minooie commented on NUTCH-2268:
--

The reporter's problem has been solved in the same stackoverflow question that 
is referenced. This issue should be closed.

> SolrIndexerJob: java.lang.RuntimeException
> --
>
> Key: NUTCH-2268
> URL: https://issues.apache.org/jira/browse/NUTCH-2268
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
> Environment: iam using 
> Hbase V:hbase-0.98.19-hadoop2
> Solr V : 6.0.0
> Nutch : 2.3.1
> java : 8
>Reporter: narendra
>  Labels: indexing
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Could you please help out of this error 
> SolrIndexerJob: java.lang.RuntimeException: job 
> failed:name=apache-nutch-2.3.1.jar   
> when i run this commend 
> local/bin/nutch solrindex http://localhost:8983/solr/ -all
> Tried with Solr 4.10.3 but same error iam getting 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2267) Solr indexer fails at the end of the job with a java error message

2016-05-23 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-2267:


 Summary: Solr indexer fails at the end of the job with a java 
error message
 Key: NUTCH-2267
 URL: https://issues.apache.org/jira/browse/NUTCH-2267
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.12
 Environment: hadoop v2.7.2  solr6 in cloud configuration with 
zookeeper 3.4.6. I use the master branch from github currently on commit 
da252eb7b3d2d7b70   ( NUTCH - 2263 mingram and maxgram support for Unigram 
Cosine Similarity Model is provided. )
Reporter: kaveh minooie
 Fix For: 1.13


this is was what I was getting first:

16/05/23 13:52:27 INFO mapreduce.Job:  map 100% reduce 100%
16/05/23 13:52:27 INFO mapreduce.Job: Task Id : 
attempt_1462499602101_0119_r_00_0, Status : FAILED
Error: Bad return type
Exception Details:
  Location:

org/apache/solr/client/solrj/impl/HttpClientUtil.createClient(Lorg/apache/solr/common/params/SolrParams;Lorg/apache/http/conn/ClientConnectionManager;)Lorg/apache/http/impl/client/CloseableHttpClient;
 @58: areturn
  Reason:
Type 'org/apache/http/impl/client/DefaultHttpClient' (current frame, 
stack[0]) is not assignable to 
'org/apache/http/impl/client/CloseableHttpClient' (from method signature)
  Current Frame:
bci: @58
flags: { }
locals: { 'org/apache/solr/common/params/SolrParams', 
'org/apache/http/conn/ClientConnectionManager', 
'org/apache/solr/common/params/ModifiableSolrParams', 
'org/apache/http/impl/client/DefaultHttpClient' }
stack: { 'org/apache/http/impl/client/DefaultHttpClient' }
  Bytecode:
0x000: bb00 0359 2ab7 0004 4db2 0005 b900 0601
0x010: 0099 001e b200 05bb 0007 59b7 0008 1209
0x020: b600 0a2c b600 0bb6 000c b900 0d02 002b
0x030: b800 104e 2d2c b800 0f2d b0
  Stackmap Table:
append_frame(@47,Object[#143])

16/05/23 13:52:28 INFO mapreduce.Job:  map 100% reduce 0% 

as you can see the failed reducer gets re-spawned. then I found this issue: 
https://issues.apache.org/jira/browse/SOLR-7657 and I updated my hadoop config 
file. after that, the indexer seems to be able to finish ( I got the document 
in the solr, it seems ) but I still get the error message at the end of the job:

16/05/23 16:39:26 INFO mapreduce.Job:  map 100% reduce 99%
16/05/23 16:39:44 INFO mapreduce.Job:  map 100% reduce 100%
16/05/23 16:39:57 INFO mapreduce.Job: Job job_1464045047943_0001 completed 
successfully
16/05/23 16:39:58 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=42700154855
FILE: Number of bytes written=70210771807
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=8699202825
HDFS: Number of bytes written=0
HDFS: Number of read operations=537
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters 
Launched map tasks=134
Launched reduce tasks=1
Data-local map tasks=107
Rack-local map tasks=27
Total time spent by all maps in occupied slots (ms)=49377664
Total time spent by all reduces in occupied slots (ms)=32765064
Total time spent by all map tasks (ms)=3086104
Total time spent by all reduce tasks (ms)=1365211
Total vcore-milliseconds taken by all map tasks=3086104
Total vcore-milliseconds taken by all reduce tasks=1365211
Total megabyte-milliseconds taken by all map tasks=12640681984
Total megabyte-milliseconds taken by all reduce tasks=8387856384
Map-Reduce Framework
Map input records=25305474
Map output records=25305474
Map output bytes=27422869763
Map output materialized bytes=27489888004
Input split bytes=15225
Combine input records=0
Combine output records=0
Reduce input groups=16061459
Reduce shuffle bytes=27489888004
Reduce input records=25305474
Reduce output records=230
Spilled Records=54688613
Shuffled Maps =134
Failed Shuffles=0
Merged Map outputs=134
GC time elapsed (ms)=88103
CPU time spent (ms)=3361270
Physical memory (bytes) snapshot=144395186176
Virtual memory (bytes) snapshot=751590166528
Total committed heap usage (bytes)=156232056832
IndexerStatus
in

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2016-05-23 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297359#comment-15297359
 ] 

kaveh minooie commented on NUTCH-1084:
--

for the next person who ends up here: this is fixed in hadopp 2.7.3, everyone 
is waiting for that version to be released. 

> ReadDB url throws exception
> ---
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to 
> write the _SUCCESS file. Until now that's the solution implemented for 
> similar issues. I've not been successful as to make the Hadoop readers simply 
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out. 
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: 
> org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
> at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2016-05-19 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292406#comment-15292406
 ] 

kaveh minooie commented on NUTCH-1084:
--

has there been any update on this issue? I am running master and this is what I 
got :

crawler@d1r2n2:/2locos/nutch/deploy$ NUTCH_HOME=$( pwd ) bin/nutch readseg 
-list /2locos/segments/20160511193155
WARNING: Use "yarn jar" to launch YARN applications.
16/05/19 16:55:42 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/05/19 16:55:42 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Exception in thread "main" java.io.IOException: can't find class: 
org.apache.nutch.protocol.ProtocolStatus because 
org.apache.nutch.protocol.ProtocolStatus
at 
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2384)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:673)
at 
org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:534)
at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:477)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:655)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

as you can see setting NUTCH_HOME has no effect. and if I set  HADOOP_CLASSPATH 
I get the same error that [~markus17] has posted 2 posts above, and [~ndouba] 
branch is way behind the master. anybody has any idea what I should do ? ( 
short of converting [~ndouba] changes into a patch and try to apply it to 
master)

> ReadDB url throws exception
> ---
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to 
> write the _SUCCESS file. Until now that's the solution implemented for 
> similar issues. I've not been successful as to make the Hadoop readers simply 
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out. 
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: 
> org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
> at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 

[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2014-11-10 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1140:
-
Attachment: 0001-NUTCH-1140-trunk.patch
0001-NUTCH-1140-2.x.patch

Sorry, there was a typo in both the patch files

> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> ---
>
> Key: NUTCH-1140
> URL: https://issues.apache.org/jira/browse/NUTCH-1140
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Joe Liedtke
>Priority: Minor
> Fix For: 1.10
>
> Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch, 
> MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
> @@ -276,6 +276,7 @@
>  for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) {
>  result = matcher.getMatch();
> +doc.removeField("title");
>  doc.add("title", result.group(1));
>  break;
>}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2014-11-10 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1140:
-
Attachment: (was: 0001-NUTCH-1140-trunk.patch)

> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> ---
>
> Key: NUTCH-1140
> URL: https://issues.apache.org/jira/browse/NUTCH-1140
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Joe Liedtke
>Priority: Minor
> Fix For: 1.10
>
> Attachments: MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
> @@ -276,6 +276,7 @@
>  for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) {
>  result = matcher.getMatch();
> +doc.removeField("title");
>  doc.add("title", result.group(1));
>  break;
>}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2014-11-10 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1140:
-
Attachment: (was: 0001-NUTCH-1140-2.x.patch)

> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> ---
>
> Key: NUTCH-1140
> URL: https://issues.apache.org/jira/browse/NUTCH-1140
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Joe Liedtke
>Priority: Minor
> Fix For: 1.10
>
> Attachments: MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
> @@ -276,6 +276,7 @@
>  for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) {
>  result = matcher.getMatch();
> +doc.removeField("title");
>  doc.add("title", result.group(1));
>  break;
>}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2014-11-07 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1140:
-
Attachment: 0001-NUTCH-1140-trunk.patch
0001-NUTCH-1140-2.x.patch

so this is still an issue, here is a sample list of urls in the wild that would 
trigger this problem:

http://www.10-s.com/site/tennis-supply/site-map.html
http://www.bigappleherp.com/site/content/big_apple_cares.html
http://www.bigappleherp.com/site/content/CareSheets.html
http://www.bigappleherp.com/site/content/company_information.html
http://www.bigappleherp.com/site/content/customer_service.html
http://www.bigappleherp.com/site/content/LiveAnimals.html
http://www.bigappleherp.com/site/content/testimonials_02.html
http://www.magellangps.com/lp/truckfamily/screens.html

Now base on a bit of a reading that I did on Content Disposition, it is a 
reasonable alternative way of determining a title which would mostly be just 
the file name, but it should NOT override the actual title if it exist as the 
information in the title are far more valueable than the file name. Not to 
mention that title is the actual title and should not be replaced if some other 
value exist.

> index-more plugin, resetTitle method creates multiple values in the Title 
> field
> ---
>
> Key: NUTCH-1140
> URL: https://issues.apache.org/jira/browse/NUTCH-1140
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Joe Liedtke
>Priority: Minor
> Fix For: 1.10
>
> Attachments: 0001-NUTCH-1140-2.x.patch, 0001-NUTCH-1140-trunk.patch, 
> MoreIndexingFilter.093011.patch
>
>
> From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
> to reset the Title field of a document if it contains a Content-Disposition 
> header. The current behavior is to add a Title regardless of whether one 
> exists or not, which can cause issues down the line with the Solr Indexing 
> process, and based on a thread in the nutch user list it appears that this is 
> causing some users to mark the title as multi-valued in the schema:
>   
> http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
> The following patch removes the title field before adding a new one, which 
> has resolved the issue for me:
> --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
> +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
> @@ -276,6 +276,7 @@
>  for (int i=0; iif (matcher.contains(contentDisposition,patterns[i])) {
>  result = matcher.getMatch();
> +doc.removeField("title");
>  doc.add("title", result.group(1));
>  break;
>}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2014-09-16 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1842:


 Summary: crawl.gen.delay has a wrong default value in 
nutch-default.xml or is being parsed incorrectly 
 Key: NUTCH-1842
 URL: https://issues.apache.org/jira/browse/NUTCH-1842
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.9
Reporter: kaveh minooie
Priority: Minor


this is from nutch-default.xml:


  crawl.gen.delay
  60480
  
   This value, expressed in milliseconds, defines how long we should keep the 
lock on records 
   in CrawlDb that were just selected for fetching. If these records are not 
updated 
   in the meantime, the lock is canceled, i.e. they become eligible for 
selecting. 
   Default value of this is 7 days (60480 ms).
  



this is the from o.a.n.crawl.Generator.configure(JobConf job)

genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;

the value in config file is in milliseconds but the code expect it to be in 
days. I reported this couple of years ago on the mailing list as well. I didn't 
post a patch becaue I am not sure which one needs to be fixed. considering all 
the other values in config file are in milliseconds it can be argued to that 
consistency matters, but 'day' is a much more reasonable unit for this property.

Also this value is not being used in 2.x ?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1480) SolrIndexer to write to multiple servers.

2014-09-11 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1480:
-
Attachment: adding-support-for-sharding-indexer-for-solr.patch

I just found this issue today when I was checking to see if what I am about to 
upload would be a duplicate issue or not and good thing I did since apparently 
there are quite a few issues about this. But considering that this is the 
latest one, I will post it here.

This patch add another plugin, indexer-solrshard, that allows to shard the 
index data on nutch side. this is mostly geared toward solr 3.x as there are 
still a few of them are around (including in our production 
environment ), but it could have benefits even with solr 4.x to which I will 
get.

it adds two new properties to the nutch config file ( solr.shardkey and 
solr.server.urls ), the solr.shardkey would the name of the field that should 
be used to generate the hash code ( and if it is being used against
solr 3.x should be the uniqekey field in schema file, otherwise the delete 
would not work properly ), and solr.server.urls would be a comma seperated list 
of solr core urls or instance urls. 
The plugin divide the hash value by the number of urls to figure out in which 
core it should put the doucment. it also uses the reset of the solr properties 
( commit sieze, etc... ). the code is really the same.
But the idea behind having a solr.server.urls instead of just using 
solr.server.url was so that both plugin could be used simultinously which can 
help in migrating from 3.x to 4.x as well, Though I guess the same
argument can be made for other properties as well.

The code use String.hashCode function which is really good enough in terms of 
evenly distributing docs accross multiple cores ( in our case with about 85 
million docs over 8 cores, the diffrence between the number 
of docs in each core is less than 5% ), but changing the hash function or even 
makeing it customizeable as was suggested in NUTCH-945 is trivial.

Turning the hasing mechanism off is also trivial ( again, I didn't know about 
this issue when I was writing this otherwise I would have done it already ) but 
we can add another property such as solr.usehash and by setting it to false, 
have the plugin to 
just post the documents to all the servers which could also be quite usefull.

As for using it against the solr 4.x, it can function as a load balancer. 
believe me when I say watching 40 reduce jobs try to write to a single solr 
instance is rather horrifying.

The patch is against the trunk but porting it to 2.x is trivial ( I actually 
think that it can probably be applied as it is, but I haven't test it yet )

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.10
>
> Attachments: NUTCH-1480-1.6.1.patch, 
> adding-support-for-sharding-indexer-for-solr.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1840) the describe function in SolrIndexWriter is not correct

2014-09-11 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1840:
-
Attachment: 2.x-updated-description-in-SolrIndexWriter.patch
trunk-1.10-updated-description-in-SolrIndexWriter.patch

> the describe function in SolrIndexWriter is not correct
> ---
>
> Key: NUTCH-1840
> URL: https://issues.apache.org/jira/browse/NUTCH-1840
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3, 1.9
>    Reporter: kaveh minooie
>Priority: Trivial
> Attachments: 2.x-updated-description-in-SolrIndexWriter.patch, 
> trunk-1.10-updated-description-in-SolrIndexWriter.patch
>
>
> the describe function in SolrIndexWriter is not correct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1840) the describe function in SolrIndexWriter is not correct

2014-09-11 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1840:


 Summary: the describe function in SolrIndexWriter is not correct
 Key: NUTCH-1840
 URL: https://issues.apache.org/jira/browse/NUTCH-1840
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.9, 2.3
Reporter: kaveh minooie
Priority: Trivial


the describe function in SolrIndexWriter is not correct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1831) compiling against gora-0.5 fails

2014-08-27 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1831:
-

Attachment: NUTCH-1831.patch

this seems to fix the problem, but I appreciate it if someone could verify it

> compiling against gora-0.5 fails
> 
>
> Key: NUTCH-1831
> URL: https://issues.apache.org/jira/browse/NUTCH-1831
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3
>    Reporter: kaveh minooie
> Attachments: NUTCH-1831.patch
>
>
> currenty if you try to compile nutch against Gora 0.5 you will get following 
> errors:
> clean-lib:
> resolve-default:
> [ivy:resolve] :: Apache Ivy 2.3.0-local-20140109133456 - 20140109133456 :: 
> http://ant.apache.org/ivy/ ::
> [ivy:resolve] :: loading settings :: file = /sources/nutch/ivy/ivysettings.xml
>   [taskdef] Could not load definitions from resource 
> org/sonar/ant/antlib.xml. It could not be found.
> copy-libs:
>  [copy] Copying 128 files to /sources/nutch/build/lib
> compile-core:
> [javac] Compiling 200 source files to /sources/nutch/build/classes
> [javac] warning: [options] bootstrap class path not set in conjunction 
> with -source 1.6
> [javac] /sources/nutch/src/java/org/apache/nutch/storage/WebPage.java:8: 
> error: WebPage is not abstract and does not override abstract method 
> getFieldsCount() in PersistentBase
> [javac] public class WebPage extends 
> org.apache.gora.persistency.impl.PersistentBase implements 
> org.apache.avro.specific.SpecificRecord, 
> org.apache.gora.persistency.Persistent {
> [javac]^
> [javac] 
> /sources/nutch/src/java/org/apache/nutch/storage/ProtocolStatus.java:11: 
> error: ProtocolStatus is not abstract and does not override abstract method 
> getFieldsCount() in PersistentBase
> [javac] public class ProtocolStatus extends 
> org.apache.gora.persistency.impl.PersistentBase implements 
> org.apache.avro.specific.SpecificRecord, 
> org.apache.gora.persistency.Persistent {
> [javac]^
> [javac] 
> /sources/nutch/src/java/org/apache/nutch/storage/ParseStatus.java:8: error: 
> ParseStatus is not abstract and does not override abstract method 
> getFieldsCount() in PersistentBase
> [javac] public class ParseStatus extends 
> org.apache.gora.persistency.impl.PersistentBase implements 
> org.apache.avro.specific.SpecificRecord, 
> org.apache.gora.persistency.Persistent {
> [javac]^
> [javac] /sources/nutch/src/java/org/apache/nutch/storage/Host.java:12: 
> error: Host is not abstract and does not override abstract method 
> getFieldsCount() in PersistentBase
> [javac] public class Host extends 
> org.apache.gora.persistency.impl.PersistentBase implements 
> org.apache.avro.specific.SpecificRecord, 
> org.apache.gora.persistency.Persistent {
> [javac]^
> [javac] Note: Some input files use unchecked or unsafe operations.
> [javac] Note: Recompile with -Xlint:unchecked for details.
> [javac] 4 errors
> [javac] 1 warning
> BUILD FAILED
> /sources/nutch/build.xml:101: Compile failed; see the compiler error output 
> for details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1831) compiling against gora-0.5 fails

2014-08-27 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1831:


 Summary: compiling against gora-0.5 fails
 Key: NUTCH-1831
 URL: https://issues.apache.org/jira/browse/NUTCH-1831
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
Reporter: kaveh minooie


currenty if you try to compile nutch against Gora 0.5 you will get following 
errors:

clean-lib:

resolve-default:
[ivy:resolve] :: Apache Ivy 2.3.0-local-20140109133456 - 20140109133456 :: 
http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = /sources/nutch/ivy/ivysettings.xml
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

copy-libs:
 [copy] Copying 128 files to /sources/nutch/build/lib

compile-core:
[javac] Compiling 200 source files to /sources/nutch/build/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] /sources/nutch/src/java/org/apache/nutch/storage/WebPage.java:8: 
error: WebPage is not abstract and does not override abstract method 
getFieldsCount() in PersistentBase
[javac] public class WebPage extends 
org.apache.gora.persistency.impl.PersistentBase implements 
org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent 
{
[javac]^
[javac] 
/sources/nutch/src/java/org/apache/nutch/storage/ProtocolStatus.java:11: error: 
ProtocolStatus is not abstract and does not override abstract method 
getFieldsCount() in PersistentBase
[javac] public class ProtocolStatus extends 
org.apache.gora.persistency.impl.PersistentBase implements 
org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent 
{
[javac]^
[javac] 
/sources/nutch/src/java/org/apache/nutch/storage/ParseStatus.java:8: error: 
ParseStatus is not abstract and does not override abstract method 
getFieldsCount() in PersistentBase
[javac] public class ParseStatus extends 
org.apache.gora.persistency.impl.PersistentBase implements 
org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent 
{
[javac]^
[javac] /sources/nutch/src/java/org/apache/nutch/storage/Host.java:12: 
error: Host is not abstract and does not override abstract method 
getFieldsCount() in PersistentBase
[javac] public class Host extends 
org.apache.gora.persistency.impl.PersistentBase implements 
org.apache.avro.specific.SpecificRecord, org.apache.gora.persistency.Persistent 
{
[javac]^
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 4 errors
[javac] 1 warning

BUILD FAILED
/sources/nutch/build.xml:101: Compile failed; see the compiler error output for 
details.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1791) Null pointer exceptions with gora-cassandra-0.4

2014-06-12 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028977#comment-14028977
 ] 

kaveh minooie commented on NUTCH-1791:
--

it would have helped if you could post a bit more of those stack traces. 
specifically, the beginning lines of the sections that start with 'caused by'.

> Null pointer exceptions with gora-cassandra-0.4
> ---
>
> Key: NUTCH-1791
> URL: https://issues.apache.org/jira/browse/NUTCH-1791
> Project: Nutch
>  Issue Type: Bug
>  Components: generator, storage
>Affects Versions: 2.3
> Environment: dsc-cassandra-2.0.2, dsc-cassandra-2.0.7
>Reporter: Koen Smets
> Fix For: 2.3
>
>
> Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also 
> Cassandra 2.0.7) as storage backend both in normal Nutch operations (inject, 
> generate, fetch) cycle as in the junit tests {{TestGoraStorage}}
> {code}
> 2014-06-03 11:24:23,495 INFO  connection.CassandraHostRetryService 
> (CassandraHostRetryService.java:(48)) - Downed Host Retry service 
> started with queue size -1 and retry delay 10s
> 2014-06-03 11:24:23,535 INFO  service.JmxMonitor 
> (JmxMonitor.java:registerMonitor(52)) - Registering JMX 
> me.prettyprint.cassandra.service_Test 
> Cluster:ServiceType=hector,MonitorType=hector
> Exception in thread "main" java.lang.NullPointerException
>   at 
> org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
>   at 
> org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
>   at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
>   at 
> org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93)
>   at 
> org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230)
> {code}
> After injecting:
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls
> InjectorJob: starting at 2014-06-03 11:55:11
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as 
> the Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and filtering: 
> 1
> Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> min score:1.0
> retry 0:  1
> jobs: {db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001, 
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97, MAP_INPUT_RECORDS=1, 
> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53, 
> COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769, 
> COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6, REDUCE_INPUT_GROUPS=6, 
> COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6, 
> VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4}, 
> FileSystemCounters={FILE_BYTES_READ=974145, FILE_BYTES_WRITTEN=1144369}, File 
> Output Format Counters ={BYTES_WRITTEN=225
> max score:1.0
> TOTAL urls:   1
> status 0 (null):  1
> avg score:1.0
> WebTable statistics: done
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
> key:  http://example.com/
> baseUrl:  null
> status:   0 (null)
> fetchTime:1401789311270
> prevFetchTime:0
> fetchInterval:2592000
> retriesSinceFetch:0
> modifiedTime: 0
> prevModifiedTime: 0
> protocolStatus:   (null)
> parseStatus:  (null)
> title:null
> score:1.0
> markers:  org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
> reprUrl:  null
> metadata _csh_ :  ?�
> {code}
> After generating,
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1
> GeneratorJob: starting at 2014-06-03 11:55:38
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 1
> GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> jobs: {db_stats-job_local73029265_0001={jobID=job_loc

[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1780:
-

Attachment: NUTCH-1780.patch

> ttl and gc_grace_seconds attributes are missing from 
> gora-cassandra-mapping.xml file
> 
>
> Key: NUTCH-1780
> URL: https://issues.apache.org/jira/browse/NUTCH-1780
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.3
>    Reporter: kaveh minooie
> Attachments: NUTCH-1780.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* 
> mapping file. I also added a few, IMHO, helpful hints. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1780:
-

Attachment: (was: NUTCH-1780.patch)

> ttl and gc_grace_seconds attributes are missing from 
> gora-cassandra-mapping.xml file
> 
>
> Key: NUTCH-1780
> URL: https://issues.apache.org/jira/browse/NUTCH-1780
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.3
>    Reporter: kaveh minooie
> Attachments: NUTCH-1780.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* 
> mapping file. I also added a few, IMHO, helpful hints. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1780:


 Summary: ttl and gc_grace_seconds attributes are missing from 
gora-cassandra-mapping.xml file
 Key: NUTCH-1780
 URL: https://issues.apache.org/jira/browse/NUTCH-1780
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.3
Reporter: kaveh minooie
 Attachments: NUTCH-1780.patch

after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* 
mapping file. I also added a few, IMHO, helpful hints. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1780) ttl and gc_grace_seconds attributes are missing from gora-cassandra-mapping.xml file

2014-05-16 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1780:
-

Attachment: NUTCH-1780.patch

there is really no good default value for gc_grace_seconds. we can use 
Cassandra default value which is 10 days,  but since the out of the box setting 
is for single node cluster, I used 0 which is the best value for that set up. 
Also using 0 would force people to actually change this before using it in a 
real cluster which I think, here, is appropriate. 

> ttl and gc_grace_seconds attributes are missing from 
> gora-cassandra-mapping.xml file
> 
>
> Key: NUTCH-1780
> URL: https://issues.apache.org/jira/browse/NUTCH-1780
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.3
>    Reporter: kaveh minooie
> Attachments: NUTCH-1780.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> after upgrading to groa 0.4 ( NUTCH-1714) we need extra properties in C* 
> mapping file. I also added a few, IMHO, helpful hints. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-15 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998118#comment-13998118
 ] 

kaveh minooie commented on NUTCH-1714:
--

Hi every one,

it seems that ttl and gc_grace_seconds attributes have not been added to 
gora-cassandra-mapping.xml file. ttl is specially important since for some 
reason gora-cassandra sets it to be 60 seconds, if it is not defined.

> Nutch 2.x upgrade to Gora 0.4
> -
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Alparslan Avcı
>Assignee: Alparslan Avcı
> Fix For: 2.3
>
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch, NUTCH-1714v6.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1642) mvn compile fails on Centos6.3

2013-09-16 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13768532#comment-13768532
 ] 

kaveh minooie commented on NUTCH-1642:
--

I don't know if you know this or not, but if you don't, nutch is build using 
'ant' and the dependencies get resolved through ivy not maven. I know there is 
a pom file in the root but I am pretty sure that it is not being maintained. 
other people here will be able to give you more detailed information, and they 
might actually update the pom file as well, but for now just use ant. just go 
to the root of the project and type ant ( assuming you have ant installed on 
your system.) 

> mvn compile fails on Centos6.3
> --
>
> Key: NUTCH-1642
> URL: https://issues.apache.org/jira/browse/NUTCH-1642
> Project: Nutch
>  Issue Type: Bug
> Environment: Apache Maven 3.1.0 
> (893ca28a1da9d5f51ac03827af98bb730128f9f2; 2013-06-28 10:15:32+0800)
> Java version: 1.7.0_25, vendor: Oracle Corporation
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "2.6.32-279.el6.x86_64", arch: "amd64", family: 
> "unix"
>Reporter: Xibao.Lv
> Attachments: NUTCH-1642.patch
>
>
> Hi,all
> I am new. when i run 'mvn compile', it return some errors.Like Follows.
> 1.[ERROR] Failed to execute goal on project nutch: Could not resolve 
> dependencies for project org.apache.nutch:nutch:jar:2.2: The following 
> artifacts could not be resolved: javax.jms:jms:jar:1.1, 
> com.sun.jdmk:jmxtools:jar:1.2.1, com.sun.jmx:jmxri:jar:1.2.1, 
> org.restlet.jse:org.restlet:jar:2.0.5, 
> org.restlet.jse:org.restlet.ext.jackson:jar:2.0.5: Could not transfer 
> artifact javax.jms:jms:jar:1.1 from/to java.net 
> (https://maven-repository.dev.java.net/nonav/repository): No connector 
> available to access repository java.net 
> (https://maven-repository.dev.java.net/nonav/repository) of type legacy using 
> the available factories WagonRepositoryConnectorFactory
> It means that the org.restlet.jse can not find in main repository. Then I add 
> repository address to pom.xml
> 2.[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' 
> must be unique: org.jdom:jdom:jar -> duplicate declaration of version 1.1 @ 
> line 268, column 29
> 3.[ERROR] Failed to execute goal on project nutch: Could not resolve 
> dependencies for project org.apache.nutch:nutch:jar:2.2: The following 
> artifacts could not be resolved: javax.jms:jms:jar:1.1, 
> com.sun.jdmk:jmxtools:jar:1.2.1, com.sun.jmx:jmxri:jar:1.2.1: Could not 
> transfer artifact javax.jms:jms:jar:1.1 from/to java.net 
> (https://maven-repository.dev.java.net/nonav/repository): No connector 
> available to access repository java.net 
> (https://maven-repository.dev.java.net/nonav/repository) of type legacy using 
> the available factories WagonRepositoryConnectorFactory
> It means that we can not find javax.jms in 
> there(https://maven-repository.dev.java.net/nonav/repository). Google said 
> log4j-1.2.15 dependency javax.jms, so we can use higher log4j, such as 
> 1.2.16+.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1634) readdb -stats show the result twice

2013-08-28 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1634:
-

Attachment: NUTCH-1634-2.x.patch

I'll check the trunk as well and will post a patch for it if needed.

> readdb -stats show the result twice
> ---
>
> Key: NUTCH-1634
> URL: https://issues.apache.org/jira/browse/NUTCH-1634
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.2.1
>Reporter: kaveh minooie
>Priority: Minor
> Attachments: NUTCH-1634-2.x.patch
>
>
> right now this is the ouput:
> WebTable statistics start
> Statistics for WebTable: 
> status 2 (status_fetched):1115
> min score:0.0
> retry 2:  2
> retry 0:  11369
> jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, 
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
> MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
> MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, 
> SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
> FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File 
> Output Format Counters ={BYTES_WRITTEN=373
> retry 1:  5
> status 5 (status_redir_perm): 69
> max score:1.0
> TOTAL urls:   11376
> status 3 (status_gone):   3
> status 4 (status_redir_temp): 3
> status 1 (status_unfetched):  10186
> avg score:0.00342827
> WebTable statistics: done
> status 2 (status_fetched):1115
> min score:0.0
> retry 2:  2
> retry 0:  11369
> jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, 
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
> MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
> MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, 
> SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
> FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File 
> Output Format Counters ={BYTES_WRITTEN=373
> retry 1:  5
> status 5 (status_redir_perm): 69
> max score:1.0
> TOTAL urls:   11376
> status 3 (status_gone):   3
> status 4 (status_redir_temp): 3
> status 1 (status_unfetched):  10186
> avg score:0.00342827
> imho, it should be this:
> WebTable statistics start
> Statistics for WebTable: 
> status 2 (status_fetched):1115
> min score:0.0
> retry 2:  2
> retry 0:  11369
> jobs: {db_stats-job_local801282144_0001={jobID=job_local801282144_0001, 
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
> MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
> MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1122631680, CPU_MILLISECONDS=0, 
> SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
> FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166191}, File 
> Output Format Counters ={BYTES_WRITTEN=373
> retry 1:  5
> status 5 (status_redir_perm): 69
> max score:1.0
> TOTAL urls:   11376
> status 3 (status_gone):   3
> status 4 (status_redir_temp): 3
> status 1 (status_unfetched):  10186
> avg score:0.00342827
> WebTable statistics: done

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1634) readdb -stats show the result twice

2013-08-28 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1634:
-

Description: 
right now this is the ouput:

WebTable statistics start
Statistics for WebTable: 
status 2 (status_fetched):  1115
min score:  0.0
retry 2:2
retry 0:11369
jobs:   {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, 
jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, 
SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File 
Output Format Counters ={BYTES_WRITTEN=373
retry 1:5
status 5 (status_redir_perm):   69
max score:  1.0
TOTAL urls: 11376
status 3 (status_gone): 3
status 4 (status_redir_temp):   3
status 1 (status_unfetched):10186
avg score:  0.00342827
WebTable statistics: done
status 2 (status_fetched):  1115
min score:  0.0
retry 2:2
retry 0:11369
jobs:   {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, 
jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, 
SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File 
Output Format Counters ={BYTES_WRITTEN=373
retry 1:5
status 5 (status_redir_perm):   69
max score:  1.0
TOTAL urls: 11376
status 3 (status_gone): 3
status 4 (status_redir_temp):   3
status 1 (status_unfetched):10186
avg score:  0.00342827


imho, it should be this:

WebTable statistics start
Statistics for WebTable: 
status 2 (status_fetched):  1115
min score:  0.0
retry 2:2
retry 0:11369
jobs:   {db_stats-job_local801282144_0001={jobID=job_local801282144_0001, 
jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1122631680, CPU_MILLISECONDS=0, 
SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166191}, File 
Output Format Counters ={BYTES_WRITTEN=373
retry 1:5
status 5 (status_redir_perm):   69
max score:  1.0
TOTAL urls: 11376
status 3 (status_gone): 3
status 4 (status_redir_temp):   3
status 1 (status_unfetched):10186
avg score:  0.00342827
WebTable statistics: done



  was://TODO


> readdb -stats show the result twice
> ---
>
> Key: NUTCH-1634
> URL: https://issues.apache.org/jira/browse/NUTCH-1634
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.2.1
>    Reporter: kaveh minooie
>Priority: Minor
>
> right now this is the ouput:
> WebTable statistics start
> Statistics for WebTable: 
> status 2 (status_fetched):1115
> min score:0.0
> retry 2:  2
> retry 0:  11369
> jobs: {db_stats-job_local1037462743_0001={jobID=job_local1037462743_0001, 
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, 
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=209, 
> MAP_INPUT_RECORDS=11376, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
> MAP_OUTPUT_BYTES=602928, COMMITTED_HEAP_BYTES=1181220864, CPU_MILLISECONDS=0, 
> SPLIT_RAW_BYTES=757, COMBINE_INPUT_RECORDS=45504, REDUCE_INPUT_RECORDS=12, 
> REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
> REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=45504}, 
> FileSystemCounters={FILE_BYTES_READ=1819, FILE_BYTES_WRITTEN=166199}, File 
> Output Format Counters ={BYTES_WRITTEN=373
> retry 1:  5
> status 5 (status_redir_perm): 69
> max score:1.0
>

[jira] [Created] (NUTCH-1634) readdb -stats show the result twice

2013-08-28 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1634:


 Summary: readdb -stats show the result twice
 Key: NUTCH-1634
 URL: https://issues.apache.org/jira/browse/NUTCH-1634
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 2.2.1
Reporter: kaveh minooie
Priority: Minor


//TODO

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-27 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1556:
-

Attachment: NUTCH-1556-v3.patch

there are typos (fetch instead of update) in v2 :)

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>    Reporter: kaveh minooie
> Fix For: 2.3
>
> Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
> NUTCH-1556-v3.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1633) slf4j is provided by hadoop and should not be included in the job file.

2013-08-26 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1633:
-

Attachment: not-include-slf4j-in-job-file.trunk.patch
not-include-slf4j-in-job-file.2.x.patch

> slf4j is provided by hadoop and should not be included in the job file.
> ---
>
> Key: NUTCH-1633
> URL: https://issues.apache.org/jira/browse/NUTCH-1633
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.7, 2.2.1
>    Reporter: kaveh minooie
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: not-include-slf4j-in-job-file.2.x.patch, 
> not-include-slf4j-in-job-file.trunk.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> there are two issues with including slf4j in the job file. the minor of the 
> two is that slf4j starts issuing warnings when it finds more than on 
> instances in the classpath( GORA-272 ). the bigger issue happens when the 
> versions of the slf4j in hadoop and nutch are not compatible (ex. hadoop 
> 1.1.1 & nutch 2.1) which results in all nutch jobs to crash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1633) slf4j is provided by hadoop and should not be included in the job file.

2013-08-26 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1633:


 Summary: slf4j is provided by hadoop and should not be included in 
the job file.
 Key: NUTCH-1633
 URL: https://issues.apache.org/jira/browse/NUTCH-1633
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 2.2.1, 1.7
Reporter: kaveh minooie
Priority: Minor
 Fix For: 2.3


there are two issues with including slf4j in the job file. the minor of the two 
is that slf4j starts issuing warnings when it finds more than on instances in 
the classpath( GORA-272 ). the bigger issue happens when the versions of the 
slf4j in hadoop and nutch are not compatible (ex. hadoop 1.1.1 & nutch 2.1) 
which results in all nutch jobs to crash. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob

2013-08-26 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750268#comment-13750268
 ] 

kaveh minooie commented on NUTCH-1632:
--

if this one was accepted make sure to close NUTCH-1556 as well.

> add batchId argument for DbUpdaterJob
> -
>
> Key: NUTCH-1632
> URL: https://issues.apache.org/jira/browse/NUTCH-1632
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 2.2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1632.patch
>
>
> add batchId argument for DbUpdaterJob, you can put the batchId to 
> DbUpdaterJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-22 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747879#comment-13747879
 ] 

kaveh minooie commented on NUTCH-1629:
--

:) for future references, the magic switch in git format-patch that does this 
(not putting 'a/' and 'b/' before the path) is  --no-prefix

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7, 2.2.1
> Environment: Java 1.7.0_25
>Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-22 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: (was: 
NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch)

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-22 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: NUTCH-1629--trunk.svn.patch
NUTCH-1629--2.x.svn.patch

so like this? 

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-22 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: (was: 
NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch)

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: NUTCH-1629--2.x.svn.patch, NUTCH-1629--trunk.svn.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-22 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747722#comment-13747722
 ] 

kaveh minooie commented on NUTCH-1629:
--

Sorry julien, could you tell me what I am missing here? do you mean that I 
should have only on file for both branches? 

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-21 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: 
NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch
NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch

@Julien these should do it.

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-21 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: (was: 
0001-no-need-to-fail-on-empty-lines-in-seed-files.patch)

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-files-1.x.patch, 
> NUTCH-1629--no-need-to-fail-on-empty-lines-in-seed-fi.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-20 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1629:


 Summary: there is no need to fail on empty lines in seed file when 
injecting.
 Key: NUTCH-1629
 URL: https://issues.apache.org/jira/browse/NUTCH-1629
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 2.2.1
 Environment: Java 1.7.0_25
Reporter: kaveh minooie
 Fix For: 2.3
 Attachments: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch

right now, if there is an empty line in a seed file, TableUtil.reversUrl would 
throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1629) there is no need to fail on empty lines in seed file when injecting.

2013-08-20 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1629:
-

Attachment: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch

> there is no need to fail on empty lines in seed file when injecting.
> 
>
> Key: NUTCH-1629
> URL: https://issues.apache.org/jira/browse/NUTCH-1629
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 2.2.1
> Environment: Java 1.7.0_25
>    Reporter: kaveh minooie
>  Labels: easyfix
> Fix For: 2.3
>
> Attachments: 0001-no-need-to-fail-on-empty-lines-in-seed-files.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> right now, if there is an empty line in a seed file, TableUtil.reversUrl 
> would throw an exception that would kill the inject job. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: problem with running 2.x in eclipse

2013-08-16 Thread kaveh minooie
Never mind.  using absolute path took care of the issue. my I suggest 
that in the wiki ( 
http://wiki.apache.org/nutch/RunNutchInEclipse#Troubleshooting ) in the 
trouble shooting section, the sample absolute path should not be the 
src/plugin, it should be build/plugins as in:



  plugin.folders
  /home/../trunk/build/plugin




On 08/16/2013 04:11 PM, kaveh minooie wrote:

Hi everyone
  so I am trying to run in eclipse and I followed this
https://wiki.apache.org/nutch/RunNutchInEclipse

Now when I am trying to run inject command I get this in the hadoop.log

2013-08-16 15:57:18,184 WARN  snappy.LoadSnappy - Snappy native library
not loaded
2013-08-16 15:57:18,716 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 1
2013-08-16 15:57:18,736 WARN  plugin.PluginRepository - Plugins:
directory not found: plugins
2013-08-16 15:57:18,738 WARN  mapred.FileOutputCommitter - Output path
is null in cleanup
2013-08-16 15:57:18,739 WARN  mapred.LocalJobRunner -
job_local1120981005_0001
java.lang.Exception: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.
 at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.
 at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)

as you can see, nutch can not see the plugins directory. my eclipse is
4.3 so the order and export tab in the java build path dialog box is a
bit different, but I think everything is set correctly. Now I know the
extensions and using the extension point is a bit new in nutch so I was
wondering if, in addition to what is said in the tutorial, there are
other stuff that need to be configured in eclipse to be able to run the
code within eclipse?

thanks,


--
Kaveh Minooie


problem with running 2.x in eclipse

2013-08-16 Thread kaveh minooie

Hi everyone
 so I am trying to run in eclipse and I followed this 
https://wiki.apache.org/nutch/RunNutchInEclipse


Now when I am trying to run inject command I get this in the hadoop.log

2013-08-16 15:57:18,184 WARN  snappy.LoadSnappy - Snappy native library 
not loaded
2013-08-16 15:57:18,716 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 1
2013-08-16 15:57:18,736 WARN  plugin.PluginRepository - Plugins: 
directory not found: plugins
2013-08-16 15:57:18,738 WARN  mapred.FileOutputCommitter - Output path 
is null in cleanup
2013-08-16 15:57:18,739 WARN  mapred.LocalJobRunner - 
job_local1120981005_0001
java.lang.Exception: java.lang.RuntimeException: x point 
org.apache.nutch.net.URLNormalizer not found.

at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: x point 
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:122)

as you can see, nutch can not see the plugins directory. my eclipse is 
4.3 so the order and export tab in the java build path dialog box is a 
bit different, but I think everything is set correctly. Now I know the 
extensions and using the extension point is a bit new in nutch so I was 
wondering if, in addition to what is said in the tutorial, there are 
other stuff that need to be configured in eclipse to be able to run the 
code within eclipse?


thanks,
--
Kaveh Minooie


crawl.gen.delay

2013-08-15 Thread kaveh minooie
 is 'crawl.gen.delay' still being used anywhere? cause I can't find 
anything in the source code except for here:


package org.apache.nutch.crawl;

public class GeneratorJob extends NutchTool implements Tool {
  public static final String GENERATOR_TOP_N = "generate.topN";
  public static final String GENERATOR_CUR_TIME = "generate.curTime";
  public static final String GENERATOR_DELAY = "crawl.gen.delay";

, and I think it has the wrong value in the nutch-default.xml file. ( 
the value is in seconds, it should be in days)




[jira] [Updated] (NUTCH-1624) Typo in WebTableReader line 486

2013-08-13 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1624:
-

Attachment: 0001-NUTCH-1624.patch

> Typo in WebTableReader  line 486
> 
>
> Key: NUTCH-1624
> URL: https://issues.apache.org/jira/browse/NUTCH-1624
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.2.1
> Environment: this was seen in 2.X HEAD 
>    Reporter: kaveh minooie
>Priority: Minor
> Fix For: 2.3
>
> Attachments: 0001-NUTCH-1624.patch
>
>
> the error message suggests to the user to use, among other things, '-stat', 
> it should be '-stats' 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1624) Typo in WebTableReader line 486

2013-08-13 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1624:
-

Summary: Typo in WebTableReader  line 486  (was: Type in WebTableReader  
line 486)

> Typo in WebTableReader  line 486
> 
>
> Key: NUTCH-1624
> URL: https://issues.apache.org/jira/browse/NUTCH-1624
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.2.1
> Environment: this was seen in 2.X HEAD 
>    Reporter: kaveh minooie
>Priority: Minor
> Fix For: 2.3
>
>
> the error message suggests to the user to use, among other things, '-stat', 
> it should be '-stats' 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1624) Type in WebTableReader line 486

2013-08-13 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1624:


 Summary: Type in WebTableReader  line 486
 Key: NUTCH-1624
 URL: https://issues.apache.org/jira/browse/NUTCH-1624
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.2.1
 Environment: this was seen in 2.X HEAD 
Reporter: kaveh minooie
Priority: Minor
 Fix For: 2.3


the error message suggests to the user to use, among other things, '-stat', it 
should be '-stats' 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: so why does solrindex-mapping.xml get ignored?

2013-04-12 Thread kaveh minooie
the code put the value under the original key anyway. there is no 
'mapping'. it just copies. we have other instruction for copying fields. 
i think the code should strictly follow the mapping file. i think that 
whole if statement should not be there.



On 04/12/2013 02:54 PM, Lewis John Mcgibbney wrote:

Hi Kaveh,


On Thu, Apr 11, 2013 at 11:53 PM, <mailto:dev-digest-h...@nutch.apache.org>> wrote:



so why does solrindex-mapping.xml get ignored?
        23089 by: kaveh minooie

why are we doing this?


I have no idea.
What is wrong?




so why does solrindex-mapping.xml get ignored?

2013-04-11 Thread kaveh minooie
this is from 
nutch/src/java/org/apache/nutch/indexer/solr/SolrWriter.java write function:


@Override
  public void write(NutchDocument doc) throws IOException {
final SolrInputDocument inputDoc = new SolrInputDocument();
for(final Entry> e : doc) {
  for (final String val : e.getValue()) {

Object val2 = val;
if (e.getKey().equals("content") || e.getKey().equals("title")) {
  val2 = stripNonCharCodepoints(val);
}

inputDoc.addField(solrMapping.mapKey(e.getKey()), val2);
String sCopy = solrMapping.mapCopyKey(e.getKey());
if (sCopy != e.getKey()) {
 inputDoc.addField(sCopy, val2);
}
  }
}


as you can see it checks to see if the field is mapped to a different 
name and if that is the case, it adds it under the original key in 
addition to the mapped key. why are we doing this?


--
Kaveh Minooie


[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing

2013-04-10 Thread kaveh minooie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628051#comment-13628051
 ] 

kaveh minooie commented on NUTCH-1555:
--

Commons CLI uses maven to build. considering that nutch uses ivy, wouldn't it 
be an issue ?

> bug in 2.x ParserJob command line parsing 
> --
>
> Key: NUTCH-1555
> URL: https://issues.apache.org/jira/browse/NUTCH-1555
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
> Fix For: 2.2
>
>
> I just accidentally passed in the following argument to parser job
> {code}
> law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse 
> updatedb
> ParserJob: starting
> ParserJob: resuming:  false
> ParserJob: forced reparse:false
> ParserJob: batchId:   updatedb
> ParserJob: success
> {code}
> This is a bug for sure

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-10 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1556:
-

Description: 
So the idea here is to be able to run updatedb and fetch for different batchId 
simultaneously. I put together a patch. it seems to be working ( it does skip 
the rows that do not match the batchId), but I am worried if and how it might 
affect the sorting in the reduce part. anyway check it out. 

it also change the command line usage to this:
Usage: DbUpdaterJob ( | -all) [-crawlId ]

  was:So the idea here is to be able to run updatedb and fetch for different 
batchId simultaneously. I put together a patch. it seems to be working ( it 
does skip the rows that do not match the batchId), but I am worried if and how 
it might affect the sorting in the reduce part. anyway check it out. 


> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>    Reporter: kaveh minooie
> Attachments: NUTCH-1556.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 
> it also change the command line usage to this:
> Usage: DbUpdaterJob ( | -all) [-crawlId ]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-10 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1556:


 Summary: enabling updatedb to accept batchId 
 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Attachments: NUTCH-1556.patch

So the idea here is to be able to run updatedb and fetch for different batchId 
simultaneously. I put together a patch. it seems to be working ( it does skip 
the rows that do not match the batchId), but I am worried if and how it might 
affect the sorting in the reduce part. anyway check it out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-10 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1556:
-

Attachment: NUTCH-1556.patch

> enabling updatedb to accept batchId 
> 
>
> Key: NUTCH-1556
> URL: https://issues.apache.org/jira/browse/NUTCH-1556
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>    Reporter: kaveh minooie
> Attachments: NUTCH-1556.patch
>
>
> So the idea here is to be able to run updatedb and fetch for different 
> batchId simultaneously. I put together a patch. it seems to be working ( it 
> does skip the rows that do not match the batchId), but I am worried if and 
> how it might affect the sorting in the reduce part. anyway check it out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed

2013-04-03 Thread kaveh minooie

so I rewrote the begingin of the index in IndexUtil:

public NutchDocument index(String key, WebPage page) {
NutchDocument doc = new NutchDocument();
LOG.info("key: " + key);
doc.add("id", key);
doc.add("digest", StringUtil.toHexString(page.getSignature().array()));
//doc.add("batchId", page.getBatchId().toString());

  if( null == page.getBatchId()) {
LOG.info("batchId is null " );
  } else {
 doc.add("batchId", page.getBatchId().toString());
  }
  try {
  LOG.info("page is:"+page);
  }catch(Exception e){
  LOG.info("error:"+e);
  }


and here is an example of what I am getting:

key: com.nba.www:http/
batchId is null
page is:org.apache.nutch.storage.WebPage@ba6fc739 {
  "baseUrl":"null"
  "status":"0"
  "fetchTime":"1366846228423"
  "prevFetchTime":"0"
  "fetchInterval":"0"
  "retriesSinceFetch":"0"
  "modifiedTime":"0"
  "prevModifiedTime":"0"
  "protocolStatus":"null"
  "content":"null"
  "contentType":"text/html"
  "prevSignature":"null"
  "signature":"java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]"
  "title":"NBA.com"
  "text":"NBA.com Skip to , part of the Turner Sports & 
Entertainment Digital Network."

  "parseStatus":"org.apache.nutch.storage.ParseStatus@7821 {
  "majorCode":"1"
  "minorCode":"0"
  "args":"[]"
}"
  "score":"1.0"
  "reprUrl":"null"
  "headers":"{Content-Encoding=gzip, Connection=close, 
Content-Type=text/html;charset=UTF-8, Content-Length=19526, 
Cache-Control=max-age=31, Date=Wed, 03 Apr 2013 23:30:28 GMT, 
Expires=Wed, 03 Apr 2013 23:30:59 GMT, Server=nginx, 
X-UA-Device=desktop, Vary=User-Agent, X-UA-Profile=desktop}"

  "outlinks":"{}"
  "inlinks":"{}"
  "markers":"{dist=0, _injmrk_=y, _idxmrk_=1365031584-127026, 
_updmrk_=1365031584-127026}"

  "metadata":"{}"
  "batchId":"null"
}

the fileds: baseUrl, protoclolStatus, reprUrl, batchId are null and the 
outlinks is empty. I am still in the process of familiarizing myself 
with code, so I can't say it for sure, and I apologize for asking stupid 
questions while we are at it, but this doesn't seem right to me, am i 
right to assume that the mentioned fields or at least most of them 
should have values?


also, the example that I am showing here is not a one off, these fields 
have the same value for all, emphasis on ALL, the a few thousands urls 
that I have fetched and with which I am playing to test the code.


the filed text was a lot longer, I removed the extra text since it was 
irreverent here, everything else I copied directly from the log file.


thanks,



On 04/03/2013 02:32 PM, Lewis John Mcgibbney wrote:

Hi Kaveh,

On Wed, Apr 3, 2013 at 1:30 PM, mailto:dev-digest-h...@nutch.apache.org>> wrote:

Hi

so I am not sure if binoy is talking about this but here it is:

the original exception comes from
src/java/org/apache/nutch/__indexer/IndexUtil.java  line 66

  public NutchDocument index(String key, WebPage page) {
 NutchDocument doc = new NutchDocument();
 doc.add("id", key);
 doc.add("digest",
StringUtil.toHexString(page.__getSignature().array()));
==>>doc.add("batchId", page.getBatchId().toString());

page.getBatchId() returns null for every urls. my guess is that
updatedb removes the batchID from the rows in webpage since the
generate and fetch work fine with batchId but after the updatedb (
which by the way does not accept batchId as one of its parameter
which means that it is going over the entire webpage table everytime
you run it, but that is a different issue) solrindex can't find the
batchIds

I've reopened NUTCH-1532 and attached a trivial patch which should now
protect against the NPE people have been getting.
Can you please check it out and get back to us?
Thank you Kaveh


--
Kaveh Minooie


Re: dev Digest 2 Apr 2013 18:42:33 -0000 Issue 1587

2013-04-02 Thread kaveh minooie

Hi

so I am not sure if binoy is talking about this but here it is:

the original exception comes from
src/java/org/apache/nutch/indexer/IndexUtil.java  line 66

 public NutchDocument index(String key, WebPage page) {
NutchDocument doc = new NutchDocument();
doc.add("id", key);
doc.add("digest", StringUtil.toHexString(page.getSignature().array()));
==>>doc.add("batchId", page.getBatchId().toString());

page.getBatchId() returns null for every urls. my guess is that updatedb 
removes the batchID from the rows in webpage since the generate and 
fetch work fine with batchId but after the updatedb ( which by the way 
does not accept batchId as one of its parameter which means that it is 
going over the entire webpage table everytime you run it, but that is a 
different issue) solrindex can't find the batchIds


thou I am not sure, I am going over the code right after I hit the send :)


On 04/02/2013 01:55 PM, Lewis John Mcgibbney wrote:

Hi Binoy,


On Tue, Apr 2, 2013 at 11:42 AM, mailto:dev-digest-h...@nutch.apache.org>> wrote:


Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh
crawl with One Seed.
 22979 by: Binoy d

Hi Lewis,
I understand the head branch can be unstable some of the time. I was
trying to point out that I was not able to reproduce the issue with
HEAD for 2.x . I will try and create the jira after I am back from
office.  I try to not the create jiras without conforming the issue,
they just tend to add noise. I haven't used the crawl scripts much
so it might take some time for me to get logs from there .


Anything you can do to help us better understand the source of the issue
is greatly appreciated Binoy. Thank you for your perseverance (and
others who are helping on these issues) it is of real value to the Nutch
community.
Best
Lewis


--
Kaveh Minooie


[jira] [Updated] (NUTCH-1552) possibility of a NPE in index-more plugin

2013-04-02 Thread kaveh minooie (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kaveh minooie updated NUTCH-1552:
-

Attachment: NUTCH-1552.patch

> possibility of a NPE in index-more plugin
> -
>
> Key: NUTCH-1552
> URL: https://issues.apache.org/jira/browse/NUTCH-1552
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.2
>    Reporter: kaveh minooie
> Attachments: NUTCH-1552.patch
>
>
> in line 203 of src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java 
> the code attempt to read from variable contentType even thou it is possible 
> for it to be null. for me, it happened when I tried to index  
> http://www.pscars.com/ 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1552) possibility of a NPE in index-more plugin

2013-04-02 Thread kaveh minooie (JIRA)
kaveh minooie created NUTCH-1552:


 Summary: possibility of a NPE in index-more plugin
 Key: NUTCH-1552
 URL: https://issues.apache.org/jira/browse/NUTCH-1552
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.2
Reporter: kaveh minooie


in line 203 of src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java 
the code attempt to read from variable contentType even thou it is possible for 
it to be null. for me, it happened when I tried to index  
http://www.pscars.com/ 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: error using generate in 2.x

2013-04-01 Thread kaveh minooie

ok so i got
gora-core-0.3-20130401.060419-325.jar
gora-hbase-0.3-20130401.065448-305.jar

and when I run generate the code finished without any exception but the 
log file was full of lines like this (one for every url that I had in 
webpage table)


INFO  mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class 
while writing to datastore.HBase mapping for field 
[org.apache.nutch.storage.WebPage#batchId] not found. Wrong 
gora-hbase-mapping.xml?



when i checked gora-hbase-mapping.xml there was no field for batchId

so I copied this line from  gora-cassandra-mapping.xml



after that everything (and by that I mean generate fetch updatedb) 
worked fine. So now here are my questions:


1- as I said that line is missing for gora-hbase-mapping.xml. does this 
needs an jira issue or can you guys just add it and commit without going 
through all the hoops?


2- is the trunk version supposed to be compiled against the gora trunk? 
cause the current HEAD is not working with 0.2.1?


P.S this by the way worked the same with and without NUTCH-1551 patch



On 04/01/2013 03:28 PM, Lewis John Mcgibbney wrote:

You're right, this is a dev issue for sure.


On Mon, Apr 1, 2013 at 2:45 PM, kaveh minooie mailto:ka...@plutoz.com>> wrote:

The patch NUTCH-1551 didn't solve my issue. I am still getting the
same exact error when i try to run generate. (this was run in local
mode) :


NUTCH-1551 is not supposed to fix this problem entirely. It merely
attempts to make the WebTableReader tool backwards compatible and
permits you to check whether accesor methods WebPage.getBatchID() and
WebPage.getPrevModifiedTime() actually work for your use case. If you
are able to check and provide feedback of the webtable dump for the URL
causing the NPE it would be very valuable indeed.


now the likely variable that is null seems to be 'mapkey' which is
probably as a result of male formed URL ( thou I can't say that for
sure )

now the put function is being called from here

this is from gora 2.1:


gora/blob/0.2.1/gora-core/src/__main/java/org/apache/gora/__mapreduce/GoraRecordWriter.__java:

...


the same function in gora trunk is like this:
...

which seems to me that would allow the code to recover from this
kind of errors. now I get gora through ivy and I don't know how or
if I can have ivy to fetch the trunk but regardless I still think
the question remains whether it is a nutch issue or gora?

So it appears that some issues have been addressed and improved within
Gora trunk (which is nice). You can pull a Gora SNAPSHOT from here [0]
and place it on your class path then try it out. Feedback would be
greatly appreciated.

The underlying problem here is that not everyone using and developing
Gora is using and developing Nutch. We have been making good progress
towards building diversity over in Gora so that it is not so heavily
reliant upon Nutch users. This means the project can stand on its own
two feet. The downside of this, is that *some* bugs arising from *some*
use cases are not discovered until a little later than we would like.
Your feedback is really really helpful.

It should be noted that you can also patch your local copy of 2.x HEAD
to not contain the two offending issues we've previously discussed.

[0]
https://repository.apache.org/content/repositories/snapshots/org/apache/gora/


--
Kaveh Minooie


Re: error using generate in 2.x

2013-04-01 Thread kaveh minooie
ains 
whether it is a nutch issue or gora?



sorry for the long email.


On 03/30/2013 04:03 PM, Lewis John Mcgibbney wrote:

I think we need also may need to add the BATCH_ID to one Job's HashSet

private static final Collection FIELDS = new
HashSet();
static {
...
   FIELDS.add(WebPage.Field.BATCH_ID);
}


On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:


Hi,
I've tried to sort this out locally this morning...
I can almost replicate this behaviour with gora-cassandra and it looks
most likely that the patch(es) applied in
* NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
o.a.n.storage.WebPage, and
* NUTCH-1532 - Replace 'segment' mapping field with batchId,
respectively are not backwards compatible because some URLs within the web
database do not contain values to the batchId.
Of course this is a major problem.
I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
backwards compatible with the above patches. Please try out the patch if
you can and comment so I can commit.

We have a couple options here.
1) Revert both of the above until we can get a fix
2) Get a fix just now and commit it.
What do you guys want to do?

I have a question about whether or not we can dynamically add fields to
existing data base entires by injecting them?
Say for example, you inject URLs without the batchId field in your mapping
file, then add the field and inject some more URLs... will the field be
added to your data base? If so then why are we getting the NPE?
There must be some other location in the Nutch code where an asserted
attempt is being made to obtain the batchId fore some given key... it
cannot be obtained and we receive the NPE.

[0] https://issues.apache.org/jira/browse/NUTCH-1551


On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie  wrote:


I use git and i fetch from github 
(https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) 
currently I am on this commit:

commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
Author: lufeng 
Date:   Thu Mar 28 13:09:09 2013 +

 NUTCH-1547 BasicIndexingFilter - Problem to index full title

 git-svn-id: https://svn.apache.org/repos/**
asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956-
**ffa450edef68


before I was on this commit :


commit f02dcf62566583551426c08bd38808**0e5b2bc93e


  f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml



On 03/29/2013 04:35 PM, alx...@aim.com wrote:


Yes, with hbase. Here is the error

13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
closed
13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
java.lang.NullPointerException
  at org.apache.gora.hbase.store.**HBaseStore.addFields(**
HBaseStore.java:398)
  at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.
**java:360)
  at org.apache.nutch.crawl.**WebTableReader.read(**
WebTableReader.java:234)
  at org.apache.nutch.crawl.**WebTableReader.run(**
WebTableReader.java:476)
  at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
java:65)
  at org.apache.nutch.crawl.**WebTableReader.main(**
WebTableReader.java:412)
  at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
Method)
  at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
NativeMethodAccessorImpl.java:**39)
  at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
DelegatingMethodAccessorImpl.**java:25)
  at java.lang.reflect.Method.**invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)


If I revert to previous release it works fine.

Thanks.
Alex.





-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Fri, Mar 29, 2013 4:30 pm
Subject: Re: error using generate in 2.x


Hi Alex,
With HBase also?
There 'was' a bug in gora-cassandra module for this command + params
however I thought it had been addressed and therefore resolved it.
Lewis


On Fri, Mar 29, 2013 at 4:00 PM,  wrote:

  Hi,


It seems that trunk has a few bugs. I found out that readdb -url urlname
also gives errors.

Thanks.
Alex.







-Original Message-
From: kaveh minooie 
To: user 
Sent: Fri, Mar 29, 2013 1:53 pm
Subject: Re: error using generate in 2.x


Hi lewis

the mapping file that I am using is the one that comes with nutch, and I
haven't touched it. this message in the log is caused by using the
-crawlId on the command line. for example this log was the result of
this command :

bin/nutch generate -topN 1000 -crawlId t1

which causes the nutch( or i guess technically gora ) to use a table
name 't1_webpage'. thou, I have to say that i don't understand the
rational behind the code generating a warning like this ( I mean I know
it is not actually a warning, just tha

Re: slf4j issue with nutch 2.x over hadoop 1.1.1

2013-02-16 Thread kaveh minooie
So when you say "prune the dependencies", I am not sure what you are 
talking about cause what I could think of is not working.  let me 
explain the situation again. nutch 2.x ivy file ( ivy/ivy.xml ) has this 
in it:


 




hadoop 1.1.1 ships with slf4j 1.4.3.  these 2 are not compatible. now, i 
rather not mess with my hadoop cluster so I tried to downgrade slf4j in 
nutch. I changed the above lines to :


 



as you can see I am upgrading the solr and zookeeper and removing the 
elasticsearch, and all of these changes work fine since I can see the 
appropriate files in the build/lib directory after ant is done. but it 
doesn't work for slf4j, and the files copied to build/lib ( and 
subsequently in my job file ) are :

kaveh@d1r2n2:/source/nutch/nutch$ ll build/lib/slf*
-rw-r--r-- 1 kaveh kaveh 25496 Jul  5  2010 build/lib/slf4j-api-1.6.1.jar
-rw-r--r-- 1 kaveh kaveh  9753 Jul  5  2010 
build/lib/slf4j-log4j12-1.6.1.jar


since i need the job file i can't just manually change the files in 
build/lib, won't do me any good. now I don't know ant very well, and 
that is mostly why I am asking this from you guys. I have to say that I 
also changed the same thing in pom.xml as well:


 
org.slf4j
slf4j-log4j12
1.4.3
   true


but I still end up with the 1.6.1 version. I don't know how exactly ant 
and ivy and pom work together, so I am asking if there is any other 
config file that I am missing, or why while it is working fine for solr 
and zookeeper it is not affecting the slf4j?


thanks,


On 02/16/2013 09:42 AM, Lewis John Mcgibbney wrote:

A solution would be to manually prune the dependencies which are fetched
via Ivy. If old slf4j dependencies are fetched for Hadoop via Ivy then
maybe we need to make the exclusions explicit within ivy.xml. if you are
able , then please provide a patch which fixes this if it is really a
problem.
It is important to note that pom.xml will most likely be outdated. You
should build nutch with ant + ivy for the time being as this is stable.
Thank you
Lewis

On Saturday, February 16, 2013, kaveh minooie  wrote:

unfortunately your links have been removed from the email that i got so i

am not sure what [0] and [1] are, but this is what i am using :

kaveh@d1r2n2:/source/nutch/nutch.git$ git remote -v
originhttps://github.com/apache/nutch.git (fetch)
originhttps://github.com/apache/nutch.git (push)
kaveh@d1r2n2:/source/nutch/nutch.git$ git branch -v
* 2.x   f02dcf6 NUTCH-XX remove unused db.max.inlinks from

nutch-default.xml

   trunk a7a1b41 NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers
kaveh@d1r2n2:/2locos/source/nutch/nutch.git$


i am using branch 2.x

On 02/15/2013 06:02 PM, Lewis John Mcgibbney wrote:

Hi Kaveh,

Two seconds please. First lets set some thing straight.
Nutch trunk is from here [0]
Nutch 2,x is from here [1]
Which one do you use?

On Fri, Feb 15, 2013 at 4:53 PM, kaveh minooie  wrote:


but here is my problem. I tried to build the nutch using ver 1.4.3 of

the

slf4j. i changed the version in both ivy.xml and pom.xml and cleaned my

ivy

cache but ant still fetches the version 1.6.1 when it builds the

project.

what am I missing?



We can progress with the problem once we know what's actually going on.
Thanks
Lewis







slf4j issue with nutch 2.x over hadoop 1.1.1

2013-02-15 Thread kaveh minooie

Hi everyone
   I recently build the nutch 2.x from the trunk, but it crashes almost 
immediately in run time. it seems that the there is a version 
incompatibility between the slf4j in hadoop which is (1.4.3) and the one 
in nutch (1.6.1) : (actually is between versions above 1.6 and below it)


$ PATH="$(pwd)/bin:$PATH" bin/nutch inject /temp/urls/
Error: Could not find or load main class org.apache.hadoop.util.PlatformName
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: starting at 
2013-02-15 15:47:15
13/02/15 15:47:15 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: 
/temp/urls
Exception in thread "main" java.lang.NoSuchMethodError: 
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
	at 
org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
	at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:139)
	at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:205)
	at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:184)
	at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:236)
	at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:477)
	at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)

at org.apache.hadoop.mapreduce.JobContext.(JobContext.java:80)
at org.apache.hadoop.mapreduce.Job.(Job.java:50)
at org.apache.hadoop.mapreduce.Job.(Job.java:54)
at org.apache.nutch.util.NutchJob.(NutchJob.java:37)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



but here is my problem. I tried to build the nutch using ver 1.4.3 of 
the slf4j. i changed the version in both ivy.xml and pom.xml and cleaned 
my ivy cache but ant still fetches the version 1.6.1 when it builds the 
project. what am I missing?


thanks,
--
Kaveh Minooie

www.plutoz.com


I think I found a bug --> multiple_values_encountered_for_non_multiValued_field_title

2012-02-21 Thread kaveh minooie
so I've been getting this error 
"multiple_values_encountered_for_non_multiValued_field_title" every once 
in a while when I am trying to run solrindex. I can now say that this is 
being caused by index-more plug in (MoreIndexingFilter.java)


private NutchDocument resetTitle(NutchDocument doc, ParseData data, 
String url) {

String contentDisposition = 
data.getMeta(Metadata.CONTENT_DISPOSITION);
if (contentDisposition == null)
  return doc;

for (int i=0; ithe problem here is that in my case this function is not reseting but it 
is just adding a new title. it seems that the original idea was that if 
CONTENT_DISPOSITION exist then the document will not have a title set 
from other plug ins (namely index-basic). unfortunately this seems not 
to be always the case as you can see by running this command:


bin/nutch indexchecker http://www.2modern.com/site/gift-registry.html

what i do get (the part that is relevant) is:


tstamp :Tue Feb 21 13:18:13 PST 2012
type :  text/html
type :  text
type :  html
date :  Tue Feb 21 13:18:13 PST 2012
url :   http://www.2modern.com/site/gift-registry.html
content :	2Modern Gift Registry  Modern Furniture & Lighting items 
in cart 0 checkout Returning 2Modern cu

user_ranking :  25.0
title : 2Modern Gift Registry
title : gift-registry.html
plutoz_ranking :10.0
categories :Furniture Home
contentLength : 12924

and as you can see there are 2 titles. I think it would be very easy to 
fix that. just check to see if a title exist already before setting the 
name of the file as title:


if (contentDisposition == null || null != doc.getField("title"))
  return doc;


 or if the substitution must happen in presence of CONTENT_DISPOSITION, 
at least remove the old one:


if (matcher.find()) {
doc.remove("title");
doc.add("title", matcher.group(1));
break;
 }


 now that being said, the real problem here is why NutchDocument 
doesn't observe the schema.xml file and alway assumes that all fields 
are multi value?


public void add(String name, Object value) {
53  NutchField field = fields.get(name);
54  if (field == null) {
55field = new NutchField(value);
56fields.put(name, field);
57  } else {
58  > field.add(value);  <---
59  }
60}

--
Kaveh Minooie

www.plutoz.com


slf4j-log4j12 new version causes runtime error

2012-02-21 Thread kaveh minooie

I hope I am sending this to the correct list :)

I just saw this issue when I was trying to run indexchecker  in local mode.

I solved this by changing this line in the ivy.xml
conf="*->master" />


to

conf="*->master" />


the pom.xml has the "correct" version but ivy.xml seems to be overriding 
it. (if that was an obvious statement, I apologize, but I am new to ivy 
and the whole Maven stuff)


this would be also helpfull:
 http://www.slf4j.org/faq.html#IllegalAccessError


and for the record this is the error I got
kaveh@index9:~/build/nutch/runtime/local$ bin/nutch indexchecker
Exception in thread "main" java.lang.IllegalAccessError: tried to access 
field org.slf4j.impl.StaticLoggerBinder.SINGLETON from class 
org.slf4j.LoggerFactory

at org.slf4j.LoggerFactory.staticInitialize(LoggerFactory.java:83)
at org.slf4j.LoggerFactory.(LoggerFactory.java:73)
	at 
org.apache.nutch.indexer.IndexingFiltersChecker.(IndexingFiltersChecker.java:36)




--
Kaveh Minooie

www.plutoz.com


Re: issue in nutch-default.xml

2012-02-17 Thread kaveh minooie

I know but the code expects to read the number of days:

genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;

and as you can see the default value, as is mentioned in the 
description, is 7 and it is in days not milliseconds;


On 02/17/2012 12:33 AM, Markus Jelsma wrote:

this is actually 7 days in milliseconds.


so I checked the source code. the value seems that should be in fact 7.
current default value means 1.6 thousand millenniums .


   crawl.gen.delay
   60480
   
This value, expressed in days, defines how long we should keep the lock
on records in CrawlDb that were just selected for fetching. If these
records are not updated in the meantime, the lock is canceled, i.e. the
become eligible for selecting. Default value of this is 7 days.
   



--
Kaveh Minooie

www.plutoz.com


Re: make nutch plugin to get termfreqvectors

2012-01-21 Thread kaveh minooie

Hi

I am having simillar problem in that I have to update bunch of plugins 
that were writen for nutch 1.1. It would be great If we could get some 
hints.


Thanks,

On 01/19/2012 03:36 PM, Ale wrote:



Hi,
I'm quite new working with nutch plugins. I'm trying to save the 
termfreqvectors of the documents.
I'm using nutch 1.4

I've seen that I had to use, in the plugin class, the method addFieldOption, 
like:
--
public void addIndexBackendOptions(Configuration conf) {

//add lucene options   //


// host is un-stored, indexed and tokenized
 LuceneWriter.addFieldOptions("host", LuceneWriter.STORE.NO,
 LuceneWriter.INDEX.TOKENIZED, conf);
   // site is un-stored, indexed and un-tokenized
 LuceneWriter.addFieldOptions("site", LuceneWriter.STORE.NO,
 LuceneWriter.INDEX.UNTOKENIZED, conf);

   // url is both stored and indexed, so it's both searchable and returned

 LuceneWriter.addFieldOptions("url", LuceneWriter.STORE.YES,

 LuceneWriter.INDEX.TOKENIZED, conf);


// content is indexed, so that it's searchable, but not stored in index

   LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.NO,

   LuceneWriter.INDEX.TOKENIZED, conf);
   // anchors are indexed, so they're searchable, but not stored in index

   LuceneWriter.addFieldOptions("anchor", LuceneWriter.STORE.NO,

   LuceneWriter.INDEX.TOKENIZED, conf);

   // title is indexed and stored so that it can be displayed

   LuceneWriter.addFieldOptions("title", LuceneWriter.STORE.YES,

 LuceneWriter.INDEX.TOKENIZED, conf);


The problem is, as far as I have seen, that LuceneWriter no longer exists in 
1.4 (Lucene 3.5)
WHich is the correct way to do it ?

Thank you very much in advance !

--



--
Kaveh Minooie

www.plutoz.com