[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-27 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667770#comment-13667770
 ] 

Luca Cavanna commented on NUTCH-1527:
-

Ok guys, I will look into this the coming days.

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch

2013-05-24 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1316#comment-1316
 ] 

Luca Cavanna commented on NUTCH-1527:
-

I just ran into this issue and thought it would be nice if nutch supported 
elasticsearch out-of-the-box. I had a look at the code and saw a few things 
that I would do differently:
- You can use the BulkProcessor instead of manually having to create the 
BulkRequest and handle it. It'll automatically execute the bulk when needed and 
it's also really flexible and configurable. That way you would be able to 
remove a lot of boilerplate code.
- I know the multicast discovery is fancy, that like you do now you don't need 
to specify any url and the client node will join an existing cluster with same 
name, but I think I would go for the other type of client here, the 
TransportClient, which is more lightweight and just sends requests to the 
configured urls in a round-robin fashion, using the internal binary protocol 
that elasticsearch uses for inter-node communication.

Let me know if I can help more, I'm certainly willing to get my hands dirty 
here if you want ;)

 Port nutch-elasticsearch-indexer to Nutch
 -

 Key: NUTCH-1527
 URL: https://issues.apache.org/jira/browse/NUTCH-1527
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: lufeng
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1527.patch


 The source repos for this can be found here [0].
 This issue should be inline with the work already done by Julien and others 
 over at NUTCH-1047.
 [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445930#comment-13445930
 ] 

Luca Cavanna commented on NUTCH-1100:
-

I agree, it would make even more sense to filter the query like this: digest:[* 
TO *] .
This way nutch wouldn't even iterate over documents that don't have a value for 
the digest field.
Unfortunately this problem is pretty common, it happens all the time if you 
have in Solr documents that don't come from nutch, together with the crawled 
documents.

 SolrDedup broken
 

 Key: NUTCH-1100
 URL: https://issues.apache.org/jira/browse/NUTCH-1100
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1100-1.6-1.patch


 Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
 Nutch will throw the exception below. There are no peculiarities to be found 
 in the Solr logs, the queries are normal and seem to succeed.
 {code}
 java.lang.NullPointerException
 at org.apache.hadoop.io.Text.encode(Text.java:388)
 at org.apache.hadoop.io.Text.set(Text.java:178)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)
Luca Cavanna created NUTCH-1464:
---

 Summary: index-static plugin doesn't allow the colon within the 
field value
 Key: NUTCH-1464
 URL: https://issues.apache.org/jira/browse/NUTCH-1464
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
Reporter: Luca Cavanna
Priority: Minor


If I want to configure a static field with a value containing a colon, the 
index-static plugin does nothing. There's a string split based on the colon 
character and if the result is an array of length everything is fine, otherwise 
nothing happens, the static field is not set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated NUTCH-1464:


Description: If I want to configure a static field with a value containing 
a colon, the index-static plugin does nothing. There's a string split based on 
the colon character and if the result is an array of length 2 everything is 
fine, otherwise nothing happens, the static field is not set.  (was: If I want 
to configure a static field with a value containing a colon, the index-static 
plugin does nothing. There's a string split based on the colon character and if 
the result is an array of length everything is fine, otherwise nothing happens, 
the static field is not set.)

 index-static plugin doesn't allow the colon within the field value
 --

 Key: NUTCH-1464
 URL: https://issues.apache.org/jira/browse/NUTCH-1464
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
Reporter: Luca Cavanna
Priority: Minor

 If I want to configure a static field with a value containing a colon, the 
 index-static plugin does nothing. There's a string split based on the colon 
 character and if the result is an array of length 2 everything is fine, 
 otherwise nothing happens, the static field is not set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2012-08-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated NUTCH-1464:


Attachment: NUTCH-1464.patch

I do have a patch, but it's against 1.5 branch. Anyway it's really tiny and 
hopefully easy to integrate.

 index-static plugin doesn't allow the colon within the field value
 --

 Key: NUTCH-1464
 URL: https://issues.apache.org/jira/browse/NUTCH-1464
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.5
Reporter: Luca Cavanna
Priority: Minor
 Attachments: NUTCH-1464.patch


 If I want to configure a static field with a value containing a colon, the 
 index-static plugin does nothing. There's a string split based on the colon 
 character and if the result is an array of length 2 everything is fine, 
 otherwise nothing happens, the static field is not set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1100) SolrDedup broken

2012-08-31 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446002#comment-13446002
 ] 

Luca Cavanna commented on NUTCH-1100:
-

The problem with the approach I mentioned before is that the field digest would 
need to be made indexed in the solr schema, otherwise that query would always 
return 0 results.


 SolrDedup broken
 

 Key: NUTCH-1100
 URL: https://issues.apache.org/jira/browse/NUTCH-1100
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1100-1.6-1.patch


 Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
 Nutch will throw the exception below. There are no peculiarities to be found 
 in the Solr logs, the queries are normal and seem to succeed.
 {code}
 java.lang.NullPointerException
 at org.apache.hadoop.io.Text.encode(Text.java:388)
 at org.apache.hadoop.io.Text.set(Text.java:178)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-923) Multilingual support for Solr-index-mapping

2012-08-03 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428003#comment-13428003
 ] 

Luca Cavanna commented on NUTCH-923:


That's brilliant. Thanks Markus for your insight.

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor
 Attachments: patch-923-nutch-release-1.2.txt


 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira