[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14714492#comment-14714492
 ] 

Lewis John McGibbney commented on NUTCH-1517:
-

Nice patch folks.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
> 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712988#comment-14712988
 ] 

Julien Nioche commented on NUTCH-1517:
--

Thanks [~jorgelbg]. Will commit soon unless someone objects.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
> 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707078#comment-14707078
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1517:
---

+1 I haven't been able to do some tests (no access to CloudSearch), but so far 
looking good! does anyone else wants to comment?

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
> 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-06-25 Thread Ji Kwon Lim (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601640#comment-14601640
 ] 

Ji Kwon Lim commented on NUTCH-1517:


Hi,

We are attempting to use nutch with CloudSearch, and we are using the patch 
provided in this ticket. However, we noticed that the patch seems to be 
incomplete, requiring a manual change to 
org.apache.nutch.parse,MetaTagsParser.java to replace all references to 
'metadata.add("metatag."' with 'metadata.add("metatag_"', changing out the 
period with an underscore. Is there a newer patch out that addresses this issue 
or a newer process altogether for getting nutch to work with CloudSearch? If 
not, could we get an update to the patch to include the change to 
org.apache.nutch.parse,MetaTagsParser.java that's necessary for the indexer to 
work properly?


Regards,

Ji Kwon Lim

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
> 0025666929_1382393138_indexer-cloudsearch.20131021.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-20 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773420#comment-13773420
 ] 

Tom Hill commented on NUTCH-1517:
-

And I'll try to get mapping, and writing to file covered in my version. Going 
forward, it seems like this might be common functionality for a base class for 
all the indexers.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-20 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773416#comment-13773416
 ] 

Tom Hill commented on NUTCH-1517:
-

Thanks for the thorough review. I've already got the serious bug fixed, just 
doing some testing before uploading fixed version. I'll try to get the things 
you mentioned covered in the next version.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773053#comment-13773053
 ] 

Julien Nioche commented on NUTCH-1517:
--

I had another look at the code. It should handle documents marked for deletion 
and have a more robust handling of the fields (e.g. with a mapping mechanism as 
in SOLR). It currently fails to remove unsupported characters if they are in 
fields which aren't the 2 you hardcoded. The regex which checks for the 
validity of a field name is not correct as it can let through string starting 
with a _ which is not allowed

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771941#comment-13771941
 ] 

Julien Nioche commented on NUTCH-1517:
--

and maybe allow an option to dump the json batch in a file? that would be 
useful for debugging and also detect the fields automatically with 
cs-configure-from-sdf

Julien

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771742#comment-13771742
 ] 

Julien Nioche commented on NUTCH-1517:
--

Tom, 

I had a quick look at your plugin. Here are a few things I found : 

* serious bug : the batch doesn't get cleared in the CloudSearchBatcher. The 
batch gets larger and larger as a result with the same docs sent multiple times
* build your patch against the trunk - not a released version - some things 
might have moved since
* move the README file to the src/plugin/indexer-cloudsearch dir
* populate the language field from the value generated by the 
LanguageIndexingFilter with a default to 'en' if it's not there
* in the readme and in your comments above maybe explain where to find the doc 
for CloudSearch, how to create a domain and declare the fields etc... People 
usually know how to apply a patch and run Nutch, but not how to deal with 
CloudSearch
* use the generic indexer - not the solr command 

Thanks

Julien




> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-17 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769470#comment-13769470
 ] 

Julien Nioche commented on NUTCH-1517:
--

Hi Tom. It is not currently called from multiple threads but that could be the 
case in the future so it would be safer to make your code thread safe.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-16 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769012#comment-13769012
 ] 

Tom Hill commented on NUTCH-1517:
-

Can CloudSearchIndexWriter.write() be called from multiple threads? If so, I 
need to synchronize some methods. (And I think the Solr indexer does to). 

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-11 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764554#comment-13764554
 ] 

Tom Hill commented on NUTCH-1517:
-

I don't think that has any relation to the CloudSearch indexer. Googling the 
error "Split metadata size exceeded 1000." gets a number of discussions of 
this, and how to fix it. How much data do you have?

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-11 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13764501#comment-13764501
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Current Error Message
java.io.IOException: Split metadata size exceeded 1000. Aborting job 
job_201309111525_0003
at 
org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at 
org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:1079)
at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:969)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4237)
at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

13/09/11 17:10:46 ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job 
failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1320)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)


> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-10 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763185#comment-13763185
 ] 

Tom Hill commented on NUTCH-1517:
-

@Daniel, per Julien's comment, you should probably be editing nutch-site.xml, 
instead of nutch-default.xml. That should make it easier, as you can just keep 
the edited version around and copy it over. 

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-10 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763176#comment-13763176
 ] 

Tom Hill commented on NUTCH-1517:
-

I was just trying to change as little as possible from the example. I'll take a 
look.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-10 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763267#comment-13763267
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Updated my script. 

I am running into errors now if I use "segments/*" when trying to run in 
deployed HDFS mode. If I select an individual segment then it works fine.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-10 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763050#comment-13763050
 ] 

Julien Nioche commented on NUTCH-1517:
--

why are you using the solrindex command? the generic 'nutch index' one would 
make more sense. Look at the content of the nutch script to see how the 
solrindex command is converted into the generic one.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-10 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13763015#comment-13763015
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

git clone https://github.com/apache/nutch
wget 
https://issues.apache.org/jira/secure/attachment/12601469/0023883254_1377197869_indexer-cloudsearch.patch
cd nutch/
git checkout -t origin/branch-1.7
patch -p0 -i ~/0023883254_1377197869_indexer-cloudsearch.patch 
vi conf/nutch-default.xml
ant
cd runtime/local/
mkdir -p urls
echo "http://www.princeton.edu/"; > ./urls/seeds.txt 
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*

the vi step is where I add my crawler name, change solr to cloudsearch and add 
my endpoint url. Tried to do this with sed to replace lines but couldn't figure 
it out. 

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762505#comment-13762505
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Sorry. You are right I have 10 documents in there. 

Now to try and figure out how to get this running with 
runtime/deploy

So that I can index my items on HDFS. I am not sure if I am getting errors 
because of hdfs, or because I am running in Amazon EMR. 
Thanks for the help! Once I have finished my "install" script I will post it.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762251#comment-13762251
 ] 

Tom Hill commented on NUTCH-1517:
-

It seems to print that message for me, even when it works. I may have not done 
something correctly. 

Please check your logs in the logs directory, and see what it says. Or check 
your cloudsearch domain, and see if the documents made it there.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762066#comment-13762066
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Does this process work with the data stored in hdfs? or does it have to be 
stored on local file system? Still not able to get nutch to save segments 
though... But when I tried to use the index on my previously crawled data I am 
still getting the matched 0 files errors.


> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762067#comment-13762067
 ] 

Tom Hill commented on NUTCH-1517:
-

I don't believe my patch affects processing at that point. Could you try the 
steps on an unpatched nutch 1.7, and make sure the crawl is working properly? 

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762115#comment-13762115
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Okay so it that issue of no segments being placed in hdfs was because I was 
using runtime/deploy/ instead of runtime/local/ so I'll worry about that later. 
Got it to run and all, but now I am running into this error: 

CloudSearchIndexWriter
cloudsearch.endpoint : URL of the CloudSearch domain's document 
endpoint. (mandatory)

I have set my value in the conf/nutch-default.xml like

http://doc-placesearch-BLAHBLAHBLAH.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch


> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-09 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761873#comment-13761873
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

I did, but I noticed that this is not creating a crawl/segments/ folder after 
running.


> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-06 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760593#comment-13760593
 ] 

Tom Hill commented on NUTCH-1517:
-

Did you do the  step 3a:" bin/nutch crawl urls -dir crawl -depth 3 -topN 5"

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-06 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760448#comment-13760448
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

I have followed the above process, but am getting errors are trying to do 
"bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*"

13/09/06 18:03:19 ERROR security.UserGroupInformation: 
PriviledgedActionException as:hadoop 
cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 
files
Input path does not exist: 
hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current
13/09/06 18:03:19 ERROR indexer.IndexingJob: Indexer: 
org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_fetch matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/crawl_parse matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_data matches 0 
files
Input Pattern 
hdfs://10.148.178.153:9000/user/hadoop/crawl/segments/*/parse_text matches 0 
files
Input path does not exist: 
hdfs://10.148.178.153:9000/user/hadoop/crawl/linkdb/current

Any suggestions?

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-06 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760105#comment-13760105
 ] 

Tom Hill commented on NUTCH-1517:
-

Thanks for the clarification!

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-06 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760035#comment-13760035
 ] 

Julien Nioche commented on NUTCH-1517:
--

Tom - by convention changes made by the users are set in nutch-site.xml whereas 
nutch-default.xml is used to list the parameters and their default values. It 
is just a convention but it helps finding what has been changed specifically 
for a given setup.

You can have multiple indexing backends used at the same time and e.g. index 
with both solr and elasticsearch provided of course that their respective 
plugins are activated in nutch-site.xml and that they are properly configured.

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-05 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759865#comment-13759865
 ] 

Tom Hill commented on NUTCH-1517:
-

I believe you can configure either in nutch-default.xml, but not both.



> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-05 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759439#comment-13759439
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Does the above patch disable solr indexing?

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-09-04 Thread Tom Hill (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758358#comment-13758358
 ] 

Tom Hill commented on NUTCH-1517:
-

I've attached a patch that adds CloudSearch as a pluggable indexing back-end. 

Slightly verbose description of how to test:

1. Create a CloudSearch domain
note the document endpoint
I created the following fields in the domain

  anchor  Active text 
(Result)
  author  Active  literal 
(Search Result)
  boost   Active  literal 
(Search Result)
  cache   Active  literal 
(Search Result)
  content Active text 
(Result)
  content_length  Active  literal 
(Search Result)
  digest  Active  literal 
(Search Result)
  feedActive  literal 
(Search Result)
  hostActive  literal 
(Search Result)
  id  Active  literal 
(Search Result)
  langActive  literal 
(Search Result)
  published_date  Active uint ()
  segment Active  literal 
(Search Result)
  subcollection   Active  literal 
(Search Result)
  tag Active  literal 
(Search Result)
  textActive text 
(Result)
  title   Active text 
(Result)
  tstamp  Active uint ()
  typeActive  literal 
(Search Result)
  updated_dateActive uint ()
  url Active text 
(Result)

2. Checkout nutch
git clone https://github.com/apache/nutch
3. Switch to 1.7 branch
git checkout -t origin/branch-1.7
4. Apply attached patch
I created it with : git diff remotes/origin/branch-1.7 --no-prefix > 
indexer-cloudsearch.patch
applied with: patch -p0 -i ~/code/nutch/indexer-cloudsearch.patch
5. Edit conf/nutch-default.xml
add the document endpoint under the cloudsearch parameters (add http:// on 
the front and /2011-02-01/documents/batch on the end)
change the line with "indexer-solr" to "indexer-cloudsearch"
6. Build nutch
Just "ant" in top directory.
builds "runtime" directory, and "local" under that.
7. cd to nutch/runtime/local
8. Do step three of the tutorial   at http://wiki.apache.org/nutch/NutchTutorial
1)  You've done step #1 already
2) Step 2, I didn't have to do, it was all correct already
3) Do step 3, stop before 3.1
a) Then do this: bin/nutch crawl urls -dir crawl -depth 3 -topN 5
b) 3.2 through 5.x SKIP
4) skip tutorial step 4
5) skip tutorial step 5
6) Parts of step 6.
Check that the domain is ready
Then just do the one line
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*
Don't worry about the URL, it's ignored. The real URL comes from 
nutch-default.xml (set above)
(This is a hack, since I'm not sure how to integrate properly. 
Hopefully someone can help here)
9.Check logs/hadoop.log
Should show The adds sent to CloudSearch. Errors show there, too.
Might have to set logging level to info in 
nutch/runtime/local/conf/log4j.properties



> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
> Attachments: 0023883254_1377197869_indexer-cloudsearch.patch
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2013-08-22 Thread Daniel Ciborowski (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747893#comment-13747893
 ] 

Daniel Ciborowski commented on NUTCH-1517:
--

Found this blog entry. This guy seems to have written the basic code needed? 
Going to try it out.

http://www.ikanow.com/blog/04/15/leveraging-nutch-with-infinit-e-mongodb-and-elasticsearch/

> CloudSearch indexer
> ---
>
> Key: NUTCH-1517
> URL: https://issues.apache.org/jira/browse/NUTCH-1517
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
> Fix For: 1.9
>
>
> Once we have made the indexers pluggable, we should add a plugin for Amazon 
> CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
> JSON based representation Search Data Format (SDF), which we could reuse for 
> a file based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira