[ https://issues.apache.org/jira/browse/NUTCH-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906488#action_12906488 ]
Markus Jelsma edited comment on NUTCH-716 at 9/6/10 9:32 AM: ------------------------------------------------------------- This patch concatenates multiple values in a single string instead of adding single values to a multi valued field. For a test crawl i have defined the following two subcollection definitions: <subcollection> <name>asdf</name> <id>asdf-site</id> <whitelist>http://asdf/</whitelist> <blacklist/> </subcollection> <subcollection> <name>news</name> <id>asdf-nieuws</id> <whitelist>http://asdf/news/</whitelist> <blacklist/> </subcollection> Reindexing the segments by sending them to Solr will yield the following results for a news URL: <doc> <arr name="subcollection"> <str>asdf</str> </arr> <str name="url">http://asdf/home/</str> </doc> <doc> <arr name="subcollection"> <str>asdf news</str> </arr> <str name="url">http://asdf/news/</str> </doc> Instead, i expected the following result for the second document: <doc> <arr name="subcollection"> <str>asdf</str> <str>news</str> </arr> <str name="url">http://asdf/news/</str> </doc> My Solr schema.xml has the following declaration for the subcollection field: <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true" /> The latest nightly build i could find: nutch-2010-07-07_04-49-04 was (Author: markus17): This patch concatenates multiple values in a single string instead of adding single values to a multi valued field. For a test crawl i have defined the following two subcollection definitions: <subcollection> <name>asdf</name> <id>asdf-site</id> <whitelist>http://asdf/</whitelist> <blacklist/> </subcollection> <subcollection> <name>news</name> <id>asdf-nieuws</id> <whitelist>http://asdf/news/</whitelist> <blacklist/> </subcollection> Reindexing the segments by sending them to Solr will yield the following results for a news URL: <doc> <arr name="subcollection"> <str>asdf</str> </arr> <str name="url">http://asdf/home/</str> </doc> <doc> <arr name="subcollection"> <str>asdf news</str> </arr> <str name="url">http://asdf/news/</str> </doc> Instead, i expected the following result for the second document: <doc> <arr name="subcollection"> <str>asdf</str> <str>news</str> </arr> <str name="url">http://asdf/news/</str> </doc> My Solr schema.xml has the following declaration for the subcollection field: <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true" /> > Make subcollection index filed multivalued > ------------------------------------------ > > Key: NUTCH-716 > URL: https://issues.apache.org/jira/browse/NUTCH-716 > Project: Nutch > Issue Type: Improvement > Components: indexer > Affects Versions: 1.0.0 > Reporter: Dmitry Lihachev > Fix For: 1.2, 2.0 > > Attachments: NUTCH-716-1_2.patch, > NUTCH-716_multivalued_subcollection.patch > > > Looks like a reasonable thing to do. Marking as 1.2 and will commit if no one > objects -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.