[jira] [Commented] (SOLR-7193) Concatenate words from token stream

2018-06-11 Thread Alexandre Rafalovitch (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508806#comment-16508806
 ] 

Alexandre Rafalovitch commented on SOLR-7193:
-

This seems to be satisfied by LUCENE-8332 and SOLR-12376, both coming in 7.4.

> Concatenate words from token stream
> ---
>
> Key: SOLR-7193
> URL: https://issues.apache.org/jira/browse/SOLR-7193
> Project: Solr
>  Issue Type: New Feature
>  Components: Schema and Analysis
>Reporter: Abhishek Bafna
>Priority: Major
> Attachments: concatenate_words.patch
>
>
> The user entered data often don't have proper spacing between words and words 
> spelling and format also varies from data like business names, address etc. 
> After tokenizing data, we might perform pattern replacement, stop word 
> filtering etc. Later we want to concatenate all the tokens and generate 
> n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7193) Concatenate words from token stream

2015-03-12 Thread abhishek bafna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359939#comment-14359939
 ] 

abhishek bafna commented on SOLR-7193:
--

[~jmtd890917] Did you get the point I tried to convey. Can you please provide 
your further comment for the patch.

 Concatenate words from token stream
 ---

 Key: SOLR-7193
 URL: https://issues.apache.org/jira/browse/SOLR-7193
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: abhishek bafna
 Attachments: concatenate_words.patch


 The user entered data often don't have proper spacing between words and words 
 spelling and format also varies from data like business names, address etc. 
 After tokenizing data, we might perform pattern replacement, stop word 
 filtering etc. Later we want to concatenate all the tokens and generate 
 n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7193) Concatenate words from token stream

2015-03-12 Thread abhishek bafna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358173#comment-14358173
 ] 

abhishek bafna commented on SOLR-7193:
--

The ConcatenateWordsFilter takes all the input token (words) and generate a 
single token. The CPU time and memory depends on the number and size of the 
tokens coming in the stream. The use case for this filter, when input stream 
contains business name, address, etc., which usually have a small number of 
tokens. I am guessing, here (test environment) input data containing long 
paragraphs or documents and that might be causing the issue.

 Concatenate words from token stream
 ---

 Key: SOLR-7193
 URL: https://issues.apache.org/jira/browse/SOLR-7193
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: abhishek bafna
 Attachments: concatenate_words.patch


 The user entered data often don't have proper spacing between words and words 
 spelling and format also varies from data like business names, address etc. 
 After tokenizing data, we might perform pattern replacement, stop word 
 filtering etc. Later we want to concatenate all the tokens and generate 
 n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7193) Concatenate words from token stream

2015-03-05 Thread abhishek bafna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348403#comment-14348403
 ] 

abhishek bafna commented on SOLR-7193:
--

The ConcatenateWordsFilter takes all the input token (words) and generate a 
single token. The CPU time and memory depends on the number and size of the 
tokens coming in the stream. The use case for this filter, when input stream 
contains business name, address, etc., which usually have a small number of 
tokens. I am guessing, here (test environment) input data containing long 
paragraphs or documents and that might be causing the issue.

 Concatenate words from token stream
 ---

 Key: SOLR-7193
 URL: https://issues.apache.org/jira/browse/SOLR-7193
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: abhishek bafna
 Attachments: concatenate_words.patch


 The user entered data often don't have proper spacing between words and words 
 spelling and format also varies from data like business names, address etc. 
 After tokenizing data, we might perform pattern replacement, stop word 
 filtering etc. Later we want to concatenate all the tokens and generate 
 n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7193) Concatenate words from token stream

2015-03-04 Thread chengyunyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348272#comment-14348272
 ] 

chengyunyun commented on SOLR-7193:
---

pressure test environment:
•   Client used Spring restful,
•   User:500
•   Total data:3 million
•   Problem: solr log search time is only 30ms; but searchClient execute 
server.query(SolrQuery query) need more time,even 30s;time gap is very large.
is it related to CPU or memory?

 Concatenate words from token stream
 ---

 Key: SOLR-7193
 URL: https://issues.apache.org/jira/browse/SOLR-7193
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: abhishek bafna
 Attachments: concatenate_words.patch


 The user entered data often don't have proper spacing between words and words 
 spelling and format also varies from data like business names, address etc. 
 After tokenizing data, we might perform pattern replacement, stop word 
 filtering etc. Later we want to concatenate all the tokens and generate 
 n-grams token for indexing business name and perform the fuzzy match.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org