Re: ExtractRequestHandler and Tika. Get only plain text

2018-11-14 Thread Sergio García Maroto
Thanks Erick.
I do use this strategy for indexing data from DB. It is very flexible for
me.
I work in a company where .net is the main dev platform , so even more
important to separate things.

Does you post mean that functionality for indexing documents in Solr using
ExtractRequestHandler doesn't provide the option of Indexing plain data ?

On Wed, 14 Nov 2018 at 16:14, Erick Erickson 
wrote:

> While ERH is find for getting started, as you go toward production
> you'll want to consider parsing the data outside of Solr for the
> reasons (and example) outlined here:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
> On Wed, Nov 14, 2018 at 6:46 AM Sergio García Maroto 
> wrote:
> >
> > Thanks a lot Jan.
> > That works very well.
> >
> > I am now trying to index the doc in Solr deleting the extractOnly
> parameter
> > and can't find any similiar option to get the data indexed in plain
> text. I
> > am getting the metadata as well,
> > This is my request.
> >
> http://localhost:8983/solr/document/update/extract?iteral.id=DDOC001&stream.file=C
> > :\TIKA\FileTest\Test.txt&commit=true&fmap.content=DocContentS
> >
> > My DocContentS contains
> > \n \n stream_size 13 \n X-Parsed-By org.apache.tika.parser.DefaultParser
> \n
> > X-Parsed-By org.apache.tika.parser.txt.TXTParser \n stream_name Test.txt
> > \n stream_source_info file:/C:/TIKA/FileTest/Test.txt \n Content-Encoding
> > ISO-8859-1 \n Content-Type text/plain; charset=ISO-8859-1 \n \n \n Prueba
> > Sergio \n "
> >
> > I can't find anywhere how to modify this behaviour.
> >
> >
> >
> >
> > On Wed, 14 Nov 2018 at 13:06, Jan Høydahl  wrote:
> >
> > > Have you tried to specify &extractFormat=text
> > >
> > > --
> > > Jan Høydahl, search solution architect
> > > Cominvent AS - www.cominvent.com
> > >
> > > > 14. nov. 2018 kl. 12:09 skrev marotosg :
> > > >
> > > > Hi all,
> > > >
> > > > Currently I am trying to do index documents from different kinds with
> > > Solr
> > > > and tika. It's working fine but when solr returns the content of the
> > > > document. Doesn't return the plain text.  It comes back as well with
> some
> > > > metadata.
> > > >
> > > > For instance my request.
> > > >
> > >
> http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C
> > > :\TIKA\FileTest\Test.txt
> > > >
> > > > Content of Test.txt file is just "*Test File*".
> > > >
> > > > Response from Solr as you can see below returns plenty of
> information.
> > > > I would the answer to be something like this without noise for the
> > > search.
> > > > 
> > > > Test File
> > > > 
> > > >
> > > > 
> > > > 
> > > > 0
> > > > 135
> > > > 
> > > > 
> > > >   > > > xmlns="http://www.w3.org/1999/xhtml";>   name="stream_size"
> > > > content="13"/>  > > > content="org.apache.tika.parser.DefaultParser"/>  name="X-Parsed-By"
> > > > content="org.apache.tika.parser.txt.TXTParser"/>  name="stream_name"
> > > > content="Test.txt"/>  > > > content="file:/C:/TIKA/FileTest/Test.txt"/>  name="Content-Encoding"
> > > > content="ISO-8859-1"/> Test File
> > > >  
> > > > 
> > > > 
> > > > 
> > > > 13
> > > > 
> > > > 
> > > > org.apache.tika.parser.DefaultParser
> > > > org.apache.tika.parser.txt.TXTParser
> > > > 
> > > > 
> > > > Test.txt
> > > > 
> > > > 
> > > > file:/C:/TIKA/FileTest/Test.txt
> > > > 
> > > > 
> > > > ISO-8859-1
> > > > 
> > > > 
> > > > text/plain; charset=ISO-8859-1
> > > > 
> > > > 
> > > > 
> > > >
> > > > Can anyone give some light here?
> > > > Thanks  a lot.
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> > >
> > >
>


Re: ExtractRequestHandler and Tika. Get only plain text

2018-11-14 Thread Sergio García Maroto
Thanks a lot Jan.
That works very well.

I am now trying to index the doc in Solr deleting the extractOnly parameter
and can't find any similiar option to get the data indexed in plain text. I
am getting the metadata as well,
This is my request.
http://localhost:8983/solr/document/update/extract?iteral.id=DDOC001&stream.file=C
:\TIKA\FileTest\Test.txt&commit=true&fmap.content=DocContentS

My DocContentS contains
\n \n stream_size 13 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
X-Parsed-By org.apache.tika.parser.txt.TXTParser \n stream_name Test.txt
\n stream_source_info file:/C:/TIKA/FileTest/Test.txt \n Content-Encoding
ISO-8859-1 \n Content-Type text/plain; charset=ISO-8859-1 \n \n \n Prueba
Sergio \n "

I can't find anywhere how to modify this behaviour.




On Wed, 14 Nov 2018 at 13:06, Jan Høydahl  wrote:

> Have you tried to specify &extractFormat=text
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 14. nov. 2018 kl. 12:09 skrev marotosg :
> >
> > Hi all,
> >
> > Currently I am trying to do index documents from different kinds with
> Solr
> > and tika. It's working fine but when solr returns the content of the
> > document. Doesn't return the plain text.  It comes back as well with some
> > metadata.
> >
> > For instance my request.
> >
> http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C
> :\TIKA\FileTest\Test.txt
> >
> > Content of Test.txt file is just "*Test File*".
> >
> > Response from Solr as you can see below returns plenty of information.
> > I would the answer to be something like this without noise for the
> search.
> > 
> > Test File
> > 
> >
> > 
> > 
> > 0
> > 135
> > 
> > 
> >   > xmlns="http://www.w3.org/1999/xhtml";>   > content="13"/>  > content="org.apache.tika.parser.DefaultParser"/>  > content="org.apache.tika.parser.txt.TXTParser"/>  > content="Test.txt"/>  > content="file:/C:/TIKA/FileTest/Test.txt"/>  > content="ISO-8859-1"/> Test File
> >  
> > 
> > 
> > 
> > 13
> > 
> > 
> > org.apache.tika.parser.DefaultParser
> > org.apache.tika.parser.txt.TXTParser
> > 
> > 
> > Test.txt
> > 
> > 
> > file:/C:/TIKA/FileTest/Test.txt
> > 
> > 
> > ISO-8859-1
> > 
> > 
> > text/plain; charset=ISO-8859-1
> > 
> > 
> > 
> >
> > Can anyone give some light here?
> > Thanks  a lot.
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>


Re: Query with exact number of tokens

2018-09-24 Thread Sergio García Maroto
Thanks all for your ideas. It was very useful information.

On Fri, 21 Sep 2018 at 19:04, Jan Høydahl  wrote:

> I have made a FieldType specially for this
> https://github.com/cominvent/exactmatch/ <
> https://github.com/cominvent/exactmatch/>
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 21. sep. 2018 kl. 18:14 skrev Steve Rowe :
> >
> > Link correction - wrong fragment identifier in ref #5 - should be:
> >
> > [5]
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#function-range-query-parser
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >> On Sep 21, 2018, at 12:04 PM, Steve Rowe  wrote:
> >>
> >> Hi Sergio,
> >>
> >> Chris “Hoss” Hostetter has a solution to this kind of problem here:
> https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E
> . See also the suggestions in comments on SOLR-12673[1], which include a
> version of Hoss’ss solution.
> >>
> >> Hoss’ss solution assumes a multivalued StrField with values counted
> using CountFieldValuesUpdateProcessorFactory, which doesn’t apply to you.
> You could instead count unique tokens in an analyzed field using the
> StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik
> Hatcher’s Lucene/Solr Revolution 2013 talk[4].
> >>
> >> Your script could look something like this (untested; replace " type>” with your field type):
> >>
> >> =
> >> function getUniqueTokenCount(analyzer, fieldName, fieldValue) {
> >> var uniqueTokens = {};
> >> var stream = analyzer.tokenStream(fieldName, fieldValue);
> >> var termAttr =
> stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
> >> stream.reset();
> >> while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] =
> 1; }
> >> stream.end();
> >> stream.close();
> >> return Object.keys(uniqueTokens).length;
> >> }
> >> function processAdd(cmd) {
> >> var analyzer =
> req.getCore().getLatestSchema().getFieldTypeByName(" type>").getIndexAnalyzer();
> >> doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer,
> null, content));
> >> }
> >> function processDelete(cmd) { }
> >> function processMergeIndexes(cmd) { }
> >> function processCommit(cmd) { }
> >> function processRollback(cmd) { }
> >> function finish() { }
> >> =
> >>
> >> And your query could then look something like (replace "” with
> your field name)[5][6]:
> >>
> >> =
> >> fq={!frange l=0
> h=0}sub(unique_token_count_i,sum(termfreq(,’CENTURY’),termfreq(,’BANCORP’),termfreq(,‘INC’)))
> >> =
> >>
> >> Note that to construct the query ^^ you’ll need to tokenize and
> uniquify terms on the client side - if tokenization is non-trivial, you
> could use Solr's Field Analysis API[8] to perform tokenization for you.
> >>
> >> [1] https://issues.apache.org/jira/browse/SOLR-12673
> >> [2] https://wiki.apache.org/solr/ScriptUpdateProcessor
> >> [3]
> https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
> >> [4]
> https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
> >> [5]
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser
> >> [6]
> https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function
> >> [7]
> https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function
> >> [8]
> https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >>> On Sep 21, 2018, at 10:45 AM, Erick Erickson 
> wrote:
> >>>
> >>> A variant on Alexandre's approach is:
> >>> at index time, count the tokens that will be produced yourself (this
> >>> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
> >>> in your analysis for instance).
> >>> Put the number of tokens in a separate field
> >>> At query time, you'd search q=+company_name:(+century +bancorp +inc)
> >>> +tokens_in_company_name_field:3
> >>>
> >>> You don't need phrase queries with this approach, order doesn't matter.
> >>>
> >>> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
> >>> BANCORP, INCORPORATED." match?
> >>>
> >>> Again, though, this means your indexing code has to do the same thing
> >>> as your analysis chain. Which isn't very hard if the analysis chain is
> >>> simple. I might use a char _filter_ factory to remove all
> >>> non-alphanumeric characters, then a whitespace tokenizer and
> >>> (probably) a lowercasefilter. That's pretty easy to replicate in order
> >>> to count tokens.
> >>>
> >>> Best,
> >>> Erick
> >>> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
> >>>  wrote:
> 
>  I think you can match everything in the query to the field using
> either
>  1) disMax/eDisMax with mm=100%
> 
> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-par

Re: Java 9 and Solr 6.6

2017-12-04 Thread Sergio García Maroto
Thanks. Very clear not to go with java 9.

On 2 December 2017 at 00:37, Shawn Heisey  wrote:

> On 12/1/2017 12:32 PM, marotosg wrote:
> > Would you recommend installing Solr 6.6.1 with Java 9 for a production
> > environement?
>
> Solr 7.x has been tested with Java 9 and should work with no problems.
> I believe that code changes were required to achieve this compatibility,
> so 6.6 might have issues with Java 9.
>
> The release information for 6.6 only mentions Java 8, while the release
> information for 7.0 explicitly says that it works with Java 9.
>
> I would not try running 6.6 with Java 9 in production without first
> testing every part of the implementation on a dev server ... and based
> on the limited information I know about, I'm not confident that those
> tests would pass.
>
> Thanks,
> Shawn
>
>


Re: Strip out punctuation at the end of token

2017-11-24 Thread Sergio García Maroto
Yes. You are right. I understand now.
Let me explain my issue a bit better with the exact problem i have.

I have this text "Information number  61149-008."
Using the tokenizers and filters described previously i get this list of
tokens.
information
number
61149-008.
61149
008

Basically last token   "61149-008."  gets tokenized as
61149-008.
61149
008
User is searching for "61149-008" without dot, so this is not a match.
I don't want to change the tokenization on the query to avoid altering the
matches for other cases.

I would like to delete the dot at the end. Basically generate this extra
token
information
number
61149-008.
61149
008
61149-008

Not sure if what I am saying make sense or there is other way to do this
right.

Thanks a lot
Sergio


On 24 November 2017 at 15:31, Shawn Heisey  wrote:

> On 11/24/2017 2:32 AM, marotosg wrote:
>
>> Hi Shaw.
>> Thanks for your reply. Actually my issue is with the last token. It looks
>> like for the last token of a string. It keeps the dot.
>>
>> In your case Testing. This is a test. Test.
>>
>> Keeps the "Test."
>>
>> Is there any reason I can't see for that behauviour?
>>
>
> I am really not sure what you're saying here.
>
> Every token is duplicated, one has the dot and one doesn't.  This is what
> you wanted based on what I read in your initial email.
>
> Making a guess as to what you're asking about this time: If you're
> noticing that there isn't a "Test" as the last token on the line for WDF,
> then I have to tell you that it actually is there, the display was simply
> too wide for the browser window. Scrolling horizontally would be required
> to see the whole thing.
>
> Thanks,
> Shawn
>
>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
You are right about that but in some cases I may need to reindex my data
and wanted to avoid deleting the full index so
I can still server queries. I thought reindexing same version would be
handy or at least to have the flexibility.

On 2 June 2017 at 14:53, Susheel Kumar  wrote:

> I see the difference now between using _version_ vs custom versionField.
> Both seems to behave differently.  The _version_ field if used allows same
> version to be updated and that's the perception I had in mind for custom
> versionField.
>
> My question is why do you want to update the document if same version.
> Shouldn't you pass higher version if the doc has changed and that makes the
> update to be accepted ?
>
> On Fri, Jun 2, 2017 at 8:13 AM, Susheel Kumar 
> wrote:
>
> > Just to confirm again before go too far,  are you able to execute these
> > examples and see same output given under "Optimistic Concurrency".
> > https://cwiki.apache.org/confluence/display/solr/
> > Updating+Parts+of+Documents#UpdatingPartsofDocuments-In-PlaceUpdates
> >
> > Let me know which example you fail to get same output as described in.
> >
> > On Fri, Jun 2, 2017 at 5:11 AM, Sergio García Maroto  >
> > wrote:
> >
> >> I had a look to the source code and I see
> >> DocBasedVersionConstraintsProcessorFactory
> >>
> >> if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
> >> oldUserVersion)) {
> >>   // log.info("VERSION returning true (proceed with update)" );
> >>   return true;
> >> }
> >>
> >> I can't find a way of overwriting same version without changing that
> piece
> >> of code.
> >> Would be possible to add a parameter to the
> >> "DocBasedVersionConstraintsProcessorFactory" something like
> >> "overwrite.same.version=true"
> >> so the new code would look like.
> >>
> >>
> >> int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
> >> oldUserVersion);
> >> if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
> >>   // log.info("VERSION returning true (proceed with update)" );
> >>   return true;
> >> }
> >>
> >>
> >> Is that thing going to break anyhting? Can i do that change?
> >>
> >> Thanks
> >> Sergio
> >>
> >>
> >> On 2 June 2017 at 10:10, Sergio García Maroto 
> wrote:
> >>
> >> > I am using  6.1.0.
> >> > I tried with two different  field types, long and date.
> >> >  />
> >> >  stored="true"/>
> >> >
> >> > I am using this configuration on the solrconfig.xml
> >> >
> >> > 
> >> >
> >> >  false
> >> >  UpdatedDateSD
> >> >
> >> >   
> >> >
> >> >   
> >> >   
> >> >
> >> > i had a look to the wiki page and it says https://cwiki.apache.org/
> >> > confluence/display/solr/Updating+Parts+of+Documents
> >> >
> >> > *Once configured, this update processor will reject (HTTP error code
> >> 409)
> >> > any attempt to update an existing document where the value of
> >> > the my_version_l field in the "new" document is not greater then the
> >> value
> >> > of that field in the existing document.*
> >> >
> >> > Do you have any tip on how to get same versions not getting rejected.
> >> >
> >> > Thanks a lot.
> >> >
> >> >
> >> > On 1 June 2017 at 19:04, Susheel Kumar  wrote:
> >> >
> >> >> Which version of solr are you using? I tested in 6.0 and if I supply
> >> same
> >> >> version, it overwrite/update the document exactly as per the wiki
> >> >> documentation.
> >> >>
> >> >> Thanks,
> >> >> Susheel
> >> >>
> >> >> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
> >> >>
> >> >> > Thanks a lot Susheel.
> >> >> > I see this is actually what I need.  I have been testing it and
> >> notice
> >> >> the
> >> >> > value of the field has to be always greater for a new document to
> get
> >> >> > indexed. if you send the same version number it doesn't work.
> >> >> >
> >> >> > Is it possible somehow to overwrite documents with the same
> version?
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > View this message in context: http://lucene.472066.n3.
> >> >> > nabble.com/version-Versioning-using-timespan-
> tp4338171p4338475.html
> >> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
I had a look to the source code and I see
DocBasedVersionConstraintsProcessorFactory

if (0 < ((Comparable)newUserVersion).compareTo((Comparable)
oldUserVersion)) {
  // log.info("VERSION returning true (proceed with update)" );
  return true;
}

I can't find a way of overwriting same version without changing that piece
of code.
Would be possible to add a parameter to the
"DocBasedVersionConstraintsProcessorFactory" something like
"overwrite.same.version=true"
so the new code would look like.


int compareTo = ((Comparable)newUserVersion).compareTo((Comparable)
oldUserVersion);
if ( ((overwritesameversion) && 0 <= compareTo) || (0 < compareTo)) {
  // log.info("VERSION returning true (proceed with update)" );
  return true;
}


Is that thing going to break anyhting? Can i do that change?

Thanks
Sergio


On 2 June 2017 at 10:10, Sergio García Maroto  wrote:

> I am using  6.1.0.
> I tried with two different  field types, long and date.
> 
> 
>
> I am using this configuration on the solrconfig.xml
>
> 
>
>  false
>  UpdatedDateSD
>
>   
>
>   
>   
>
> i had a look to the wiki page and it says https://cwiki.apache.org/
> confluence/display/solr/Updating+Parts+of+Documents
>
> *Once configured, this update processor will reject (HTTP error code 409)
> any attempt to update an existing document where the value of
> the my_version_l field in the "new" document is not greater then the value
> of that field in the existing document.*
>
> Do you have any tip on how to get same versions not getting rejected.
>
> Thanks a lot.
>
>
> On 1 June 2017 at 19:04, Susheel Kumar  wrote:
>
>> Which version of solr are you using? I tested in 6.0 and if I supply same
>> version, it overwrite/update the document exactly as per the wiki
>> documentation.
>>
>> Thanks,
>> Susheel
>>
>> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
>>
>> > Thanks a lot Susheel.
>> > I see this is actually what I need.  I have been testing it and  notice
>> the
>> > value of the field has to be always greater for a new document to get
>> > indexed. if you send the same version number it doesn't work.
>> >
>> > Is it possible somehow to overwrite documents with the same version?
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.
>> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
>
>


Re: _version_ / Versioning using timespan

2017-06-02 Thread Sergio García Maroto
I am using  6.1.0.
I tried with two different  field types, long and date.



I am using this configuration on the solrconfig.xml


   
 false
 UpdatedDateSD
   
  
   
  
  

i had a look to the wiki page and it says
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

*Once configured, this update processor will reject (HTTP error code 409)
any attempt to update an existing document where the value of
the my_version_l field in the "new" document is not greater then the value
of that field in the existing document.*

Do you have any tip on how to get same versions not getting rejected.

Thanks a lot.


On 1 June 2017 at 19:04, Susheel Kumar  wrote:

> Which version of solr are you using? I tested in 6.0 and if I supply same
> version, it overwrite/update the document exactly as per the wiki
> documentation.
>
> Thanks,
> Susheel
>
> On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:
>
> > Thanks a lot Susheel.
> > I see this is actually what I need.  I have been testing it and  notice
> the
> > value of the field has to be always greater for a new document to get
> > indexed. if you send the same version number it doesn't work.
> >
> > Is it possible somehow to overwrite documents with the same version?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: Classify document using bag of words

2017-03-26 Thread Sergio García Maroto
Hi John.
thanks for that.

That's actually a good option but I would need the category text on the
field so I can facet on the field and get every category and the number.

On 26 March 2017 at 18:27, John Blythe  wrote:

> You could use keepwords to filter out any other words besides your bag and
> then have a synonym filter that translates the remaining word(s) to a
> corresponding category/classification
>
> On Sun, Mar 26, 2017 at 12:05 PM marotosg  wrote:
>
> > Hi,
> >
> > I have a very simple use case where I would need to classify a document
> > using a bag of words. Basically if a field within the document contains
> any
> > of the words on my bag then I use a new field to assign a category to the
> > document.
> >
> > Is this something achievable on Solr?
> >
> > I was thinking on using Lucene Document
> > classificationhttps://wiki.apache.org/solr/SolrClassification.
> > From what I understand I need to feed already the category on some
> > documents. New documents would be classified.
> >
> > Is there anything else I can't find?
> >
> > Thanks a lot.
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Classify-document-
> using-bag-of-words-tp4326865.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> --
> --
> *John Blythe*
> Product Manager & Lead Developer
>
> 251.605.3071 | j...@curvolabs.com
> www.curvolabs.com
>
> 58 Adams Ave
> Evansville, IN 47713
>


Re: Classify document using bag of words

2017-03-26 Thread Sergio García Maroto
Sorry it actually works. Thanks a lot.

On 26 March 2017 at 21:45, Sergio García Maroto  wrote:

> Hi John.
> thanks for that.
>
> That's actually a good option but I would need the category text on the
> field so I can facet on the field and get every category and the number.
>
> On 26 March 2017 at 18:27, John Blythe  wrote:
>
>> You could use keepwords to filter out any other words besides your bag and
>> then have a synonym filter that translates the remaining word(s) to a
>> corresponding category/classification
>>
>> On Sun, Mar 26, 2017 at 12:05 PM marotosg  wrote:
>>
>> > Hi,
>> >
>> > I have a very simple use case where I would need to classify a document
>> > using a bag of words. Basically if a field within the document contains
>> any
>> > of the words on my bag then I use a new field to assign a category to
>> the
>> > document.
>> >
>> > Is this something achievable on Solr?
>> >
>> > I was thinking on using Lucene Document
>> > classificationhttps://wiki.apache.org/solr/SolrClassification.
>> > From what I understand I need to feed already the category on some
>> > documents. New documents would be classified.
>> >
>> > Is there anything else I can't find?
>> >
>> > Thanks a lot.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://lucene.472066.n3.nabble.com/Classify-document-using-
>> bag-of-words-tp4326865.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> --
>> --
>> *John Blythe*
>> Product Manager & Lead Developer
>>
>> 251.605.3071 | j...@curvolabs.com
>> www.curvolabs.com
>>
>> 58 Adams Ave
>> Evansville, IN 47713
>>
>
>


Re: After migrating to SolrCloud

2017-01-30 Thread Sergio García Maroto
Thanks a lot guys. It was the F5 load balancer.

Regards,
Sergio

On 28 January 2017 at 01:50, Chris Hostetter 
wrote:

>
> That error means that some client talking to your server is attempting to
> use an antiquated HTTP protocol version, which was (evidently) supported
> by the jetty used in 3.6, but is no longer supported by the jetty used in
> 6.2.
>
> (some details: https://stackoverflow.com/a/32302263/689372 )
>
> If it's happening once a second, that sounds like perhaps some sort of
> monitoring agent? or perhaps you have a load balancer with an antiquated
> health check mechanism?
>
>
> NUANCE NOTE: even though the error specifically mentions "HTTP/0.9" it's
> possible that the problematic client is actually attempting to use
> HTTP/1.0, but for a vairety of esoteric reasons related to how broadly
> formatted HTTP/0.9 requests can be, some HTTP/1.0 requests will look like
> "valid" (but unsupported) HTTP/0.9 requests to the jetty server -- hence
> that error message...
>
> https://github.com/eclipse/jetty.project/issues/370
>
>
>
>
> : Date: Thu, 26 Jan 2017 08:49:06 -0700 (MST)
> : From: marotosg 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: After migrating to SolrCloud
> :
> : Hi All,
> : I have migrated Solr from older versio 3.6 to SolrCloud 6.2 and all good
> but
> : there are almost every second some WARN messages in the logs.
> :
> : HttpParser
> : bad HTTP parsed: 400 HTTP/0.9 not supported for
> : HttpChannelOverHttp@16a84451{r=0,​c=false,​a=
> IDLE,​uri=null}
> :
> : Anynone knows where are these coming from?
> :
> : Thanks
> :
> :
> :
> :
> : --
> : View this message in context: http://lucene.472066.n3.
> nabble.com/After-migrating-to-SolrCloud-tp4315943.html
> : Sent from the Solr - User mailing list archive at Nabble.com.
> :
>
> -Hoss
> http://www.lucidworks.com/
>


Re: RTF Rich text format

2016-11-14 Thread Sergio García Maroto
Thanks for the response.

I am afraid I can't use the DataImportHandler. I do the indexation using an
Indexation Service joining data from several places.

I have a final xml with plenty of data and one of them is the rtf field.
That's the xml I send to Solr using the /update. I am guessing if it would
be possible Solr to do it with a tokenizer filter or something like that.

On 14 November 2016 at 16:24, Alexandre Rafalovitch 
wrote:

> I think DataImportHandler with nested entity (JDBC, then Tika with
> FieldReaderDataSource) should do the trick.
>
> Have you tried that?
>
> Regards,
>Alex.
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 15 November 2016 at 03:19, marotosg  wrote:
> > Hi,
> >
> > I have a use case where I need to index information coming from a
> database
> > where there is a field which contains rich text format. I would like to
> > convert that text into simple plain text, same as tika does when indexing
> > documents.
> >
> > Is there any way to achive that having a field only where i sent this
> rich
> > text and then Solr cleans that data? I can't find anyhting so far.
> >
> > Thanks
> > Sergio
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/RTF-Rich-text-format-tp4305778.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query by distance

2016-10-19 Thread Sergio García Maroto
Thanks a lot. I will try it and let you know.

Thanks again
Sergio


On 18 October 2016 at 17:02, John Bickerstaff 
wrote:

> Just in case it helps, I had good success on multi-word synonyms using this
> plugin...
>
> https://github.com/healthonnet/hon-lucene-synonyms
>
> IIRC, the instructions are clear and fairly easy to follow - especially for
> Solr 6.x
>
> Ping back if you run into any problems setting it up...
>
>
>
> On Tue, Oct 18, 2016 at 7:12 AM, marotosg  wrote:
>
> > This is my field type.
> >
> >
> > I was reading about this and it looks like the issue
> >  class="solr.TextField"
> > positionIncrementGap="300">
> >   
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> > preserveOriginal="1" protected="protwordscompany.txt"/>
> >  > preserveOriginal="false"/>
> > 
> >  > synonyms="positionsynonyms.txt" ignoreCase="true" expand="true"/>
> >   
> >   
> > 
> >  > generateWordParts="0" generateNumberParts="0" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"
> > preserveOriginal="1" protected="protwordscompany.txt"/>
> >  > preserveOriginal="false"/>
> > 
> >   
> >  
> >
> >
> > I have been reading and it looks like the issue is about multi term
> > synonym.
> > http://opensourceconnections.com/blog/2013/10/27/why-is-
> > multi-term-synonyms-so-hard-in-solr/
> >
> > I may try this plug in to check if it works.
> >
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Query-by-distance-tp4300660p4301697.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: join and NOT together

2016-02-18 Thread Sergio García Maroto
HI Mikhail. Sorry for all the confusion

This is the original query which doesn't work
q=PersonName:peter AND {!type=join from=DocPersonID to=PersonID
fromIndex=document v='(*:* -DocType:pdf)' }

I figure out  that negating outside the cross join query makes the trick
for me.
I take the negation out of the v='' and put in in the person collection
part of the query.
In that way I can exclude everyone.

q=PersonName:peter AND (*:* - {!type=join from=DocPersonID to=PersonID
fromIndex=document v='(DocType:pdf)' })


On 17 February 2016 at 12:13, Mikhail Khludnev 
wrote:

> Sergo,
>
> Please provide more debug output, I want to see how query was parsed.
>
> On Tue, Feb 16, 2016 at 1:20 PM, Sergio García Maroto 
> wrote:
>
> > My debugQuery=true returns related to the NOT:
> >
> > 0.06755901 = (MATCH) sum of: 0.06755901 = (MATCH) MatchAllDocsQuery,
> > product of: 0.06755901 = queryNorm
> >
> > I tried changing v='(*:* -DocType:pdf)'  to v='(-DocType:pdf)'
> > and it worked.
> >
> > Anyone could explain the difference?
> >
> > Thanks
> > Sergo
> >
> >
> > On 15 February 2016 at 21:12, Mikhail Khludnev <
> mkhlud...@griddynamics.com
> > >
> > wrote:
> >
> > > Hello Sergio,
> > >
> > > What debougQuery=true output does look like?
> > >
> > > On Mon, Feb 15, 2016 at 7:10 PM, marotosg  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trying to solve an issue when doing a search joining two
> > collections
> > > > and negating the cross core query.
> > > >
> > > > Let's say I have one collection person and another collection
> documents
> > > and
> > > > I can join them using local param !join because I have PersonIDS in
> > > > document
> > > > collection.
> > > >
> > > > if my query is like below. Query executed against Person Core. I want
> > to
> > > > retrieve people with name Peter and not documents attached of type
> pdf.
> > > >
> > > > q=PersonName:peter AND {!type=join from=DocPersonID to=PersonID
> > > > fromIndex=document v='(*:* -DocType:pdf)' }
> > > >
> > > > If I have for person 1 called peter two documents one of type:pdf and
> > > other
> > > > one of type:word.
> > > > Then this person will come back.
> > > >
> > > > Is there any way of excluding that person if any of the docs fulfill
> > the
> > > > NOT.
> > > >
> > > > Thanks
> > > > Sergio
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > >
> > http://lucene.472066.n3.nabble.com/join-and-NOT-together-tp4257411.html
> > > > Sent from the Solr - User mailing list archive at Nabble.com.
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > 
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


Re: join and NOT together

2016-02-16 Thread Sergio García Maroto
My debugQuery=true returns related to the NOT:

0.06755901 = (MATCH) sum of: 0.06755901 = (MATCH) MatchAllDocsQuery,
product of: 0.06755901 = queryNorm

I tried changing v='(*:* -DocType:pdf)'  to v='(-DocType:pdf)'
and it worked.

Anyone could explain the difference?

Thanks
Sergo


On 15 February 2016 at 21:12, Mikhail Khludnev 
wrote:

> Hello Sergio,
>
> What debougQuery=true output does look like?
>
> On Mon, Feb 15, 2016 at 7:10 PM, marotosg  wrote:
>
> > Hi,
> >
> > I am trying to solve an issue when doing a search joining two collections
> > and negating the cross core query.
> >
> > Let's say I have one collection person and another collection documents
> and
> > I can join them using local param !join because I have PersonIDS in
> > document
> > collection.
> >
> > if my query is like below. Query executed against Person Core. I want to
> > retrieve people with name Peter and not documents attached of type pdf.
> >
> > q=PersonName:peter AND {!type=join from=DocPersonID to=PersonID
> > fromIndex=document v='(*:* -DocType:pdf)' }
> >
> > If I have for person 1 called peter two documents one of type:pdf and
> other
> > one of type:word.
> > Then this person will come back.
> >
> > Is there any way of excluding that person if any of the docs fulfill the
> > NOT.
> >
> > Thanks
> > Sergio
> >
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/join-and-NOT-together-tp4257411.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Solr Consultant for remote project - on NLP and Solr Faceted Search

2015-02-02 Thread Sergio García Maroto
Hi,

My name is Sergio Garcia.
I would be interested in this role. Attached you can find a copy of my CV.

Regards,
Sergio

On 31 January 2015 at 14:18, MKGoose  wrote:

> We are looking for a remote / freelance consultant to work with us on a
> project related to Solr faceted search and NLP. It involves data extraction
> / summarisation and custom faceted search on Solr.
>
> Please contact me if you have expertise in this area and can work remotely
> with a small team.
>
> Thanks,
> MG.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Consultant-for-remote-project-on-NLP-and-Solr-Faceted-Search-tp4183236.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>