RE: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) - Build # 3121 - Failure!

2013-08-08 Thread Uwe Schindler
Hi,

this one looks crazy. Maybe a Windows-only problem, have never seen that before!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Policeman Jenkins Server [mailto:jenk...@thetaphi.de]
> Sent: Wednesday, August 07, 2013 6:35 AM
> To: dev@lucene.apache.org; rm...@apache.org; hoss...@apache.org
> Subject: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) -
> Build # 3121 - Failure!
> 
> Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3121/
> Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC
> 
> 1 tests failed.
> REGRESSION:
> org.apache.lucene.index.TestIndexWriterOutOfFileDescriptors.test
> 
> Error Message:
> unreferenced files: before delete:
> [_0_TestBloomFilteredLucene41Postings_0.doc,
> _0_TestBloomFilteredLucene41Postings_0.pos, _k.fdt, _k.fdx, _k.fnm,
> _k.nvd, _k.nvm, _k.si, _k.tvd, _k.tvx, _k_Lucene41WithOrds_0.doc,
> _k_Lucene41WithOrds_0.pos, _k_Lucene41WithOrds_0.tib,
> _k_Lucene41WithOrds_0.tii, _k_Lucene42_0.dvd, _k_Lucene42_0.dvm,
> _k_MockFixedIntBlock_0.doc, _k_MockFixedIntBlock_0.frq,
> _k_MockFixedIntBlock_0.pos, _k_MockFixedIntBlock_0.pyl,
> _k_MockFixedIntBlock_0.skp, _k_MockFixedIntBlock_0.tib,
> _k_MockFixedIntBlock_0.tii, _k_MockVariableIntBlock_0.doc,
> _k_MockVariableIntBlock_0.frq, _k_MockVariableIntBlock_0.pos,
> _k_MockVariableIntBlock_0.pyl, _k_MockVariableIntBlock_0.skp,
> _k_MockVariableIntBlock_0.tib, _k_MockVariableIntBlock_0.tii,
> _k_TestBloomFilteredLucene41Postings_0.blm,
> _k_TestBloomFilteredLucene41Postings_0.doc,
> _k_TestBloomFilteredLucene41Postings_0.pos,
> _k_TestBloomFilteredLucene41Postings_0.tim,
> _k_TestBloomFilteredLucene41Postings_0.tip, _l.fdt, _l.fdx, _l.fnm, _l.nvd,
> _l.nvm, _l.si, _l.tvd, _l.tvx, _l_Lucene41WithOrds_0.doc,
> _l_Lucene41WithOrds_0.pos, _l_Lucene41WithOrds_0.tib,
> _l_Lucene41WithOrds_0.tii, _l_Lucene42_0.dvd, _l_Lucene42_0.dvm,
> _l_MockFixedIntBlock_0.doc, _l_MockFixedIntBlock_0.frq,
> _l_MockFixedIntBlock_0.pos, _l_MockFixedIntBlock_0.pyl,
> _l_MockFixedIntBlock_0.skp, _l_MockFixedIntBlock_0.tib,
> _l_MockFixedIntBlock_0.tii, _l_MockVariableIntBlock_0.doc,
> _l_MockVariableIntBlock_0.frq, _l_MockVariableIntBlock_0.pos,
> _l_MockVariableIntBlock_0.pyl, _l_MockVariableIntBlock_0.skp,
> _l_MockVariableIntBlock_0.tib, _l_MockVariableIntBlock_0.tii,
> _l_TestBloomFilteredLucene41Postings_0.blm,
> _l_TestBloomFilteredLucene41Postings_0.doc,
> _l_TestBloomFilteredLucene41Postings_0.pos,
> _l_TestBloomFilteredLucene41Postings_0.tim,
> _l_TestBloomFilteredLucene41Postings_0.tip, _m.cfe, _m.cfs, _m.si, _n.cfe,
> _n.cfs, _n.si, _o.fdt, _o.fdx, _o.fnm, _o.nvd, _o.nvm, _o.si, _o.tvd, _o.tvx,
> _o_Lucene41WithOrds_0.doc, _o_Lucene41WithOrds_0.pos,
> _o_Lucene41WithOrds_0.tib, _o_Lucene41WithOrds_0.tii,
> _o_Lucene42_0.dvd, _o_Lucene42_0.dvm, _o_MockFixedIntBlock_0.doc,
> _o_MockFixedIntBlock_0.frq, _o_MockFixedIntBlock_0.pos,
> _o_MockFixedIntBlock_0.pyl, _o_MockFixedIntBlock_0.skp,
> _o_MockFixedIntBlock_0.tib, _o_MockFixedIntBlock_0.tii,
> _o_MockVariableIntBlock_0.doc, _o_MockVariableIntBlock_0.frq,
> _o_MockVariableIntBlock_0.pos, _o_MockVariableIntBlock_0.pyl,
> _o_MockVariableIntBlock_0.skp, _o_MockVariableIntBlock_0.tib,
> _o_MockVariableIntBlock_0.tii,
> _o_TestBloomFilteredLucene41Postings_0.blm,
> _o_TestBloomFilteredLucene41Postings_0.doc,
> _o_TestBloomFilteredLucene41Postings_0.pos,
> _o_TestBloomFilteredLucene41Postings_0.tim,
> _o_TestBloomFilteredLucene41Postings_0.tip, _q.fdt, _q.fdx, _q.fnm,
> _q.nvd, _q.nvm, _q.si, _q.tvd, _q.tvx, _q_Lucene41WithOrds_0.doc,
> _q_Lucene41WithOrds_0.pos, _q_Lucene41WithOrds_0.tib,
> _q_Lucene41WithOrds_0.tii, _q_Lucene42_0.dvd, _q_Lucene42_0.dvm,
> _q_MockFixedIntBlock_0.doc, _q_MockFixedIntBlock_0.frq,
> _q_MockFixedIntBlock_0.pos, _q_MockFixedIntBlock_0.pyl,
> _q_MockFixedIntBlock_0.skp, _q_MockFixedIntBlock_0.tib,
> _q_MockFixedIntBlock_0.tii, _q_MockVariableIntBlock_0.doc,
> _q_MockVariableIntBlock_0.frq, _q_MockVariableIntBlock_0.pos,
> _q_MockVariableIntBlock_0.pyl, _q_MockVariableIntBlock_0.skp,
> _q_MockVariableIntBlock_0.tib, _q_MockVariableIntBlock_0.tii,
> _q_TestBloomFilteredLucene41Postings_0.blm,
> _q_TestBloomFilteredLucene41Postings_0.doc,
> _q_TestBloomFilteredLucene41Postings_0.pos,
> _q_TestBloomFilteredLucene41Postings_0.tim,
> _q_TestBloomFilteredLucene41Postings_0.tip, _r.fdt, _r.fdx, _r.fnm, _r.nvd,
> _r.nvm, _r.si, _r.tvd, _r.tvx, _r_Lucene41WithOrds_0.doc,
> _r_Lucene41WithOrds_0.pay, _r_Lucene41WithOrds_0.pos,
> _r_Lucene41WithOrds_0.tib, _r_Lucene41WithOrds_0.tii,
> _r_Lucene42_0.dvd, _r_Lucene42_0.dvm, _r_MockFixedIntBlock_0.doc,
> _r_MockFixedIntBlock_0.frq, _r_MockFixedIntBlock_0.pos,
> _r_MockFixedIntBlock_0.pyl, _r_MockFixedIntBlock_0.skp,
> _r_MockFixedIntBlock_0.tib, _r_MockFixedIntBlock_0.tii,
> _r_MockVariableIntBlock_0.doc, _r_MockVariableIntBlock_0.frq,
> _r_MockVariableIntBlock_0.pos, _r_MockVariableIntBlock_0.pyl,
> _r_MockVaria

[jira] [Created] (SOLR-5123) NullPointerException on JdbcDataSource

2013-08-08 Thread Thomas SZADEL (JIRA)
Thomas SZADEL created SOLR-5123:
---

 Summary: NullPointerException on JdbcDataSource
 Key: SOLR-5123
 URL: https://issues.apache.org/jira/browse/SOLR-5123
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 4.3
 Environment: Linux
Reporter: Thomas SZADEL
Priority: Minor


We got an NPE with Solr 4.3 when getting a database connection (and JBoss fails 
to get connection)

Solr runs on an JBoss 7.1 et gets their connections from a JNDI call 
(connection is provided by JBoss).


Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:253)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:38)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:241)
... 12 more


In the code, the possible null value is not checked :
239  try {
240Connection c = getConnection();
241stmt = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, 
ResultSet.CONCUR_READ_ONLY);

... maybe a check may be more safe :
if(c == null){
   throw new XXXException();
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733281#comment-13733281
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511633 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1511633 ]

SOLR-5113

> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733292#comment-13733292
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511635 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511635 ]

SOLR-5113

> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul resolved SOLR-5113.
--

   Resolution: Fixed
Fix Version/s: 5.0
   4.5

> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Fix For: 4.5, 5.0
>
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733297#comment-13733297
 ] 

Uwe Schindler commented on SOLR-5113:
-

Hi Noble,
thanks for committing! I think it is now up to jenkins to verify that it works! 



> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Fix For: 4.5, 5.0
>
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA
Christoph Straßer created SOLR-5124:
---

 Summary: Solr glues word´s when parsing PDFs under certan 
circumstances
 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor


For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results 
in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Straßer updated SOLR-5124:


Attachment: 04_Solr.png
03_TikaOutput_GUI_StructuredText.png
03_TikaOutput_GUI_PlainText.png
03_TikaOutput_GUI_MainContent.png
03_TikaOutput.png
02_PDF.png
01_alz_2009_folge11_2009_05_28.pdf

Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index.

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this wird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Straßer updated SOLR-5124:


Description: 
For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This results 
in worse suggestions by the Suggester.

  was:
For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results 
in worse suggestions by the Suggester.


> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Problem using Benchmark

2013-08-08 Thread Abhishek Gupta
Anyone pls help!!



On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta wrote:

> Hi,
> I am using PyLucene and there I tried to use lucene's Benchmark to
> evaluate Trec Data. I was having a doubt which I first asked on
> pylucene-dev mailing list. After solving the first problem I got another
> problem which is referred a java error by Andi. You can see the thread here
> (
> http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E
> )
>
> I am getting the class not found exception for Compressor(
> http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
> I an a newbie to java development, so I don't know about Ant much. PLease
> help in solving this issue.
>
>
> Thanking You
> Abhishek Gupta,
> 9624799165
>



-- 
Abhishek Gupta,
897876422, 9416106204, 9624799165


[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733312#comment-13733312
 ] 

Uwe Schindler commented on SOLR-5124:
-

I have not looked into DIH's code, but I know that TIKA adds the extra 
whitespace as "ignoreable whitespace" XML data. It might be "ignored" by the 
extraction content handler when it consumes the SAX events.

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733321#comment-13733321
 ] 

Christoph Straßer commented on SOLR-5124:
-

Maybe it´s in some way related to SOLR-4679. (But not sure; We use the 
ExtractingRequestHandler) 

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733325#comment-13733325
 ] 

Uwe Schindler commented on SOLR-5124:
-

Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will 
close this as duplicate.

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed SOLR-5124.
---

Resolution: Duplicate

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Problem using Benchmark

2013-08-08 Thread Abhishek Gupta
You can see the complete error I am getting
here
.


On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta wrote:

> Anyone pls help!!
>
>
>
> On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta 
> wrote:
>
>> Hi,
>> I am using PyLucene and there I tried to use lucene's Benchmark to
>> evaluate Trec Data. I was having a doubt which I first asked on
>> pylucene-dev mailing list. After solving the first problem I got another
>> problem which is referred a java error by Andi. You can see the thread here
>> (
>> http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E
>> )
>>
>> I am getting the class not found exception for Compressor(
>> http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
>> I an a newbie to java development, so I don't know about Ant much. PLease
>> help in solving this issue.
>>
>>
>> Thanking You
>> Abhishek Gupta,
>> 9624799165
>>
>
>
>
> --
> Abhishek Gupta,
> 897876422, 9416106204, 9624799165
>



-- 
Abhishek Gupta,
897876422, 9416106204, 9624799165


[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler commented on SOLR-4679:
-

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for  tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for  tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes "block" XHTML elements like ,  also emit a newline as 
ignoreable on the closing element).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for  tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Tes

[jira] [Assigned] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned SOLR-4679:
---

Assignee: Uwe Schindler

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for  tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes "block" XHTML elements like ,  also emit a newline as 
ignoreable on the closing element, see TIKA-171).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for  tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to "corectly produce" ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes "block" XHTML elements like ,  also emit a newline as 
ignoreable on the closing element).

FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace 
SAX event to report this "added whitespace". The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you "understand" block tags and 
, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotF

[jira] [Comment Edited] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1377#comment-1377
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM:
--

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all "synthetic" whitespace added to support text-only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
"one more time" and document under which circumstances TIKA emits 
ignorableWhitepsace.

  was (Author: thetaphi):
The stuff with ignorableWhitespace was discussed between [~jukkaz] and me 
in TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all "synthetic" whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
"one more time" and document under which circumstances TIKA emits 
ignorableWhitepsace.
  
> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1377#comment-1377
 ] 

Uwe Schindler commented on SOLR-4679:
-

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all "synthetic" whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
"one more time" and document under which circumstances TIKA emits 
ignorableWhitepsace.

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
When I do this on trunk:

ant jar-checksums
svn stat

I get:
?   solr\licenses\jcl-over-slf4j.jar.sha1
?   solr\licenses\jul-to-slf4j.jar.sha1
?   solr\licenses\log4j.jar.sha1
?   solr\licenses\slf4j-api.jar.sha1
?   solr\licenses\slf4j-log4j12.jar.sha1

Where should this be fixed?  Should we svn-ignore those files or
should they be somehow excluded from the re-generation of SHA
checksums?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
Never mind, these were local files and they were svn-ignored, when I
removed everything and checked out from scratch this problem is no
longer there.

I really wish svn had an equivalent of git clean -xfd .

Dawid

On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss  wrote:
> When I do this on trunk:
>
> ant jar-checksums
> svn stat
>
> I get:
> ?   solr\licenses\jcl-over-slf4j.jar.sha1
> ?   solr\licenses\jul-to-slf4j.jar.sha1
> ?   solr\licenses\log4j.jar.sha1
> ?   solr\licenses\slf4j-api.jar.sha1
> ?   solr\licenses\slf4j-log4j12.jar.sha1
>
> Where should this be fixed?  Should we svn-ignore those files or
> should they be somehow excluded from the re-generation of SHA
> checksums?
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Uwe Schindler
Hi,

Some GUIs like TortoiseSVN have this. I use this to delete all unversioned 
files in milliseconds(TM). But native svn does not have it, unfortunately. 

Uwe



Dawid Weiss  schrieb:
>Never mind, these were local files and they were svn-ignored, when I
>removed everything and checked out from scratch this problem is no
>longer there.
>
>I really wish svn had an equivalent of git clean -xfd .
>
>Dawid
>
>On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss 
>wrote:
>> When I do this on trunk:
>>
>> ant jar-checksums
>> svn stat
>>
>> I get:
>> ?   solr\licenses\jcl-over-slf4j.jar.sha1
>> ?   solr\licenses\jul-to-slf4j.jar.sha1
>> ?   solr\licenses\log4j.jar.sha1
>> ?   solr\licenses\slf4j-api.jar.sha1
>> ?   solr\licenses\slf4j-log4j12.jar.sha1
>>
>> Where should this be fixed?  Should we svn-ignore those files or
>> should they be somehow excluded from the re-generation of SHA
>> checksums?
>>
>> Dawid
>
>-
>To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
I kind of use a workaround of removing everything except the .svn
folder and then svn revert -R .
But this is a dumb solution :)

D.

On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler  wrote:
> Hi,
>
> Some GUIs like TortoiseSVN have this. I use this to delete all unversioned
> files in milliseconds(TM). But native svn does not have it, unfortunately.
>
> Uwe
>
>
>
> Dawid Weiss  schrieb:
>>
>> Never mind, these were local files and they were svn-ignored, when I
>> removed everything and checked out from scratch this problem is no
>> longer there.
>>
>> I really wish svn had an equivalent of git clean -xfd .
>>
>> Dawid
>>
>> On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss 
>> wrote:
>>>
>>>  When I do this on trunk:
>>>
>>>  ant jar-checksums
>>>  svn stat
>>>
>>>  I get:
>>>  ?   solr\licenses\jcl-over-slf4j.jar.sha1
>>>  ?   solr\licenses\jul-to-slf4j.jar.sha1
>>>  ?   solr\licenses\log4j.jar.sha1
>>>  ?   solr\licenses\slf4j-api.jar.sha1
>>>  ?   solr\licenses\slf4j-log4j12.jar.sha1
>>>
>>>  Where should this be fixed?  Should we svn-ignore those files or
>>>  should they be somehow excluded from the re-generation of SHA
>>>  checksums?
>>>
>>>  Daw
>>>  id
>>
>>
>> 
>>
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-Tests-trunk-Java7 - Build # 4219 - Failure

2013-08-08 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java7/4219/

All tests passed

Build Log:
[...truncated 34909 lines...]
BUILD FAILED
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:389:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:328:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:66:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:139:
 The following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 80 minutes 23 seconds
Build step 'Invoke Ant' marked build as failure
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-trunk-Linux (64bit/jdk1.7.0_25) - Build # 6924 - Failure!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/6924/
Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops -XX:+UseParallelGC

All tests passed

Build Log:
[...truncated 34822 lines...]
BUILD FAILED
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:389: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:328: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:66: The 
following error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:139: The 
following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 52 minutes 24 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops 
-XX:+UseParallelGC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.7.0_25) - Build # 3125 - Failure!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3125/
Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC

All tests passed

Build Log:
[...truncated 31514 lines...]
BUILD FAILED
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:389: The 
following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:328: The 
following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:66: 
The following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:139:
 The following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 103 minutes 30 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733406#comment-13733406
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511715 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1511715 ]

SOLR-5113 setting svn:eol-style native

> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Fix For: 4.5, 5.0
>
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733408#comment-13733408
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511717 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511717 ]

SOLR-5113 setting svn:eol-style native

> CollectionsAPIDistributedZkTest fails all the time
> --
>
> Key: SOLR-5113
> URL: https://issues.apache.org/jira/browse/SOLR-5113
> Project: Solr
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.5, 5.0
>Reporter: Uwe Schindler
>Assignee: Noble Paul
>Priority: Blocker
> Fix For: 4.5, 5.0
>
> Attachments: SOLR-5113.patch, SOLR-5113.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-4.x-Linux (32bit/jdk1.8.0-ea-b99) - Build # 6841 - Still Failing!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/6841/
Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC

All tests passed

Build Log:
[...truncated 31168 lines...]
BUILD FAILED
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:395: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:334: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:66: The 
following error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:139: The 
following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 45 minutes 31 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Problem using Benchmark

2013-08-08 Thread Andi Vajda
 Abishek,

On Aug 8, 2013, at 12:08, Abhishek Gupta  wrote:

> You can see the complete error I am getting here.

Like I told you on pylucene-dev, you need to setup your classpath correctly so 
that these classes are found. If you are a Java newbie (as you said) and don't 
know what that means or how to achieve it you need to research the issue 
yourself first.

This mailing list is not the right forum for this question. Try a general Java 
programming forum first.

Andi..

> 
> 
> On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta  
> wrote:
>> Anyone pls help!!
>> 
>> 
>> 
>> On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta  
>> wrote:
>>> Hi,
>>> I am using PyLucene and there I tried to use lucene's Benchmark to evaluate 
>>> Trec Data. I was having a doubt which I first asked on pylucene-dev mailing 
>>> list. After solving the first problem I got another problem which is 
>>> referred a java error by Andi. You can see the thread here 
>>> (http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E)
>>> 
>>> I am getting the class not found exception for 
>>> Compressor(http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
>>>  I an a newbie to java development, so I don't know about Ant much. PLease 
>>> help in solving this issue.
>>>  
>>> 
>>> Thanking You
>>> Abhishek Gupta,
>>> 9624799165
>> 
>> 
>> 
>> -- 
>> Abhishek Gupta,
>> 897876422, 9416106204, 9624799165
> 
> 
> 
> -- 
> Abhishek Gupta,
> 897876422, 9416106204, 9624799165


[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733454#comment-13733454
 ] 

Michael McCandless commented on LUCENE-5152:


bq. can you elaborate what you are concerned about?

I'm worried about the O(N^2) cost of the assert: for every arc (single
byte of each term in a seekExact) we are iterating over all root arcs
(up to 256 arcs) in this assert.

bq. findTargetArc is the only place where we actually use this cache?

Ahh that's true, I hadn't realized that.

Maybe, instead, we can move the assert just inside the if that
actually uses the cached arcs?  Ie, put it here:

{code}
  if (follow.target == startNode && labelToMatch < cachedRootArcs.length) {
assert assertRootArcs();
...
  }
{code}

This would address my concern: the cost becomes O(N) not O(N^2).  And
the coverage is the same?


> Lucene FST is not immutable
> ---
>
> Key: LUCENE-5152
> URL: https://issues.apache.org/jira/browse/LUCENE-5152
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/FSTs
>Affects Versions: 4.4
>Reporter: Simon Willnauer
>Priority: Blocker
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch
>
>
> a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
> output from and FST (BytesRef) which caused sideffects in later execution. 
> I added an assertion into the FST that checks if a cached root arc is 
> modified and in-fact this happens for instance in our MemoryPostingsFormat 
> and I bet we find more places. We need to think about how to make this less 
> trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [PROPOSAL] Make Luke a Lucene/Solr Module

2013-08-08 Thread Ajay Bhat
Hello,

Thanks so much for accepting the project proposal. I've started the coding
work. I'll keep you all posted of my work.


On Fri, Jul 26, 2013 at 1:48 PM, Ajay Bhat  wrote:

> Hi,
>
> I have a question regarding one of the interfaces in the orig version.
>
> The IOReporter.java [1] is used by the Hadoop Plugin [2] and it only has 2
> functions which are implemented by Hadoop  Plugin. Is this interface really
> needed? Can't I just use the functions as is in Hadoop class without
> needing to get the IOReporter?
>
> [1]
> http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/IOReporter.java
>
> [2]
> http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/HadoopPlugin.java
>
>
> On Sat, Jul 20, 2013 at 11:12 PM, SUJIT PAL  wrote:
>
>> Hi Ajay,
>>
>> Thanks for the reply and the links to the email threads. I saw a response
>> on this thread from Shawn Helsey about this as well. I didn't realize your
>> focus was Luke, then Lucene, then Solr - the proposal title and the JIRA
>> both mention Lucene/Solr module, which probably misled me - I guess I
>> should have read the doc more carefully... Thank you for the clarification
>> and good luck with your project.
>>
>> -sujit
>>
>> On Jul 20, 2013, at 9:09 AM, Ajay Bhat wrote:
>>
>> > Hi Sujit,
>> >
>> > Thanks for your comments. There was actually some discussion earlier
>> about whether or not Solr was the highest priority.
>> >
>> >
>> http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3C0F7176D08A99494EBF1E129298E12904%40JackKrupansky%3E
>> >
>> http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3CCAOdYfZVQ1WzWhYVeKgwpA%3DmQVONxo4XiLza28geV2L1PCpcQJg%40mail.gmail.com%3E
>> >
>> > Right now I don't think I could do the integration with Solr since (a)
>> I don't know enough Javascript to work with Solr and (b) The time for
>> submitting proposals for the program is over.
>> >
>> > The project duration is scheduled till end of October. After that or if
>> I get time during project period I'll try and work with other
>> functionalities of Luke and then try for Solr. I think its best to make
>> Luke completely functioning before integrating with the trunk, and this is
>> better in incremental steps.
>> >
>> >
>> > On Fri, Jul 19, 2013 at 9:59 PM, SUJIT PAL 
>> wrote:
>> > Hi Ajay,
>> >
>> > Since you asked for feedback from the community... a lot of what Luke
>> used to do is now already available in Solr's admin tool. From Luke's
>> feature set that you had in your proposal Google doc. the only ones I think
>> are /not/ preset are the following:
>> >
>> > * Browse by document number
>> > * selectively delete documents from the index - there is no delete
>> document page AFAIK, but you can still do this from the URL.
>> > * reconstruct the original document fields, edit them and re-insert to
>> the index - you can do this using code as long as the fields are stored,
>> but there is no reconstruct page.
>> > * optimize indexes - can be done from the URL but probably no
>> page/button for this.
>> >
>> > As a Solr user, for me your tool would be most useful if it
>> concentrated on these areas, and if it could be integrated into the
>> existing admin tool (the Solr 4 one of course). I am not sure what the
>> Solr4 admin tool uses, if its Pivot then I guess thats what you should use
>> (and by extension, if not, you probably should use what the current tool
>> uses so its easy to maintain going forward). Benefit to users such as
>> myself would be unified look-and-feel so not much of a learning
>> curve/barrier to adoption.
>> >
>> > Just my $0.02...
>> >
>> > -sujit
>> >
>> > On Jul 19, 2013, at 8:06 AM, Ajay Bhat wrote:
>> >
>> > > Hi Mark,
>> > >
>> > > I've added the proposal to the ASF-ICFOSS proposals page [1].
>> According to the ICFOSS programme [2] the last date for submission of
>> project proposal is July 19th (today)
>> > > The time period for mentors to review and rank students project
>> proposals is July 22nd to August 2nd, i.e next week onwards.
>> > >
>> > > I'd like some feedback on my proposal from the community as well.
>> > > Link to proposal on Google Docs :
>> https://docs.google.com/document/d/18Vu5YB6C7WLDxnG01BnZXFEKUC3EQYb0Y5_tCJFb_sc
>> > > Link to proposal on CWiki page :
>> https://cwiki.apache.org/confluence/display/COMDEV/Proposal+for+Apache+Lucene+-+Ajay+Bhat
>> > >
>> > > [1]
>> https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme
>> > >
>> > > [2] http://community.apache.org/mentoringprogramme-icfoss-pilot.html
>> > >
>> > >
>> > > On Thu, Jul 18, 2013 at 12:04 AM, Ajay Bhat 
>> wrote:
>> > > Thanks Mark. I've given you comment access as well so you can comment
>> on specific parts of the proposal
>> > >
>> > >
>> > > On Wed, Jul 17, 2013 at 11:51 PM, Mark Miller 
>> wrote:
>> > > You can put my down for the mentor.
>> > >
>> > > - Mark
>> > >
>> > > On Jul 17, 2013, at 2:04 PM, Ajay Bhat  wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I want to do

[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733474#comment-13733474
 ] 

Simon Willnauer commented on LUCENE-5152:
-

bq. This would address my concern: the cost becomes O(N) not O(N^2). And the 
coverage is the same?

The problem here is that we really need to check after we returned from the 
cache and that might be the case only once in a certain test. Yet, I think it's 
OK to do it there. I still don't get what you are concerned of we only have -ea 
in tests and the tests don't seem to be any slower? Can you elaborate what you 
are afraid of?

> Lucene FST is not immutable
> ---
>
> Key: LUCENE-5152
> URL: https://issues.apache.org/jira/browse/LUCENE-5152
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/FSTs
>Affects Versions: 4.4
>Reporter: Simon Willnauer
>Priority: Blocker
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch
>
>
> a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
> output from and FST (BytesRef) which caused sideffects in later execution. 
> I added an assertion into the FST that checks if a cached root arc is 
> modified and in-fact this happens for instance in our MemoryPostingsFormat 
> and I bet we find more places. We need to think about how to make this less 
> trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733547#comment-13733547
 ] 

Jack Krupansky commented on SOLR-5124:
--

Try doing the update with the extractOnly=true parameter and look at the actual 
byte codes where the two adjacent terms meet - it may be some odd Unicode value 
that Solr filters ignore rather than treat as white space.

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5120) Solrj Query response error with result number

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733589#comment-13733589
 ] 

Shawn Heisey commented on SOLR-5120:


[~lukasw44] I have a question to ask you:

What resources did you look at in order to decide that you should file a bug to 
get an answer to your question?  The reason that I ask is because we have been 
seeing an increase recently in the number of people who file a bug for support 
issues instead of asking for help via our discussion resources like the mailing 
list.  This suggests that there might be some incorrect support information out 
there that needs correction.

Related to your issue: If setting the start parameter to 0 or omitting the 
parameter didn't fix your issue, then this issue can be reopened, but I'm 
confident that this is the problem.


> Solrj Query response error with result number 
> --
>
> Key: SOLR-5120
> URL: https://issues.apache.org/jira/browse/SOLR-5120
> Project: Solr
>  Issue Type: Bug
> Environment: linux, lubuntu, java version "1.7.0_13".
>Reporter: Łukasz Woźniczka
>Priority: Critical
>
> This is my simple code : 
>  QueryResponse qr;
> try {
> qr = fs.execute(solrServer);
> System.out.println("QUERY RESPONSE : " + qr);
> for (Entry r : qr.getResponse()) {
> System.out.println("RESPONSE: " + r.getKey() + " -> " + 
> r.getValue());
> }
> SolrDocumentList dl = qr.getResults();
> System.out.println("--RESULT SIZE:[ " + dl.size() );
> } catch (SolrServerException e) {
> e.printStackTrace();
> }
> I am using solrj and solr-core version 4.4.0. And there is a bug probably in 
> solrj in query result. I am creating one simple txt doc with content 'anna' 
> and then i am restar solr and try to search this phrase. Nothing is found but 
> this is my query response system out {numFound=1,start=1,docs=[]}.
> So as you can see ther is info that numFound=1 but docs=[] <-- is empty. Next 
> i add another document with only one word 'anna' and then try search that 
> string and this is sysout; 
> {numFound=2,start=1,docs=[SolrDocument{file_id=9882, 
> file_name=luk-search2.txt, file_create_user=-1, file_department=10, 
> file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 
> 4781, 341, -20, -1], _version_=1442647024934584320}]}
> So as you can see ther is numFound = 2 but only one document is listed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733598#comment-13733598
 ] 

Michael McCandless commented on LUCENE-5152:


bq. Can you elaborate what you are afraid of?

In general I think it's bad if an assert changes too much how the code
would run without asserts.  E.g., maybe this O(N^2) assert alters how
threads are scheduled and changes how / whether an issue appears in
practice.

Similarly, if a user is having trouble, I'll recommend turning on
asserts to see if one trips, but if this causes a change in how the
code runs then this can change whether the issue reproduces.

I also just don't like O(N^2) code, even when it's under an assert :)

I think asserts should minimize their impact to the real code when
possible, and it certainly seems possible in this case.

Separately, we really should run our tests w/o asserts, too, since
this is how our users typically run (I know some tests fail if
assertions are off ... we'd have to fix them).  What if we accidentally
commit "real" code behind an assert?  Our tests wouldn't catch it ...


> Lucene FST is not immutable
> ---
>
> Key: LUCENE-5152
> URL: https://issues.apache.org/jira/browse/LUCENE-5152
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/FSTs
>Affects Versions: 4.4
>Reporter: Simon Willnauer
>Priority: Blocker
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch
>
>
> a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
> output from and FST (BytesRef) which caused sideffects in later execution. 
> I added an assertion into the FST that checks if a cached root arc is 
> modified and in-fact this happens for instance in our MemoryPostingsFormat 
> and I bet we find more places. We need to think about how to make this less 
> trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3076) Solr(Cloud) should support block joins

2013-08-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733601#comment-13733601
 ] 

Yonik Seeley commented on SOLR-3076:


Making progeess... currently working on randomized testing (using our current 
join implementation to cross-check this implementation).  I've hit some snags 
and am working through them...

bq. one of inconveniences is the necessity to provide user cache for BJQParser 

Yeah, I had some things in mind to handle that as well.

> Solr(Cloud) should support block joins
> --
>
> Key: SOLR-3076
> URL: https://issues.apache.org/jira/browse/SOLR-3076
> Project: Solr
>  Issue Type: New Feature
>Reporter: Grant Ingersoll
>Assignee: Yonik Seeley
> Fix For: 4.5, 5.0
>
> Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, 
> bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, 
> child-bjqparser.patch, dih-3076.patch, dih-config.xml, 
> parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 
> 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-3076.patch, SOLR-3076.patch, 
> SOLR-7036-childDocs-solr-fork-trunk-patched, 
> solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, 
> tochild-bjq-filtered-search-fix.patch
>
>
> Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5120) Solrj Query response error with result number

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733606#comment-13733606
 ] 

Łukasz Woźniczka commented on SOLR-5120:


Shawn Haisey its my fault sorry. I am setting start parameter = 1 

> Solrj Query response error with result number 
> --
>
> Key: SOLR-5120
> URL: https://issues.apache.org/jira/browse/SOLR-5120
> Project: Solr
>  Issue Type: Bug
> Environment: linux, lubuntu, java version "1.7.0_13".
>Reporter: Łukasz Woźniczka
>Priority: Critical
>
> This is my simple code : 
>  QueryResponse qr;
> try {
> qr = fs.execute(solrServer);
> System.out.println("QUERY RESPONSE : " + qr);
> for (Entry r : qr.getResponse()) {
> System.out.println("RESPONSE: " + r.getKey() + " -> " + 
> r.getValue());
> }
> SolrDocumentList dl = qr.getResults();
> System.out.println("--RESULT SIZE:[ " + dl.size() );
> } catch (SolrServerException e) {
> e.printStackTrace();
> }
> I am using solrj and solr-core version 4.4.0. And there is a bug probably in 
> solrj in query result. I am creating one simple txt doc with content 'anna' 
> and then i am restar solr and try to search this phrase. Nothing is found but 
> this is my query response system out {numFound=1,start=1,docs=[]}.
> So as you can see ther is info that numFound=1 but docs=[] <-- is empty. Next 
> i add another document with only one word 'anna' and then try search that 
> string and this is sysout; 
> {numFound=2,start=1,docs=[SolrDocument{file_id=9882, 
> file_name=luk-search2.txt, file_create_user=-1, file_department=10, 
> file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 
> 4781, 341, -20, -1], _version_=1442647024934584320}]}
> So as you can see ther is numFound = 2 but only one document is listed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733656#comment-13733656
 ] 

Hoss Man commented on SOLR-4679:


Uwe: I defer to your judgement on this.  if you think the patch is hte right 
way to go, then +1 from me.

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2548) Multithreaded faceting

2013-08-08 Thread Gun Akkor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733690#comment-13733690
 ] 

Gun Akkor commented on SOLR-2548:
-

I would like to revive this ticket, if possible. We have an index with about 10 
fields that we regularly facet on. These fields are either multi-valued or are 
of type TextField, so facet code chooses FC as the facet method, and uses the 
UnInvertedField instances to count each facet field, which takes several 
seconds per field in our case. So, multi-thread execution of getTermCounts() 
reduces the overall facet time considerably.

I started with the patch that was posted against 3.1 and modified it a little 
bit to take into account previous comments made by Yonik and Adrien. The new 
patch applies against 4.2.1, uses the already existing facetExecutor thread 
pool, and is configured per request via a facet.threads request param. If the 
param is not supplied, the code defaults to directExecutor and runs sequential 
as before. So, code should behave as is if user chooses not to submit number of 
threads to use.

Also in the process of testing, I noticed that 
UnInvertedField.getUnInvertedField() call was synchronized too early, before 
the call to new UnInvertedField(field, searcher) if the field is not in the 
field value cache. Because its init can take several seconds, synchronizing on 
the cache in that duration was effectively serializing the execution of the 
multiple threads.
So, I modified it (albeit inelegantly) to synchronize later (in our case cache 
hit ratio is low, so this makes a difference).

The patch is still incomplete, as it does not extend this framework to possibly 
other calls like ranges and dates, but it is a start.

> Multithreaded faceting
> --
>
> Key: SOLR-2548
> URL: https://issues.apache.org/jira/browse/SOLR-2548
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 3.1
>Reporter: Janne Majaranta
>Priority: Minor
>  Labels: facet
> Attachments: SOLR-2548_for_31x.patch, SOLR-2548.patch
>
>
> Add multithreading support for faceting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2548) Multithreaded faceting

2013-08-08 Thread Gun Akkor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gun Akkor updated SOLR-2548:


Attachment: SOLR-2548_4.2.1.patch

Patch against 4.2.1

> Multithreaded faceting
> --
>
> Key: SOLR-2548
> URL: https://issues.apache.org/jira/browse/SOLR-2548
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 3.1
>Reporter: Janne Majaranta
>Priority: Minor
>  Labels: facet
> Attachments: SOLR-2548_4.2.1.patch, SOLR-2548_for_31x.patch, 
> SOLR-2548.patch
>
>
> Add multithreading support for faceting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733701#comment-13733701
 ] 

Adrien Grand commented on LUCENE-5157:
--

I discussed about this issue with Robert to see how we can move forward:
 - moving OrdinalMap to MultiTermsEnum can be controversial as Robert explained 
so let's only tackle the naming and getSegmentOrd API issues here,
 - another option to make getSegmentOrd less trappy is to add an assertion that 
the provided segment number is the same as the one returned by 
{{getSegmentNumber}}, this would allow for returning the segment ordinals on 
any segment in the future without changing the API,
 - renaming subIndex to segment is ok as it makes the naming more consistent.

Robert, please correct me if you think it doesn't reflect correctly what we 
said.
Boaz, what do you think?

> Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
> 
>
> Key: LUCENE-5157
> URL: https://issues.apache.org/jira/browse/LUCENE-5157
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Boaz Leskes
>Priority: Minor
> Attachments: LUCENE-5157.patch
>
>
> I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
> renaming some methods to more clearly communicate what they do. Also I 
> renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733703#comment-13733703
 ] 

Shawn Heisey commented on SOLR-4414:


[~shalinmangar] I came across this issue while looking into my problems with 
distributed MoreLikeThis.  Things look a little off, so I'm writing this.

At a quick glance, the commit comment doesn't seem to be related to this issue, 
because it doesn't mention MLT at all.  Also, you have never commented on this 
issue outside the commit comment.  This is the issue number in CHANGES.txt, 
though.  Is the commit for this issue or another one?

If the commit is for this issue, I think this probably needs to be closed, 
fixed in 4.2 and 5.0.  If not, CHANGES.txt probably needs some cleanup.


> MoreLikeThis on a shard finds no interesting terms if the document queried is 
> not in that shard
> ---
>
> Key: SOLR-4414
> URL: https://issues.apache.org/jira/browse/SOLR-4414
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis, SolrCloud
>Affects Versions: 4.1
>Reporter: Colin Bartolome
>
> Running a MoreLikeThis query in a cloud works only when the document being 
> queried exists in whatever shard serves the request. If the document is not 
> present in the shard, no "interesting terms" are found and, consequently, no 
> matches are found.
> h5. Steps to reproduce
> * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
> the rest of the request handlers:
> {code:xml}
> 
> {code}
> * Follow the [simplest SolrCloud 
> example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
>  to get two shards running.
> * Hit this URL: 
> [http://localhost:8983/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> * Compare that output to that of this URL: 
> [http://localhost:7574/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> The former URL will return a result and list some interesting terms. The 
> latter URL will return no results and list no interesting terms. It will also 
> show this odd XML element:
> {code:xml}
> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5157:
-

Assignee: Adrien Grand

> Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
> 
>
> Key: LUCENE-5157
> URL: https://issues.apache.org/jira/browse/LUCENE-5157
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Boaz Leskes
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-5157.patch
>
>
> I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
> renaming some methods to more clearly communicate what they do. Also I 
> renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733704#comment-13733704
 ] 

Adrien Grand commented on LUCENE-5150:
--

I'll commit soon if there is no objection. These dense sets can be common in 
cases where e.g. users are allowed to see everything but something.

> WAH8DocIdSet: dense sets compression
> 
>
> Key: LUCENE-5150
> URL: https://issues.apache.org/jira/browse/LUCENE-5150
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-5150.patch
>
>
> In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
> able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)
Robert Muir created LUCENE-5159:
---

 Summary: compressed diskdv sorted/sortedset termdictionaries
 Key: LUCENE-5159
 URL: https://issues.apache.org/jira/browse/LUCENE-5159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Robert Muir


Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
"term dictionary" of all the values.

You can do a few operations on these:
* ord -> term lookup (e.g. retrieving facet labels)
* term -> ord lookup (reverse lookup: e.g. fieldcacherangefilter)
* get a term enumerator (e.g. merging, ordinalmap construction)

The current implementation for diskdv was the simplest thing that can possibly 
work: under the hood it just makes a binary DV for these (treating ordinals as 
document ids). When the terms are fixed length, you can address a term directly 
with multiplication. When they are variable length though, we have to store a 
packed ints structure in RAM.

This variable length case is overkill and chews up a lot of RAM if you have 
many unique values. It also chews up a lot of disk since all the values are 
just concatenated (no sharing).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733713#comment-13733713
 ] 

Mark Miller commented on SOLR-4414:
---

I think it was simply mis-tagged.

> MoreLikeThis on a shard finds no interesting terms if the document queried is 
> not in that shard
> ---
>
> Key: SOLR-4414
> URL: https://issues.apache.org/jira/browse/SOLR-4414
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis, SolrCloud
>Affects Versions: 4.1
>Reporter: Colin Bartolome
>
> Running a MoreLikeThis query in a cloud works only when the document being 
> queried exists in whatever shard serves the request. If the document is not 
> present in the shard, no "interesting terms" are found and, consequently, no 
> matches are found.
> h5. Steps to reproduce
> * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
> the rest of the request handlers:
> {code:xml}
> 
> {code}
> * Follow the [simplest SolrCloud 
> example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
>  to get two shards running.
> * Hit this URL: 
> [http://localhost:8983/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> * Compare that output to that of this URL: 
> [http://localhost:7574/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> The former URL will return a result and list some interesting terms. The 
> latter URL will return no results and list no interesting terms. It will also 
> show this odd XML element:
> {code:xml}
> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5159:


Attachment: LUCENE-5159.patch

Here's an in-progress patch... all the core/codec tests pass, but I'm sure 
there are a few bugs to knock out (improving the tests is the way to go here).

I'm also unhappy with the complexity.

The idea is for the variable case, we just prefix-share (i set interval=16), 
like lucene 3.x dictionary. The current patch specializes the termsenum and 
reverselookup for this case (but again, im sure there are bugs, its hairy)

> compressed diskdv sorted/sortedset termdictionaries
> ---
>
> Key: LUCENE-5159
> URL: https://issues.apache.org/jira/browse/LUCENE-5159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-5159.patch
>
>
> Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
> "term dictionary" of all the values.
> You can do a few operations on these:
> * ord -> term lookup (e.g. retrieving facet labels)
> * term -> ord lookup (reverse lookup: e.g. fieldcacherangefilter)
> * get a term enumerator (e.g. merging, ordinalmap construction)
> The current implementation for diskdv was the simplest thing that can 
> possibly work: under the hood it just makes a binary DV for these (treating 
> ordinals as document ids). When the terms are fixed length, you can address a 
> term directly with multiplication. When they are variable length though, we 
> have to store a packed ints structure in RAM.
> This variable length case is overkill and chews up a lot of RAM if you have 
> many unique values. It also chews up a lot of disk since all the values are 
> just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733717#comment-13733717
 ] 

Shalin Shekhar Mangar commented on SOLR-4414:
-

[~elyograg] - That was a mistake. The commit mentioned here actually belonged 
to SOLR-4415. I fixed the issue number in the change log but I forgot to put a 
comment here.

> MoreLikeThis on a shard finds no interesting terms if the document queried is 
> not in that shard
> ---
>
> Key: SOLR-4414
> URL: https://issues.apache.org/jira/browse/SOLR-4414
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis, SolrCloud
>Affects Versions: 4.1
>Reporter: Colin Bartolome
>
> Running a MoreLikeThis query in a cloud works only when the document being 
> queried exists in whatever shard serves the request. If the document is not 
> present in the shard, no "interesting terms" are found and, consequently, no 
> matches are found.
> h5. Steps to reproduce
> * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
> the rest of the request handlers:
> {code:xml}
> 
> {code}
> * Follow the [simplest SolrCloud 
> example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
>  to get two shards running.
> * Hit this URL: 
> [http://localhost:8983/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> * Compare that output to that of this URL: 
> [http://localhost:7574/solr/collection1/mlt?mlt.fl=includes&q=id:3007WFP&mlt.match.include=false&mlt.interestingTerms=list&mlt.mindf=1&mlt.mintf=1]
> The former URL will return a result and list some interesting terms. The 
> latter URL will return no results and list no interesting terms. It will also 
> show this odd XML element:
> {code:xml}
> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733745#comment-13733745
 ] 

Robert Muir commented on LUCENE-5157:
-

+1, lets improve it for now and not expand it to try to be a "general" 
termsenum merger. but on the other hand, i am still not convinced we can't 
improve the efficiency of this thing, so its good if we can prevent innards 
from being too exposed (unless its causing some use case an actual problem)

> Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
> 
>
> Key: LUCENE-5157
> URL: https://issues.apache.org/jira/browse/LUCENE-5157
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Boaz Leskes
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-5157.patch
>
>
> I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
> renaming some methods to more clearly communicate what they do. Also I 
> renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5159:


Attachment: LUCENE-5159.patch

fixes a OB1 bug. ill beef up the DV base test case to really exercise this 
termsenum...

> compressed diskdv sorted/sortedset termdictionaries
> ---
>
> Key: LUCENE-5159
> URL: https://issues.apache.org/jira/browse/LUCENE-5159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-5159.patch, LUCENE-5159.patch
>
>
> Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
> "term dictionary" of all the values.
> You can do a few operations on these:
> * ord -> term lookup (e.g. retrieving facet labels)
> * term -> ord lookup (reverse lookup: e.g. fieldcacherangefilter)
> * get a term enumerator (e.g. merging, ordinalmap construction)
> The current implementation for diskdv was the simplest thing that can 
> possibly work: under the hood it just makes a binary DV for these (treating 
> ordinals as document ids). When the terms are fixed length, you can address a 
> term directly with multiplication. When they are variable length though, we 
> have to store a packed ints structure in RAM.
> This variable length case is overkill and chews up a lot of RAM if you have 
> many unique values. It also chews up a lot of disk since all the values are 
> just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733776#comment-13733776
 ] 

Uwe Schindler commented on SOLR-4679:
-

Hoss: I just took this issue because it was unassigned and I was the one 
mandating to add ignorable whitespace at that time in TIKA. So Jukka and I 
decided this would be the best.

Because you are still not convinced with my argumentation, let me recapitulate 
TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents 
to the consumer. This is nice, because it allows to preserve some of the 
formatting (like bold fonts, paragraphs,...) originating from the original 
document. Of course most of this formatting is lost, but you can still "detect" 
things like emphasized text. By choosing XHTML as output format, of course TIKA 
must use XHTML formatting for new lines and similar. So whenever a line break 
is needed, the TIKA pasrer emits a  tag or places the "paragraph" (in a 
PDF) inside a  element. As we all know, HTML ignores formatting like 
newlines, tabs,... (all are treated as one single whitespace, so means like 
this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the 
*plain text* contents. With the XHTML-only approach this would be hard for the 
consumer. Because to add the correct newlines, the consumer has to fully 
understand XHTML and detect block elements and replace them by \n

To support both usages of TIKA the idea was to embed this information which is 
unimportant to HTML (as HTML ignores whitespaces completely) as 
ignorableWhitespace as "convenience" for the user. A fully compliant XHTML 
consumer would not parse the ignoreable stuff. As it understands HTML it would 
detect a  element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the 
text only contents, so ideally when consuming the HTLL it could use 
{{WriteoutContentHandler(StringBuilder, 
BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the 
right thing automatically: It would extract only text from the body element and 
would use the "convenience whitespace" to format the text in ASCII-ART-like way 
(using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is 
similar to the above approcha), but the bug is that in contrast to TIKA's 
official WriteOutContentHandler it does not use the ignorable whitespace 
inserted for convenience. In addition TIKA also has a stack where it allows to 
process parts of the documents (like the title element or all  elements). 
In that case it has several StringBuilders in parallel that are populated with 
the contents. The problems are here too, but cannot be solved by using 
ignorable whitespace: e.g. one indexes only all  elements (which are inline 
HTML elements no block elements), there is no whitespace so all em elements 
would be glued together in the em field of your index... I just mention this, 
in my opinion the SolrContentHandler needs more work to "correctly" understand 
HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real 
{{character()}} events - but this is not the right thing to do here. As I said, 
HTML does not know these characters, they are ignored. The "formatting" is done 
by the element names (like , , ). So the "helper" whitespace for 
text-only consumers should be inserted as ignorableWhitespace only, if we would 
add it to the real character data we would report things that every HTML parser 
(like nekohtml) would never report to the consumer. Nekohtml would also report 
this useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is 
"configured" to help the text-only user, but don't hurt the HTML-only user. 
This differentiation is done by reporting the HTML element names (p, div, 
table, th, td, tr, abbr, em, strong,...) but also report the 
ASCII-ART-text-only content like TABs indide tables, newlines after block 
elements,... This is always done as ignorableWhitespace (for convenience), a 
real HTML parser must ignore it - and its correct to do this.



> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz

[jira] [Commented] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733787#comment-13733787
 ] 

Michael McCandless commented on LUCENE-5159:


+1, patch looks great.

> compressed diskdv sorted/sortedset termdictionaries
> ---
>
> Key: LUCENE-5159
> URL: https://issues.apache.org/jira/browse/LUCENE-5159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Robert Muir
> Attachments: LUCENE-5159.patch, LUCENE-5159.patch
>
>
> Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
> "term dictionary" of all the values.
> You can do a few operations on these:
> * ord -> term lookup (e.g. retrieving facet labels)
> * term -> ord lookup (reverse lookup: e.g. fieldcacherangefilter)
> * get a term enumerator (e.g. merging, ordinalmap construction)
> The current implementation for diskdv was the simplest thing that can 
> possibly work: under the hood it just makes a binary DV for these (treating 
> ordinals as document ids). When the terms are fixed length, you can address a 
> term directly with multiplication. When they are variable length though, we 
> have to store a packed ints structure in RAM.
> This variable length case is overkill and chews up a lot of RAM if you have 
> many unique values. It also chews up a lot of disk since all the values are 
> just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733791#comment-13733791
 ] 

Hoss Man commented on SOLR-4679:


bq. Because you are still not convinced with my argumentation, let me 
recapitulate TIKA's problems:

I never said that ... you said "I can take the issue if you like." and you 
explained why the existing patch should be committed -- i'm totally willing to 
go along with that, so have at it.  it seems sketchy to me, but if that's the 
way Tika works that's the way tika works, you certainly understand it better 
then me, so i defer to your assesment.

(as mentioned in TIKA-1134 it would be nice if this type of behavior was better 
documented for people implementing their own ContentHandlers, but that's a Tika 
issue not a Solr issue.)

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)
Shawn Heisey created SOLR-5125:
--

 Summary: Distributed MoreLikeThis fails with NullPointerException, 
shard query gives EarlyTerminatingCollectorException
 Key: SOLR-5125
 URL: https://issues.apache.org/jira/browse/SOLR-5125
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis
Affects Versions: 4.4
Reporter: Shawn Heisey
 Fix For: 4.5, 5.0


A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 
4.4.0.  The original query returns a NullPointerException.  The Solr log shows 
that the shard queries are throwing EarlyTerminatingCollectorException.  Full 
details to follow in the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733816#comment-13733816
 ] 

Shawn Heisey commented on SOLR-5125:


The query that works fine in 4.2.1 has the following URL:

/solr/ncmain/ncdismax?q=tag_id:ugphotos000996&mlt=true&mlt.fl=catchall&mlt.count=100

The ncmain handler has the shards parameter in solrconfig.xml and is set up for 
edismax. The shards.qt parameter is /search, a handler using the default query 
parser.  On 4.2.1, it had a QTime of 49641, a performance issue that I 
mentioned on the mailing list and will be pursuing there.  Here's a server log 
excerpt, showing a shard request, the shard exception, the original query, and 
the final exception.

{noformat}
INFO  - 2013-08-08 12:18:20.030; org.apache.solr.core.SolrCore; [s3live] 
webapp=/solr path=/search 
params={mlt.fl=catchall&sort=score+desc&tie=0.1&shards.qt=/search&mlt.dist.id=ugphotos000996&mlt=true&q.alt=*:*&distrib=false&shards.tolerant=true&version=2&NOW=1375985885078&shard.url=bigindy5.REDACTED.com:8982/solr/s3live&df=catchall&fl=score,tag_id&qs=3&qt=/search&lowercaseOperators=false&mm=100%25&qf=catchall&wt=javabin&rows=100&defType=edismax&pf=catchall^2&mlt.count=100&start=0&q=%2B(catchall:arabian+catchall:close-up+catchall:horse+catchall:closeup+catchall:close+catchall:white+catchall:up+catchall:sassy+catchall:154+catchall:equestrian+catchall:domestic+catchall:animals+catchall:of)+-tag_id:ugphotos000996&shards.info=true&boost=min(recip(abs(ms(NOW/HOUR,pd)),1.92901e-10,1.5,1.5),0.85)&isShard=true&ps=3}
 6815483 status=500 QTime=14639
ERROR - 2013-08-08 12:18:20.030; org.apache.solr.common.SolrException; 
null:org.apache.solr.search.EarlyTerminatingCollectorException
at 
org.apache.solr.search.EarlyTerminatingCollector.collect(EarlyTerminatingCollector.java:62)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:289)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:624)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1494)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
at 
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:1226)
at 
org.apache.solr.handler.MoreLikeThisHandler$MoreLikeThisHelper.getMoreLikeThis(MoreLikeThisHandler.java:365)
at 
org.apache.solr.handler.component.MoreLikeThisComponent.getMoreLikeThese(MoreLikeThisComponent.java:356)
at 
org.apache.solr.handler.component.MoreLikeThisComponent.process(MoreLikeThisComponent.java:107)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(A

[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733833#comment-13733833
 ] 

Shawn Heisey commented on SOLR-5125:


Here's someone else having the same problem.  They don't say whether it's a 
single index or distributed, though.

http://stackoverflow.com/questions/17866313/earlyterminatingcollectorexception-in-mlt-component-of-solr-4-4
 


> Distributed MoreLikeThis fails with NullPointerException, shard query gives 
> EarlyTerminatingCollectorException
> --
>
> Key: SOLR-5125
> URL: https://issues.apache.org/jira/browse/SOLR-5125
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis
>Affects Versions: 4.4
>Reporter: Shawn Heisey
> Fix For: 4.5, 5.0
>
>
> A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 
> 4.4.0.  The original query returns a NullPointerException.  The Solr log 
> shows that the shard queries are throwing EarlyTerminatingCollectorException. 
>  Full details to follow in the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733844#comment-13733844
 ] 

ASF subversion and git services commented on SOLR-4952:
---

Commit 1511954 from hoss...@apache.org in branch 'dev/trunk'
[ https://svn.apache.org/r1511954 ]

SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging

> audit test configs to use solrconfig.snippet.randomindexconfig.xml in more 
> tests
> 
>
> Key: SOLR-4952
> URL: https://issues.apache.org/jira/browse/SOLR-4952
> Project: Solr
>  Issue Type: Sub-task
>Reporter: Hoss Man
>Assignee: Hoss Man
>
> in SOLR-4942 i updated every solrconfig.xml to either...
> * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so
> * use the useCompoundFile sys prop if it already had an {{}} 
> section, or if including the snippet wasn't going to be easy (ie: contrib 
> tests)
> As an improvment on this:
> * audit all core configs not already using 
> solrconfig.snippet.randomindexconfig.xml and either:
> ** make them use it, ignoring any previously unimportant explicit 
> incdexConfig settings
> ** make them use it, using explicit sys props to overwrite random values in 
> cases were explicit indexConfig values are important for test
> ** add a comment why it's not using the include snippet in cases where the 
> explicit parsing is part of hte test
> * try figure out a way for contrib tests to easily include the same file 
> and/or apply the same rules as above

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733851#comment-13733851
 ] 

ASF subversion and git services commented on SOLR-4952:
---

Commit 1511958 from hoss...@apache.org in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511958 ]

SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging (merge 
r1511954)

> audit test configs to use solrconfig.snippet.randomindexconfig.xml in more 
> tests
> 
>
> Key: SOLR-4952
> URL: https://issues.apache.org/jira/browse/SOLR-4952
> Project: Solr
>  Issue Type: Sub-task
>Reporter: Hoss Man
>Assignee: Hoss Man
>
> in SOLR-4942 i updated every solrconfig.xml to either...
> * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so
> * use the useCompoundFile sys prop if it already had an {{}} 
> section, or if including the snippet wasn't going to be easy (ie: contrib 
> tests)
> As an improvment on this:
> * audit all core configs not already using 
> solrconfig.snippet.randomindexconfig.xml and either:
> ** make them use it, ignoring any previously unimportant explicit 
> incdexConfig settings
> ** make them use it, using explicit sys props to overwrite random values in 
> cases were explicit indexConfig values are important for test
> ** add a comment why it's not using the include snippet in cases where the 
> explicit parsing is part of hte test
> * try figure out a way for contrib tests to easily include the same file 
> and/or apply the same rules as above

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created LUCENE-5160:
---

 Summary: NIOFSDirectory, SimpleFSDirectory (others?) don't 
properly handle valid file and FileChannel read conditions
 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4, 5.0
Reporter: Grant Ingersoll


Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
handle the -1 condition that can be returned from FileChannel.read().  If it 
returns -1, then it will move the file pointer back and you will enter an 
infinite loop.  SimpleFSDirectory displays the same characteristics, although I 
have only seen the issue on NIOFSDirectory.

The code in question from NIOFSDirectory:
{code}
try {
while (readLength > 0) {
  final int limit;
  if (readLength > chunkSize) {
// LUCENE-1566 - work around JVM Bug by breaking
// very large reads into chunks
limit = readOffset + chunkSize;
  } else {
limit = readOffset + readLength;
  }
  bb.limit(limit);
  int i = channel.read(bb, pos);
  pos += i;
  readOffset += i;
  readLength -= i;
}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733865#comment-13733865
 ] 

Uwe Schindler commented on LUCENE-5160:
---

This is a bug, which never is hit by lucene because we never read sequentially 
until end of file.

+1 to fix this. Theoretically to comply with MMapDirectory it should throw 
EOFException if it gets -1, because Lucene code should not read beyond file end.

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs

2013-08-08 Thread Mikhail Khludnev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733872#comment-13733872
 ] 

Mikhail Khludnev commented on LUCENE-4774:
--

fwiw something like 
http://www.gossamer-threads.com/lists/lucene/java-dev/199372?do=post_view_threaded
 happens to me 

NOTE: reproduce with: ant test  -Dtestcase=TestBlockJoinSorting 
-Dtests.method=testNestedSorting -Dtests.seed=FB4F1BE85579255B 
-Dtests.slow=true -Dtests.locale=da_DK -Dtests.timezone=Asia/Qatar 
-Dtests.file.encoding=UTF-8
NOTE: test params are: codec=Asserting, 
sim=RandomSimilarityProvider(queryNorm=true,coord=crazy): {}, locale=da_DK, 
timezone=Asia/Qatar
NOTE: Linux 2.6.32-131.0.15.el6.x86_64 amd64/Sun Microsystems Inc. 1.6.0_29 
(64-bit)/cpus=4,threads=1,free=317130512,total=349241344
NOTE: All tests run in this JVM: [TestJoinUtil, TestBlockJoin, 
TestBlockJoinSorting]

---
Test set: org.apache.lucene.search.join.TestBlockJoinSorting
---
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec <<< 
FAILURE!
testNestedSorting(org.apache.lucene.search.join.TestBlockJoinSorting)  Time 
elapsed: 0.021 sec  <<< FAILURE!
java.lang.AssertionError: expected:<3> but was:<28>
at 
__randomizedtesting.SeedInfo.seed([FB4F1BE85579255B:F3A6F6A915D02835]:0)
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.junit.Assert.assertEquals(Assert.java:456)
at 
org.apache.lucene.search.join.TestBlockJoinSorting.testNestedSorting(TestBlockJoinSorting.java:226)

> Add FieldComparator that allows sorting parent docs based on field inside the 
> child docs
> 
>
> Key: LUCENE-4774
> URL: https://issues.apache.org/jira/browse/LUCENE-4774
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/join
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
> Fix For: 5.0, 4.3
>
> Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch
>
>
> A field comparator for sorting block join parent docs based on the a field in 
> the associated child docs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733901#comment-13733901
 ] 

Uwe Schindler commented on SOLR-4679:
-

bq. I never said that ...

You somehow said:

bq. I defer to your judgement on this

So I assumed that you are still not 100% convinced. Sorry.

In any case I will take the issue. In my opinion there is more work to be done 
with this crazy stack of StringBuilders to better handle the ignorableWhitepace 
when a new field begins/ends. Currently its insered after the block end tag, so 
it would go one up in the stack only. I have to think a little bit about it, 
but the fix in your patch is the easiest for now. And the maybe useless 
whitespace on some lower stacked StringBuilders is generally removed by text 
analysis.

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-5160:
---

Assignee: Grant Ingersoll

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-5160:


Attachment: LUCENE-5160.patch

Patch adds the -1 check and throws an EOF

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-5160.patch
>
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733912#comment-13733912
 ] 

Uwe Schindler commented on LUCENE-5160:
---

+1 to commit. Looks good. Writing a test is a bit hard.

MMapDirectory is not affected as it already has a check for the length of the 
MappedByteBuffers.

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-5160.patch
>
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733914#comment-13733914
 ] 

ASF subversion and git services commented on LUCENE-5160:
-

Commit 1512011 from [~gsingers] in branch 'dev/trunk'
[ https://svn.apache.org/r1512011 ]

LUCENE-5160: check for -1 return conditions in file reads

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-5160.patch
>
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)
Robert Muir created LUCENE-5161:
---

 Summary: review FSDirectory chunking defaults and test the chunking
 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir


Today there is a loop in SimpleFS/NIOFS:
{code}
try {
  do {
final int readLength;
if (total + chunkSize > len) {
  readLength = len - total;
} else {
  // LUCENE-1566 - work around JVM Bug by breaking very large reads 
into chunks
  readLength = chunkSize;
}
final int i = file.read(b, offset + total, readLength);
total += i;
  } while (total < len);
} catch (OutOfMemoryError e) {
{code}

I bet if you look at the clover report its untested, because its fixed at 100MB 
for 32-bit users and 2GB for 64-bit users (are these defaults even good?!).

Also if you call the setter on a 64-bit machine to change the size, it just 
totally ignores it. We should remove that, the setter should always work.

And we should set it to small values in tests so this loop is actually executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733921#comment-13733921
 ] 

ASF subversion and git services commented on LUCENE-5160:
-

Commit 1512016 from [~gsingers] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1512016 ]

LUCENE-5160: merge from trunk

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-5160.patch
>
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-5160.
-

   Resolution: Fixed
Fix Version/s: 4.5
   5.0
Lucene Fields:   (was: New)

> NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
> and FileChannel read conditions
> 
>
> Key: LUCENE-5160
> URL: https://issues.apache.org/jira/browse/LUCENE-5160
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 5.0, 4.4
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 5.0, 4.5
>
> Attachments: LUCENE-5160.patch
>
>
> Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
> handle the -1 condition that can be returned from FileChannel.read().  If it 
> returns -1, then it will move the file pointer back and you will enter an 
> infinite loop.  SimpleFSDirectory displays the same characteristics, although 
> I have only seen the issue on NIOFSDirectory.
> The code in question from NIOFSDirectory:
> {code}
> try {
> while (readLength > 0) {
>   final int limit;
>   if (readLength > chunkSize) {
> // LUCENE-1566 - work around JVM Bug by breaking
> // very large reads into chunks
> limit = readOffset + chunkSize;
>   } else {
> limit = readOffset + readLength;
>   }
>   bb.limit(limit);
>   int i = channel.read(bb, pos);
>   pos += i;
>   readOffset += i;
>   readLength -= i;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5161:


Attachment: LUCENE-5161.patch

This patch makes the setter always work, and changes lucenetestcase to use 
small values for the chunking.

I didnt adjust any defaults (maybe Uwe can help, he knows about the code in 
question)

> review FSDirectory chunking defaults and test the chunking
> --
>
> Key: LUCENE-5161
> URL: https://issues.apache.org/jira/browse/LUCENE-5161
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-5161.patch
>
>
> Today there is a loop in SimpleFS/NIOFS:
> {code}
> try {
>   do {
> final int readLength;
> if (total + chunkSize > len) {
>   readLength = len - total;
> } else {
>   // LUCENE-1566 - work around JVM Bug by breaking very large 
> reads into chunks
>   readLength = chunkSize;
> }
> final int i = file.read(b, offset + total, readLength);
> total += i;
>   } while (total < len);
> } catch (OutOfMemoryError e) {
> {code}
> I bet if you look at the clover report its untested, because its fixed at 
> 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
> good?!).
> Also if you call the setter on a 64-bit machine to change the size, it just 
> totally ignores it. We should remove that, the setter should always work.
> And we should set it to small values in tests so this loop is actually 
> executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-08 Thread Grant Ingersoll
I seem to recall seeing this on my cluster when we didn't have clocks in sync, 
but perhaps my memory is fuzzy as well.

-Grant

On Aug 7, 2013, at 7:41 AM, Erick Erickson  wrote:

> Well, we're reconstructing a chain of _possibilities_ post-mortem,
> so there's not much I can say for sure. Mostly just throwing this 
> out there in case it sparks some "aha" moments. Not knowing
> ZK well, anything I say is speculation.
> 
> But I speculate that this isn't really the root of the problem given
> that we haven't been seeing the "ClusterState says we are the leader..."
> error go by the user lists for a while. It may well be a coincidence. The
> place that this happened reported that the problem "seemed to 
> be better" after adjusting the ZK nodes' times. I know when I
> reconstruct events like this I'm never sure about cause and
> effect since I'm usually doing several things at once.
> 
> Erick
> 
> 
> On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter  
> wrote:
> 
> : > When the times were coordinated, many of the problems with recovery went
> : > away. We're trying to reconstruct the scenario from memory, but it
> : > prompted me to pass the incident in case it sparked any thoughts.
> : > Specifically, I wonder if there's anything that comes to mind if the ZK
> : > nodes are significantly out of synch with each other time-wise.
> :
> : Does this mean that ntp or other strict time synchronization is important 
> for
> : SolrCloud?  I strive for this anyway, just to ensure that when I'm 
> researching
> : log files between two machines that I can match things up properly.
> 
> I don't know if/how Solr/ZK is affected by having machines with clocks out
> of sync, but i do remember seeing discussions a while back about weird
> things happening ot ZK client apps *while* time adjustments are taking
> place to get back in sync.
> 
> IIRC: as the local clock starts accelerating and jumping ahead in
> increments to "correct" itself with ntp, then those jumps can confuse the
> ZK code into thinking it's been waiting a lot longer then it really
> has for zk heartbeat (or whatever it's called) and it can trigger a
> timeout situation.
> 
> 
> -Hoss
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734047#comment-13734047
 ] 

Uwe Schindler commented on LUCENE-5161:
---

Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes emory in thread-local direct 
buffers allocated internally by the JVM's NIO code.

> review FSDirectory chunking defaults and test the chunking
> --
>
> Key: LUCENE-5161
> URL: https://issues.apache.org/jira/browse/LUCENE-5161
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-5161.patch
>
>
> Today there is a loop in SimpleFS/NIOFS:
> {code}
> try {
>   do {
> final int readLength;
> if (total + chunkSize > len) {
>   readLength = len - total;
> } else {
>   // LUCENE-1566 - work around JVM Bug by breaking very large 
> reads into chunks
>   readLength = chunkSize;
> }
> final int i = file.read(b, offset + total, readLength);
> total += i;
>   } while (total < len);
> } catch (OutOfMemoryError e) {
> {code}
> I bet if you look at the clover report its untested, because its fixed at 
> 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
> good?!).
> Also if you call the setter on a 64-bit machine to change the size, it just 
> totally ignores it. We should remove that, the setter should always work.
> And we should set it to small values in tests so this loop is actually 
> executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734047#comment-13734047
 ] 

Uwe Schindler edited comment on LUCENE-5161 at 8/8/13 9:44 PM:
---

Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes memory in large *thread-local* 
direct buffers allocated internally by the JVM's NIO code.

  was (Author: thetaphi):
Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes emory in thread-local direct 
buffers allocated internally by the JVM's NIO code.
  
> review FSDirectory chunking defaults and test the chunking
> --
>
> Key: LUCENE-5161
> URL: https://issues.apache.org/jira/browse/LUCENE-5161
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-5161.patch
>
>
> Today there is a loop in SimpleFS/NIOFS:
> {code}
> try {
>   do {
> final int readLength;
> if (total + chunkSize > len) {
>   readLength = len - total;
> } else {
>   // LUCENE-1566 - work around JVM Bug by breaking very large 
> reads into chunks
>   readLength = chunkSize;
> }
> final int i = file.read(b, offset + total, readLength);
> total += i;
>   } while (total < len);
> } catch (OutOfMemoryError e) {
> {code}
> I bet if you look at the clover report its untested, because its fixed at 
> 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
> good?!).
> Also if you call the setter on a 64-bit machine to change the size, it just 
> totally ignores it. We should remove that, the setter should always work.
> And we should set it to small values in tests so this loop is actually 
> executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734058#comment-13734058
 ] 

Robert Muir commented on LUCENE-5161:
-

Thanks Uwe, I will leave the issue for you tomorrow to fix the defaults.

I can only say the chunking does not seem buggy (all tests pass with the 
randomization in the patch), so at least we have that.

> review FSDirectory chunking defaults and test the chunking
> --
>
> Key: LUCENE-5161
> URL: https://issues.apache.org/jira/browse/LUCENE-5161
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
> Attachments: LUCENE-5161.patch
>
>
> Today there is a loop in SimpleFS/NIOFS:
> {code}
> try {
>   do {
> final int readLength;
> if (total + chunkSize > len) {
>   readLength = len - total;
> } else {
>   // LUCENE-1566 - work around JVM Bug by breaking very large 
> reads into chunks
>   readLength = chunkSize;
> }
> final int i = file.read(b, offset + total, readLength);
> total += i;
>   } while (total < len);
> } catch (OutOfMemoryError e) {
> {code}
> I bet if you look at the clover report its untested, because its fixed at 
> 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
> good?!).
> Also if you call the setter on a 64-bit machine to change the size, it just 
> totally ignores it. We should remove that, the setter should always work.
> And we should set it to small values in tests so this loop is actually 
> executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734138#comment-13734138
 ] 

Hoss Man commented on LUCENE-4774:
--

Mikhail: can you please open a new bug with the details of your test failure -- 
specifically: what branch/revision you are testing and whether or not that seed 
reproduces for you.

(it's not really appropriate to comment on closed issues that added features 
with concerns about bugs in that feature -- that's what Jira issue linking can 
be helpful for).



> Add FieldComparator that allows sorting parent docs based on field inside the 
> child docs
> 
>
> Key: LUCENE-4774
> URL: https://issues.apache.org/jira/browse/LUCENE-4774
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/join
>Reporter: Martijn van Groningen
>Assignee: Martijn van Groningen
> Fix For: 5.0, 4.3
>
> Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch
>
>
> A field comparator for sorting block join parent docs based on the a field in 
> the associated child docs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734159#comment-13734159
 ] 

Hoss Man commented on SOLR-5084:


Elran:

1) there's still several sections in your patch that have a lot of reformatting 
making it hard to see what exactly you've added.  (I realize that the 
formatting may not be 100% uniform in all of these files, but the key to making 
patches easy to read is not to change anything that does't *have* to be changed 
... formatting changes should be done seperately and independently from 
functionality changes)

2) could you please add a few unit tests to show how the type can be used when 
indexing/querying/faceting/returning stored fields so it's more clear what this 
patch does?

3) I'm not sure that it makes sense to customize the response writers and the 
JavaBinCodec to know about hte enum values -- it seems like it would make a lot 
more sense (and by much simpler) to have clients just treat the enum values as 
strings

4) a lot of your code seems to be cut/paste from TrieField ... why can't the 
EnumField class subclass TrieField to re-use this behavior (or worst case: wrap 
a TrieIntField similar to how TrieDateField works)

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734178#comment-13734178
 ] 

Robert Muir commented on SOLR-5084:
---

I agree with Hossman.. stick with it though, I really like the idea of an 
efficient enumerated type.

A few other ideas/questions (just from a glance, i could be wrong):
* should we enforce from the enum config that the integer values are 0-N or 
something simple? This way, things like valuesources dont have to do hashing 
but simple array lookups.
* it isnt clear to me what happens if you send a bogus value. I think an 
enumerated type would be best if its "strongly-typed" and just throws exception 
if the value is bogus.
* should the config, instead of being a separate config file, just be a nested 
element underneath the field type? I dont know if this is even possible or a 
good idea, but its an that would remove some xml files.


> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734202#comment-13734202
 ] 

Robert Muir commented on LUCENE-5150:
-

Thanks Adrien, i am too curious if its possible for you to re-run 
http://people.apache.org/~jpountz/doc_id_sets.html

Because now with smaller sets in the dense case, maybe there is no need for 
wacky heuristics in CachingWrapperFilter and we could just always cache (i am 
sure some cases would be slower, but if in general its faster...). This would 
really simplify LUCENE-5101.

> WAH8DocIdSet: dense sets compression
> 
>
> Key: LUCENE-5150
> URL: https://issues.apache.org/jira/browse/LUCENE-5150
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-5150.patch
>
>
> In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
> able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734210#comment-13734210
 ] 

Hoss Man commented on SOLR-5084:


bq. ...nested element underneath the field type? I dont know if this is even 
possible or a good idea, but its an that would remove some xml files.

i don't think the schema parsing code can handle that -- it's attribute based, 
not nested element based

bq. should we enforce from the enum config that the integer values are 0-N or 
something simple? ...

yeah ... it would be tempting to not even let the config specify numeric values 
-- just an ordered list, except:

1) all hell would break loose if someone accidently inserted a new element 
anywhere other then the end of the list
2) you'd need/want a way to "disable" values form the middle of the list from 
working again.

#2 is a problem you'd need to worry about even if we keep the mappings explicit 
but enforce 0-N ... there needs to be something like...

{code}
  



<
 


   



  
{code}

bq. ... This way, things like valuesources dont have to do hashing but simple 
array lookups.

I was actually thinking it would be nice to support multiple legal names (with 
one canonical for respones) per value, but that would preent the simple array 
lookps...

{code}
  
Not Available
Low

<
 

High


  Critical
  Highest

  
{code}


> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734223#comment-13734223
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
...nested element underneath the field type? I dont know if this is even 
possible or a good idea, but its an that would remove some xml files.

i don't think the schema parsing code can handle that – it's attribute based, 
not nested element based
{quote}

Right but code can change. Other parts of solr allow this kinda stuff.

{quote}
yeah ... it would be tempting to not even let the config specify numeric values 
– just an ordered list, except:

1) all hell would break loose if someone accidently inserted a new element 
anywhere other then the end of the list
2) you'd need/want a way to "disable" values form the middle of the list from 
working again.
{quote}

Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

{quote}
I was actually thinking it would be nice to support multiple legal names (with 
one canonical for respones) per value, but that would preent the simple array 
lookps...
{quote}

Why? I'm talking about int->canonical name (e.g. in the valuesource impl) not 
anything else. as far as name->int, you want a hash anyway.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734228#comment-13734228
 ] 

Hoss Man commented on SOLR-5084:


bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
"int" field an managing the mappings myself so id always know i could 
add/remove fields w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name->int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734228#comment-13734228
 ] 

Hoss Man edited comment on SOLR-5084 at 8/9/13 12:14 AM:
-

bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
"int" field an managing the mappings myself so id always know i could 
add/remove (EDIT) -fields- values w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name->int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.

  was (Author: hossman):
bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
"int" field an managing the mappings myself so id always know i could 
add/remove fields w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name->int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.
  
> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734239#comment-13734239
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases – if it had that kind of limitation, i would just use an "int" 
field an managing the mappings myself so id always know i could add/remove 
(EDIT) fields values w/o needing to reindex.
{quote}

This isnt really going to work here, because the idea is you want to assign 
sort order (not just values mapped to ints). If you want to rename a label, 
thats fine, but you cant really change the sort order without reindexing.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734248#comment-13734248
 ] 

Hoss Man commented on SOLR-5084:


bq. If you want to rename a label, thats fine, but you cant really change the 
sort order without reindexing.

No, no .. of course not ... i wasn't suggestiong you could change the order, 
just:
* *remove* a legal value from the list (w/o causing hte validation to complain)
* add new values to the end of the list
* (as you mentioned) modify the label on an existing value

See the example i posted before about removing "Medium" but keeping "High" & 
"Critical" exactly as they are -- no change in indexed data, just a way to tell 
the validation logic you were talking about adding "skip this value, i removed 
it on purpose" (or i suppose: "skip this value, i'm reserving it as a 
placeholder for future use")

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734254#comment-13734254
 ] 

Robert Muir commented on SOLR-5084:
---

I think adding new values to the end of the list is no issue at all. neither is 
renaming labels.

but removing a legal value from the list, i think you need to reindex. Because 
what to do with documents that have that integer value?

in general i'm just trying to make sure we keep things sane here, so that the 
underlying shit is efficient.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734259#comment-13734259
 ] 

Hoss Man commented on SOLR-5084:


bq. but removing a legal value from the list, i think you need to reindex. 
Because what to do with documents that have that integer value?

For sorting and value sources etc... nothing special happens -- they still have 
the same numeric value under the covers; it's just that when writing out the 
"stored" values (ie: label) you act as if they have no value in the field at 
all (shouldn't affect efficiency at all.)

If the user wants some other behavior the burden is on them to re-index or 
delete the affected docs -- but the simple stuff stays just as simple as if 
they were dealing with the int<->label mappings in their own code, the 
validation of legal labels just moves from the client to solr.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734266#comment-13734266
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
For sorting and value sources etc... nothing special happens – they still have 
the same numeric value under the covers; it's just that when writing out the 
"stored" values (ie: label) you act as if they have no value in the field at 
all (shouldn't affect efficiency at all.)
{quote}

Then this is just renaming a label to some special value.

I really think the best thing is to keep it simple, like java.lang.Enum. Just 
give a list of values. This way it will be efficient everywhere since the 
values will be dense. Its also conceptually simple.

Otherwise, things get complicated. and the implementation may suffer due to 
sparse "ordinals". Really, i dont care, as docvalues will do the right thing as 
long as you have < 256 values (regardless of sparsity). Fieldcache wont, but 
doesn't bother me a bit.

But still, there is no sense in making things complicated and inefficient for 
no good reason. Someone could make a HairyComplicatedAndInefficientEnumType for 
that.

> new field type - EnumField
> --
>
> Key: SOLR-5084
> URL: https://issues.apache.org/jira/browse/SOLR-5084
> Project: Solr
>  Issue Type: New Feature
>Reporter: Elran Dvir
> Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
> Solr-5084.patch
>
>
> We have encountered a use case in our system where we have a few fields 
> (Severity. Risk etc) with a closed set of values, where the sort order for 
> these values is pre-determined but not lexicographic (Critical is higher than 
> High). Generically this is very close to how enums work.
> To implement, I have prototyped a new type of field: EnumField where the 
> inputs are a closed predefined  set of strings in a special configuration 
> file (similar to currency.xml).
> The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Robert Muir
If you google 'svn remove unversioned' you find a couple one-liners
you can alias.

I also found 
http://svn.apache.org/repos/asf/subversion/trunk/contrib/client-side/svn-clean

Weird that it has a GPL license though!

On Thu, Aug 8, 2013 at 4:14 AM, Dawid Weiss
 wrote:
> I kind of use a workaround of removing everything except the .svn
> folder and then svn revert -R .
> But this is a dumb solution :)
>
> D.
>
> On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler  wrote:
>> Hi,
>>
>> Some GUIs like TortoiseSVN have this. I use this to delete all unversioned
>> files in milliseconds(TM). But native svn does not have it, unfortunately.
>>
>> Uwe
>>
>>
>>
>> Dawid Weiss  schrieb:
>>>
>>> Never mind, these were local files and they were svn-ignored, when I
>>> removed everything and checked out from scratch this problem is no
>>> longer there.
>>>
>>> I really wish svn had an equivalent of git clean -xfd .
>>>
>>> Dawid
>>>
>>> On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss 
>>> wrote:

  When I do this on trunk:

  ant jar-checksums
  svn stat

  I get:
  ?   solr\licenses\jcl-over-slf4j.jar.sha1
  ?   solr\licenses\jul-to-slf4j.jar.sha1
  ?   solr\licenses\log4j.jar.sha1
  ?   solr\licenses\slf4j-api.jar.sha1
  ?   solr\licenses\slf4j-log4j12.jar.sha1

  Where should this be fixed?  Should we svn-ignore those files or
  should they be somehow excluded from the re-generation of SHA
  checksums?

  Daw
  id
>>>
>>>
>>> 
>>>
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>
>> --
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, 28213 Bremen
>> http://www.thetaphi.de
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734463#comment-13734463
 ] 

Christoph Straßer commented on SOLR-5124:
-

@Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of 
extractOnly=true attached.)
@Uwe: Big thanks for taking care of this issue! :-)

> Solr glues word´s when parsing PDFs under certan circumstances
> --
>
> Key: SOLR-5124
> URL: https://issues.apache.org/jira/browse/SOLR-5124
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.4
> Environment: Windows 7 (don´t think, this is relevant)
>Reporter: Christoph Straßer
>Priority: Minor
>  Labels: tika,text-extraction
> Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
> 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
> 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png
>
>
> For some kind of PDF-documents Solr glues words at linebreaks under some 
> circumstances. (eg the last word of line 1 and the first word of line 2 are 
> merged into one word)
> (Stand-alone-)Tika extracts the text correct. Attached you find one 
> sample-PDF and screenshots of tika-output and the corrupted content indexed 
> by solr.
> (This issue does not occur with all PDF-documents. Tried to recreate the 
> issue with new word-documents, I converted into PDF on multiple ways without 
> success.) The attached PDF-document has a real weird internal structure. But 
> Tika seems to do it´s work right. Even with this weird document.
> In our Solr-indices we have a good amount of this weird documents. This 
> results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734469#comment-13734469
 ] 

Christoph Straßer edited comment on SOLR-4679 at 8/9/13 6:41 AM:
-

@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input!

  was (Author: christophs78):
@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input'!
  
> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734469#comment-13734469
 ] 

Christoph Straßer commented on SOLR-4679:
-

@Uwe: Big thanks for taking care of this issue! 
@Hoss Man: Thank you for your input'!

> HTML line breaks () are removed during indexing; causes wrong search 
> results
> 
>
> Key: SOLR-4679
> URL: https://issues.apache.org/jira/browse/SOLR-4679
> Project: Solr
>  Issue Type: Bug
>  Components: update
>Affects Versions: 4.2
> Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>Reporter: Christoph Straßer
>Assignee: Uwe Schindler
> Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (, , , ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> 
> 
> Test mit HTML-Zeilenschaltungen
> 
> 
> word1word2
> Some other words, a special name like linzand another special name - 
> vienna
> 
> 
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org