[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread Thomas Neidhart (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030450#comment-14030450
 ] 

Thomas Neidhart commented on CODEC-187:
---

the rule file for ashkenazi approx is very different from the original version, 
maybe by mistake or for another reason.

We should create a separate issue to upgrade the rule files to the latest 
version, i.e. 3.02

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MATH-1127) 2.0 equal to -2.0

2014-06-13 Thread Luc Maisonobe (JIRA)
Luc Maisonobe created MATH-1127:
---

 Summary: 2.0 equal to -2.0
 Key: MATH-1127
 URL: https://issues.apache.org/jira/browse/MATH-1127
 Project: Commons Math
  Issue Type: Bug
Affects Versions: 3.3
 Environment: Linux, Java 5
Reporter: Luc Maisonobe


The following test fails:

{code}
@Test
public void testMath1127() {
Assert.assertFalse(Precision.equals(2.0, -2.0, 1));
}
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (COMPRESS-284) Multi Thread Uncompress TGZ - CRC32 ERROR

2014-06-13 Thread Stefan Bodewig (JIRA)

[ 
https://issues.apache.org/jira/browse/COMPRESS-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030606#comment-14030606
 ] 

Stefan Bodewig commented on COMPRESS-284:
-

GzipCompressorInputStream is the one throwing the exception and it is not 
thread-safe in the sense that you can't have to threads reading from the same 
GzipCompressorInputStream instance at the same time.  Reading from different 
instances shouldn't cause any problems, each instance has a CRC of its own.  
Are you sharing the same Gzip input stream with multiple consumers?


> Multi Thread Uncompress TGZ - CRC32 ERROR
> -
>
> Key: COMPRESS-284
> URL: https://issues.apache.org/jira/browse/COMPRESS-284
> Project: Commons Compress
>  Issue Type: Bug
>Affects Versions: 1.8.1
> Environment: Linux
>Reporter: Inspico
>
> We have to uncompress ".tar.gz".
> So we use an "TarArchiveInputStream(GzipCompressorInputStream)".
> An archive extracted alone works perfectly.
> But when we have to launch paralleles thread to extract many archives at the 
> same time we get the same error for each thread :
> java.lang.Exception: Error while extracting list of files from Archive : 
> java.io.IOException: Gzip-compressed data is corrupt (CRC32 error)
> Sometimes we may have a success for only one archive among all errors.
> Is there any problems on the use of TarInputStream or GZIPInputStream in 
> multi-thread ?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (COMPRESS-263) Add DEFLATE support

2014-06-13 Thread Stefan Bodewig (JIRA)

[ 
https://issues.apache.org/jira/browse/COMPRESS-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030613#comment-14030613
 ] 

Stefan Bodewig commented on COMPRESS-263:
-

Thanks, I'll probably commit this during the weekend (minus the changes to the 
POM ;-) )

One thing I might quibble about is the name of isZlibHeaderPresent - this reads 
the wrong way when applied to the writing side, where should...BePresent was 
more appropriate.  How about withZlibHeader?

Also, do you think it possible to auto-detect the format, at least in the case 
where the stream contains a ZLIB header?


> Add DEFLATE support
> ---
>
> Key: COMPRESS-263
> URL: https://issues.apache.org/jira/browse/COMPRESS-263
> Project: Commons Compress
>  Issue Type: New Feature
>  Components: Compressors
>Reporter: Matthias Stevens
>  Labels: features
> Fix For: 1.9
>
> Attachments: COMPRESS-263_DeflateSupport.patch, 
> COMPRESS-263_DeflateSupport_v1.1.patch, bla.tar.deflate, bla.tar.deflatez
>
>
> GZIP is not a compression algorithm "as such". The de facto (and currently 
> the only supported) compression algorithm it uses is DEFLATE.
> GZIP adds a header of minimum 10 bytes and a footer of 8 bytes to a 
> "deflated" data stream. Find out more here: 
> http://en.wikipedia.org/wiki/Gzip#File_format
> I have no problem with the current GZIP support, but it would be nice if 
> CommonsCompress would also have compression and decompression support for 
> "raw" DEFLATE streams and DEFLATE streams with the zlib header.
> Similarly to the GZIP support in CommonsCompress these functionality can be 
> implemented very easily using the standard java.util.zip package, as done in 
> the provided patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MATH-1127) 2.0 equal to -2.0

2014-06-13 Thread Luc Maisonobe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MATH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luc Maisonobe resolved MATH-1127.
-

   Resolution: Fixed
Fix Version/s: 3.4

Fixed in subversion repository as of r1602438.

This was a fun one!

> 2.0 equal to -2.0
> -
>
> Key: MATH-1127
> URL: https://issues.apache.org/jira/browse/MATH-1127
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.3
> Environment: Linux, Java 5
>Reporter: Luc Maisonobe
> Fix For: 3.4
>
>
> The following test fails:
> {code}
> @Test
> public void testMath1127() {
> Assert.assertFalse(Precision.equals(2.0, -2.0, 1));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (MATH-1126) "LevenbergMarquardtOptimizer": Divergent behavior of new code

2014-06-13 Thread Gilles (JIRA)

 [ 
https://issues.apache.org/jira/browse/MATH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilles resolved MATH-1126.
--

Resolution: Cannot Reproduce

It seems that after several cycles of recompiling, I cannot reproduce different 
outputs from the two implementations!  Maybe I was using "stale" files 
somewhere...
The results are almost exactly the same.

The performance degradation remains, although down to about 20% (vs 35% as 
initially observed).  I still get a different number of evaluations but it's 
not caused by the core of the optimization algorithm...
Thus closing this report. Sorry for the noise.


> "LevenbergMarquardtOptimizer": Divergent behavior of new code
> -
>
> Key: MATH-1126
> URL: https://issues.apache.org/jira/browse/MATH-1126
> Project: Commons Math
>  Issue Type: Bug
>Affects Versions: 3.3
>Reporter: Gilles
>  Labels: regression
> Attachments: LM_cost_NEW, LM_cost_OLD
>
>
> The new implementation of "LevenbergMarquardtOptimizer" (package 
> "o.a.c.m.fitting.leastsquares") behaves differently from the previous one 
> (package "o.a.c.m.optim.nonlinear.vector.jacobian").
> This shows up not so much in the solutions respectively found by one and the 
> other implementation; there are fairly similar, but in my use-case, the 
> number of function evaluations is quite different. And this could explain an 
> observed 35% performance degradation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread michael tobias (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030762#comment-14030762
 ] 

michael tobias commented on CODEC-187:
--

but the rules for ashkenazi approx have not changed since 2009.  it was 
clearly wrong at the first implementation.

Do you want me to re-open this issue or create a new one?

I am happy to spend time working with you to test the tokens produced by any 
code update.

There is another issue however.  If we are fixing bugs in the BMPM algorithm 
and then also updating the rules to the latest version (which should not really 
be very different form the original version coded) then any indexes generated 
using BMPM should really be re-created because anybody updating their commons 
codec will find that new indexing and queries will be producing different 
tokens from those in existing indexes and so queries might not find existing 
records.

Will the issue of a new commons codec be accompanied by detailed information 
advising all existing indexes using BMPM to be re-created?

Michael

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread michael tobias (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030762#comment-14030762
 ] 

michael tobias edited comment on CODEC-187 at 6/13/14 3:41 PM:
---

but the rules for ashkenazi approx have not changed since 2009.  it was 
clearly wrong at the first implementation.

Do you want me to re-open this issue or create a new one?

I am happy to spend time working with you to test the tokens produced by any 
code update.

There is another issue however.  If we are fixing bugs in the BMPM algorithm 
and then also updating the rules to the latest version (which should not really 
be very different from the original version coded) then any indexes generated 
using BMPM should really be re-created because anybody updating their commons 
codec will find that new indexing and queries will be producing different 
tokens from those in existing indexes and so queries might not find existing 
records.

Will the issue of a new commons codec be accompanied by detailed information 
advising all existing indexes using BMPM to be re-created?

Michael


was (Author: mikkitobi):
but the rules for ashkenazi approx have not changed since 2009.  it was 
clearly wrong at the first implementation.

Do you want me to re-open this issue or create a new one?

I am happy to spend time working with you to test the tokens produced by any 
code update.

There is another issue however.  If we are fixing bugs in the BMPM algorithm 
and then also updating the rules to the latest version (which should not really 
be very different form the original version coded) then any indexes generated 
using BMPM should really be re-created because anybody updating their commons 
codec will find that new indexing and queries will be producing different 
tokens from those in existing indexes and so queries might not find existing 
records.

Will the issue of a new commons codec be accompanied by detailed information 
advising all existing indexes using BMPM to be re-created?

Michael

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030773#comment-14030773
 ] 

Gary Gregory commented on CODEC-187:


Feel free to write up any documentation you deem useful and attach it here as a 
patch file or a plain text file. We can add this to the release notes I would 
guess as this is the only place that make sense.

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread michael tobias (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030799#comment-14030799
 ] 

michael tobias commented on CODEC-187:
--

when we get to that point I will happily do so.

Gary should I start a new issue for the continuing bug(s) ?

I was also wondering. (dangerous I know).

Because of the potential existing index issues after the revised code is issued 
(though it looks like anybody using EXACT is probably ok), would it be possible 
/ better for us to leave the current BMPM coding untouched from 1.9 and issue 
the bug-fixed version as a BMPM3.02 ADDITIONAL functionality?  In that way 
existing users could continue to use the current (buggy) versions if it works 
fine for them while those wanting/needing the full correct implementation could 
use the 3.02 version.  This would also make it 100% clear which version of the 
algorithm is coded/being used. 

I realise this means we are 'bloating' the Codec having 2 versions of the code, 
but it does actually keep things quite clean and allows users to ignore the 
bug-fixes and/or move over to the fixed 3.02 version in their own time.

What do you think?

Michael

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread michael tobias (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030799#comment-14030799
 ] 

michael tobias edited comment on CODEC-187 at 6/13/14 4:21 PM:
---

when we get to that point I will happily do so.

Gary should I start a new issue for the continuing bug(s) ?

I was also wondering. (dangerous I know).

Because of the potential existing index issues after the revised code is issued 
(though it looks like anybody using EXACT is probably ok), would it be possible 
/ better for us to leave the current BMPM coding untouched from 1.9 and issue 
the bug-fixed version as a BMPM3.02 ADDITIONAL functionality?  In that way 
existing users could continue to use the current (buggy) versions if it works 
fine for them while those wanting/needing the full correct implementation could 
use the 3.02 version.  This would also make it 100% clear which version of the 
algorithm is coded/being used. 

I realise this means we are 'bloating' the Codec having 2 versions of the code, 
but it does actually keep things quite clean and allows users to ignore the 
bug-fixes and/or move over to the fixed 3.02 version in their own time.

It could also be made clear that eventually the original buggy BMPM will be 
dropped and users would be encouraged to adopt the 3.02 version.

What do you think?

Michael


was (Author: mikkitobi):
when we get to that point I will happily do so.

Gary should I start a new issue for the continuing bug(s) ?

I was also wondering. (dangerous I know).

Because of the potential existing index issues after the revised code is issued 
(though it looks like anybody using EXACT is probably ok), would it be possible 
/ better for us to leave the current BMPM coding untouched from 1.9 and issue 
the bug-fixed version as a BMPM3.02 ADDITIONAL functionality?  In that way 
existing users could continue to use the current (buggy) versions if it works 
fine for them while those wanting/needing the full correct implementation could 
use the 3.02 version.  This would also make it 100% clear which version of the 
algorithm is coded/being used. 

I realise this means we are 'bloating' the Codec having 2 versions of the code, 
but it does actually keep things quite clean and allows users to ignore the 
bug-fixes and/or move over to the fixed 3.02 version in their own time.

What do you think?

Michael

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (EXEC-87) Watchdog should allow both SIGTERM and SIGKILL to be sent to a process

2014-06-13 Thread Corey J. Nolet (JIRA)
Corey J. Nolet created EXEC-87:
--

 Summary: Watchdog should allow both SIGTERM and SIGKILL to be sent 
to a process
 Key: EXEC-87
 URL: https://issues.apache.org/jira/browse/EXEC-87
 Project: Commons Exec
  Issue Type: Wish
Reporter: Corey J. Nolet
 Fix For: 1.3


It would be nice to allow a process to clean itself up gracefully after being 
killed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (EXEC-87) Watchdog should allow both SIGTERM and SIGKILL to be sent to a process

2014-06-13 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/EXEC-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031032#comment-14031032
 ] 

Corey J. Nolet commented on EXEC-87:


Also thinking it would be useful if a SIGKILL could be sent if the SIGTERM runs 
indefinitely (after some interval of time). This would be a worst case.

> Watchdog should allow both SIGTERM and SIGKILL to be sent to a process
> --
>
> Key: EXEC-87
> URL: https://issues.apache.org/jira/browse/EXEC-87
> Project: Commons Exec
>  Issue Type: Wish
>Reporter: Corey J. Nolet
> Fix For: 1.3
>
>
> It would be nice to allow a process to clean itself up gracefully after being 
> killed. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031171#comment-14031171
 ] 

Gary Gregory commented on CODEC-187:


Well... that's a good question and I'd love to have Matthew P's opinion.

Should we support more than one version of the algo in the first place?

This needs the thoughts of some SMEs.

>From a user's perspective, I could see using a factory method with a version 
>argument return an implementation.

First, we need to decide whether multiple versions should be supported within 
the same code base.

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread Thomas Neidhart (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031191#comment-14031191
 ] 

Thomas Neidhart commented on CODEC-187:
---

This has to be further analysed, but I doubt that the algorithm / code has 
changed at all. The Beider Morse phonetic encoder is a generic rule-based 
replacement algorithm with domain-specific rules.

Having said this, I could imagine that we add different versions of the rules 
and allow the user to create instances of the BeiderMorseEncoder using 
different rulesets. Just keep in mind that the current ruleset is approx. 548kB 
uncompressed and ~115kB compressed, which means if we add multiple versions 
this would further increase the size of the jar file.

Furthermore, if updates to the rules just result in more tokens to be returned, 
no re-indexing would be necessary imho (it might create better results though).

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

2014-06-13 Thread michael tobias (JIRA)

[ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031233#comment-14031233
 ] 

michael tobias commented on CODEC-187:
--

I agree that you probably dont want to add TOO many versions into the code - 
because of code size issues.

I have done very limited testing of the Codec implementation.  It APPEARS that 
the EXACT algorithm is working fine and also SEPHARDIC APPROX.  GENERIC APPROX 
- APPEARS to be missing some tokens which you argue might not require 
re-indexing, but the ASHKENAZI APPROX results are just downright WRONG and 
anybody who has created such tokens/indexes is not getting good results.

The BMPM rules/algorithm is fairly stable/static now so I am not sure whether 
it will ever be necessary to implement further versions into the CODEC.

Can I suggest that you consider a 2-version approach? 1 - the current existing 
faulty code, kept for backwards compatibility with existing indexes and the 2nd 
version 3.02 being the most current and likely to remain adequate for the 
foreseeable future?

Michael

 

> Beider Morse Phonetic Matching producing incorrect tokens
> -
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
>  Issue Type: Bug
>Affects Versions: 1.9
>Reporter: michael tobias
>Priority: Minor
> Fix For: 1.10
>
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (COMPRESS-263) Add DEFLATE support

2014-06-13 Thread Stefan Bodewig (JIRA)

[ 
https://issues.apache.org/jira/browse/COMPRESS-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031475#comment-14031475
 ] 

Stefan Bodewig commented on COMPRESS-263:
-

I've committed your patch unchanged as svn revision 1602546

Apart from the isZlibHeaderPresent name already mentioned there are three 
things I will change but you may want to discuss or provide a patch for:

* we need docs :-)
* the count-invocations in input stream are counting uncompressed bytes where 
they should be counting the compressed amount.  I think wrapping the original 
stream in a CountingInputStream is the way I'd go.
* add counting to the output stream


> Add DEFLATE support
> ---
>
> Key: COMPRESS-263
> URL: https://issues.apache.org/jira/browse/COMPRESS-263
> Project: Commons Compress
>  Issue Type: New Feature
>  Components: Compressors
>Reporter: Matthias Stevens
>  Labels: features
> Fix For: 1.9
>
> Attachments: COMPRESS-263_DeflateSupport.patch, 
> COMPRESS-263_DeflateSupport_v1.1.patch, bla.tar.deflate, bla.tar.deflatez
>
>
> GZIP is not a compression algorithm "as such". The de facto (and currently 
> the only supported) compression algorithm it uses is DEFLATE.
> GZIP adds a header of minimum 10 bytes and a footer of 8 bytes to a 
> "deflated" data stream. Find out more here: 
> http://en.wikipedia.org/wiki/Gzip#File_format
> I have no problem with the current GZIP support, but it would be nice if 
> CommonsCompress would also have compression and decompression support for 
> "raw" DEFLATE streams and DEFLATE streams with the zlib header.
> Similarly to the GZIP support in CommonsCompress these functionality can be 
> implemented very easily using the standard java.util.zip package, as done in 
> the provided patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)