Re: joshua_api

2016-04-27 Thread Matt Post
;>> synchronize and compare better earlier than later.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Felix
>>>>>>> 
>>>>>>>> On 25.04.2016, at 07:44, kellen sunderland <
>>>> kellen.sunderl...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey Matt,
>>>>>>>> 
>>>>>>>> Sorry for the late reply.  The Joshua-6 folder and tst may have just
>>>>>> been
>>>>>>>> artifacts of some symlinks I have locally.  Sorry they may have been
>>>>>> pushed
>>>>>>>> by mistake, I can clean that up.
>>>>>>>> 
>>>>>>>> Good idea to have the api code in a separate branch.  We can merge
>> the
>>>>>> work
>>>>>>>> that we've done some time next week.
>>>>>>>> 
>>>>>>>> KBestExtractor is one of the things we want to return via the API.
>> We
>>>>>>>> already have some of this implemented though as you suggest.  I'll
>> try
>>>>>> and
>>>>>>>> push the remaining work we've done into my github branch so you can
>>>>>> compare.
>>>>>>>> 
>>>>>>>> -Kellen
>>>>>>>> 
>>>>>>>>> On Mon, Apr 25, 2016 at 6:11 AM, Matt Post <p...@cs.jhu.edu>
>> wrote:
>>>>>>>>> 
>>>>>>>>> Okay, after looking at this a bit more, I have a better
>>>> understanding,
>>>>>> and
>>>>>>>>> an idea for how to move forward.
>>>>>>>>> 
>>>>>>>>> First, I see that Translation.java has provisions for structured
>>>>>> output.
>>>>>>>>> I'm guessing StructuredTranslation was added by mistake?
>>>>>>>>> 
>>>>>>>>> Moving forward, on the joshua_api branch, I was thinking of the
>>>>>> following,
>>>>>>>>> but want to make sure it doesn't collide with what you've done or
>> are
>>>>>> doing:
>>>>>>>>> 
>>>>>>>>> - Factor KBestExtractor to return Translation objects instead of
>>>>>> printing,
>>>>>>>>> and also turn it into an iterator
>>>>>>>>> 
>>>>>>>>> - There's a real discrepancy with competing forest representations.
>>>>>> There
>>>>>>>>> are operations on the hypergraph (via WalkerFunction), and then
>> also
>>>>>>>>> operations on Derivations. This leads to code that operates on
>> both.
>>>> It
>>>>>>>>> would be nice if the KBestExtractor just returned something like a
>>>>>> reduced
>>>>>>>>> "slice" of a forest forest new nodes containing only single back
>>>>>> pointers,
>>>>>>>>> representing exactly the nth-best derivation. Then we could
>>>>>> generically use
>>>>>>>>> the WalkerFunctions on that (e.g., viterbi extraction), and get rid
>>>> of
>>>>>> many
>>>>>>>>> of the DerivationVisitor classes
>>>>>>>>> 
>>>>>>>>> - Related: constructing the k-best list is expensive, even for just
>>>> the
>>>>>>>>> first item, since you have to set up all the candidate lists and so
>>>> on.
>>>>>>>>> This led to me implementing top-n = 0, where you can get the
>>>>>> translation
>>>>>>>>> and some limited information (not replayed features) via Viterbi
>>>>>> extractors
>>>>>>>>> on the hypergraph, and you only have to call KBestExtractor if you
>>>>>> actually
>>>>>>>>> want k-best lists. This leads to dual code, e.g., substitutions of
>>>>>>>>> output_format in multiple places. The first item the KBestIterator
>>>>>> returns
>>>>>>>>> should be constructed more efficiently, on the assumption that the
>>>>>> caller
>>>>>>>>> might not ask for mor

Re: joshua_api

2016-04-27 Thread kellen sunderland
Hey Matt,

If you had time that would be fantastic.  I've created a new PR in case you
want to pull it in.  There's actually 4 tests failing for me currently
(casing issues causing at least one).  If you want to wait until we fix
these tests that's also completely fine.

-Kellen

On Wed, Apr 27, 2016 at 11:32 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Do you want me to fix the recapitalization? Or are you going to do that? I
> looked a bit, and it seems I'll have to add a method to get a word
> alignment object instead of just the string, so that I can poke through
> them. This approach is as good as true-casing in some languages.
>
> A few other things:
>
> - I saw a comment in the commit about the changes not working for
> phrase-based translation. Can you (or Felix) elaborate? What exactly will
> no longer work?
>
> - Currently, there are multiple places where the "output-format" string
> has to get edited (KBestExtractor and in Translation). After you push your
> changes in, I'm going to make some edits so that this all occurs in one
> place.
>
> matt
>
>
> > On Apr 27, 2016, at 2:25 PM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >
> > Thanks for taking a look Matt,
> >
> > I think this is all we've got planned as far as changes relating to an
> API
> > would go.  We have a few more commits coming but they're just performance
> > improvements and they don't change too much in the way of interfaces or
> > method signatures.
> >
> > -Kellen
> >
> > On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> >> Kellen,
> >>
> >> Great. I had a chance to start looking over the ReworkedExtractions
> >> branch. I'll have some more time today. It looks good to me so far. Is
> >> there anything else you plan to do, or does that branch contain
> basically
> >> all of it (apart from the recapitalization fix, which I see should be
> >> applied more selectively, maybe only when a -recapitalize flag is
> present,
> >> to save on time).
> >>
> >> matt
> >>
> >>
> >>> On Apr 26, 2016, at 1:56 AM, kellen sunderland <
> >> kellen.sunderl...@gmail.com> wrote:
> >>>
> >>> Hey Matt,
> >>>
> >>> I've opened a new pull request with a few of our commits, feel free to
> >> take
> >>> a look when you have some time.
> >>>
> >>> More importantly I've pushed our queue of upcoming commits to the
> >> following
> >>> branch in my fork:
> >>>
> >>
> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
> >>> .  From there you can get an idea for the work we've done so far.  I
> >>> haven't opened a PR yet for these commits because there's still some
> >>> merging I have to do (there's a few failing tests and I had to
> >> temporarily
> >>> comment out some of your casing code).  Once that's fixed I'll do a
> >> proper
> >>> PR for these commits.
> >>>
> >>> -Kellen
> >>>
> >>> On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >>>
> >>>> Great. On that first point, I meant that translate() would return a
> >>>> Translation object, which would know its hypergraph and could iterate
> >> over
> >>>> a KBestExtractor. In any case, though, it sounds like you are a bit
> >> ahead
> >>>> of me on this, so I'll wait for a push that I can see, and then we can
> >>>> converge on the design.
> >>>>
> >>>> matt
> >>>>
> >>>>
> >>>>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de>
> wrote:
> >>>>>
> >>>>> Hi Matt,
> >>>>>
> >>>>> These are some nice suggestions. Most of the work we have done is in
> >>>> line of what you propose so I would agree with Kellen that we should
> >>>> synchronize and compare better earlier than later.
> >>>>>
> >>>>> Best,
> >>>>> Felix
> >>>>>
> >>>>>> On 25.04.2016, at 07:44, kellen sunderland <
> >> kellen.sunderl...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hey Matt,
> >>>>>>
> >>>>>> Sorry for the late reply.  The Joshua-6 folder and tst may have jus

Re: joshua_api

2016-04-27 Thread Matt Post
Do you want me to fix the recapitalization? Or are you going to do that? I 
looked a bit, and it seems I'll have to add a method to get a word alignment 
object instead of just the string, so that I can poke through them. This 
approach is as good as true-casing in some languages.

A few other things:

- I saw a comment in the commit about the changes not working for phrase-based 
translation. Can you (or Felix) elaborate? What exactly will no longer work?

- Currently, there are multiple places where the "output-format" string has to 
get edited (KBestExtractor and in Translation). After you push your changes in, 
I'm going to make some edits so that this all occurs in one place.

matt


> On Apr 27, 2016, at 2:25 PM, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Thanks for taking a look Matt,
> 
> I think this is all we've got planned as far as changes relating to an API
> would go.  We have a few more commits coming but they're just performance
> improvements and they don't change too much in the way of interfaces or
> method signatures.
> 
> -Kellen
> 
> On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:
> 
>> Kellen,
>> 
>> Great. I had a chance to start looking over the ReworkedExtractions
>> branch. I'll have some more time today. It looks good to me so far. Is
>> there anything else you plan to do, or does that branch contain basically
>> all of it (apart from the recapitalization fix, which I see should be
>> applied more selectively, maybe only when a -recapitalize flag is present,
>> to save on time).
>> 
>> matt
>> 
>> 
>>> On Apr 26, 2016, at 1:56 AM, kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>> 
>>> Hey Matt,
>>> 
>>> I've opened a new pull request with a few of our commits, feel free to
>> take
>>> a look when you have some time.
>>> 
>>> More importantly I've pushed our queue of upcoming commits to the
>> following
>>> branch in my fork:
>>> 
>> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
>>> .  From there you can get an idea for the work we've done so far.  I
>>> haven't opened a PR yet for these commits because there's still some
>>> merging I have to do (there's a few failing tests and I had to
>> temporarily
>>> comment out some of your casing code).  Once that's fixed I'll do a
>> proper
>>> PR for these commits.
>>> 
>>> -Kellen
>>> 
>>> On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>>> Great. On that first point, I meant that translate() would return a
>>>> Translation object, which would know its hypergraph and could iterate
>> over
>>>> a KBestExtractor. In any case, though, it sounds like you are a bit
>> ahead
>>>> of me on this, so I'll wait for a push that I can see, and then we can
>>>> converge on the design.
>>>> 
>>>> matt
>>>> 
>>>> 
>>>>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de> wrote:
>>>>> 
>>>>> Hi Matt,
>>>>> 
>>>>> These are some nice suggestions. Most of the work we have done is in
>>>> line of what you propose so I would agree with Kellen that we should
>>>> synchronize and compare better earlier than later.
>>>>> 
>>>>> Best,
>>>>> Felix
>>>>> 
>>>>>> On 25.04.2016, at 07:44, kellen sunderland <
>> kellen.sunderl...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hey Matt,
>>>>>> 
>>>>>> Sorry for the late reply.  The Joshua-6 folder and tst may have just
>>>> been
>>>>>> artifacts of some symlinks I have locally.  Sorry they may have been
>>>> pushed
>>>>>> by mistake, I can clean that up.
>>>>>> 
>>>>>> Good idea to have the api code in a separate branch.  We can merge the
>>>> work
>>>>>> that we've done some time next week.
>>>>>> 
>>>>>> KBestExtractor is one of the things we want to return via the API.  We
>>>>>> already have some of this implemented though as you suggest.  I'll try
>>>> and
>>>>>> push the remaining work we've done into my github branch so you can
>>>> compare.
>>>>>> 
>>>>>> -Kellen
>>>>&g

Re: joshua_api

2016-04-27 Thread kellen sunderland
Thanks for taking a look Matt,

I think this is all we've got planned as far as changes relating to an API
would go.  We have a few more commits coming but they're just performance
improvements and they don't change too much in the way of interfaces or
method signatures.

-Kellen

On Wed, Apr 27, 2016 at 4:47 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Kellen,
>
> Great. I had a chance to start looking over the ReworkedExtractions
> branch. I'll have some more time today. It looks good to me so far. Is
> there anything else you plan to do, or does that branch contain basically
> all of it (apart from the recapitalization fix, which I see should be
> applied more selectively, maybe only when a -recapitalize flag is present,
> to save on time).
>
> matt
>
>
> > On Apr 26, 2016, at 1:56 AM, kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> >
> > Hey Matt,
> >
> > I've opened a new pull request with a few of our commits, feel free to
> take
> > a look when you have some time.
> >
> > More importantly I've pushed our queue of upcoming commits to the
> following
> > branch in my fork:
> >
> https://github.com/KellenSunderland/incubator-joshua/commits/ReworkedExtractions
> > .  From there you can get an idea for the work we've done so far.  I
> > haven't opened a PR yet for these commits because there's still some
> > merging I have to do (there's a few failing tests and I had to
> temporarily
> > comment out some of your casing code).  Once that's fixed I'll do a
> proper
> > PR for these commits.
> >
> > -Kellen
> >
> > On Mon, Apr 25, 2016 at 1:35 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> >> Great. On that first point, I meant that translate() would return a
> >> Translation object, which would know its hypergraph and could iterate
> over
> >> a KBestExtractor. In any case, though, it sounds like you are a bit
> ahead
> >> of me on this, so I'll wait for a push that I can see, and then we can
> >> converge on the design.
> >>
> >> matt
> >>
> >>
> >>> On Apr 25, 2016, at 4:10 PM, Hieber, Felix <fhie...@amazon.de> wrote:
> >>>
> >>> Hi Matt,
> >>>
> >>> These are some nice suggestions. Most of the work we have done is in
> >> line of what you propose so I would agree with Kellen that we should
> >> synchronize and compare better earlier than later.
> >>>
> >>> Best,
> >>> Felix
> >>>
> >>>> On 25.04.2016, at 07:44, kellen sunderland <
> kellen.sunderl...@gmail.com>
> >> wrote:
> >>>>
> >>>> Hey Matt,
> >>>>
> >>>> Sorry for the late reply.  The Joshua-6 folder and tst may have just
> >> been
> >>>> artifacts of some symlinks I have locally.  Sorry they may have been
> >> pushed
> >>>> by mistake, I can clean that up.
> >>>>
> >>>> Good idea to have the api code in a separate branch.  We can merge the
> >> work
> >>>> that we've done some time next week.
> >>>>
> >>>> KBestExtractor is one of the things we want to return via the API.  We
> >>>> already have some of this implemented though as you suggest.  I'll try
> >> and
> >>>> push the remaining work we've done into my github branch so you can
> >> compare.
> >>>>
> >>>> -Kellen
> >>>>
> >>>>> On Mon, Apr 25, 2016 at 6:11 AM, Matt Post <p...@cs.jhu.edu> wrote:
> >>>>>
> >>>>> Okay, after looking at this a bit more, I have a better
> understanding,
> >> and
> >>>>> an idea for how to move forward.
> >>>>>
> >>>>> First, I see that Translation.java has provisions for structured
> >> output.
> >>>>> I'm guessing StructuredTranslation was added by mistake?
> >>>>>
> >>>>> Moving forward, on the joshua_api branch, I was thinking of the
> >> following,
> >>>>> but want to make sure it doesn't collide with what you've done or are
> >> doing:
> >>>>>
> >>>>> - Factor KBestExtractor to return Translation objects instead of
> >> printing,
> >>>>> and also turn it into an iterator
> >>>>>
> >>>>> - There's a real discrepancy with competing forest representations.
> >> There
> >>>>> are operatio

Re: joshua_api

2016-04-25 Thread Hieber, Felix
Hi Matt, 

These are some nice suggestions. Most of the work we have done is in line of 
what you propose so I would agree with Kellen that we should synchronize and 
compare better earlier than later.

Best,
Felix

> On 25.04.2016, at 07:44, kellen sunderland <kellen.sunderl...@gmail.com> 
> wrote:
> 
> Hey Matt,
> 
> Sorry for the late reply.  The Joshua-6 folder and tst may have just been
> artifacts of some symlinks I have locally.  Sorry they may have been pushed
> by mistake, I can clean that up.
> 
> Good idea to have the api code in a separate branch.  We can merge the work
> that we've done some time next week.
> 
> KBestExtractor is one of the things we want to return via the API.  We
> already have some of this implemented though as you suggest.  I'll try and
> push the remaining work we've done into my github branch so you can compare.
> 
> -Kellen
> 
>> On Mon, Apr 25, 2016 at 6:11 AM, Matt Post <p...@cs.jhu.edu> wrote:
>> 
>> Okay, after looking at this a bit more, I have a better understanding, and
>> an idea for how to move forward.
>> 
>> First, I see that Translation.java has provisions for structured output.
>> I'm guessing StructuredTranslation was added by mistake?
>> 
>> Moving forward, on the joshua_api branch, I was thinking of the following,
>> but want to make sure it doesn't collide with what you've done or are doing:
>> 
>> - Factor KBestExtractor to return Translation objects instead of printing,
>> and also turn it into an iterator
>> 
>> - There's a real discrepancy with competing forest representations. There
>> are operations on the hypergraph (via WalkerFunction), and then also
>> operations on Derivations. This leads to code that operates on both. It
>> would be nice if the KBestExtractor just returned something like a reduced
>> "slice" of a forest forest new nodes containing only single back pointers,
>> representing exactly the nth-best derivation. Then we could generically use
>> the WalkerFunctions on that (e.g., viterbi extraction), and get rid of many
>> of the DerivationVisitor classes
>> 
>> - Related: constructing the k-best list is expensive, even for just the
>> first item, since you have to set up all the candidate lists and so on.
>> This led to me implementing top-n = 0, where you can get the translation
>> and some limited information (not replayed features) via Viterbi extractors
>> on the hypergraph, and you only have to call KBestExtractor if you actually
>> want k-best lists. This leads to dual code, e.g., substitutions of
>> output_format in multiple places. The first item the KBestIterator returns
>> should be constructed more efficiently, on the assumption that the caller
>> might not ask for more items. The StructuredTranslation object already is
>> lazy about returning things that are asked for (e.g., it will only replay
>> features if you ask for the feature functions).
>> 
>> I will probably implement most of these tonight and tomorrow unless there
>> are objections from anyone (including an objection asking for more time to
>> evaluate!)
>> 
>> matt
>> 
>> 
>>> On Apr 23, 2016, at 7:22 PM, Matt Post <p...@cs.jhu.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> Kellen suggested we create a Joshua API, which I think is an excellent
>> idea. I've just made a start at this. It is not done and needs more work,
>> but I know that the Amazon folks have done some things on the backend, and
>> I wanted to make sure not to duplicate any work they might have done. Also,
>> it's something we should discuss.
>>> 
>>> First, I was a bit confused about the joshua-6 subdirectory, and the
>> files there (also, what is tst/? Both of these were from a recent commit).
>> I moved those over and then things didn't compile. I got things compiling
>> and then made a few changes to StructuredTranslation.
>>> 
>>> The biggest change I hope doesn't create problems is that I simplified
>> StructuredTranslation to no longer contain the Hypergraph object; instead,
>> it contains a DerivationState object. This represents a particular k-best
>> derivation, using Huang & Chiang (2005)-style ranked back pointers. The
>> nice thing is that you can simplify define a DerivationVisitor class and
>> pass it to DeriviationState::visit, and it will see every node in a
>> particular derivation.
>>> 
>>> This is distinct from WalkerFunction, which walks an entire *HyperGraph*.
>>> 
>>> Let me know what you guys thing about these changes, and maybe we can
>> spe

Re: joshua_api

2016-04-25 Thread kellen sunderland
Hey Matt,

Sorry for the late reply.  The Joshua-6 folder and tst may have just been
artifacts of some symlinks I have locally.  Sorry they may have been pushed
by mistake, I can clean that up.

Good idea to have the api code in a separate branch.  We can merge the work
that we've done some time next week.

KBestExtractor is one of the things we want to return via the API.  We
already have some of this implemented though as you suggest.  I'll try and
push the remaining work we've done into my github branch so you can compare.

-Kellen

On Mon, Apr 25, 2016 at 6:11 AM, Matt Post <p...@cs.jhu.edu> wrote:

> Okay, after looking at this a bit more, I have a better understanding, and
> an idea for how to move forward.
>
> First, I see that Translation.java has provisions for structured output.
> I'm guessing StructuredTranslation was added by mistake?
>
> Moving forward, on the joshua_api branch, I was thinking of the following,
> but want to make sure it doesn't collide with what you've done or are doing:
>
> - Factor KBestExtractor to return Translation objects instead of printing,
> and also turn it into an iterator
>
> - There's a real discrepancy with competing forest representations. There
> are operations on the hypergraph (via WalkerFunction), and then also
> operations on Derivations. This leads to code that operates on both. It
> would be nice if the KBestExtractor just returned something like a reduced
> "slice" of a forest forest new nodes containing only single back pointers,
> representing exactly the nth-best derivation. Then we could generically use
> the WalkerFunctions on that (e.g., viterbi extraction), and get rid of many
> of the DerivationVisitor classes
>
> - Related: constructing the k-best list is expensive, even for just the
> first item, since you have to set up all the candidate lists and so on.
> This led to me implementing top-n = 0, where you can get the translation
> and some limited information (not replayed features) via Viterbi extractors
> on the hypergraph, and you only have to call KBestExtractor if you actually
> want k-best lists. This leads to dual code, e.g., substitutions of
> output_format in multiple places. The first item the KBestIterator returns
> should be constructed more efficiently, on the assumption that the caller
> might not ask for more items. The StructuredTranslation object already is
> lazy about returning things that are asked for (e.g., it will only replay
> features if you ask for the feature functions).
>
> I will probably implement most of these tonight and tomorrow unless there
> are objections from anyone (including an objection asking for more time to
> evaluate!)
>
> matt
>
>
> > On Apr 23, 2016, at 7:22 PM, Matt Post <p...@cs.jhu.edu> wrote:
> >
> > Hi,
> >
> > Kellen suggested we create a Joshua API, which I think is an excellent
> idea. I've just made a start at this. It is not done and needs more work,
> but I know that the Amazon folks have done some things on the backend, and
> I wanted to make sure not to duplicate any work they might have done. Also,
> it's something we should discuss.
> >
> > First, I was a bit confused about the joshua-6 subdirectory, and the
> files there (also, what is tst/? Both of these were from a recent commit).
> I moved those over and then things didn't compile. I got things compiling
> and then made a few changes to StructuredTranslation.
> >
> > The biggest change I hope doesn't create problems is that I simplified
> StructuredTranslation to no longer contain the Hypergraph object; instead,
> it contains a DerivationState object. This represents a particular k-best
> derivation, using Huang & Chiang (2005)-style ranked back pointers. The
> nice thing is that you can simplify define a DerivationVisitor class and
> pass it to DeriviationState::visit, and it will see every node in a
> particular derivation.
> >
> > This is distinct from WalkerFunction, which walks an entire *HyperGraph*.
> >
> > Let me know what you guys thing about these changes, and maybe we can
> spec out the API, and then clean things up inside a bit to use it (there's
> no reason to be passing output stream writers to KBestExtractor, for
> example...).
> >
> > matt
> >
> >
> >
> >> Begin forwarded message:
> >>
> >> From: mjp...@apache.org
> >> Subject: incubator-joshua git commit: Simplified StructuredTranslation
> to use derivations instead of hypergraphs, now using in KBestExtractor
> >> Date: April 23, 2016 at 7:12:19 PM EDT
> >> To: comm...@joshua.incubator.apache.org
> >> Reply-To: dev@joshua.incubator.apache.org
> >>
> >> Repository: incubator-joshua
&g