Re:

2017-12-01 Thread NerdyNick
Assuming you're in Java. You could just follow on in your Main method.
Checking the state of the Result.

Example:
PipelineResult result = pipeline.run();
try {
result.waitUntilFinish();
if(result.getState() == PipelineResult.State.DONE) {
//DO ES work
}
} catch(Exception e) {
result.cancel();
throw e;
}

Otherwise you could also use Oozie to construct a work flow.

On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> yes, we had a similar question some days ago.
>
> We can imagine to have a user callback fn fired when the sink batch is
> complete.
>
> Let me think about that.
>
> Regards
> JB
>
> On 12/01/2017 09:04 AM, Philip Chan wrote:
>
>> Hey JB,
>>
>> Thanks for getting back so quickly.
>> I suppose in that case I would need a way of monitoring when the ES
>> transform completes successfully before I can proceed with doing the swap.
>> The problem with this is that I can't think of a good way to determine
>> that termination state short of polling the new index to check the document
>> count compared to the size of input PCollection.
>> That, or maybe I'd need to use an external system like you mentioned to
>> poll on the state of the pipeline (I'm using Google Dataflow, so maybe
>> there's a way to do this with some API).
>> But I would have thought that there would be an easy way of simply saying
>> "do not process this transform until this other transform completes".
>> Is there no established way of "signaling" between pipelines when some
>> pipeline completes, or have some way of declaring a dependency of 1
>> transform on another transform?
>>
>> Thanks again,
>> Philip
>>
>> On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>> <mailto:j...@nanthrax.net>> wrote:
>>
>> Hi Philip,
>>
>> You won't be able to do (3) in the same pipeline as the Elasticsearch
>> Sink
>> PTransform ends the pipeline with PDone.
>>
>> So, (3) has to be done in another pipeline (using a DoFn) or in
>> another
>> "system" (like Camel for instance). I would do a check of the data in
>> the
>> index and then trigger the swap there.
>>
>> Regards
>> JB
>>
>> On 12/01/2017 08:41 AM, Philip Chan wrote:
>>
>> Hi,
>>
>> I'm pretty new to Beam, and I've been trying to use the
>> ElasticSearchIO
>> sink to write docs into ES.
>> With this, I want to be able to
>> 1. ingest and transform rows from DB (done)
>> 2. write JSON docs/strings into a new ES index (done)
>> 3. After (2) is complete and all documents are written into a new
>> index,
>> trigger an atomic index swap under an alias to replace the current
>> aliased index with the new index generated in step 2. This is
>> basically
>> a single POST request to the ES cluster.
>>
>> The problem I'm facing is that I don't seem to be able to find a
>> way to
>> have a way for (3) to happen after step (2) is complete.
>>
>> The ElasticSearchIO.Write transform returns a PDone, and I'm not
>> sure
>> how to proceed from there because it doesn't seem to let me do
>> another
>> apply on it to "define" a dependency.
>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>> <https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>> <https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>> <https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>>
>>
>> Is there a recommended way to construct pipelines workflows like
>> this?
>>
>> Thanks in advance,
>> Philip
>>
>>
>> -- Jean-Baptiste Onofré
>> jbono...@apache.org <mailto:jbono...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Nick Verbeck - NerdyNick

NerdyNick.com
TrailsOffroad.com
NoKnownBoundaries.com


Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-16 Thread NerdyNick
> beans
>>>> and have an String id field. Matching java types to ES types is
>>>> quite
>>>> tricky, so, for now we just relied on the user to serialize his
>>>> beans
>>>> into json and let ES match the types automatically.
>>>>
>>>> Related to the problems you raise bellow:
>>>>
>>>> 1. Well, the bundle is the commit entity of beam. Consider the case
>>>> of
>>>> ESIO.batchSize being < to bundle size. While processing records,
>>>> when the
>>>> number of elements reaches batchSize, an ES bulk insert will be
>>>> issued
>>>> but no finishBundle. If there is a problem later on in the bundle
>>>> processing before the finishBundle, the checkpoint will still be at
>>>> the
>>>> beginning of the bundle, so all the bundle will be retried leading
>>>> to
>>>> duplicate documents. Thanks for raising that! I'm CCing the dev
>>>> list so
>>>> that someone could correct me on the checkpointing mecanism if I'm
>>>> missing something. Besides I'm thinking about forcing the user to
>>>> provide
>>>> an id in all cases to workaround this issue.
>>>>
>>>> 2. Correct.
>>>>
>>>> Best,
>>>> Etienne
>>>>
>>>> Le 15/11/2017 à 02:16, Chet Aldrich a écrit :
>>>>
>>>>> Hello all!
>>>>>
>>>>> So I’ve been using the ElasticSearchIO sink for a project
>>>>> (unfortunately
>>>>> it’s Elasticsearch 5.x, and so I’ve been messing around with the
>>>>> latest
>>>>> RC) and I’m finding that it doesn’t allow for changing the
>>>>> document ID,
>>>>> but only lets you pass in a record, which means that the document
>>>>> ID is
>>>>> auto-generated. See this line for what specifically is happening:
>>>>>
>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/elas
>>>>> ticsearch/src/main/java/org/apache/beam/sdk/io/elasticsear
>>>>> ch/ElasticsearchIO.java#L838
>>>>> <https://github.com/apache/beam/blob/master/sdks/java/io/ela
>>>>> sticsearch/src/main/java/org/apache/beam/sdk/io/elasticsear
>>>>> ch/ElasticsearchIO.java#L838>
>>>>>
>>>>> Essentially the data part of the document is being placed but it
>>>>> doesn’t
>>>>> allow for other properties, such as the document ID, to be set.
>>>>>
>>>>> This leads to two problems:
>>>>>
>>>>> 1. Beam doesn’t necessarily guarantee exactly-once execution for a
>>>>> given
>>>>> item in a PCollection, as I understand it. This means that you may
>>>>> get
>>>>> more than one record in Elastic for a given item in a PCollection
>>>>> that
>>>>> you pass in.
>>>>>
>>>>> 2. You can’t do partial updates to an index. If you run a batch job
>>>>> once, and then run the batch job again on the same index without
>>>>> clearing it, you just double everything in there.
>>>>>
>>>>> Is there any good way around this?
>>>>>
>>>>> I’d be happy to try writing up a PR for this in theory, but not
>>>>> sure how
>>>>> to best approach it. Also would like to figure out a way to get
>>>>> around
>>>>> this in the meantime, if anyone has any ideas.
>>>>>
>>>>> Best,
>>>>>
>>>>> Chet
>>>>>
>>>>> P.S. CCed echauc...@gmail.com <mailto:echauc...@gmail.com> because
>>>>> it
>>>>> seems like he’s been doing work related to the elastic sink.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Nick Verbeck - NerdyNick

NerdyNick.com
TrailsOffroad.com
NoKnownBoundaries.com


KafkaIO Autocommit

2017-11-06 Thread NerdyNick
There seems to be a lot of oddities with the auto offset committer and the
watermark management as well as kafka offsets in general.

One issue I keep having is the auto committer will just not commit any
offsets. So the topic will look like it's backing up. From what I've been
able to trace on it it appears to be in relation to the executor/thread
shutting down before the auto commit has a chance to run. Even though the
min read times are set. It still prematurely shuts down. Turning auto
commit interval down seems to help but doesn't resolve the issue. Just
seems to allow it to correct itself much quicker.

Another I just had happen is after restarting a pipeline the auto committed
offsets reset to the earliest record and the pipeline appears to be working
on those records. Which is odd in contrary to a lot of things. When I shut
the pipeline down it was only a few thousand records behind. The consumer
is configured to start at the latest offset not the earliest. Give that It
would appear the recorded watermarks had an odd corruption or something
where they believed they where in the past.

-- 
Nick Verbeck - NerdyNick

NerdyNick.com
Coloco.ubuntu-rocks.org


Regarding Beam Slack Channel

2017-10-13 Thread NerdyNick
Hello

Can someone please add me to the Beam slack channel?

Thanks.