Re: Controlling read offset

Chris Riccomini Mon, 10 Nov 2014 12:26:56 -0800

Hey Alex,

Yea, it looks like this could be documented better.


> trying to dump one with the script doesn't work with hello-samza.

My guess is that you tried to use wikipedia-feed.properties. This doesn't
have checkpointing enabled because IRC isn't repayable, and that's where
we're getting the wikipedia feed from. If you run the
wikipedia-parser.properties job, the checkpoint tool works:

  deploy/samza/bin/checkpoint-tool.sh
--config-path=file:!/Code/incubator-samza-hello-samza/samza-job-package/src
/main/config/wikipedia-parser.properties

You'll see:

  2014-11-10 12:20:24 CheckpointTool [INFO] Current checkpoint:
systems.kafka.streams.wikipedia-raw.partitions.0 = 1533

The format for the files is:


  systems.kafka.streams.wikipedia-raw.partitions.0=1533

  systems.<system name>.streams.<stream name>.partitions.<partition
number>=<offset>


Cheers,
Chris

PS: you'll need to tweak the log4j.xml file to look like this:

<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/";>
<appender name="console" class="org.apache.log4j.ConsoleAppender">
<param name="Target" value="System.out" />
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{yyyy-MM-dd HH:mm:ss} %c{1} [%p]
%m%n" />
</layout>
</appender>
<root>
<priority value="info" />
<appender-ref ref="console"/>
</root>
</log4j:configuration>

By default, the log4j file output sot disk. You'll want it to output to
console. This is fixed in 0.8.0.



On 11/10/14 10:33 AM, "Alexander Taggart" <[email protected]> wrote:

>Thanks, Chris.
>
>It's not clear to me from the documentation whether the checkpoint tool
>can
>be used to control the starting offset for a job that has not yet ever
>been
>run, and if so, how the properties file would need to be crafted.  The
>checkpoint doc page doesn't show what the properties file looks like, and
>trying to dump one with the script doesn't work with hello-samza.
>
>On Mon, Nov 10, 2014 at 11:36 AM, Chris Riccomini <
>[email protected]> wrote:
>
>> Hey Alexander,
>>
>> We have a checkpoint offset tool
>> (./samza-shell/src/main/bash/checkpoint-tool.sh), which allows you to
>>read
>> and write offsets for all input partitions. This tool will allow you to
>> arbitrarily set offsets before a job starts.
>>
>> We also support the samza.offset.default, and samza.reset.offset
>> configurations:
>>
>>
>> 
>>http://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/configur
>>at
>> ion-table.html#streams
>>
>> These allow you to specify whether a job should read from the head or
>>tail
>> of an input stream when the job first starts.
>>
>> We don't currently support a way to change offsets once a job has
>>already
>> started. If you can get more specific about your use case,
>>
>> Cheers,
>> Chris
>>
>> On 11/10/14 6:53 AM, "Alexander Taggart" <[email protected]> wrote:
>>
>> >We're investigating using Samza, and one aspect of our usage would
>>require
>> >being able to start a job such that it begins reading from a specified
>> >Kafka offset.  If I understand correctly, each job being bound to a
>> >specific partition would need to be provided with a specific offset.
>>Is
>> >there any facility for providing such values, either via config or via
>> >API?  If not, what might be a good approach to implementing it (e.g., a
>> >custom kafka consumer)?
>>
>>

Re: Controlling read offset

Reply via email to