Re: Kafka Connector for Solr

2016-04-24 Thread Gwen Shapira
Thank you, Surendra.

I've added your connector to the Connector Hub page:
http://www.confluent.io/developers/connectors


On Fri, Apr 22, 2016 at 10:11 PM, Surendra , Manchikanti
 wrote:
> Hi Jay,
>
> Thanks!! Can you please share the contact person to include this in
> Confluent Coneector Hub page.
>
> Regards,
> Surendra M
>
> -- Surendra Manchikanti
>
> On Fri, Apr 22, 2016 at 4:32 PM, Jay Kreps  wrote:
>
>> This is great!
>>
>> -Jay
>>
>> On Fri, Apr 22, 2016 at 2:28 PM, Surendra , Manchikanti <
>> surendra.manchika...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I have implemented KafkaConnector for Solr, Please find the below github
>> > link.
>> >
>> > https://github.com/msurendra/kafka-connect-solr
>> >
>> > The initial release having SolrSinkConnector Only, SolrSourceConnector
>> > under development will add it soon.
>> >
>> > Regards,
>> > Surendra M
>> >
>>


Re: How to "buffer" a stream with high churn and output only at the end of a window?

2016-04-24 Thread Guozhang Wang
Henry,

Thanks for the great feedbacks. I'm making some proposal for adding the
control mechanism for latency v.s. data volume tradeoffs, which I will put
up to wiki once it is done:

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Discussions

We can continue the discussion from there. In the end we need to proposal a
KIP for this feature.


Guozhang


On Wed, Apr 20, 2016 at 10:57 PM, Henry Cai 
wrote:

> My use case is actually myTable.aggregate().to("output_topic"), so I need a
> way to suppress the number of outputs.
>
> I don't think correlating the internal cache flush with the output window
> emit frequency is ideal.  It's too hard for application developer to see
> when the cache will be flushed, we would like to see a clearly defined
> window emit policy.  I think emit on window end (or plus a application
> defined delay) is very easy to understand, it's OK to re-emit when we have
> late-arrival events since this doesn't happen that often (if we use google
> data flow concept, the window end is based on user defined watermark
> function which should already add the buffer for most common processing
> delays).
>
> Another problem with cache flush is it actually flushed quite often, e.g.
> since the in-memory cache doesn't support rangeScan and most of the lookup
> on window-based store needs to do range scan which would trigger the flush
> first.
>
> On Wed, Apr 20, 2016 at 9:13 PM, Jay Kreps  wrote:
>
> > To summarize a chat session Guozhang and I just had on slack:
> >
> > We currently do dedupe the output for stateful operations (e.g. join,
> > aggregate). They hit an in-memory cache and only produce output to
> > rocksdb/kafka when either that cache fills up or the commit period
> occurs.
> > So the changelog for these operations which is often also the output
> > already gets this deduplication. Controlling the commit frequency and
> cache
> > size is probably the right way to trade off latency of the update vs
> update
> > volume.
> >
> > The operation we don't dedupe is to()/through(). So, say if you do an
> > operation like
> >myTable.aggregate().filter().map().to("output_topic")
> > Here the aggregation itself (and hence its changelog) isn't the intended
> > output, but rather some transformed version of it. In this case the issue
> > you describe is correct, we don't dedupe. There might be several options
> > here. One would be for the aggregate to produce deduped output lazily.
> The
> > other would be for the to() operator to also dedupe.
> >
> > Personally I feel this idea of caching to suppress output versus is
> > actually a better way to model and think about what's going on than
> trying
> > to have a triggering policy. If you set a triggering policy that says
> "only
> > output at the end of the window" the reality is that if late data comes
> you
> > still have to produce additional outputs. So you don't produce one output
> > at the end but rather potentially any number of outputs. So a simple way
> to
> > think about this is that you produce all updates but optimistically
> > suppress some duplicates for efficiency.
> >
> > -Jay
> >
> > On Wed, Apr 20, 2016 at 5:24 PM, Henry Cai 
> > wrote:
> >
> > > 0.10.0.1 is fine for me, I am actually building from trunk head for
> > streams
> > > package.
> > >
> > > On Wed, Apr 20, 2016 at 5:06 PM, Guozhang Wang 
> > wrote:
> > >
> > > > I saw that note, thanks for commenting.
> > > >
> > > > I are cutting the next 0.10.0.0 RC next week, so I am not certain if
> it
> > > > will make it for 0.10.0.0. But we can push it to be in 0.10.0.1.
> > > >
> > > > Guozhang
> > > >
> > > > On Wed, Apr 20, 2016 at 4:57 PM, Henry Cai
>  > >
> > > > wrote:
> > > >
> > > > > Thanks.
> > > > >
> > > > > Do you know when KAFKA-3101 will be implemented?
> > > > >
> > > > > I also add a note to that JIRA for a left outer join use case which
> > > also
> > > > > need buffer support.
> > > > >
> > > > >
> > > > > On Wed, Apr 20, 2016 at 4:42 PM, Guozhang Wang  >
> > > > wrote:
> > > > >
> > > > > > Henry,
> > > > > >
> > > > > > I thought you were concerned about consumer memory contention.
> > > That's a
> > > > > > valid point, and yes, you need to keep those buffered records in
> a
> > > > > > persistent store.
> > > > > >
> > > > > > As I mentioned we are trying to do optimize the aggregation
> outputs
> > > as
> > > > in
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/KAFKA-3101
> > > > > >
> > > > > > Its idea is very similar to buffering, while we keep the
> aggregated
> > > > > values
> > > > > > in RocksDB, we do not send the updated values for each receiving
> > > record
> > > > > but
> > > > > > only do that based on some policy. More generally we can have a
> > > trigger
> > > > > > mechanism for user to customize when to emit.
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 20, 2016 at 4:03 PM, Henry Cai
> > >  > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I thin

Re: poll() semantics

2016-04-24 Thread Richard Rodseth
Thanks. Yes I get that it's bytes.
Good to know about the new setting.

On Sun, Apr 24, 2016 at 10:19 AM, Jens Rantil  wrote:

> Hi Richard,
>
> > which defaults to a very large large number, will affect the number of
> records returned by each call to poll()
>
> No, it will affect the total sum of the message sizes fetched. This is not
> the same as "number of messages". The upcoming release of 9.1 (not out yet)
> will contain a setting that allows you to set a cap on the maximum number
> of messages that poll() returns. See also
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
> .
>
> Cheers,
> Jens
>
> On Sat, Apr 23, 2016 at 2:20 AM Richard Rodseth 
> wrote:
>
> > To answer my own question (partially), I have learned that
> >
> > max.partition.fetch.bytes
> >
> > , which defaults to a very large large number, will affect the number of
> > records returned by each call to poll()
> >
> > I also learned that seekToBeginning is a partition-level thing, but
> >
> >  props.put("auto.offset.reset","earliest")
> > has the desired effect.
> >
> > On Fri, Apr 22, 2016 at 11:08 AM, Richard Rodseth 
> > wrote:
> >
> > > Do I understand correctly that poll() will return a subset of the
> > messages
> > > in a topic each time it is called? So if I want to replay all
> messages, I
> > > would seek to the beginning and call poll in a loop? Not easily knowing
> > > when I was done, without a high watermark
> > >
> > > https://issues.apache.org/jira/browse/KAFKA-2076
> > >
> > > This is a pretty basic question, but I don't think it is explained in
> the
> > > JavaDoc
> > >
> > >
> > >
> >
> http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
> > >
> > > Thanks
> > >
> >
> --
>
> Jens Rantil
> Backend Developer @ Tink
>
> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
> For urgent matters you can reach me at +46-708-84 18 32.
>


Re: poll method thread

2016-04-24 Thread Liquan Pei
Hi Spico,

Kafka Consumer is single threaded which means all operations such as
sending heart beat, fech records and maintain group membership are done in
the same thread as the caller.

Also, poll() is a blocking method with timeout and you can interrupt it
with the wakeup method in Kafka Consumer.

Thanks,
Liquan


On Sun, Apr 24, 2016 at 11:43 AM, Spico Florin 
wrote:

> hi!
>  i would like to ask if the kafka consumer poll method is done in aseprated
> thread than the caller or in the same thread as the caller?
> it is syncriunous blocking method or asynch?
>  thank you
> florin
>



-- 
Liquan Pei
Software Engineer, Confluent Inc


poll method thread

2016-04-24 Thread Spico Florin
hi!
 i would like to ask if the kafka consumer poll method is done in aseprated
thread than the caller or in the same thread as the caller?
it is syncriunous blocking method or asynch?
 thank you
florin


Re: poll() semantics

2016-04-24 Thread Jens Rantil
Hi Richard,

> which defaults to a very large large number, will affect the number of
records returned by each call to poll()

No, it will affect the total sum of the message sizes fetched. This is not
the same as "number of messages". The upcoming release of 9.1 (not out yet)
will contain a setting that allows you to set a cap on the maximum number
of messages that poll() returns. See also
https://cwiki.apache.org/confluence/display/KAFKA/KIP-41%3A+KafkaConsumer+Max+Records
.

Cheers,
Jens

On Sat, Apr 23, 2016 at 2:20 AM Richard Rodseth  wrote:

> To answer my own question (partially), I have learned that
>
> max.partition.fetch.bytes
>
> , which defaults to a very large large number, will affect the number of
> records returned by each call to poll()
>
> I also learned that seekToBeginning is a partition-level thing, but
>
>  props.put("auto.offset.reset","earliest")
> has the desired effect.
>
> On Fri, Apr 22, 2016 at 11:08 AM, Richard Rodseth 
> wrote:
>
> > Do I understand correctly that poll() will return a subset of the
> messages
> > in a topic each time it is called? So if I want to replay all messages, I
> > would seek to the beginning and call poll in a loop? Not easily knowing
> > when I was done, without a high watermark
> >
> > https://issues.apache.org/jira/browse/KAFKA-2076
> >
> > This is a pretty basic question, but I don't think it is explained in the
> > JavaDoc
> >
> >
> >
> http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
> >
> > Thanks
> >
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.